Amir’s Homepage

Setup LAMMPS

2024-09-09T00:00:00-04:00

Large-scale Atomic/Molecular Massively Parallel Simulator (LAMMPS) is a classical molecular dynamics code that can be used to model atoms or, more generally, as a parallel particle simulator at various scales. The complete documentation of LAMMPS can be found here. In this post, I will provide a guide on how to setup LAMMPS on a Linux machine. My setup is on a cluster with NVIDIA GPUs, UCX, and OpenMPI. Also, we have a built-in module system to load the necessary modules.

Prepare the Environment

Prerequisites:

Git
CMake
An MPI library, like OpenMPI
For NVIDIA GPU support, CUDA Toolkit is needed.

While this step maybe different in various scenarios, I have the following environment variables set:

#! /bin/bash
module --force purge
module load cuda

# If the argument is "builtin" then load the builtin modules, otherwise don't load any other modules
if [ "$#" -eq 1 ] && [ "$1" == "builtin" ]; then
  echo "using builtin modules"
  module load ucx
  module load openmpi
  module list
  echo "Built-in modules loaded"
  return
fi

echo "No additional modules loaded"
module list

# If no argument is passed, set the root dir to the current directory,
# else set it to the passed argument
if [ "$#" -eq 0 ]; then
  export ROOT_DIR=$(pwd)
else
  export ROOT_DIR=$1
fi

export BUILD_DIR=$ROOT_DIR/build

################### Some checks ###################

# Check if LDFLAGS is bound or not
if [ -z ${LDFLAGS+x} ]; then
  export LDFLAGS=""
fi

# Same with LD_RUN_PATH
if [ -z ${LD_RUN_PATH+x} ]; then
  export LD_RUN_PATH=""
fi

# Same with CXXFLAGS
if [ -z ${CXXFLAGS+x} ]; then
  export CXXFLAGS=""
fi

# Same with LIBRARY_PATH
if [ -z ${LIBRARY_PATH+x} ]; then
  export LIBRARY_PATH=""
fi

# Same with LD_LIBRARY_PATH
if [ -z ${LD_LIBRARY_PATH+x} ]; then
  export LD_LIBRARY_PATH=""
fi

################### CUDA Configurations ###################

# CUDA Configurations (mostly needed to build OpenMPI and UCX)
export NVCC=$(which nvcc)
export CUDA_LIB=$CUDA_HOME/lib64/stubs

export LD_LIBRARY_PATH=$CUDA_HOME/lib64/:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$CUDA_LIB:$LD_LIBRARY_PATH

export LIBRARY_PATH=$CUDA_HOME/lib64/:$LIBRARY_PATH
export LIBRARY_PATH=$CUDA_LIB:$LIBRARY_PATH

export LDFLAGS="-L$CUDA_LIB -L$CUDA_HOME/lib64 $LDFLAGS"
export CPATH=$CUDA_HOME/include:$CPATH
export LD_RUN_PATH=$CUDA_LIB:$LD_RUN_PATH
export CUDA_LDFLAGS="-lcuda -lcudart -lcudadevrt -lnvidia-ml -L$CUDA_LIB"

export LD_LIBRARY_PATH=$BUILD_DIR/lib:$LD_LIBRARY_PATH
export LIBRARY_PATH=$BUILD_DIR/lib:$LIBRARY_PATH
export LDFLAGS="-L$BUILD_DIR/lib $LDFLAGS"

export CPATH=$BUILD_DIR/include:$CPATH
export LD_RUN_PATH=$BUILD_DIR/lib:$LD_RUN_PATH
export PATH=$BUILD_DIR/bin/:$PATH

# Now UCX and OpenMPI can be built

I have skipped the UCX and OpenMPI configurations, but you can find them in my previous posts or in their official documentation.

Build LAMMPS

The following script clones the LAMMPS repository, builds it, and runs some benchmarks. For more information about the available packages, you can check the LAMMPS documentation.

#!/bin/bash
set -eux

export OPENMPI_DIR="/path/to/openmpi"
source "the_above_script.sh" $OPENMPI_DIR

# Check the paths of the executables
which nvcc
which mpicc
which mpirun

export LAMMPS_DIR="/path/to/lammps"

# Perform a clean clone
rm -rf $LAMMPS_DIR
git clone --depth=1 -b release https://github.com/lammps/lammps.git $LAMMPS_DIR
cd $LAMMPS_DIR

# Build LAMMPS
mkdir -p $LAMMPS_DIR/build
cd $LAMMPS_DIR/build

cmake -D CMAKE_BUILD_TYPE=Release -D CMAKE_INSTALL_PREFIX=$LAMMPS_DIR/build \
  -D PKG_KSPACE=1 -D PKG_MOLECULE=1 -D PKG_RIGID=1 -D PKG_MANYBODY=1 \
  -D CMAKE_CXX_FLAGS=-DCUDA_PROXY -D BUILD_MPI=1 -D PKG_GPU=1 -D GPU_API=CUDA \
  -D CUDA_MPS_SUPPORT=1 $LAMMPS_DIR/cmake

cmake --build . --parallel 32

# Some tests
# mpirun -n 8 --mca pml ucx -x UCX_TLS=sm,cuda_copy,cuda_ipc --mca btl ^vader,tcp,openib \
#   --mca coll ^hcoll ../lammps/build/lmp -sf gpu -pk gpu 4 -in ../lammps/bench/in.eam
# mpirun -n 8 --mca pml ucx -x UCX_TLS=sm,cuda_copy,cuda_ipc --mca btl ^vader,tcp,openib \
#   --mca coll ^hcoll ../lammps/build/lmp -sf gpu -pk gpu 1 -in ../lammps/bench/in.chain
# mpirun -n 32 --mca pml ucx -x UCX_TLS=sm,cuda_copy,cuda_ipc --mca btl ^vader,tcp,openib \
#   --mca coll ^hcoll ../lammps/build/lmp -sf gpu -pk gpu 4 -in ../lammps/bench/in.lj

MPS on Multi-Instance GPU

2024-08-28T00:00:00-04:00

In previous posts, (MPS and MIG), I have explained how to enable MPS and MIG on NVIDIA GPUs. In this post, I will explain how to use both technologies at the same time. In more detail, I would like to enable MPS on all of the MIG instances. For more information, you can refer to the NVIDIA document.

Enabling MPS on MIG

I assume that you have already enabled MIG on your GPU(s). If not, please refer to the previous posts. As stated in the NVIDIA document, the steps for configuring MPS on MIG is as follows:

Configure the desired MIG geometry on the GPU.
Setup the CUDA_MPS_PIPE_DIRECTORY variable to point to unique directories so that the multiple MPS servers and clients can communicate with each other using named pipes and Unix domain sockets.
Launch the application by specifying the MIG device using CUDA_VISIBLE_DEVICES. , <– This one might be unnecessary if you point to the correct MPS server using CUDA_MPS_PIPE_DIRECTORY.

To enable MPS on MIG, I wrote a simple script that does the above steps. The script is as follows:

#!/bin/bash

set -eux

# GPU_UUIDs=($(nvidia-smi -L | grep -oE "(GPU|MIG)-[0-9a-f]*-[0-9a-f]*-[0-9a-f]*-[0-9a-f]*-[0-9a-f]*"))
GPU_UUIDs=($(nvidia-smi -L | grep -oE "(MIG)-[0-9a-f]*-[0-9a-f]*-[0-9a-f]*-[0-9a-f]*-[0-9a-f]*"))

for ((index = 0; index < ${#GPU_UUIDs[@]}; index++)); do
  GPU=${GPU_UUIDs[index]}
  rm -rf /tmp/mps_${GPU}
  rm -rf /tmp/mps_log_${GPU}
  mkdir /tmp/mps_${GPU}
  mkdir /tmp/mps_log_${GPU}
  # Skip setting the GPU compute mode to Exclusive Process (not supported on MIG-enabled GPUs)
  # nvidia-smi -i $index -c EXCLUSIVE_PROCESS
  export CUDA_VISIBLE_DEVICES=${GPU}
  export CUDA_MPS_PIPE_DIRECTORY=/tmp/mps_${GPU}
  export CUDA_MPS_LOG_DIRECTORY=/tmp/mps_log_${GPU}
  nvidia-cuda-mps-control -d
done

ps -ef | grep mps

In summary, the script does the following:

getting the UUIDs of the MIG instances.
creating unique directories for each MIG instance.
setting the CUDA_VISIBLE_DEVICES to the MIG instance.
setting the CUDA_MPS_PIPE_DIRECTORY and CUDA_MPS_LOG_DIRECTORY to the unique directories.
enabling MPS server on the specified MIG instance.
repeating the steps for all the MIG instances.
And at the end, listing the MPS processes.

Disabling MPS on MIG

To disable MPS on MIG, you can use the following script:

#!/bin/bash

set -eux

# GPU_UUIDs=($(nvidia-smi -L | grep -oE "(MIG|GPU)-[0-9a-f]*-[0-9a-f]*-[0-9a-f]*-[0-9a-f]*-[0-9a-f]*"))
GPU_UUIDs=($(nvidia-smi -L | grep -oE "(MIG)-[0-9a-f]*-[0-9a-f]*-[0-9a-f]*-[0-9a-f]*-[0-9a-f]*"))

for ((index = 0; index < ${#GPU_UUIDs[@]}; index++)); do
  GPU=${GPU_UUIDs[index]}
  export CUDA_VISIBLE_DEVICES=${GPU}
  export CUDA_MPS_PIPE_DIRECTORY=/tmp/mps_${GPU}
  export CUDA_MPS_LOG_DIRECTORY=/tmp/mps_log_${GPU}
  echo "quit" | nvidia-cuda-mps-control
  rm -rf /tmp/mps_log_${GPU}
  rm -rf /tmp/mps_${GPU}
  # Reset the GPU compute mode to Default (not supported on MIG-enabled GPUs)
  # nvidia-smi -i $index -c DEFAULT
done

ps -ef | grep mps

In summary, the script does the following:

getting the UUIDs of the MIG instances.
setting the CUDA_VISIBLE_DEVICES to the MIG instance.
setting the CUDA_MPS_PIPE_DIRECTORY and CUDA_MPS_LOG_DIRECTORY to the unique directories.
disabling MPS server on the specified MIG instance.
removing the directories.
repeating the steps for all the MIG instances.
And at the end, listing the MPS processes which should be none.

Notice that the script does not destroy MIG configuration, and the GPUs will still be in MIG mode. If you want to see how to disable MIG, please refer to the previous post.

Example

I have created 2 GPU instances each with 2 compute instances on my A30 GPU. This is how it looks like:

$ nvidia-smi -L
GPU 0: NVIDIA A30 (UUID: GPU-8f8bff94-112e-9541-43da-cfd453333404)
  MIG 1c.2g.12gb  Device  0: (UUID: MIG-22f0c05f-5cf2-5ea2-8297-789695656dc0)
  MIG 1c.2g.12gb  Device  1: (UUID: MIG-d74dcafc-aad0-58b8-83d8-61bcd963d2e9)
  MIG 1c.2g.12gb  Device  2: (UUID: MIG-27be287f-e2db-5526-a2f6-0bfabcf34af9)
  MIG 1c.2g.12gb  Device  3: (UUID: MIG-235f71be-a125-5ce0-9fe6-0cd97ae57733)
GPU 1: NVIDIA A30 (UUID: GPU-0783f1eb-ab00-d6ec-92e4-8676be77de38)
GPU 2: NVIDIA A30 (UUID: GPU-a90c6e94-391e-0fc3-8fc5-e2ef46ec6d2d)
GPU 3: NVIDIA A30 (UUID: GPU-46d1eefe-dfc8-2f00-16a9-95c08e019d47)

$ nvidia-smi
Wed Aug 28 17:07:36 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A30                     On  | 00000000:17:00.0 Off |                   On |
| N/A   28C    P0              30W / 165W |     50MiB / 24576MiB |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A30                     On  | 00000000:65:00.0 Off |                    0 |
| N/A   28C    P0              30W / 165W |      4MiB / 24576MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A30                     On  | 00000000:CA:00.0 Off |                    0 |
| N/A   27C    P0              31W / 165W |      4MiB / 24576MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A30                     On  | 00000000:E3:00.0 Off |                    0 |
| N/A   28C    P0              32W / 165W |      4MiB / 24576MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| MIG devices:                                                                          |
+------------------+--------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |                   Memory-Usage |        Vol|      Shared           |
|      ID  ID  Dev |                     BAR1-Usage | SM     Unc| CE ENC DEC OFA JPG    |
|                  |                                |        ECC|                       |
|==================+================================+===========+=======================|
|  0    1   0   0  |              25MiB / 11968MiB  | 14      0 |  2   0    2    0    0 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+                                +-----------+-----------------------+
|  0    1   1   1  |                                | 14      0 |  2   0    2    0    0 |
|                  |                                |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  0    2   0   2  |              25MiB / 11968MiB  | 14      0 |  2   0    2    0    0 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+                                +-----------+-----------------------+
|  0    2   1   3  |                                | 14      0 |  2   0    2    0    0 |
|                  |                                |           |                       |
+------------------+--------------------------------+-----------+-----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Here is the output of ps -ef | grep mps after running the enabling MPS on MIG script (with sudo):

$ ps -ef | grep mps
root      241318       1  0 17:08 ?        00:00:00 nvidia-cuda-mps-control -d
root      241326       1  0 17:08 ?        00:00:00 nvidia-cuda-mps-control -d
root      241334       1  0 17:08 ?        00:00:00 nvidia-cuda-mps-control -d
root      241342       1  0 17:08 ?        00:00:00 nvidia-cuda-mps-control -d

# And the content of tmp directory
$ ls -l /tmp/
total 0
drwxr-xr-x 2 root   root   120 Aug 28 17:08 mps_MIG-22f0c05f-5cf2-5ea2-8297-789695656dc0
drwxr-xr-x 2 root   root   120 Aug 28 17:08 mps_MIG-235f71be-a125-5ce0-9fe6-0cd97ae57733
drwxr-xr-x 2 root   root   120 Aug 28 17:08 mps_MIG-27be287f-e2db-5526-a2f6-0bfabcf34af9
drwxr-xr-x 2 root   root   120 Aug 28 17:08 mps_MIG-d74dcafc-aad0-58b8-83d8-61bcd963d2e9
drwxr-xr-x 2 root   root    80 Aug 28 17:08 mps_log_MIG-22f0c05f-5cf2-5ea2-8297-789695656dc0
drwxr-xr-x 2 root   root    80 Aug 28 17:08 mps_log_MIG-235f71be-a125-5ce0-9fe6-0cd97ae57733
drwxr-xr-x 2 root   root    80 Aug 28 17:08 mps_log_MIG-27be287f-e2db-5526-a2f6-0bfabcf34af9
drwxr-xr-x 2 root   root    80 Aug 28 17:08 mps_log_MIG-d74dcafc-aad0-58b8-83d8-61bcd963d2e9

Now, you can run your application on the MIG instances with MPS enabled. For example, you can run the following command to run the deviceQuery:

  CUDA_MPS_PIPE_DIRECTORY=/tmp/mps_MIG-22f0c05f-5cf2-5ea2-8297-789695656dc0 \
  CUDA_MPS_LOG_DIRECTORY=/tmp/mps_log_MIG-22f0c05f-5cf2-5ea2-8297-789695656dc0 \
  ./deviceQuery

After I ran the deviceQuery on all the MIG isntances with the correct PIPE and LOG directories, this is the output of nvidia-smi:

$ nvidia-smi
...
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0    1    0     241923      C   nvidia-cuda-mps-server                       30MiB |
|    0    1    1     254870      C   nvidia-cuda-mps-server                       30MiB |
|    0    2    0     249932      C   nvidia-cuda-mps-server                       30MiB |
|    0    2    1     242189      C   nvidia-cuda-mps-server                       30MiB |
+---------------------------------------------------------------------------------------+

Notice the GI IDs and CI IDs, and how each of them has its own MPS server. It’s worth mentioning that MPS servers are started in a lazy fashion. So if you don’t run any application, the MPS server will not be started.

Btw, I didn’t see any difference in the behaviour of the application when passing CUDA_VISIBLE_DEVICES to the command. It seems that the CUDA_MPS_PIPE_DIRECTORY is enough to point to the correct MPS server. However, setting CUDA_VISIBLE_DEVICES is a good practice to avoid any confusion, as setting it to a wrong value will cause the application to return an error.

Wasm + WebGPU example on DCP

2024-06-20T00:00:00-04:00

This example is a follow-up to my previous post on how to write a cross-platform WebGPU example. In this one, I’ll demonstrate how to deploy a matmult example written in C/C++ and WebGPU in a DCP worker using WASM. Note that for verification purposes, I provide dawn-based native test, too, but this example doesn’t need to build/install dawn in order to work.

What is DCP?

DCP is a secure, and powerful parallel computing platform built on the web technology. For more information, take a look at here.

This example

We will experiment with a matrix multiplication in C/C++ using WebGPU. The matrices are stored in a 1D array, with their dimensions as their first and second elements, and their data stored afterwards. For this example to work, we have three options:

Build to run natively using dawn
Build to run as a web worker using emscripten
Build to run as a DCP workload using emscripten <– we are interested in this one!

Code structure

By the end of the build process, the directory structure should look like this:

.
├── clean-and-build.sh
├── deployJob.js
├── node_modules
│   └── ...
├── package
│   ├── build-web
│   │   ├── wasm-webgpu-matmult.js
│   │   └── ...
│   ├── CMakeLists.txt
│   ├── package.dcp
│   └── src
│       ├── closebravo.js
│       ├── index.html
│       ├── openbravo.js
│       └── wasm-webgpu-matmult.cpp
├── package.json
├── package-lock.json
├── README.md
└── updateVersion.js

The source ./package/src/wasm-webgpu-matmult.cpp is similar to the source from the previous post:

#include 
#include 
#include 
#include 
#include 

#ifdef __EMSCRIPTEN__
#include 
#endif

wgpu::Instance instance;
wgpu::Adapter adapter;
wgpu::Device device;

wgpu::Buffer gpuReadBuffer;
size_t resultMatrixSize;
bool work_done = false;

// GetAdapter gets a callback function that it's get called
// after the RequestAdapter resolves.
void GetAdapter(void (*callback)()) {
  instance.RequestAdapter(
      nullptr,
      [](WGPURequestAdapterStatus status, WGPUAdapter cAdapter,
         const char *message, void *userdata) {
        if (message) {
          std::cout << "RequestAdapter message: " << message << std::endl;
        }
        if (status != WGPURequestAdapterStatus_Success) {
          std::cout << "AdapterRequest was not successfull" << std::endl;
          exit(0);
        }
        adapter = wgpu::Adapter::Acquire(cAdapter);
        // (2) Cast userdata back to the callback and then call it
        reinterpret_cast<void (*)()>(userdata)();
      },
      // (1) Cast the call back to void pointer and pass it in
      reinterpret_cast<void *>(callback));
}

// Similar to GetAdapter, the callback is called when RequestDevice resolves
void GetDevice(void (*callback)()) {
  adapter.RequestDevice(
      nullptr,
      [](WGPURequestDeviceStatus status, WGPUDevice cDevice,
         const char *message, void *userdata) {
        if (message) {
          std::cout << "RequestDevice message: " << message << std::endl;
        }
        if (status != WGPURequestDeviceStatus_Success) {
          std::cout << "AdapterRequest was not successfull" << std::endl;
          exit(0);
        }
        device = wgpu::Device::Acquire(cDevice);
        device.SetUncapturedErrorCallback(
            [](WGPUErrorType type, const char *message, void *userdata) {
              std::cout << "Error: " << type << " - message: " << message;
            },
            nullptr);
        // (2) Cast userdata back to the callback and then call it
        reinterpret_cast<void (*)()>(userdata)();
      },
      // (1) Cast the call back to void pointer and pass it in
      reinterpret_cast<void *>(callback));
}

const char shaderCode[] = R"(
    struct Matrix {
        size : vec2,
        numbers: array,
    };

    @group(0) @binding(0) var firstMatrix : Matrix;
    @group(0) @binding(1) var secondMatrix : Matrix;
    @group(0) @binding(2) var resultMatrix : Matrix;

    @compute @workgroup_size(8, 8)
    fn main(@builtin(global_invocation_id) global_id : vec3) {
        // Guard against out-of-bounds work group sizes
        if (global_id.x >= u32(firstMatrix.size.x) || global_id.y >= u32(secondMatrix.size.y)) {
            return;
        }

        resultMatrix.size = vec2(firstMatrix.size.x, secondMatrix.size.y);

        let resultCell = vec2(global_id.x, global_id.y);
        var result = 0.0;
        for (var i = 0u; i < u32(firstMatrix.size.y); i = i + 1u) {
            let a = i + resultCell.x * u32(firstMatrix.size.y);
            let b = resultCell.y + i * u32(secondMatrix.size.y);
            result = result + firstMatrix.numbers[a] * secondMatrix.numbers[b];
        }

        let index = resultCell.y + resultCell.x * u32(secondMatrix.size.y);
        resultMatrix.numbers[index] = result;
    }
)";

// This callback is called when the last mapAsync is resolved
void BufferMapCallbackFunction(WGPUBufferMapAsyncStatus status,
                               void *userdata) {

  std::cout << "In Buffer async call back, status: " << status << std::endl;

  if (status == WGPUBufferMapAsyncStatus_Success) {
    const float *resultData = static_cast<const float *>(
        gpuReadBuffer.GetConstMappedRange(0, resultMatrixSize));

    std::cout << "Result Matrix: " << std::endl;
    for (int i = 0; i < resultMatrixSize / sizeof(float); i++) {
      std::cout << resultData[i] << " ";
    }
    std::cout << std::endl;

    gpuReadBuffer.Unmap();
  } else {
    std::cout << "Failed to map result buffer" << std::endl;
  }
  *reinterpret_cast<bool *>(userdata) = true;
}

void RunMatMult() {
  // First Matrix
  const std::vector<float> firstMatrix = {2, 4, 1, 2, 3, 4, 5, 6, 7, 8};
  size_t firstMatrixSize = firstMatrix.size() * sizeof(float);

  wgpu::Buffer gpuBufferFirstMatrix =
      device.CreateBuffer(new wgpu::BufferDescriptor{
          .usage = wgpu::BufferUsage::Storage,
          .size = firstMatrixSize,
          .mappedAtCreation = true,
      });
  std::memcpy(gpuBufferFirstMatrix.GetMappedRange(), firstMatrix.data(),
              firstMatrixSize);
  gpuBufferFirstMatrix.Unmap();

  std::cout << "First Matrix: " << std::endl;
  std::copy(firstMatrix.begin(), firstMatrix.end(),
            std::ostream_iterator<float>(std::cout, " "));
  std::cout << std::endl;

  // Second Matrix
  const std::vector<float> secondMatrix = {4, 2, 1, 2, 3, 4, 5, 6, 7, 8};
  size_t secondMatrixSize = secondMatrix.size() * sizeof(float);

  wgpu::Buffer gpuBufferSecondMatrix =
      device.CreateBuffer(new wgpu::BufferDescriptor{
          .usage = wgpu::BufferUsage::Storage,
          .size = secondMatrixSize,
          .mappedAtCreation = true,
      });
  std::memcpy(gpuBufferSecondMatrix.GetMappedRange(), secondMatrix.data(),
              secondMatrixSize);
  gpuBufferSecondMatrix.Unmap();

  std::cout << "Second Matrix: " << std::endl;
  std::copy(secondMatrix.begin(), secondMatrix.end(),
            std::ostream_iterator<float>(std::cout, " "));
  std::cout << std::endl;

  // Result Matrix
  resultMatrixSize =
      sizeof(float) * (2 + static_cast<size_t>(firstMatrix[0]) *
                               static_cast<size_t>(secondMatrix[1]));

  wgpu::Buffer resultMatrixBuffer =
      device.CreateBuffer(new wgpu::BufferDescriptor{
          .usage = wgpu::BufferUsage::Storage | wgpu::BufferUsage::CopySrc,
          .size = resultMatrixSize,
      });

  // Compute shader code
  wgpu::ShaderModuleWGSLDescriptor shaderModuleDesc = {};
  shaderModuleDesc.code = shaderCode;
  wgpu::ShaderModuleDescriptor shaderModuleDescriptor{.nextInChain =
                                                          &shaderModuleDesc};
  wgpu::ShaderModule shaderModule =
      device.CreateShaderModule(&shaderModuleDescriptor);
  // Pipeline setup
  wgpu::ComputePipelineDescriptor pipelineDesc = {};
  pipelineDesc.compute.module = shaderModule;
  pipelineDesc.compute.entryPoint = "main";

  wgpu::ComputePipeline computePipeline =
      device.CreateComputePipeline(&pipelineDesc);

  // Bind group
  wgpu::BindGroupDescriptor bindGroupDesc = {};
  wgpu::BindGroupEntry entries[3] = {};
  entries[0].binding = 0;
  entries[0].buffer = gpuBufferFirstMatrix;
  entries[1].binding = 1;
  entries[1].buffer = gpuBufferSecondMatrix;
  entries[2].binding = 2;
  entries[2].buffer = resultMatrixBuffer;
  bindGroupDesc.entryCount = 3;
  bindGroupDesc.entries = entries;
  bindGroupDesc.layout = computePipeline.GetBindGroupLayout(0);

  wgpu::BindGroup bindGroup = device.CreateBindGroup(&bindGroupDesc);

  // Commands submission
  wgpu::CommandEncoder commandEncoder = device.CreateCommandEncoder();

  wgpu::ComputePassEncoder passEncoder = commandEncoder.BeginComputePass();
  passEncoder.SetPipeline(computePipeline);
  passEncoder.SetBindGroup(0, bindGroup);
  uint32_t workgroupCountX =
      static_cast<uint32_t>(std::ceil(firstMatrix[0] / 8.0f));
  uint32_t workgroupCountY =
      static_cast<uint32_t>(std::ceil(secondMatrix[1] / 8.0f));
  passEncoder.DispatchWorkgroups(workgroupCountX, workgroupCountY);
  passEncoder.End();

  // Get a GPU buffer for reading in an unmapped state
  gpuReadBuffer = device.CreateBuffer(new wgpu::BufferDescriptor{
      .usage = wgpu::BufferUsage::CopyDst | wgpu::BufferUsage::MapRead,
      .size = resultMatrixSize,
  });

  // Encode commands for copying buffer to buffer
  commandEncoder.CopyBufferToBuffer(resultMatrixBuffer, 0, gpuReadBuffer, 0,
                                    resultMatrixSize);

  // Submit GPU commands
  wgpu::CommandBuffer commands = commandEncoder.Finish();
  device.GetQueue().Submit(1, &commands);

  std::cout << "Commands submitted to the GPU Queue" << std::endl;

  gpuReadBuffer.MapAsync(wgpu::MapMode::Read, (size_t)0, resultMatrixSize,
                         BufferMapCallbackFunction,
                         reinterpret_cast<void *>(&work_done));
}

// The content of this function could be in the main()
// I wrote it like this to show how the function export works in emscripten 
// Also, we can pass necessary arguements easier from JS side. 
extern "C" {
void RunMatMultWrapper() {
  instance = wgpu::CreateInstance();

  GetAdapter([]() {
    std::cout << "GPU Adapter acquired." << std::endl;
    GetDevice([]() {
      std::cout << "GPU Device acquired." << std::endl;
      RunMatMult();
    });
  });

  // https://eliemichel.github.io/LearnWebGPU/getting-started/the-command-queue.html#device-polling
#ifdef __EMSCRIPTEN__
  while (!work_done) {
    emscripten_sleep(100);
  }
#else
  while (!work_done) {
    instance.ProcessEvents();
  }
#endif
}
}

int main() {
  RunMatMultWrapper();
  return 0;
}

Requirements

CMake
Emscripten
DCP setup

You can use Emscripten SDK to install all the required tools. Make sure to set environment variables (either in bash or everytime you want to use WASM toolchain)

emsdk install latest
emsdk activate latest
source "/path/to/emsdk/emsdk_env.sh"

To start your DCP journey, go to here.

Build and Run

The current example is tested with Emscripten 3.1.61 (for DCP) and dawn chrome/6562 (as standalone).

Build

You can build the project using ./clean-and-build.sh script with these options:

DCP (to build and deploy the package)
web (to test the the code as a standalone web example)
native (to run the binary file natively, again standalone)

For DCP and web options, the script uses emscripten cmake:

emcmake cmake -B build-web && cmake --build build-web

For native builds, make sure to set the correct address to the dawn directory in the CMakeLists.txt and the script has CMake taking care of building it with:

cmake -B build && cmake --build build -j4

# For debugging, you can add the following option
cmake -DCMAKE_BUILD_TYPE=Debug -B build && cmake --build build -j4

This is the content of the ./clean-and-build.sh script:

#!/bin/bash

set -eux

# Arg could be DCP (default), web, or native
# Default to DCP if no argument is passed
MODE="${1:-DCP}"

BUILD_DIR=package/build
BUILD_WEB_DIR=package/build-web

# Function to prompt the user for confirmation
confirm() {
  local dir="$1"
  read -r -p "Are you sure you want to remove ${dir}? [y/N] " response
  case "$response" in
  [yY][eE][sS] | [yY])
    true
    ;;
  *)
    false
    ;;
  esac
}
# Set CMake options based on the MODE
if [ "$MODE" == "DCP" ]; then
  CMAKE_OPTIONS="-DBUILD_FOR_DCP=ON"
elif [ "$MODE" == "web" ]; then
  echo "Standalone web mode is enabled. Setting DCP to off."
  CMAKE_OPTIONS="-DBUILD_FOR_DCP=OFF"
elif [ "$MODE" == "native" ]; then
  echo "Standalone native mode is enabled. Setting DCP to off."
  CMAKE_OPTIONS="-DBUILD_FOR_DCP=OFF"
else
  echo "No valid option is passed. Options are DCP (default) or local."
  CMAKE_OPTIONS="-DBUILD_FOR_DCP=ON"
  MODE="DCP"
fi

if [ "$MODE" == "native" ]; then
  if confirm "$BUILD_DIR"; then
    echo "Doing a clean build!"
    rm -rf "$BUILD_DIR"
  fi
  cmake -B build -S package -B package/build && cmake --build package/build -j4
else
  if confirm "$BUILD_WEB_DIR"; then
    echo "Doing a clean build!"
    rm -rf "$BUILD_WEB_DIR"
  fi
  emcmake cmake -S package -B package/build-web $CMAKE_OPTIONS &&
    cmake --build package/build-web -- VERBOSE=1
fi

# Run additional commands only if MODE is DCP
if [ "$MODE" == "DCP" ]; then
  node ./updateVersion.js
  npm i -g dcp-client
  publish package package/package.dcp
fi

The ./package/CMakelists.txt file:

cmake_minimum_required(VERSION 3.13)
project(wasm-webgpu-matmult LANGUAGES C CXX)
set(CMAKE_CXX_STANDARD 20)

add_executable(wasm-webgpu-matmult "src/wasm-webgpu-matmult.cpp")

if(EMSCRIPTEN)
  # Create a JS file only, and not the html template file
  set_target_properties(wasm-webgpu-matmult PROPERTIES SUFFIX ".js")

  target_link_options(wasm-webgpu-matmult PRIVATE "-sSINGLE_FILE=1")

  # Enable WebGPU through (webgpu/webgpu.h)
  target_link_options(wasm-webgpu-matmult PRIVATE "-sUSE_WEBGPU=1")

  # Help with printing stack trace, error prevention
  target_link_options(wasm-webgpu-matmult PRIVATE "-sASSERTIONS=1")

  # Enable memory growth at runtime and refrain from throwing exception
  target_link_options(wasm-webgpu-matmult PRIVATE "-sALLOW_MEMORY_GROWTH=1")

  # Disable WASM module generation. (Everything will be in a JS file)
  # So far, passing -sWASM=0 or -sWASM=1 doesn't make any difference :-?
  target_link_options(wasm-webgpu-matmult PRIVATE "-sWASM=1")

  # Whether to support async operations in the compiled code. This makes it
  # possible to call JS functions from synchronous-looking code in C/C++.
  target_link_options(wasm-webgpu-matmult PRIVATE "-sASYNCIFY=1")

  # Enable optimization in code speed and size
  target_link_options(wasm-webgpu-matmult PRIVATE "-O3")

  target_link_options(
    wasm-webgpu-matmult PRIVATE
    "-sEXPORTED_RUNTIME_METHODS=['ccall','cwrap','callMain']"
  )

  # Symbols that are explicitly exported. These symbols are kept alive through
  # LLVM dead code elimination, and also made accessible outside of the
  # generated code even after running closure compiler (on "Module").  Native
  # symbols listed here require an ``_`` prefix. By default if this setting is
  # not specified on the command line the ``_main`` function will be implicitly
  # exported.  In STANDALONE_WASM mode the default export is ``__start`` (or
  # ``__initialize`` if --no-entry is specified). JS Library symbols can also be
  # added to this list (without the leading `$`). var EXPORTED_FUNCTIONS = [];
  target_link_options(
    wasm-webgpu-matmult PRIVATE
    "-sEXPORTED_FUNCTIONS=['_RunMatMultWrapper','_main']"
  )

  # Whether we will run the main() function. Disable if you embed the generated
  # code in your own, and will call main() yourself at the right time (which you
  # can do with Module.callMain()
  if(BUILD_FOR_DCP)
    target_link_options(wasm-webgpu-matmult PRIVATE "-sINVOKE_RUN=0")
  else()
    target_link_options(wasm-webgpu-matmult PRIVATE "-sINVOKE_RUN=1")
  endif()

  # Specify which runtime environments the JS output will be capable of running
  # in.  For maximum portability this can configured to support all environments
  # or it can be limited to reduce overall code size.
  # var ENVIRONMENT = 'web,webview,worker,node';
  target_link_options(wasm-webgpu-matmult PRIVATE "-sENVIRONMENT=worker")

  # If set to 0, does not build in any filesystem support. Useful if you are
  # just doing pure computation, but not reading files or using any streams
  # (including fprintf, and other stdio.h things) or anything related.
  target_link_options(wasm-webgpu-matmult PRIVATE "-sFILESYSTEM=1")

  if(BUILD_FOR_DCP)
    target_link_options(
      wasm-webgpu-matmult PUBLIC
      "--extern-pre-js=${PROJECT_SOURCE_DIR}/src/openbravo.js"
    )

    target_link_options(
      wasm-webgpu-matmult PUBLIC
      "--extern-post-js=${PROJECT_SOURCE_DIR}/src/closebravo.js"
    )
  endif()
  
else()
  set(DAWN_FETCH_DEPENDENCIES ON)
  add_subdirectory("../../dawn" "build" EXCLUDE_FROM_ALL)
  target_link_libraries(wasm-webgpu-matmult PRIVATE webgpu_cpp webgpu_dawn)
endif()

Test in the browser without DCP

To test the example in the browser without DCP, simply open the file index.html under package/src in the browser.

Make sure to enable WebGPU in your browser first! For instance, if you are on Linux, and your browser is chrome-unstable, pass the necessary options:

google-chrome-unstable --enable-unsafe-webgpu --enable-features=Vulkan \ 
  --disable-dawn-features=disallow_unsafe_apis &

This is the ./package/src/index.html file:

 lang="en">
  
     charset="UTF-8" />
    </span>WASM + WebGPU<span class="nt">
    
    Open the console!

Test the binary natively

It is unnecessary, but if you could build the standalone native binary, the result should be something like this:

$ ./package/build/wasm-webgpu-matmult

GPU Adapter acquired.
Warning: SetUncapturedErrorCallback is deprecated. Pass the callback in the device descriptor instead.
GPU Device acquired.
First Matrix: 
2 4 1 2 3 4 5 6 7 8 
Second Matrix: 
4 2 1 2 3 4 5 6 7 8 
Commands submitted to the GPU Queue
Warning: Old MapAsync APIs are deprecated. If using C please pass a CallbackInfo struct that has two userdatas. Otherwise, if using C++, please use templated helpers.
In Buffer async call back, status: 1
Result Matrix: 
2 2 50 60 114 140 
Warning: No Dawn device lost callback was set. This is probably not intended. If you really want to ignore device lost and suppress this message, set the callback explicitly.

Deploying the job on DCP

Again, the script ./clean-and-build.sh DCP performs the necessary preparation steps. More specifically:

The code gets built and wrapped between openbravo.js and closebravo.js under the package/src/ directory to make it a DCP-friendly module.
The version number under package/package.dcp will be update
The npm package dcp-client will be installed
The source wasm-webgpu-matmult.js under package/build-web/ will be deployed.

After that, the job can be deployed to the scheduler specified in the environment variable DCP_SCHEDULER_LOCATION. Make sure you have valid authentication keys. Also, note that the current deploy function in deployJob.js deploys the job to the default compute group.

The ./updateVersion.js script which is used to update the version number in the ./package/package.dcp file:

const fs = require('node:fs');
const content = JSON.parse(
  fs.readFileSync('./package/package.dcp', { encoding: 'utf8' }),
);

const version = content.version.split('.');
version[2] = +version[2] + 1;
content.version = version.join('.');

fs.writeFileSync('./package/package.dcp', JSON.stringify(content), {
  encoding: 'utf8',
});

The ./package/src/package.dcp file:

{
  "name": "wasm-webgpu-matmult",
  "version": "0.0.19",
  "files": {
    "./build-web/wasm-webgpu-matmult.js": "wasm-webgpu-matmult.js"
  }
}

The ./package/src/openbravo.js file:

// file name: openbravo.js

// This is a BravoJS module definition, generated for DCP
module.declare([], function(require, exports, module) {

The ./package/src/closebravo.js file:

// file name: closebravo.js

exports.Module = Module;
exports.ccall = ccall;
exports.cwrap = cwrap;
});

// This concludes the BravoJS module definition

This is the content of the ./deployJob.js script:

#!/usr/bin/env node

async function workFn(sliceNumber, arg) {
  progress();
  const { Module } = require('wasm-webgpu-matmult.js');

  async function matmult() {
    // cwrap(function name, return type, args type); null means void here
    RunMatMultWrapper = Module.cwrap('RunMatMultWrapper', 'null', ['null'], {
      async: true,
    });
    await RunMatMultWrapper();
  }

  return new Promise((res) => {
    if (Module.onRuntimeInitialized) {
      const result = matmult();
      res(result);
    } else {
      Module.onRuntimeInitialized = () => {
        const result = matmult();
        res(result);
      };
    }
  });
}

async function deployJob() {
  await require('dcp-client').init();

  let startTime;

  const compute = require('dcp/compute');
  const wallet = require('dcp/wallet');

  const job = compute.for([1], workFn, [0]);

  // Get the stringified message from the worker and log
  job.on('console', (message) => console.log(message));

  // job.requirements.discrete = true;
  job.on('accepted', () => {
    console.log(' - Job accepted by scheduler, waiting for results');
    console.log(` - Job has id ${job.id}`);
    startTime = Date.now();
  });

  job.on('readystatechange', (arg) => {
    console.log(`new ready state: ${arg}`);
  });

  job.on('result', (ev) => {
    console.log(
      ` - Received result for slice ${ev.sliceNumber} at ${
        Math.round((Date.now() - startTime) / 100) / 10
      }s`,
    );
  });

  job.on('status', (ev) => {
    console.log('Got status update: ', ev);
  });

  job.on('error', (message) => console.log(message));

  const ks = await wallet.get(); /* usually loads ~/.dcp/default.keystore */
  job.requires(['wasm-webgpu-matmult/wasm-webgpu-matmult.js']);
  job.public.name = 'wasm-webgpu-matmult';
  job.requirements.environment = { webgpu: true };
  job.setPaymentAccountKeystore(ks);

  const results = await job.exec();
  console.log('results=', Array.from(results));
}

exports.deployJob = deployJob;
deployJob();

To deploy the job, simply run:

node deployJob.js

The current code has a lot of logging messages, so you should see something like the following as the output:

$ node deployJob.js
new ready state: exec
new ready state: init
new ready state: preauth
new ready state: deploying
new ready state: listeners
new ready state: compute-groups
new ready state: uploading
new ready state: deployed
 - Job accepted by scheduler, waiting for results

{
  level: 'log',
  message: [ 'GPU Adapter acquired.' ],
  sliceNumber: 1
}
{
  level: 'log',
  message: [ 'GPU Device acquired.' ],
  sliceNumber: 1
}
{
  level: 'log',
  message: [ 'First Matrix: ' ],
  sliceNumber: 1
}
{
  level: 'log',
  message: [ '2 4 1 2 3 4 5 6 7 8 ' ],
  sliceNumber: 1
}
{
  level: 'log',
  message: [ 'Second Matrix: ' ],
  sliceNumber: 1
}
{
  level: 'log',
  message: [ '4 2 1 2 3 4 5 6 7 8 ' ],
  sliceNumber: 1
}
{
  level: 'log',
  message: [ 'Commands submitted to the GPU Queue' ],
  sliceNumber: 1
}
{
  level: 'log',
  message: [ 'In Buffer async call back, status: 0' ],
  sliceNumber: 1
}
{
  level: 'log',
  message: [ 'Result Matrix: ' ],
  sliceNumber: 1
}
{
  level: 'log',
  message: [ '2 2 50 60 114 140 ' ],
  sliceNumber: 1
}
 - Received result for slice 1 at 5.2s
Got status update:  {
  runStatus: 'finished',
  total: 1,
  distributed: 1,
  computed: 1,
}

Some points:

The deployJob script, deploys the one-slice job with the entry point workFn. When a worker picks up a job slice, it starts to execute this function.
Before we can call any function from the generated WASM module we should wait for the runtime to be initialized. This is done like this:

  const { Module } = require('wasm-webgpu-matmult.js');

  if (Module.onRuntimeInitialized) {
    // the module is already initialized
    // ...
  } else {
    // the module is not initialized, we will set a callback  
    Module.onRuntimeInitialized = () => {
      // ..
    };
  }

Next, the function RunMatMultWrapper gets called, and as we need its return value, the function should behave synchronously. However, the example in C++ wasm-webgpu-matmult.cpp uses callbacks everywhere (to hanle device and adapter initialization, etc.). According to here, on the C/C++ side, we need to wait a little bit, and importantly to call tick/poll the device so that it updates its awaiting tasks. This is a part of the API that is not standard yet, so we must adapt our implementation to the backend.
All the optinos in CMakeLists.txt is explained to some extent. Note that we need the -sASYNCIFY option, specifically, so that we can await on the cwraped function from our workFn in JS.

RunMatMultWrapper = Module.cwrap('RunMatMultWrapper', 'null', ['null'], {async: true});
await RunMatMultWrapper();

There you go! An example of how to deploy a WebGPU example on DCP using WASM. You can find the complete source of this post here

Cross Platform WebGPU Example - Compute API

2024-06-13T00:00:00-04:00

As a part of my work at Distributive, I developed an example providing a simple matrix multiplication using WebGPU Compute API in C++. This example is written in C++ and can be compiled to WebAssembly using Emscripten, native executable file using Dawn. The example is based on here, and its complete source can be found there.

Matmult Source Code

The following is a simple matrix multiplication using WebGPU Compute API in C++. Look how the code uses callbacks to handle the asynchronous nature of the GPU operations.

GetAdapter([]() {
  std::cout << "GPU Adapter acquired." << std::endl;
  GetDevice([]() {
    std::cout << "GPU Device acquired." << std::endl;
    RunMatMult();
  });
});

Here, GetAdapter is called first, and it gets the GPU adapter. Once the adapter is acquired, it calls the callback which calls GetDevice to get the GPU device. After the device is acquired, it calls RunMatMult to run the matrix multiplication.

See the complete source code below with the shader module written in WGSL.

#include 
#include 
#include 
#include 
#include 

#ifdef __EMSCRIPTEN__
#include 
#endif

wgpu::Instance instance;
wgpu::Adapter adapter;
wgpu::Device device;

wgpu::Buffer gpuReadBuffer;
size_t resultMatrixSize;
bool work_done = false;

// GetAdapter gets a callback function that it's get called
// after the RequestAdapter resolves.
void GetAdapter(void (*callback)()) {
  instance.RequestAdapter(
      nullptr,
      [](WGPURequestAdapterStatus status, WGPUAdapter cAdapter,
         const char *message, void *userdata) {
        if (message) {
          std::cout << "RequestAdapter message: " << message << std::endl;
        }
        if (status != WGPURequestAdapterStatus_Success) {
          std::cout << "AdapterRequest was not successfull" << std::endl;
          exit(0);
        }
        adapter = wgpu::Adapter::Acquire(cAdapter);
        // (2) Cast userdata back to the callback and then call it
        reinterpret_cast<void (*)()>(userdata)();
      },
      // (1) Cast the call back to void pointer and pass it in
      reinterpret_cast<void *>(callback));
}

void GetDevice(void (*callback)()) {
  adapter.RequestDevice(
      nullptr,
      [](WGPURequestDeviceStatus status, WGPUDevice cDevice,
         const char *message, void *userdata) {
        if (message) {
          std::cout << "RequestDevice message: " << message << std::endl;
        }
        if (status != WGPURequestDeviceStatus_Success) {
          std::cout << "AdapterRequest was not successfull" << std::endl;
          exit(0);
        }
        device = wgpu::Device::Acquire(cDevice);
        device.SetUncapturedErrorCallback(
            [](WGPUErrorType type, const char *message, void *userdata) {
              std::cout << "Error: " << type << " - message: " << message;
            },
            nullptr);
        // (2) Cast userdata back to the callback and then call it
        reinterpret_cast<void (*)()>(userdata)();
      },
      // (1) Cast the call back to void pointer and pass it in
      reinterpret_cast<void *>(callback));
}

const char shaderCode[] = R"(
    struct Matrix {
        size : vec2,
        numbers: array,
    };

    @group(0) @binding(0) var firstMatrix : Matrix;
    @group(0) @binding(1) var secondMatrix : Matrix;
    @group(0) @binding(2) var resultMatrix : Matrix;

    @compute @workgroup_size(8, 8)
    fn main(@builtin(global_invocation_id) global_id : vec3) {
        // Guard against out-of-bounds work group sizes
        if (global_id.x >= u32(firstMatrix.size.x) || global_id.y >= u32(secondMatrix.size.y)) {
            return;
        }

        resultMatrix.size = vec2(firstMatrix.size.x, secondMatrix.size.y);

        let resultCell = vec2(global_id.x, global_id.y);
        var result = 0.0;
        for (var i = 0u; i < u32(firstMatrix.size.y); i = i + 1u) {
            let a = i + resultCell.x * u32(firstMatrix.size.y);
            let b = resultCell.y + i * u32(secondMatrix.size.y);
            result = result + firstMatrix.numbers[a] * secondMatrix.numbers[b];
        }

        let index = resultCell.y + resultCell.x * u32(secondMatrix.size.y);
        resultMatrix.numbers[index] = result;
    }
)";

void BufferMapCallbackFunction(WGPUBufferMapAsyncStatus status,
                               void *userdata) {

  std::cout << "In Buffer async call back, status: " << status << std::endl;

  if (status == WGPUBufferMapAsyncStatus_Success) {
    const float *resultData = static_cast<const float *>(
        gpuReadBuffer.GetConstMappedRange(0, resultMatrixSize));

    std::cout << "Result Matrix: " << std::endl;
    for (int i = 0; i < resultMatrixSize / sizeof(float); i++) {
      std::cout << resultData[i] << " ";
    }
    std::cout << std::endl;

    gpuReadBuffer.Unmap();
  } else {
    std::cout << "Failed to map result buffer" << std::endl;
  }
  *reinterpret_cast<bool *>(userdata) = true;
}

void RunMatMult() {
  // First Matrix
  const std::vector<float> firstMatrix = {2, 4, 1, 2, 3, 4, 5, 6, 7, 8};
  size_t firstMatrixSize = firstMatrix.size() * sizeof(float);

  wgpu::Buffer gpuBufferFirstMatrix =
      device.CreateBuffer(new wgpu::BufferDescriptor{
          .usage = wgpu::BufferUsage::Storage,
          .size = firstMatrixSize,
          .mappedAtCreation = true,
      });
  std::memcpy(gpuBufferFirstMatrix.GetMappedRange(), firstMatrix.data(),
              firstMatrixSize);
  gpuBufferFirstMatrix.Unmap();

  std::cout << "First Matrix: " << std::endl;
  std::copy(firstMatrix.begin(), firstMatrix.end(),
            std::ostream_iterator<float>(std::cout, " "));
  std::cout << std::endl;

  // Second Matrix
  const std::vector<float> secondMatrix = {4, 2, 1, 2, 3, 4, 5, 6, 7, 8};
  size_t secondMatrixSize = secondMatrix.size() * sizeof(float);

  wgpu::Buffer gpuBufferSecondMatrix =
      device.CreateBuffer(new wgpu::BufferDescriptor{
          .usage = wgpu::BufferUsage::Storage,
          .size = secondMatrixSize,
          .mappedAtCreation = true,
      });
  std::memcpy(gpuBufferSecondMatrix.GetMappedRange(), secondMatrix.data(),
              secondMatrixSize);
  gpuBufferSecondMatrix.Unmap();

  std::cout << "Second Matrix: " << std::endl;
  std::copy(secondMatrix.begin(), secondMatrix.end(),
            std::ostream_iterator<float>(std::cout, " "));
  std::cout << std::endl;

  // Result Matrix
  resultMatrixSize =
      sizeof(float) * (2 + static_cast<size_t>(firstMatrix[0]) *
                               static_cast<size_t>(secondMatrix[1]));

  wgpu::Buffer resultMatrixBuffer =
      device.CreateBuffer(new wgpu::BufferDescriptor{
          .usage = wgpu::BufferUsage::Storage | wgpu::BufferUsage::CopySrc,
          .size = resultMatrixSize,
      });

  // Compute shader code
  wgpu::ShaderModuleWGSLDescriptor shaderModuleDesc = {};
  shaderModuleDesc.code = shaderCode;
  wgpu::ShaderModuleDescriptor shaderModuleDescriptor{.nextInChain =
                                                          &shaderModuleDesc};
  wgpu::ShaderModule shaderModule =
      device.CreateShaderModule(&shaderModuleDescriptor);
  // Pipeline setup
  wgpu::ComputePipelineDescriptor pipelineDesc = {};
  pipelineDesc.compute.module = shaderModule;
  pipelineDesc.compute.entryPoint = "main";

  wgpu::ComputePipeline computePipeline =
      device.CreateComputePipeline(&pipelineDesc);

  // Bind group
  wgpu::BindGroupDescriptor bindGroupDesc = {};
  wgpu::BindGroupEntry entries[3] = {};
  entries[0].binding = 0;
  entries[0].buffer = gpuBufferFirstMatrix;
  entries[1].binding = 1;
  entries[1].buffer = gpuBufferSecondMatrix;
  entries[2].binding = 2;
  entries[2].buffer = resultMatrixBuffer;
  bindGroupDesc.entryCount = 3;
  bindGroupDesc.entries = entries;
  bindGroupDesc.layout = computePipeline.GetBindGroupLayout(0);

  wgpu::BindGroup bindGroup = device.CreateBindGroup(&bindGroupDesc);

  // Commands submission
  wgpu::CommandEncoder commandEncoder = device.CreateCommandEncoder();

  wgpu::ComputePassEncoder passEncoder = commandEncoder.BeginComputePass();
  passEncoder.SetPipeline(computePipeline);
  passEncoder.SetBindGroup(0, bindGroup);
  uint32_t workgroupCountX =
      static_cast<uint32_t>(std::ceil(firstMatrix[0] / 8.0f));
  uint32_t workgroupCountY =
      static_cast<uint32_t>(std::ceil(secondMatrix[1] / 8.0f));
  passEncoder.DispatchWorkgroups(workgroupCountX, workgroupCountY);
  passEncoder.End();

  // Get a GPU buffer for reading in an unmapped state
  gpuReadBuffer = device.CreateBuffer(new wgpu::BufferDescriptor{
      .usage = wgpu::BufferUsage::CopyDst | wgpu::BufferUsage::MapRead,
      .size = resultMatrixSize,
  });

  // Encode commands for copying buffer to buffer
  commandEncoder.CopyBufferToBuffer(resultMatrixBuffer, 0, gpuReadBuffer, 0,
                                    resultMatrixSize);

  // Submit GPU commands
  wgpu::CommandBuffer commands = commandEncoder.Finish();
  device.GetQueue().Submit(1, &commands);

  std::cout << "Commands submitted to the GPU Queue" << std::endl;

  gpuReadBuffer.MapAsync(wgpu::MapMode::Read, (size_t)0, resultMatrixSize,
                         BufferMapCallbackFunction,
                         reinterpret_cast<void *>(&work_done));
}

extern "C" {
void RunMatMultWrapper() {
  instance = wgpu::CreateInstance();

  GetAdapter([]() {
    std::cout << "GPU Adapter acquired." << std::endl;
    GetDevice([]() {
      std::cout << "GPU Device acquired." << std::endl;
      RunMatMult();
    });
  });

  // https://eliemichel.github.io/LearnWebGPU/getting-started/the-command-queue.html#device-polling
#ifndef __EMSCRIPTEN__
  while (!work_done) {
    instance.ProcessEvents();
  }
#else
  while (!work_done) {
    emscripten_sleep(100);
  }
#endif
}
}

int main() {
  // I put the call to the RunMatMult function in a wrapper, so that
  // I can pass arguments if necessary
  RunMatMultWrapper();
  return 0;
}

CMakeLists.txt

Now, you can use the following CMakeLists.txt to build the example for both Emscripten and Dawn.

cmake_minimum_required(VERSION 3.13) 
project(matmult
  LANGUAGES C CXX
)                         
set(CMAKE_CXX_STANDARD 20)           

add_executable(matmult "matmult.cpp")

if(EMSCRIPTEN)
  # set_target_properties(matmult PROPERTIES SUFFIX ".html")
  
  # Create a JS file only, and not the html template file
  set_target_properties(matmult PROPERTIES SUFFIX ".js")

  # Enable WebGPU through (webgpu/webgpu.h)
  target_link_options(matmult PRIVATE "-sUSE_WEBGPU=1")

  # Help with printing stack trace, error prevention
  target_link_options(matmult PRIVATE "-sASSERTIONS=1")

  # Enable memory growth at runtime and refrain from throwing exception
  target_link_options(matmult PRIVATE "-sALLOW_MEMORY_GROWTH=1")

  # Disable WASM module generation. (Everything will be in a JS file)
  target_link_options(matmult PRIVATE "-sWASM=0")
  
  # Whether to support async operations in the compiled code. This makes it
  # possible to call JS functions from synchronous-looking code in C/C++.
  target_link_options(matmult PRIVATE "-sASYNCIFY=1")
  
  # Enable optimization in code speed and size
  target_link_options(matmult PRIVATE "-O3")

  # target_link_options(matmult PRIVATE "-sDISABLE_EXCEPTION_CATCHING=0")
  
  # Whether we will run the main() function. Disable if you embed the generated
  # code in your own, and will call main() yourself at the right time (which you
  # can do with Module.callMain()
  # target_link_options(matmult PRIVATE "-sINVOKE_RUN=0")

  # Specify which runtime environments the JS output will be capable of running
  # in.  For maximum portability this can configured to support all environments
  # or it can be limited to reduce overall code size.
  # var ENVIRONMENT = 'web,webview,worker,node';
  target_link_options(matmult PRIVATE "-sENVIRONMENT=web")

  # If set to 0, does not build in any filesystem support. Useful if you are just
  # doing pure computation, but not reading files or using any streams (including
  # fprintf, and other stdio.h things) or anything related.
  target_link_options(matmult PRIVATE "-sFILESYSTEM=0")
  
else()
  set(DAWN_FETCH_DEPENDENCIES ON)
  add_subdirectory("../../dawn" "build" EXCLUDE_FROM_ALL)
  target_link_libraries(matmult PRIVATE webgpu_cpp webgpu_dawn)
endif()

Building with Emscripten

To install Emscripten, you can use emsdk github repository. Install and activate it like this:

emsdk install latest
emsdk activate latest
source "/path/to/emsdk/emsdk_env.sh"

The current code works with the latest emscripten as the time of writing this document. (3.1.61) If you have set it up properly, you would only need to run this command in the directory where CMakeLists.txt is located.

emcmake cmake -B build-web && cmake --build build-web

The resulted matmult.js file is generated in the build-web directory, and can be run in the browser.

You can create an index.html file with the following content to run the generated matmult.js file.

    Open the console!
    

python3 -m http.server

Building with Dawn

Attention! This part is written before Dawn adopts the new CMake build system. Take a look at here for the latest build instructions.

You can clone Dawn (or add it as a submodule to your repo) and let CMake takes care of building it. Make sure to update the CMakeLists.txt file to include the Dawn library from the correct path.

# Dawn is already added as a submodule to this repo
# git submodule add https://dawn.googlesource.com/dawn --branch chromium/6478
cmake -B build && cmake --build build -j8

Known Issues

For versions of Dawn before chromium/6478, the dawn/src/dawn/native/Surface.cpp file has a bug that prevents the example from running properly. My current workaround is commenting the following lines from dawn/src/dawn/native/Surface.cpp function ValidateSurfaceConfiguration, then building dawn again.

DAWN_INVALID_IF(presentModeIt == capabilities.presentModes.end(),
                "Present mode (%s) is not supported by the adapter (%s) for this surface.",
                config->format, config->device->GetAdapter());

Reporting Prediction Errors

2024-05-30T00:00:00-04:00

There are several types of reporting errors in prediction. There is a whole research behind this topic, especially for the purpose of training the ML models. However, I put some of them that I usually use here.

Given that $f \rightarrow forcast$ and $y \rightarrow value$, we can calculate these errros:

Raw Error (RE)
\[RE = f - y\]
Percentage Error (PE)
\[PE = {(f - y) \over y}\]
Symmetric Percentage Error (sPE)
\[sPE = {(f - y) \over (f + y) / 2}\]
Log Error (LE)

$LE = log(f) - log(y)$ or
\[LE = log({f \over y})\]

And some of the associated performance metrics are:

Mean Symmetric Error (MSE)
\[{1 \over n} \sum_{k=1}^n (f_k - y_k)^2\]
Mean Absolute Percentage Error (MAPE)
\[{1 \over n} \sum_{k=1}^n |f_k - y_k| / y_k\]
Symmetric Mean Absolute Percentage Error (sMAPE)
\[{1 \over n} \sum_{k=1}^n |f_k - y_k| / y_k\]
Mean Absolute Log Error (MALE)
\[{1 \over n} \sum_{k=1}^n |log(f_k/y_k)|\]
Root Mean Square Log Error (RMSLE)
\[sqrt({1 \over n} \sum_{k=1}^n |log(f_k/y_k)|^2)\]
Exponential Mean Absolute MALE (EMALE)
\[\exp({1 \over n} \sum_{k=1}^n |log(f_k/y_k)|)\]
Exponential Root Mean Square Log Error (ERMSLE)
\[\exp(sqrt({1 \over n} \sum_{k=1}^n |log(f_k/y_k)|^2))\]

Python version

# Given:
# f -> forecast
# y -> value

# RAW Error (RE)
RE = f - y

# Percentage Error (PE)
PE = (f - y) / y

# Symmetric Percentage Error (sPE)
sPE = (f - y) / ((f + y) / 2)

# Log Error (LE)
LE = log(f) - log(y) 
# or
LE = log(f/y)
# LE = log(1 + PE)

And the respective performance metrics of the model/prediction:

# Mean Absolute Percentage Error
MAPE = mean(abs(PE))

# Symmetric Mean Absolute Percentage Error
sMAPE = mean(abs(sPE))

# Mean Absolute Log Error
MALE = mean(abs(LE))

# Root Mean Square Log Error
RMSLE = sqrt(mean(LE ** 2))

# Exponential Mean Absolute MALE
EMALE = exp(MALE)

# Exponential Root Mean Square Log Error
ERMSLE = expo(RMSLE)

There are tons of resources out there. If you want to see which one suits your usecase, take a look at here.

Running a WebGPU-enabled DCP worker on Google Colab

2024-05-09T00:00:00-04:00

If you already know what Distributive Compute Protocol (DCP) is, and how it works, skip to the next section.

DCP is a fast, secure, and powerful parallel computing platform built on the web technology. With a handful of lines of code, one can harness the power of a supercomputer without any orchestration.

With DCP, any idle device can join the compute groups. And by “any” I mean ANY device! The only requirement is that they should be able to run a web browser! DCP supports some other platforms, too. Check out more info from the github repos or from the docs page.

Setting up a DCP worker on Google Colab instances

First, if you want to utilize the GPU, make sure there is a GPU attached to the runtime/instance.

!nvidia-smi

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   53C    P8              10W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Install the necessary packages and update nodejs:

!sudo apt update
!curl -fsSL https://deb.nodesource.com/setup_20.x | sudo -E bash - && sudo apt-get install -y nodejs

Then, install dcp-util and create new ids for the dcp worker.

!mkdir ~/.dcp
!npm i -g dcp-util
!npm install --global dcp-util
!mkad new id
!mkad new default
!ls ~/.dcp

To be able to use the GPU, vulkan is needed, too. More info is here.

# For Ubuntu 22.04
!wget -qO- https://packages.lunarg.com/lunarg-signing-key-pub.asc | sudo tee /etc/apt/trusted.gpg.d/lunarg.asc
!sudo wget -qO /etc/apt/sources.list.d/lunarg-vulkan-jammy.list http://packages.lunarg.com/vulkan/lunarg-vulkan-jammy.list
!sudo apt update
!sudo apt install vulkan-sdk

Install dcp-worker and upload and unzip the dcp-native release zip file to the instance.

!npm i dcp-worker
!unzip ./dcp-native-5.0.1-Linux-x64.zip

Now the directory should look like this:

node_modules artifacts

Now, first run the evaluator in the background:

!nohup ./node_modules/dcp-worker/bin/dcp-evaluator-start --port 9004 --evaluator ./artifacts/Release/dcp-evaluator -- --webgpu &

Notice that:

nohup is needed in google colab to send the process to the background. The little & would not be enough.
The --port option has the evaluator listen on that specific port.
The --evaluator option specifies the dcp-native evaluator path.
If you want to pass any option to the dcp-evaluator, they should be passed after --. For examplae enabling WebGPU would be like -- --webgpu.
The output is redirected to nohup.out. You can open it, by double clicking on the file. If the command runs successfully, you should be able to see something similar to this output:

Starting DCP Evaluator -- 
. Evaluator program at:         /content/artifacts/Release/dcp-evaluator
. Listening on:               9004
. Worker environment type:     native
. Worker environment:      
    /content/node_modules/dcp-client/libexec/sandbox/deny-node.js
    /content/node_modules/kvin/kvin.js
    /content/node_modules/dcp-client/libexec/sandbox/sa-ww-simulation.js
    /content/node_modules/dcp-client/libexec/sandbox/script-load-wrapper.js
    /content/node_modules/dcp-client/libexec/sandbox/native-event-loop.js
    /content/node_modules/dcp-client/libexec/sandbox/wrap-event-listeners.js
    /content/node_modules/dcp-client/libexec/sandbox/timer-classes.js
    /content/node_modules/dcp-client/libexec/sandbox/event-loop-virtualization.js
    /content/node_modules/dcp-client/libexec/sandbox/unique-timing.js
    /content/node_modules/dcp-client/libexec/sandbox/worktimes.js
    /content/node_modules/dcp-client/libexec/sandbox/access-lists.js
    /content/node_modules/dcp-client/libexec/sandbox/bravojs-init.js
    /content/node_modules/bravojs/bravo.js
    /content/node_modules/dcp-client/libexec/sandbox/bravojs-env.js
    /content/node_modules/dcp-client/libexec/sandbox/pyodide-core.js
    /content/node_modules/dcp-client/libexec/sandbox/calculate-capabilities.js
    /content/node_modules/dcp-client/libexec/sandbox/bootstrap.js

Running "dcp-evaluator", version 5.0.1.0+60ebd1d...
Listening for connections (press Ctrl+C to quit)...
Listening on port 9000...
Running "dcp-evaluator", version 5.0.1.0+60ebd1d...
Warning: loader_scanned_icd_add: Driver /usr/lib/x86_64-linux-gnu/libvulkan_lvp.so supports Vulkan 1.1, but only supports loader interface version 4. Interface version 5 or newer required to support this version of Vulkan (Policy #LDP_DRIVER_7)
Warning: loader_scanned_icd_add: Driver /usr/lib/x86_64-linux-gnu/libvulkan_radeon.so supports Vulkan 1.2, but only supports loader interface version 4. Interface version 5 or newer required to support this version of Vulkan (Policy #LDP_DRIVER_7)
Warning: loader_scanned_icd_add: Driver /usr/lib/x86_64-linux-gnu/libvulkan_intel.so supports Vulkan 1.2, but only supports loader interface version 4. Interface version 5 or newer required to support this version of Vulkan (Policy #LDP_DRIVER_7)
Warning: loader_icd_scan: Can not find 'ICD' object in ICD JSON file /usr/share/vulkan/icd.d/nvidia_layers.json.  Skipping ICD JSON
WebGPU: obtained device
Evaluating (press Ctrl+C to quit)...

Now, you should run the worker

!./node_modules/dcp-worker/bin/dcp-worker -o console --port 9004

If you want the worker to join a specific compute group, pass the options: -g computeGroupName,computeGroupPass --leavePublicGroup

The output should be similar to this:

! Public Group fallback has been requested, but the public group is blocked by local configuration
 * Starting DCP Worker KoTlJj1fs6ILYut9nIpilT
 . Configured for scheduler https://scheduler.distributed.computer/
 . Bank is https://bank.distributed.computer/
 . Earned funds will be deposited in account 0xaaaaaaa
 . Identity is 0xbbbbbbb
 . Joining compute group brotwurst
 . Falling back on public group when preferred groups have no work
 . Leaving the public compute group
 . Configured Cores: 1,0.75
 . Worktimes Available:
    -    map-basic@1.0.0
    -    pyodide@0.23.2
 . Supervisor version: 2.0.0 . Output mode: console
 * Ready

Now the worker should be able to pick up new job slices.

To understand how to submit jobs, visit here.

NVIDIA Management Library (NVML)

2024-04-12T00:00:00-04:00

In this post, I will provide two examples on how to use the NVIDIA Management Library (NVML) to query the GPU information. The first example is a simple C++ program that prints the GPU power and temperature, and the second one is about getting PCIe info.

NVML

The NVIDIA Management Library (NVML) is a C-based API for monitoring and managing various states of NVIDIA GPU devices. The library is used to query the GPU information, such as power, temperature, and utilization. The library is also used to manage the GPU devices, such as setting the power limit and the clock speed. The complete documentation of NVML can be found here. Some famous tools that use NVML are nvidia-smi and nvtop.

Example 1: Power and Temperature

The following C++ program prints the GPU power and temperature.

#include 
#include 

#define CHECK_NVML(result, message) \
    if (result != NVML_SUCCESS) { \
        std::cerr << message << ": " << nvmlErrorString(result) << std::endl; \
        return 1; \
    }

int main() {
    
    nvmlDevice_t device;
    unsigned int power;
    unsigned int temperature;

    CHECK_NVML(nvmlInit(), "Failed to initialize NVML");
    CHECK_NVML(nvmlDeviceGetHandleByIndex(0, &device), "Failed to get device handle");
    CHECK_NVML(nvmlDeviceGetPowerUsage(device, &power), "Failed to get power usage");
    CHECK_NVML(nvmlDeviceGetTemperature(device, NVML_TEMPERATURE_GPU, &temperature), "Failed to get temperature");
    
    std::cout << "Power: " << power << " mW" << std::endl;
    std::cout << "Temperature: " << temperature << " C" << std::endl;

    nvmlShutdown();
    return 0;
}

The program uses the NVML library, therefore, it should be compiled with the -lnvidia-ml flag.

g++ -o nvml_power_temp nvml_power_temp.cpp -lnvidia-ml
./nvml_power_temp
# If you want a specific GPU, you can set the CUDA_VISIBLE_DEVICES environment variable:
# CUDA_VISIBLE_DEVICES=0 ./nvml_power_temp

Example 2: Some PCIe Information

The following C++ program prints some PCIe information.

#include 
#include 

#define CHECK_NVML(result, message) \
    if (result != NVML_SUCCESS) { \
        std::cerr << message << ": " << nvmlErrorString(result) << std::endl; \
        return 1; \
    }

int main() {
    
    unsigned int device_count;
    
    CHECK_NVML(nvmlInit(), "Failed to initialize NVML");
    CHECK_NVML(nvmlDeviceGetCount(&device_count), "Failed to get device count");
    
    for (unsigned int i = 0; i < device_count; i++) {
        nvmlDevice_t device;
        nvmlPciInfo_t pci_info;
        unsigned int speed;
        unsigned int max_speed;
        unsigned int max_width;
        unsigned int curr_width;
        unsigned int tx_throughput;
        unsigned int rx_throughput;

        CHECK_NVML(nvmlDeviceGetHandleByIndex(i, &device), 
                    "Failed to get device handle");
        CHECK_NVML(nvmlDeviceGetPciInfo(device, &pci_info), 
                    "Failed to get PCI info");
        CHECK_NVML(nvmlDeviceGetMaxPcieLinkWidth(device, &max_width), 
                    "Failed to get PCIe link max width");
        CHECK_NVML(nvmlDeviceGetCurrPcieLinkWidth(device, &curr_width), 
                    "Failed to get PCIe link current width");
        // For whatever reasons, the following metric doesn't make sense to me.
        CHECK_NVML(nvmlDeviceGetPcieLinkMaxSpeed(device, &max_speed), 
                    "Failed to get PCIe link max speed");
        CHECK_NVML(nvmlDeviceGetPcieSpeed(device, &speed), 
                    "Failed to get PCIe link speed");
        CHECK_NVML(nvmlDeviceGetPcieThroughput(device, NVML_PCIE_UTIL_TX_BYTES, &tx_throughput), 
                    "Failed to get PCIe throughput");
        CHECK_NVML(nvmlDeviceGetPcieThroughput(device, NVML_PCIE_UTIL_RX_BYTES, &rx_throughput), 
                    "Failed to get PCIe throughput");

        std::cout << "Device " << i << ":" << std::endl;
        std::cout << "  Bus ID: " << pci_info.busId << std::endl;
        std::cout << "  PCIe Max Link Width: " << max_width << " lanes" << std::endl;
        std::cout << "  PCIe Current Link Width: " << curr_width << " lanes" << std::endl;
        std::cout << "  PCIe Max Link Speed: " << max_speed << " MB/s" << std::endl;
        std::cout << "  PCIe Link Speed: " << speed << " MB/s" << std::endl;
        std::cout << "  PCIe TX Throughput: " << tx_throughput << " KB/s" << std::endl;
        std::cout << "  PCIe RX Throughput: " << rx_throughput << " KB/s" << std::endl;
    }

    nvmlShutdown();
    return 0;
}

Again, the program uses the NVML library, therefore, it should be compiled with the -lnvidia-ml flag.

Network Commands and Tools

2024-03-22T00:00:00-04:00

A list of network commands and tools grabbed from LinuxBlog.

aria2: downloading just about everything. Torrents included.
arpwatch: Ethernet Activity Monitor.
bmon: bandwidth monitor and rate estimator.
bwm-ng: live network bandwidth monitor.
curl: transferring data with URLs. (or try httpie)
darkstat: captures network traffic, usage statistics.
dhclient: Dynamic Host Configuration Protocol Client
dig: query DNS servers for information.
dstat: replacement for vmstat, iostat, mpstat, netstat and ifstat.
ethtool: utility for controlling network drivers and hardware.
gated: gateway routing daemon.
host: DNS lookup utility.
hping: TCP/IP packet assembler/analyzer.
ibmonitor: shows bandwidth and total data transferred.
ifstat: report network interfaces bandwidth.
iftop: display bandwidth usage.
ip: a command with more features that ifconfig (net-tools).
iperf3: network bandwidth measurement tool. (above screenshot Stacklinux VPS)
iproute2: collection of utilities for controlling TCP/IP.
iptables: take control of network traffic.
IPTraf: An IP Network Monitor.
iputils: set of small useful utilities for Linux networking.
iw: a new nl80211 based CLI configuration utility for wireless devices.
jwhois: – client for the whois service.
8220: 1; – reveal information about your network sockets.
mtr: network diagnostic tool.
net-tools: utilities include: arp, hostname, ifconfig, netstat, rarp, route, plipconfig, slattach, mii-tool, iptunnel and ipmaddr.
ncat: improved re-implementation of the venerable netcat.
netcat: networking utility for reading/writing network connections.
nethogs: a small ‘net top’ tool.
Netperf: Network bandwidth Testing.
netplan: Netplan is a utility for easily configuring networking on a linux system.
netsniff-ng: Swiss army knife for daily Linux network plumbing.
netwatch: monitoring Network Connections.
ngrep: grep applied to the network layer.
nload: display network usage.
nmap: network discovery and security auditing.
nmcli: a command-line tool for controlling NetworkManager and reporting network status.
nmtui: provides a text interface to configure networking by controlling NetworkManager.
nslookup: query Internet name servers interactively.
ping: send icmp echo_request to network hosts.
route: show / manipulate the IP routing table.
slurm: network load monitor.
snort: Network Intrusion Detection and Prevention System.
smokeping: keeps track of your network latency.
socat: establishes two bidirectional byte streams and transfers data between them.
speedometer: Measure and display the rate of data across a network.
speedtest-cli: test internet bandwidth using speedtest.net
ss: utility to investigate sockets.
ssh: secure system administration and file transfers over insecure networks.
tcpdump: command-line packet analyzer.
tcptrack: Displays information about tcp connections on a network interface.
telnet: user interface to the TELNET protocol.
tracepath: very similar function to traceroute.
traceroute: print the route packets trace to network host.
vnStat: network traffic monitor.
websocat: Connection forwarder from/to web sockets to/from usual sockets, in style of socat.
wget: retrieving files using HTTP, HTTPS, FTP and FTPS.
Wireless Tools for Linux: includes iwconfig, iwlist, iwspy, iwpriv and ifrename.
Wireshark: network protocol analyzer.

Reference

LinuxBlog

Multi-Instance GPU

2024-02-21T00:00:00-05:00

Multi-Instance GPU (MIG) in a nutshell

MIG (Multi-Instance GPU) is a feature of the NVIDIA driver that allows a single GPU to be partitioned into multiple instances, each with its own compute, memory, and I/O resources. This feature is useful when multiple applications require GPU resources, but the applications do not fully utilize the GPU. It is somewhow similar to MPS, but it works in a totally different way! However, they might be used together to improve the GPU utilization even further.

You can read more info about MIG in the NVIDIA document.

Enabling MIG

MIG mode can be enabled on a per-GPU basis. The GPU IDs are the indices of the GPUs that you want to enable MIG mode on. For example, to check all avaiable GPU IDs and enable MIG mode on one of them, run the following command:

nvidia-smi -L
nvidia-smi -i 0 -mig 1 # Enable MIG mode on GPU 0

# To query the MIG mode status
nvidia-smi -i 0 --query-gpu=pci.bus_id,mig.mode.current --format=csv

If you see errors depending on your GPU model, you can take a look at this section.

Creating MIG Instances

After enabling MIG mode, you can create MIG instances. First, to see the available MIG profiles, run the following command:

nvidia-smi mig -lgip

The MIG profile names are in the format of g.gb. The +me suffix indicates that the profile includes Media Extension. For instance, MIG 1g.6gb+me indicates that the profile has 1 GPU slice and 6GB of memory and includes Media Extension.

The syntax of the placement is {}: shows the placement of the instances on the GPU. The placement index shown indicates how the profiles are mapped on the GPU. For instance:

$ nvidia-smi mig -lgipp
GPU  0 Profile ID 14 Placements: {0,1,2,3}:1
GPU  0 Profile ID 21 Placements: {0,1,2,3}:1
GPU  0 Profile ID  5 Placements: {0,2}:2
GPU  0 Profile ID  6 Placements: {0,2}:2
GPU  0 Profile ID  0 Placement : {0}:4

The placement index {0,1,2,3}:1 with the output of the previous command indicate that the MIG profile with ID 14 has 4 instances, each with 1 GPU slice, and the instances can be placed on GPU slices 0, 1, 2, and 3.

Now, it’s time to create some MIG instances!

# There are basically three ways to do this:
# 1. By specifying a combination of the profile IDs (make sure that the GPU has enough resources)  
nvidia-smi mig -cgi <14,21,5,6,0>

# 2. By specifying the short profile name
nvidia-smi mig -cgi <1g.6gb,1g.6gb+me,2g.12gb,2g.12gb+me,4g.24gb>

# 3. By specifying the full profile name
nvidia-smi mig -cgi 

# Or a combination of the above
nvidia-smi mig -cgi 14,1g.6gb

Destroying MIG Instances

To destroy all the CIs and GIs:

sudo nvidia-smi mig -dci # --destroy-compute-instance
sudo nvidia-smi mig -dgi # --destroy-gpu-instance

# Or to destroy a specific instance
# This one destroys the compute instances with IDs 0, 1, and 2 under the GPU instance 1
nvidia-smi mig -dci -ci 0,1,2 -gi 1

# Verify the status
nvidia-smi mig -lgi # --list-gpu-instances

An example on a node with 4 A30 GPUs

Let’s see the list of the GPUs:

$ nvidia-smi -L
GPU 0: NVIDIA A30 (UUID: GPU-d8430827-89da-89df-1d70-bde0c9883859)
GPU 1: NVIDIA A30 (UUID: GPU-d1ee693e-571b-7664-f22b-1312585630d0)
GPU 2: NVIDIA A30 (UUID: GPU-fc7c8f73-47c4-7b21-c9da-858462fa0433)
GPU 3: NVIDIA A30 (UUID: GPU-bef3638b-2c97-ea4b-83d9-b1a6b7d2fc29)

# Or more detailed information (MIG is not enabled yet)
$ nvidia-smi
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A30                     On  | 00000000:17:00.0 Off |                    0 |
| N/A   28C    P0              33W / 165W |      4MiB / 24576MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A30                     On  | 00000000:65:00.0 Off |                    0 |
| N/A   28C    P0              36W / 165W |      4MiB / 24576MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A30                     On  | 00000000:CA:00.0 Off |                    0 |
| N/A   29C    P0              30W / 165W |      4MiB / 24576MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A30                     On  | 00000000:E3:00.0 Off |                    0 |
| N/A   30C    P0              31W / 165W |      4MiB / 24576MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

# Let's enable MIG mode on GPU 0
$ nvidia-smi -i 0 -mig 1
Enabled MIG Mode for GPU 00000000:17:00.0
All done.

$ nvidia-smi
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A30                     On  | 00000000:17:00.0 Off |                   On |
| N/A   27C    P0              26W / 165W |      0MiB / 24576MiB |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A30                     On  | 00000000:65:00.0 Off |                    0 |
| N/A   27C    P0              27W / 165W |      4MiB / 24576MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A30                     On  | 00000000:CA:00.0 Off |                    0 |
| N/A   29C    P0              30W / 165W |      4MiB / 24576MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A30                     On  | 00000000:E3:00.0 Off |                    0 |
| N/A   29C    P0              31W / 165W |      4MiB / 24576MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| MIG devices:                                                                          |
+------------------+--------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |                   Memory-Usage |        Vol|      Shared           |
|      ID  ID  Dev |                     BAR1-Usage | SM     Unc| CE ENC DEC OFA JPG    |
|                  |                                |        ECC|                       |
|==================+================================+===========+=======================|
|  No MIG devices found                                                                 |
+---------------------------------------------------------------------------------------+

# Let's see the available MIG profiles
$ nvidia-smi mig -lgip 
+-----------------------------------------------------------------------------+
| GPU instance profiles:                                                      |
| GPU   Name             ID    Instances   Memory     P2P    SM    DEC   ENC  |
|                              Free/Total   GiB              CE    JPEG  OFA  |
|=============================================================================|
|   0  MIG 1g.6gb        14     4/4        5.81       No     14     1     0   |
|                                                             1     0     0   |
+-----------------------------------------------------------------------------+
|   0  MIG 1g.6gb+me     21     1/1        5.81       No     14     1     0   |
|                                                             1     1     1   |
+-----------------------------------------------------------------------------+
|   0  MIG 2g.12gb        5     2/2        11.69      No     28     2     0   |
|                                                             2     0     0   |
+-----------------------------------------------------------------------------+
|   0  MIG 2g.12gb+me     6     1/1        11.69      No     28     2     0   |
|                                                             2     1     1   |
+-----------------------------------------------------------------------------+
|   0  MIG 4g.24gb        0     1/1        23.44      No     56     4     0   |
|                                                             4     1     1   |
+-----------------------------------------------------------------------------+

# The table SM column shows the number of SMs available to the MIG instance.
# The table CE column shows the number of copy engines available to the MIG instance.

For instance, let’s create two instances, each with with 2 GPU slices (out of 4 avialable) and 12GB of memory. I can use the following command:

$ nvidia-smi mig -cgi 5,5
Successfully created GPU instance ID  1 on GPU  0 using profile MIG 2g.12gb (ID  5)
Successfully created GPU instance ID  2 on GPU  0 using profile MIG 2g.12gb (ID  5)

# To see the status of the created instances
$ nvidia-smi mig -lgi # --list-gpu-instances
+-------------------------------------------------------+
| GPU instances:                                        |
| GPU   Name             Profile  Instance   Placement  |
|                          ID       ID       Start:Size |
|=======================================================|
|   0  MIG 2g.12gb          5        1          0:2     |
+-------------------------------------------------------+
|   0  MIG 2g.12gb          5        2          2:2     |
+-------------------------------------------------------+

# However there are no Compute Instance (CI) created yet! 
$ nvidia-smi mig -lci
No compute instances found: Not Found

# And the list has not changed
$ nvidia-smi -L
GPU 0: NVIDIA A30 (UUID: GPU-d8430827-89da-89df-1d70-bde0c9883859)
GPU 1: NVIDIA A30 (UUID: GPU-d1ee693e-571b-7664-f22b-1312585630d0)
GPU 2: NVIDIA A30 (UUID: GPU-fc7c8f73-47c4-7b21-c9da-858462fa0433)
GPU 3: NVIDIA A30 (UUID: GPU-bef3638b-2c97-ea4b-83d9-b1a6b7d2fc29)

But what the hell is a Compute Instance (CI)?! In very basic terms, apparently, it is the actual instance that your CUDA code can run on. So, it has to be created manually after creating the GPU instances. In the example above that I created two GPU instances, I can update the command to create the CIs as well:

# I have deleted the previous instances!
$ nvidia-smi mig -dgi

# Now, I can create the instances with the following command
$ nvidia-smi mig -cgi 5,5 -C
Successfully created GPU instance ID  1 on GPU  0 using profile MIG 2g.12gb (ID  5)
Successfully created compute instance ID  0 on GPU  0 GPU instance ID  1 using profile MIG 2g.12gb (ID  1)
Successfully created GPU instance ID  2 on GPU  0 using profile MIG 2g.12gb (ID  5)
Successfully created compute instance ID  0 on GPU  0 GPU instance ID  2 using profile MIG 2g.12gb (ID  1)

# To see the status of the created instances (no changes yet)
$ nvidia-smi mig -lgi # --list-gpu-instances

# Now, the list of the GPUs has changed
$ nvidia-smi -L
GPU 0: NVIDIA A30 (UUID: GPU-d8430827-89da-89df-1d70-bde0c9883859)
  MIG 2g.12gb     Device  0: (UUID: MIG-89b757f7-3c4c-5b1c-9476-a11b14aa9308)
  MIG 2g.12gb     Device  1: (UUID: MIG-6df26cf8-a984-58d3-978e-acb0c808d513)
GPU 1: NVIDIA A30 (UUID: GPU-d1ee693e-571b-7664-f22b-1312585630d0)
GPU 2: NVIDIA A30 (UUID: GPU-fc7c8f73-47c4-7b21-c9da-858462fa0433)
GPU 3: NVIDIA A30 (UUID: GPU-bef3638b-2c97-ea4b-83d9-b1a6b7d2fc29)

$ nvidia-smi
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A30                     On  | 00000000:17:00.0 Off |                   On |
| N/A   27C    P0              26W / 165W |     50MiB / 24576MiB |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A30                     On  | 00000000:65:00.0 Off |                    0 |
| N/A   27C    P0              27W / 165W |      4MiB / 24576MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A30                     On  | 00000000:CA:00.0 Off |                    0 |
| N/A   29C    P0              30W / 165W |      4MiB / 24576MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A30                     On  | 00000000:E3:00.0 Off |                    0 |
| N/A   29C    P0              31W / 165W |      4MiB / 24576MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| MIG devices:                                                                          |
+------------------+--------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |                   Memory-Usage |        Vol|      Shared           |
|      ID  ID  Dev |                     BAR1-Usage | SM     Unc| CE ENC DEC OFA JPG    |
|                  |                                |        ECC|                       |
|==================+================================+===========+=======================|
|  0    1   0   0  |              25MiB / 11968MiB  | 28      0 |  2   0    2    0    0 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  0    2   0   1  |              25MiB / 11968MiB  | 28      0 |  2   0    2    0    0 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+

Phew! That was a lot! But now, I have two MIG instances, each with 2 GPU slices and 12GB of memory. I can run my CUDA code on these instances like this:

CUDA_VISIBLE_DEVICES=MIG-89b757f7-3c4c-5b1c-9476-a11b14aa9308 ./my_cuda_code &
CUDA_VISIBLE_DEVICES=MIG-6df26cf8-a984-58d3-978e-acb0c808d513 ./my_cuda_code &

Pay attention! In this case, each GPU instance has two GPU slices (not to be confused with Compute instance!), and we can split them into two further slices. Let’s try that!

# Deleting the previous compute instances
$ nvidia-smi mig -dci

# Let's see the compute instance profiles
$ nvidia-smi mig -lcip
+--------------------------------------------------------------------------------------+
| Compute instance profiles:                                                           |
| GPU     GPU       Name             Profile  Instances   Exclusive       Shared       |
|       Instance                       ID     Free/Total     SM       DEC   ENC   OFA  |
|         ID                                                          CE    JPEG       |
|======================================================================================|
|   0      1       MIG 1c.2g.12gb       0      2/2           14        2     0     0   |
|                                                                      2     0         |
+--------------------------------------------------------------------------------------+
|   0      1       MIG 2g.12gb          1*     1/1           28        2     0     0   |
|                                                                      2     0         |
+--------------------------------------------------------------------------------------+
|   0      2       MIG 1c.2g.12gb       0      2/2           14        2     0     0   |
|                                                                      2     0         |
+--------------------------------------------------------------------------------------+
|   0      2       MIG 2g.12gb          1*     1/1           28        2     0     0   |
|                                                                      2     0         |
+--------------------------------------------------------------------------------------+

Consider the name MIG 1c.2g.12gb. This name indicates that the profile has 1 compute instance, 2 GPU slices, and 12GB of memory. And it means that the compute instances use the SMs exclusively, but they share GPU memory and copy engines, etc. Therefore, I can create two compute instances per GPU instance:

# Once again, let's see the list of the GPU instances
$ nvidia-smi mig -lgi
+-------------------------------------------------------+
| GPU instances:                                        |
| GPU   Name             Profile  Instance   Placement  |
|                          ID       ID       Start:Size |
|=======================================================|
|   0  MIG 2g.12gb          5        1          0:2     |
+-------------------------------------------------------+
|   0  MIG 2g.12gb          5        2          2:2     |
+-------------------------------------------------------+

# Now, I can create two Compute instances of ID 0 for the GPU instance 1 
$ nvidia-smi mig -cci 0,0 -gi 1
Successfully created compute instance ID  0 on GPU  0 GPU instance ID  1 using profile MIG 1c.2g.12gb (ID  0)
Successfully created compute instance ID  1 on GPU  0 GPU instance ID  1 using profile MIG 1c.2g.12gb (ID  0)

# Let's do the same for the GPU instance 2
nvidia-smi mig -cci 0,0 -gi 2
Successfully created compute instance ID  0 on GPU  0 GPU instance ID  2 using profile MIG 1c.2g.12gb (ID  0)
Successfully created compute instance ID  1 on GPU  0 GPU instance ID  2 using profile MIG 1c.2g.12gb (ID  0)

# Now, the list of the Compute instances:
$ nvidia-smi mig -lci
+--------------------------------------------------------------------+
| Compute instances:                                                 |
| GPU     GPU       Name             Profile   Instance   Placement  |
|       Instance                       ID        ID       Start:Size |
|         ID                                                         |
|====================================================================|
|   0      1       MIG 1c.2g.12gb       0         0          0:1     |
+--------------------------------------------------------------------+
|   0      1       MIG 1c.2g.12gb       0         1          1:1     |
+--------------------------------------------------------------------+
|   0      2       MIG 1c.2g.12gb       0         0          0:1     |
+--------------------------------------------------------------------+
|   0      2       MIG 1c.2g.12gb       0         1          1:1     |
+--------------------------------------------------------------------+

# And the list of the GPUs:
$ nvidia-smi -L
GPU 0: NVIDIA A30 (UUID: GPU-d8430827-89da-89df-1d70-bde0c9883859)
  MIG 1c.2g.12gb  Device  0: (UUID: MIG-89b757f7-3c4c-5b1c-9476-a11b14aa9308)
  MIG 1c.2g.12gb  Device  1: (UUID: MIG-bdf043cc-8668-599c-95b0-5b640d416440)
  MIG 1c.2g.12gb  Device  2: (UUID: MIG-6df26cf8-a984-58d3-978e-acb0c808d513)
  MIG 1c.2g.12gb  Device  3: (UUID: MIG-6f79e1f8-f1ed-5121-976a-4ff3f0df8634)
GPU 1: NVIDIA A30 (UUID: GPU-d1ee693e-571b-7664-f22b-1312585630d0)
GPU 2: NVIDIA A30 (UUID: GPU-fc7c8f73-47c4-7b21-c9da-858462fa0433)
GPU 3: NVIDIA A30 (UUID: GPU-bef3638b-2c97-ea4b-83d9-b1a6b7d2fc29)

# And the list of the MIG devices from nvidia-smi:
$ nvidia-smi
+---------------------------------------------------------------------------------------+
| MIG devices:                                                                          |
+------------------+--------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |                   Memory-Usage |        Vol|      Shared           |
|      ID  ID  Dev |                     BAR1-Usage | SM     Unc| CE ENC DEC OFA JPG    |
|                  |                                |        ECC|                       |
|==================+================================+===========+=======================|
|  0    1   0   0  |              25MiB / 11968MiB  | 14      0 |  2   0    2    0    0 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+                                +-----------+-----------------------+
|  0    1   1   1  |                                | 14      0 |  2   0    2    0    0 |
|                  |                                |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  0    2   0   2  |              25MiB / 11968MiB  | 14      0 |  2   0    2    0    0 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+                                +-----------+-----------------------+
|  0    2   1   3  |                                | 14      0 |  2   0    2    0    0 |
|                  |                                |           |                       |
+------------------+--------------------------------+-----------+-----------------------+

Ok, now I can run four CUDA codes on the four compute instances together. The command would be like this:

CUDA_VISIBLE_DEVICES=MIG-89b757f7-3c4c-5b1c-9476-a11b14aa9308 ./my_cuda_code &
CUDA_VISIBLE_DEVICES=MIG-bdf043cc-8668-599c-95b0-5b640d416440 ./my_cuda_code &
CUDA_VISIBLE_DEVICES=MIG-6df26cf8-a984-58d3-978e-acb0c808d513 ./my_cuda_code &
CUDA_VISIBLE_DEVICES=MIG-6f79e1f8-f1ed-5121-976a-4ff3f0df8634 ./my_cuda_code &

Latex Algorithm2e

2024-02-18T00:00:00-05:00

Algorithm2e

Algorithm2e is a LaTeX package for typesetting algorithms in a more convenient way.

In this post, I will show you how to use the algorithm2e package to write algorithms in 2 column format, and how to customize the appearance of the algorithm. It was particularly hard for me to find a good example of how to use algorithm2e in a 2 column format, so I hope this post will be helpful to others.

The main reference for this post is the official documentation of the algorithm2e package.

The first step is to include the algorithm2e package in the preamble of your LaTeX document:

% Usually in preamble 
\usepackage[inoutnumbered,linesnumbered,ruled,lined]{algorithm2e}
\usepackage{multicol}

\definecolor{mygreen}{rgb}{0,0.5,0}
\newcommand\mycommfont[1]{\footnotesize\ttfamily\textcolor{mygreen}{#1}}
\SetCommentSty{mycommfont}

The algorithm2e package provides a number of options for customizing the appearance of the algorithm. The options used in the above example are:

inoutnumbered: Number the input and output lines.
linesnumbered: Number the lines of the algorithm.
ruled: Draw a horizontal line at the top and bottom of the algorithm.
lined: Draw a vertical line to visually show each block of the algorithm.

The multicol package is used to create a two-column layout for the algorithm.

The next line defines a custom color for the comments in the algorithm. The \mycommfont command is used to set the font and color of the comments. The \SetCommentSty command is used to apply the custom font and color to the comments.

Here is an example of how to use the algorithm2e package to write an algorithm in a two-column format. The algorithm is split into two columns using the multicol environment. The algorithm* environment is used to create a two-column layout for the algorithm. The trick algorithm2e is that it does not split the algorithm into two columns right in the middle of any block. So, we need to manually split the algorithm into two parts. To make the algorithm split into two parts, we intentionally end the function body and the while loop. This way, the algorithm will be split into two parts, and each part will be placed in a separate column.

\begin{algorithm*}
  \begin{multicols}{2}
  \SetAlgoShortEnd
  \DontPrintSemicolon
  
  \SetKwInput{KwInput}{Input}
  \SetKwInput{KwOutput}{Output}              
  \SetKwFunction{FMain}{Main}
  \SetKwProg{Loop}{while}{}{}
  \SetKwProg{Fn}{Function}{:}{}
  \KwIn{$input\_graph$}
    \KwOut{$output\_graph$}
    \Fn{$random\_function$}{
      $counter \leftarrow 0$ \tcp{inline comment}
      $edges$ \tcp{another comment}

      $assign random \leftarrow 0$ \;

      \For{$i \leftarrow 0$ \KwTo sizeof($edges$)}{
        \uIf{$edges[i] \in new\_edges$ \textbf{and} $check(new\_edges, edges[i], random) > 0$}{
          $vertex[i] \leftarrow$ Enable \;
        }\Else{
          $vertex[i] \leftarrow$ Disable \;
        }

        \uIf{$edges[i] \in new\_edges$ \textbf{and} $check(old\_edges, edges[i], random) > 0$}{
          $vertex[i] \leftarrow$ Enable \;
        }\Else{
          $vertex[i] \leftarrow$ Disable \;
        }
      }

      $edges \leftarrow $ \# Enable edges $+$ \# Enable vertices \;

      \Loop {$edges\_size < edges$ \textbf{or} $counter < edges$}{
        \For{$i \leftarrow 0$ \KwTo sizeof($edges$)}{
          \uIf{$edges[i] \in new\_edges$ \textbf{and} $check(new\_edges, edges[i], random) > 0$}{
            $vertex[i] \leftarrow$ Enable \;
          }\Else{
            $vertex[i] \leftarrow$ Disable \;
          }

          \If{$edges[i] \in new\_edges$}{
            $vertex[i] \leftarrow$ Enable \;
          }
        }
      } % Here is where we end the while loop intentionally to enable split of the algorithm
    } % Here is where we end the function intentionally to enable split of the algorithm

    \SetKwBlock{Begin}{}{return}
    \tcp*[h]{Rest of the function body}
    \Begin{
      \SetKwBlock{Begin}{}{end\ while}
      \tcp*[h]{Rest of the while body}
      \Begin {

        \uIf{$check(new\_edges, edges[i], random) > 0$}{
          $vertex[i] \leftarrow$ Enable \;
        }
        \uElseIf{$check(old\_edges, edges[i], random) > 0$}{
          $vertex[i] \leftarrow$ Disable \;
        }
        \uElseIf{$check(old\_edges, edges[i], random) == 0$}{
          $vertex[i] \leftarrow$ Disable \;
        }
        \Else{
          $vertex[i] \leftarrow$ Disable \;
        }

        \For{$i \leftarrow 0$ \KwTo sizeof($edges$)}{
          \uIf{$edges[i] \in new\_edges$ \textbf{and} $check(new\_edges, edges[i], random) > 0$}{
            $vertex[i] \leftarrow$ Enable \;
          }\Else{
            $vertex[i] \leftarrow$ Disable \;
          }

          \If{$edges[i] \in new\_edges$}{
            $vertex[i] \leftarrow$ Enable \;
          }
        }

        \uIf{$check(new\_edges, edges[i], random) > 0$}{
          $vertex[i] \leftarrow$ Enable \;
        }
        \uElseIf{$check(old\_edges, edges[i], random) > 0$}{
          $vertex[i] \leftarrow$ Disable \;
        }
        \uElseIf{$check(old\_edges, edges[i], random) == 0$}{
          $vertex[i] \leftarrow$ Disable \;
        }
        \Else{
          $vertex[i] \leftarrow$ Disable \;
        }
      }  % End of while
    } % End of find_agent

    \caption{Random\ algorithm}
    \label{Random Algorithm}
  \end{multicols}
\end{algorithm*}

These two lines make sure that the function and while blocks are ended with no end or return keyword.

  \SetKwProg{Loop}{while}{}{}
  \SetKwProg{Fn}{Function}{:}{}

These two lines end the function and while blocks intentionally to enable the split of the algorithm.

  } % Here is where we end the function intentionally to enable split of the algorithm
} % Here is where we end the while loop intentionally to enable split of the algorithm

Then, at the start of the next split, we make sure that the function and while blocks are started with no keywords (the first curly brackets are empty).

  \SetKwBlock{Begin}{}{return}
  \tcp*[h]{Rest of the function body}
  \Begin{
    \SetKwBlock{Begin}{}{end\ while}
    \tcp*[h]{Rest of the while body}
    \Begin {

This is how it should look like in the final document: