MPS on Multi-Instance GPU

Published:
9 minute read

In previous posts, (MPS and MIG), I have explained how to enable MPS and MIG on NVIDIA GPUs. In this post, I will explain how to use both technologies at the same time. In more detail, I would like to enable MPS on all of the MIG instances. For more information, you can refer to the NVIDIA document.

Enabling MPS on MIG

I assume that you have already enabled MIG on your GPU(s). If not, please refer to the previous posts. As stated in the NVIDIA document, the steps for configuring MPS on MIG is as follows:

  • Configure the desired MIG geometry on the GPU.
  • Setup the CUDA_MPS_PIPE_DIRECTORY variable to point to unique directories so that the multiple MPS servers and clients can communicate with each other using named pipes and Unix domain sockets.
  • Launch the application by specifying the MIG device using CUDA_VISIBLE_DEVICES. , <– This one might be unnecessary if you point to the correct MPS server using CUDA_MPS_PIPE_DIRECTORY.

To enable MPS on MIG, I wrote a simple script that does the above steps. The script is as follows:

#!/bin/bash

set -eux

# GPU_UUIDs=($(nvidia-smi -L | grep -oE "(GPU|MIG)-[0-9a-f]*-[0-9a-f]*-[0-9a-f]*-[0-9a-f]*-[0-9a-f]*"))
GPU_UUIDs=($(nvidia-smi -L | grep -oE "(MIG)-[0-9a-f]*-[0-9a-f]*-[0-9a-f]*-[0-9a-f]*-[0-9a-f]*"))

for ((index = 0; index < ${#GPU_UUIDs[@]}; index++)); do
  GPU=${GPU_UUIDs[index]}
  rm -rf /tmp/mps_${GPU}
  rm -rf /tmp/mps_log_${GPU}
  mkdir /tmp/mps_${GPU}
  mkdir /tmp/mps_log_${GPU}
  # Skip setting the GPU compute mode to Exclusive Process (not supported on MIG-enabled GPUs)
  # nvidia-smi -i $index -c EXCLUSIVE_PROCESS
  export CUDA_VISIBLE_DEVICES=${GPU}
  export CUDA_MPS_PIPE_DIRECTORY=/tmp/mps_${GPU}
  export CUDA_MPS_LOG_DIRECTORY=/tmp/mps_log_${GPU}
  nvidia-cuda-mps-control -d
done

ps -ef | grep mps

In summary, the script does the following:

  • getting the UUIDs of the MIG instances.
  • creating unique directories for each MIG instance.
  • setting the CUDA_VISIBLE_DEVICES to the MIG instance.
  • setting the CUDA_MPS_PIPE_DIRECTORY and CUDA_MPS_LOG_DIRECTORY to the unique directories.
  • enabling MPS server on the specified MIG instance.
  • repeating the steps for all the MIG instances.
  • And at the end, listing the MPS processes.

Disabling MPS on MIG

To disable MPS on MIG, you can use the following script:

#!/bin/bash

set -eux

# GPU_UUIDs=($(nvidia-smi -L | grep -oE "(MIG|GPU)-[0-9a-f]*-[0-9a-f]*-[0-9a-f]*-[0-9a-f]*-[0-9a-f]*"))
GPU_UUIDs=($(nvidia-smi -L | grep -oE "(MIG)-[0-9a-f]*-[0-9a-f]*-[0-9a-f]*-[0-9a-f]*-[0-9a-f]*"))

for ((index = 0; index < ${#GPU_UUIDs[@]}; index++)); do
  GPU=${GPU_UUIDs[index]}
  export CUDA_VISIBLE_DEVICES=${GPU}
  export CUDA_MPS_PIPE_DIRECTORY=/tmp/mps_${GPU}
  export CUDA_MPS_LOG_DIRECTORY=/tmp/mps_log_${GPU}
  echo "quit" | nvidia-cuda-mps-control
  rm -rf /tmp/mps_log_${GPU}
  rm -rf /tmp/mps_${GPU}
  # Reset the GPU compute mode to Default (not supported on MIG-enabled GPUs)
  # nvidia-smi -i $index -c DEFAULT
done

ps -ef | grep mps

In summary, the script does the following:

  • getting the UUIDs of the MIG instances.
  • setting the CUDA_VISIBLE_DEVICES to the MIG instance.
  • setting the CUDA_MPS_PIPE_DIRECTORY and CUDA_MPS_LOG_DIRECTORY to the unique directories.
  • disabling MPS server on the specified MIG instance.
  • removing the directories.
  • repeating the steps for all the MIG instances.
  • And at the end, listing the MPS processes which should be none.

Notice that the script does not destroy MIG configuration, and the GPUs will still be in MIG mode. If you want to see how to disable MIG, please refer to the previous post.

Example

I have created 2 GPU instances each with 2 compute instances on my A30 GPU. This is how it looks like:

$ nvidia-smi -L
GPU 0: NVIDIA A30 (UUID: GPU-8f8bff94-112e-9541-43da-cfd453333404)
  MIG 1c.2g.12gb  Device  0: (UUID: MIG-22f0c05f-5cf2-5ea2-8297-789695656dc0)
  MIG 1c.2g.12gb  Device  1: (UUID: MIG-d74dcafc-aad0-58b8-83d8-61bcd963d2e9)
  MIG 1c.2g.12gb  Device  2: (UUID: MIG-27be287f-e2db-5526-a2f6-0bfabcf34af9)
  MIG 1c.2g.12gb  Device  3: (UUID: MIG-235f71be-a125-5ce0-9fe6-0cd97ae57733)
GPU 1: NVIDIA A30 (UUID: GPU-0783f1eb-ab00-d6ec-92e4-8676be77de38)
GPU 2: NVIDIA A30 (UUID: GPU-a90c6e94-391e-0fc3-8fc5-e2ef46ec6d2d)
GPU 3: NVIDIA A30 (UUID: GPU-46d1eefe-dfc8-2f00-16a9-95c08e019d47)

$ nvidia-smi
Wed Aug 28 17:07:36 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A30                     On  | 00000000:17:00.0 Off |                   On |
| N/A   28C    P0              30W / 165W |     50MiB / 24576MiB |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A30                     On  | 00000000:65:00.0 Off |                    0 |
| N/A   28C    P0              30W / 165W |      4MiB / 24576MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A30                     On  | 00000000:CA:00.0 Off |                    0 |
| N/A   27C    P0              31W / 165W |      4MiB / 24576MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A30                     On  | 00000000:E3:00.0 Off |                    0 |
| N/A   28C    P0              32W / 165W |      4MiB / 24576MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| MIG devices:                                                                          |
+------------------+--------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |                   Memory-Usage |        Vol|      Shared           |
|      ID  ID  Dev |                     BAR1-Usage | SM     Unc| CE ENC DEC OFA JPG    |
|                  |                                |        ECC|                       |
|==================+================================+===========+=======================|
|  0    1   0   0  |              25MiB / 11968MiB  | 14      0 |  2   0    2    0    0 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+                                +-----------+-----------------------+
|  0    1   1   1  |                                | 14      0 |  2   0    2    0    0 |
|                  |                                |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  0    2   0   2  |              25MiB / 11968MiB  | 14      0 |  2   0    2    0    0 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+                                +-----------+-----------------------+
|  0    2   1   3  |                                | 14      0 |  2   0    2    0    0 |
|                  |                                |           |                       |
+------------------+--------------------------------+-----------+-----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Here is the output of ps -ef | grep mps after running the enabling MPS on MIG script (with sudo):

$ ps -ef | grep mps
root      241318       1  0 17:08 ?        00:00:00 nvidia-cuda-mps-control -d
root      241326       1  0 17:08 ?        00:00:00 nvidia-cuda-mps-control -d
root      241334       1  0 17:08 ?        00:00:00 nvidia-cuda-mps-control -d
root      241342       1  0 17:08 ?        00:00:00 nvidia-cuda-mps-control -d

# And the content of tmp directory
$ ls -l /tmp/
total 0
drwxr-xr-x 2 root   root   120 Aug 28 17:08 mps_MIG-22f0c05f-5cf2-5ea2-8297-789695656dc0
drwxr-xr-x 2 root   root   120 Aug 28 17:08 mps_MIG-235f71be-a125-5ce0-9fe6-0cd97ae57733
drwxr-xr-x 2 root   root   120 Aug 28 17:08 mps_MIG-27be287f-e2db-5526-a2f6-0bfabcf34af9
drwxr-xr-x 2 root   root   120 Aug 28 17:08 mps_MIG-d74dcafc-aad0-58b8-83d8-61bcd963d2e9
drwxr-xr-x 2 root   root    80 Aug 28 17:08 mps_log_MIG-22f0c05f-5cf2-5ea2-8297-789695656dc0
drwxr-xr-x 2 root   root    80 Aug 28 17:08 mps_log_MIG-235f71be-a125-5ce0-9fe6-0cd97ae57733
drwxr-xr-x 2 root   root    80 Aug 28 17:08 mps_log_MIG-27be287f-e2db-5526-a2f6-0bfabcf34af9
drwxr-xr-x 2 root   root    80 Aug 28 17:08 mps_log_MIG-d74dcafc-aad0-58b8-83d8-61bcd963d2e9

Now, you can run your application on the MIG instances with MPS enabled. For example, you can run the following command to run the deviceQuery:

  CUDA_MPS_PIPE_DIRECTORY=/tmp/mps_MIG-22f0c05f-5cf2-5ea2-8297-789695656dc0 \
  CUDA_MPS_LOG_DIRECTORY=/tmp/mps_log_MIG-22f0c05f-5cf2-5ea2-8297-789695656dc0 \
  ./deviceQuery

After I ran the deviceQuery on all the MIG isntances with the correct PIPE and LOG directories, this is the output of nvidia-smi:

$ nvidia-smi
...
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0    1    0     241923      C   nvidia-cuda-mps-server                       30MiB |
|    0    1    1     254870      C   nvidia-cuda-mps-server                       30MiB |
|    0    2    0     249932      C   nvidia-cuda-mps-server                       30MiB |
|    0    2    1     242189      C   nvidia-cuda-mps-server                       30MiB |
+---------------------------------------------------------------------------------------+

Notice the GI IDs and CI IDs, and how each of them has its own MPS server. It’s worth mentioning that MPS servers are started in a lazy fashion. So if you don’t run any application, the MPS server will not be started.

Btw, I didn’t see any difference in the behaviour of the application when passing CUDA_VISIBLE_DEVICES to the command. It seems that the CUDA_MPS_PIPE_DIRECTORY is enough to point to the correct MPS server. However, setting CUDA_VISIBLE_DEVICES is a good practice to avoid any confusion, as setting it to a wrong value will cause the application to return an error.