MPS on Multi-Instance GPU
In previous posts, (MPS and MIG), I have explained how to enable MPS and MIG on NVIDIA GPUs. In this post, I will explain how to use both technologies at the same time. In more detail, I would like to enable MPS on all of the MIG instances. For more information, you can refer to the NVIDIA document.
Enabling MPS on MIG
I assume that you have already enabled MIG on your GPU(s). If not, please refer to the previous posts. As stated in the NVIDIA document, the steps for configuring MPS on MIG is as follows:
- Configure the desired MIG geometry on the GPU.
- Setup the
CUDA_MPS_PIPE_DIRECTORY
variable to point to unique directories so that the multiple MPS servers and clients can communicate with each other using named pipes and Unix domain sockets. - Launch the application by specifying the MIG device using
CUDA_VISIBLE_DEVICES
. , <– This one might be unnecessary if you point to the correct MPS server usingCUDA_MPS_PIPE_DIRECTORY
.
To enable MPS on MIG, I wrote a simple script that does the above steps. The script is as follows:
#!/bin/bash
set -eux
# GPU_UUIDs=($(nvidia-smi -L | grep -oE "(GPU|MIG)-[0-9a-f]*-[0-9a-f]*-[0-9a-f]*-[0-9a-f]*-[0-9a-f]*"))
GPU_UUIDs=($(nvidia-smi -L | grep -oE "(MIG)-[0-9a-f]*-[0-9a-f]*-[0-9a-f]*-[0-9a-f]*-[0-9a-f]*"))
for ((index = 0; index < ${#GPU_UUIDs[@]}; index++)); do
GPU=${GPU_UUIDs[index]}
rm -rf /tmp/mps_${GPU}
rm -rf /tmp/mps_log_${GPU}
mkdir /tmp/mps_${GPU}
mkdir /tmp/mps_log_${GPU}
# Skip setting the GPU compute mode to Exclusive Process (not supported on MIG-enabled GPUs)
# nvidia-smi -i $index -c EXCLUSIVE_PROCESS
export CUDA_VISIBLE_DEVICES=${GPU}
export CUDA_MPS_PIPE_DIRECTORY=/tmp/mps_${GPU}
export CUDA_MPS_LOG_DIRECTORY=/tmp/mps_log_${GPU}
nvidia-cuda-mps-control -d
done
ps -ef | grep mps
In summary, the script does the following:
- getting the UUIDs of the MIG instances.
- creating unique directories for each MIG instance.
- setting the
CUDA_VISIBLE_DEVICES
to the MIG instance. - setting the
CUDA_MPS_PIPE_DIRECTORY
andCUDA_MPS_LOG_DIRECTORY
to the unique directories. - enabling MPS server on the specified MIG instance.
- repeating the steps for all the MIG instances.
- And at the end, listing the MPS processes.
Disabling MPS on MIG
To disable MPS on MIG, you can use the following script:
#!/bin/bash
set -eux
# GPU_UUIDs=($(nvidia-smi -L | grep -oE "(MIG|GPU)-[0-9a-f]*-[0-9a-f]*-[0-9a-f]*-[0-9a-f]*-[0-9a-f]*"))
GPU_UUIDs=($(nvidia-smi -L | grep -oE "(MIG)-[0-9a-f]*-[0-9a-f]*-[0-9a-f]*-[0-9a-f]*-[0-9a-f]*"))
for ((index = 0; index < ${#GPU_UUIDs[@]}; index++)); do
GPU=${GPU_UUIDs[index]}
export CUDA_VISIBLE_DEVICES=${GPU}
export CUDA_MPS_PIPE_DIRECTORY=/tmp/mps_${GPU}
export CUDA_MPS_LOG_DIRECTORY=/tmp/mps_log_${GPU}
echo "quit" | nvidia-cuda-mps-control
rm -rf /tmp/mps_log_${GPU}
rm -rf /tmp/mps_${GPU}
# Reset the GPU compute mode to Default (not supported on MIG-enabled GPUs)
# nvidia-smi -i $index -c DEFAULT
done
ps -ef | grep mps
In summary, the script does the following:
- getting the UUIDs of the MIG instances.
- setting the
CUDA_VISIBLE_DEVICES
to the MIG instance. - setting the
CUDA_MPS_PIPE_DIRECTORY
andCUDA_MPS_LOG_DIRECTORY
to the unique directories. - disabling MPS server on the specified MIG instance.
- removing the directories.
- repeating the steps for all the MIG instances.
- And at the end, listing the MPS processes which should be none.
Notice that the script does not destroy MIG configuration, and the GPUs will still be in MIG mode. If you want to see how to disable MIG, please refer to the previous post.
Example
I have created 2 GPU instances each with 2 compute instances on my A30 GPU. This is how it looks like:
$ nvidia-smi -L
GPU 0: NVIDIA A30 (UUID: GPU-8f8bff94-112e-9541-43da-cfd453333404)
MIG 1c.2g.12gb Device 0: (UUID: MIG-22f0c05f-5cf2-5ea2-8297-789695656dc0)
MIG 1c.2g.12gb Device 1: (UUID: MIG-d74dcafc-aad0-58b8-83d8-61bcd963d2e9)
MIG 1c.2g.12gb Device 2: (UUID: MIG-27be287f-e2db-5526-a2f6-0bfabcf34af9)
MIG 1c.2g.12gb Device 3: (UUID: MIG-235f71be-a125-5ce0-9fe6-0cd97ae57733)
GPU 1: NVIDIA A30 (UUID: GPU-0783f1eb-ab00-d6ec-92e4-8676be77de38)
GPU 2: NVIDIA A30 (UUID: GPU-a90c6e94-391e-0fc3-8fc5-e2ef46ec6d2d)
GPU 3: NVIDIA A30 (UUID: GPU-46d1eefe-dfc8-2f00-16a9-95c08e019d47)
$ nvidia-smi
Wed Aug 28 17:07:36 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08 Driver Version: 545.23.08 CUDA Version: 12.3 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A30 On | 00000000:17:00.0 Off | On |
| N/A 28C P0 30W / 165W | 50MiB / 24576MiB | N/A Default |
| | | Enabled |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA A30 On | 00000000:65:00.0 Off | 0 |
| N/A 28C P0 30W / 165W | 4MiB / 24576MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA A30 On | 00000000:CA:00.0 Off | 0 |
| N/A 27C P0 31W / 165W | 4MiB / 24576MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA A30 On | 00000000:E3:00.0 Off | 0 |
| N/A 28C P0 32W / 165W | 4MiB / 24576MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| MIG devices: |
+------------------+--------------------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG |
| | | ECC| |
|==================+================================+===========+=======================|
| 0 1 0 0 | 25MiB / 11968MiB | 14 0 | 2 0 2 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+ +-----------+-----------------------+
| 0 1 1 1 | | 14 0 | 2 0 2 0 0 |
| | | | |
+------------------+--------------------------------+-----------+-----------------------+
| 0 2 0 2 | 25MiB / 11968MiB | 14 0 | 2 0 2 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+ +-----------+-----------------------+
| 0 2 1 3 | | 14 0 | 2 0 2 0 0 |
| | | | |
+------------------+--------------------------------+-----------+-----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
Here is the output of ps -ef | grep mps
after running the enabling MPS on MIG script (with sudo):
$ ps -ef | grep mps
root 241318 1 0 17:08 ? 00:00:00 nvidia-cuda-mps-control -d
root 241326 1 0 17:08 ? 00:00:00 nvidia-cuda-mps-control -d
root 241334 1 0 17:08 ? 00:00:00 nvidia-cuda-mps-control -d
root 241342 1 0 17:08 ? 00:00:00 nvidia-cuda-mps-control -d
# And the content of tmp directory
$ ls -l /tmp/
total 0
drwxr-xr-x 2 root root 120 Aug 28 17:08 mps_MIG-22f0c05f-5cf2-5ea2-8297-789695656dc0
drwxr-xr-x 2 root root 120 Aug 28 17:08 mps_MIG-235f71be-a125-5ce0-9fe6-0cd97ae57733
drwxr-xr-x 2 root root 120 Aug 28 17:08 mps_MIG-27be287f-e2db-5526-a2f6-0bfabcf34af9
drwxr-xr-x 2 root root 120 Aug 28 17:08 mps_MIG-d74dcafc-aad0-58b8-83d8-61bcd963d2e9
drwxr-xr-x 2 root root 80 Aug 28 17:08 mps_log_MIG-22f0c05f-5cf2-5ea2-8297-789695656dc0
drwxr-xr-x 2 root root 80 Aug 28 17:08 mps_log_MIG-235f71be-a125-5ce0-9fe6-0cd97ae57733
drwxr-xr-x 2 root root 80 Aug 28 17:08 mps_log_MIG-27be287f-e2db-5526-a2f6-0bfabcf34af9
drwxr-xr-x 2 root root 80 Aug 28 17:08 mps_log_MIG-d74dcafc-aad0-58b8-83d8-61bcd963d2e9
Now, you can run your application on the MIG instances with MPS enabled. For example, you can run the following command to run the deviceQuery
:
CUDA_MPS_PIPE_DIRECTORY=/tmp/mps_MIG-22f0c05f-5cf2-5ea2-8297-789695656dc0 \
CUDA_MPS_LOG_DIRECTORY=/tmp/mps_log_MIG-22f0c05f-5cf2-5ea2-8297-789695656dc0 \
./deviceQuery
After I ran the deviceQuery
on all the MIG isntances with the correct PIPE and LOG directories, this is the output of nvidia-smi
:
$ nvidia-smi
...
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 1 0 241923 C nvidia-cuda-mps-server 30MiB |
| 0 1 1 254870 C nvidia-cuda-mps-server 30MiB |
| 0 2 0 249932 C nvidia-cuda-mps-server 30MiB |
| 0 2 1 242189 C nvidia-cuda-mps-server 30MiB |
+---------------------------------------------------------------------------------------+
Notice the GI IDs and CI IDs, and how each of them has its own MPS server. It’s worth mentioning that MPS servers are started in a lazy fashion. So if you don’t run any application, the MPS server will not be started.
Btw, I didn’t see any difference in the behaviour of the application when passing CUDA_VISIBLE_DEVICES
to the command. It seems that the CUDA_MPS_PIPE_DIRECTORY
is enough to point to the correct MPS server. However, setting CUDA_VISIBLE_DEVICES
is a good practice to avoid any confusion, as setting it to a wrong value will cause the application to return an error.