Multi-Instance GPU
Multi-Instance GPU (MIG) in a nutshell
MIG (Multi-Instance GPU) is a feature of the NVIDIA driver that allows a single GPU to be partitioned into multiple instances, each with its own compute, memory, and I/O resources. This feature is useful when multiple applications require GPU resources, but the applications do not fully utilize the GPU. It is somewhow similar to MPS, but it works in a totally different way! However, they might be used together to improve the GPU utilization even further.
You can read more info about MIG in the NVIDIA document.
Enabling MIG
MIG mode can be enabled on a per-GPU basis. The GPU IDs are the indices of the GPUs that you want to enable MIG mode on. For example, to check all avaiable GPU IDs and enable MIG mode on one of them, run the following command:
nvidia-smi -L
nvidia-smi -i 0 -mig 1 # Enable MIG mode on GPU 0
# To query the MIG mode status
nvidia-smi -i 0 --query-gpu=pci.bus_id,mig.mode.current --format=csv
If you see errors depending on your GPU model, you can take a look at this section.
Creating MIG Instances
After enabling MIG mode, you can create MIG instances. First, to see the available MIG profiles, run the following command:
nvidia-smi mig -lgip
The MIG profile names are in the format of <GPU Slice Count>g.<Memory Size>gb
. The +me
suffix indicates that the profile includes Media Extension. For instance, MIG 1g.6gb+me
indicates that the profile has 1 GPU slice and 6GB of memory and includes Media Extension.
The syntax of the placement is {<index>}:<GPU Slice Count>
shows the placement of the instances on the GPU. The placement index shown indicates how the profiles are mapped on the GPU. For instance:
$ nvidia-smi mig -lgipp
GPU 0 Profile ID 14 Placements: {0,1,2,3}:1
GPU 0 Profile ID 21 Placements: {0,1,2,3}:1
GPU 0 Profile ID 5 Placements: {0,2}:2
GPU 0 Profile ID 6 Placements: {0,2}:2
GPU 0 Profile ID 0 Placement : {0}:4
The placement index {0,1,2,3}:1
with the output of the previous command indicate that the MIG profile with ID 14 has 4 instances, each with 1 GPU slice, and the instances can be placed on GPU slices 0, 1, 2, and 3.
Now, it’s time to create some MIG instances!
# There are basically three ways to do this:
# 1. By specifying a combination of the profile IDs (make sure that the GPU has enough resources)
nvidia-smi mig -cgi <14,21,5,6,0>
# 2. By specifying the short profile name
nvidia-smi mig -cgi <1g.6gb,1g.6gb+me,2g.12gb,2g.12gb+me,4g.24gb>
# 3. By specifying the full profile name
nvidia-smi mig -cgi <MIG 1g.6gb,MIG 1g.6gb+me,MIG 2g.12gb,MIG 2g.12gb+me,MIG 4g.24gb>
# Or a combination of the above
nvidia-smi mig -cgi 14,1g.6gb
Destroying MIG Instances
To destroy all the CIs and GIs:
sudo nvidia-smi mig -dci # --destroy-compute-instance
sudo nvidia-smi mig -dgi # --destroy-gpu-instance
# Or to destroy a specific instance
# This one destroys the compute instances with IDs 0, 1, and 2 under the GPU instance 1
nvidia-smi mig -dci -ci 0,1,2 -gi 1
# Verify the status
nvidia-smi mig -lgi # --list-gpu-instances
An example on a node with 4 A30 GPUs
Let’s see the list of the GPUs:
$ nvidia-smi -L
GPU 0: NVIDIA A30 (UUID: GPU-d8430827-89da-89df-1d70-bde0c9883859)
GPU 1: NVIDIA A30 (UUID: GPU-d1ee693e-571b-7664-f22b-1312585630d0)
GPU 2: NVIDIA A30 (UUID: GPU-fc7c8f73-47c4-7b21-c9da-858462fa0433)
GPU 3: NVIDIA A30 (UUID: GPU-bef3638b-2c97-ea4b-83d9-b1a6b7d2fc29)
# Or more detailed information (MIG is not enabled yet)
$ nvidia-smi
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08 Driver Version: 545.23.08 CUDA Version: 12.3 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A30 On | 00000000:17:00.0 Off | 0 |
| N/A 28C P0 33W / 165W | 4MiB / 24576MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA A30 On | 00000000:65:00.0 Off | 0 |
| N/A 28C P0 36W / 165W | 4MiB / 24576MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA A30 On | 00000000:CA:00.0 Off | 0 |
| N/A 29C P0 30W / 165W | 4MiB / 24576MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA A30 On | 00000000:E3:00.0 Off | 0 |
| N/A 30C P0 31W / 165W | 4MiB / 24576MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
# Let's enable MIG mode on GPU 0
$ nvidia-smi -i 0 -mig 1
Enabled MIG Mode for GPU 00000000:17:00.0
All done.
$ nvidia-smi
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08 Driver Version: 545.23.08 CUDA Version: 12.3 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A30 On | 00000000:17:00.0 Off | On |
| N/A 27C P0 26W / 165W | 0MiB / 24576MiB | N/A Default |
| | | Enabled |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA A30 On | 00000000:65:00.0 Off | 0 |
| N/A 27C P0 27W / 165W | 4MiB / 24576MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA A30 On | 00000000:CA:00.0 Off | 0 |
| N/A 29C P0 30W / 165W | 4MiB / 24576MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA A30 On | 00000000:E3:00.0 Off | 0 |
| N/A 29C P0 31W / 165W | 4MiB / 24576MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| MIG devices: |
+------------------+--------------------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG |
| | | ECC| |
|==================+================================+===========+=======================|
| No MIG devices found |
+---------------------------------------------------------------------------------------+
# Let's see the available MIG profiles
$ nvidia-smi mig -lgip
+-----------------------------------------------------------------------------+
| GPU instance profiles: |
| GPU Name ID Instances Memory P2P SM DEC ENC |
| Free/Total GiB CE JPEG OFA |
|=============================================================================|
| 0 MIG 1g.6gb 14 4/4 5.81 No 14 1 0 |
| 1 0 0 |
+-----------------------------------------------------------------------------+
| 0 MIG 1g.6gb+me 21 1/1 5.81 No 14 1 0 |
| 1 1 1 |
+-----------------------------------------------------------------------------+
| 0 MIG 2g.12gb 5 2/2 11.69 No 28 2 0 |
| 2 0 0 |
+-----------------------------------------------------------------------------+
| 0 MIG 2g.12gb+me 6 1/1 11.69 No 28 2 0 |
| 2 1 1 |
+-----------------------------------------------------------------------------+
| 0 MIG 4g.24gb 0 1/1 23.44 No 56 4 0 |
| 4 1 1 |
+-----------------------------------------------------------------------------+
# The table SM column shows the number of SMs available to the MIG instance.
# The table CE column shows the number of copy engines available to the MIG instance.
For instance, let’s create two instances, each with with 2 GPU slices (out of 4 avialable) and 12GB of memory. I can use the following command:
$ nvidia-smi mig -cgi 5,5
Successfully created GPU instance ID 1 on GPU 0 using profile MIG 2g.12gb (ID 5)
Successfully created GPU instance ID 2 on GPU 0 using profile MIG 2g.12gb (ID 5)
# To see the status of the created instances
$ nvidia-smi mig -lgi # --list-gpu-instances
+-------------------------------------------------------+
| GPU instances: |
| GPU Name Profile Instance Placement |
| ID ID Start:Size |
|=======================================================|
| 0 MIG 2g.12gb 5 1 0:2 |
+-------------------------------------------------------+
| 0 MIG 2g.12gb 5 2 2:2 |
+-------------------------------------------------------+
# However there are no Compute Instance (CI) created yet!
$ nvidia-smi mig -lci
No compute instances found: Not Found
# And the list has not changed
$ nvidia-smi -L
GPU 0: NVIDIA A30 (UUID: GPU-d8430827-89da-89df-1d70-bde0c9883859)
GPU 1: NVIDIA A30 (UUID: GPU-d1ee693e-571b-7664-f22b-1312585630d0)
GPU 2: NVIDIA A30 (UUID: GPU-fc7c8f73-47c4-7b21-c9da-858462fa0433)
GPU 3: NVIDIA A30 (UUID: GPU-bef3638b-2c97-ea4b-83d9-b1a6b7d2fc29)
But what the hell is a Compute Instance (CI)?! In very basic terms, apparently, it is the actual instance that your CUDA code can run on. So, it has to be created manually after creating the GPU instances. In the example above that I created two GPU instances, I can update the command to create the CIs as well:
# I have deleted the previous instances!
$ nvidia-smi mig -dgi
# Now, I can create the instances with the following command
$ nvidia-smi mig -cgi 5,5 -C
Successfully created GPU instance ID 1 on GPU 0 using profile MIG 2g.12gb (ID 5)
Successfully created compute instance ID 0 on GPU 0 GPU instance ID 1 using profile MIG 2g.12gb (ID 1)
Successfully created GPU instance ID 2 on GPU 0 using profile MIG 2g.12gb (ID 5)
Successfully created compute instance ID 0 on GPU 0 GPU instance ID 2 using profile MIG 2g.12gb (ID 1)
# To see the status of the created instances (no changes yet)
$ nvidia-smi mig -lgi # --list-gpu-instances
# Now, the list of the GPUs has changed
$ nvidia-smi -L
GPU 0: NVIDIA A30 (UUID: GPU-d8430827-89da-89df-1d70-bde0c9883859)
MIG 2g.12gb Device 0: (UUID: MIG-89b757f7-3c4c-5b1c-9476-a11b14aa9308)
MIG 2g.12gb Device 1: (UUID: MIG-6df26cf8-a984-58d3-978e-acb0c808d513)
GPU 1: NVIDIA A30 (UUID: GPU-d1ee693e-571b-7664-f22b-1312585630d0)
GPU 2: NVIDIA A30 (UUID: GPU-fc7c8f73-47c4-7b21-c9da-858462fa0433)
GPU 3: NVIDIA A30 (UUID: GPU-bef3638b-2c97-ea4b-83d9-b1a6b7d2fc29)
$ nvidia-smi
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08 Driver Version: 545.23.08 CUDA Version: 12.3 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A30 On | 00000000:17:00.0 Off | On |
| N/A 27C P0 26W / 165W | 50MiB / 24576MiB | N/A Default |
| | | Enabled |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA A30 On | 00000000:65:00.0 Off | 0 |
| N/A 27C P0 27W / 165W | 4MiB / 24576MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA A30 On | 00000000:CA:00.0 Off | 0 |
| N/A 29C P0 30W / 165W | 4MiB / 24576MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA A30 On | 00000000:E3:00.0 Off | 0 |
| N/A 29C P0 31W / 165W | 4MiB / 24576MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| MIG devices: |
+------------------+--------------------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG |
| | | ECC| |
|==================+================================+===========+=======================|
| 0 1 0 0 | 25MiB / 11968MiB | 28 0 | 2 0 2 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
| 0 2 0 1 | 25MiB / 11968MiB | 28 0 | 2 0 2 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
Phew! That was a lot! But now, I have two MIG instances, each with 2 GPU slices and 12GB of memory. I can run my CUDA code on these instances like this:
CUDA_VISIBLE_DEVICES=MIG-89b757f7-3c4c-5b1c-9476-a11b14aa9308 ./my_cuda_code &
CUDA_VISIBLE_DEVICES=MIG-6df26cf8-a984-58d3-978e-acb0c808d513 ./my_cuda_code &
Pay attention! In this case, each GPU instance has two GPU slices (not to be confused with Compute instance!), and we can split them into two further slices. Let’s try that!
# Deleting the previous compute instances
$ nvidia-smi mig -dci
# Let's see the compute instance profiles
$ nvidia-smi mig -lcip
+--------------------------------------------------------------------------------------+
| Compute instance profiles: |
| GPU GPU Name Profile Instances Exclusive Shared |
| Instance ID Free/Total SM DEC ENC OFA |
| ID CE JPEG |
|======================================================================================|
| 0 1 MIG 1c.2g.12gb 0 2/2 14 2 0 0 |
| 2 0 |
+--------------------------------------------------------------------------------------+
| 0 1 MIG 2g.12gb 1* 1/1 28 2 0 0 |
| 2 0 |
+--------------------------------------------------------------------------------------+
| 0 2 MIG 1c.2g.12gb 0 2/2 14 2 0 0 |
| 2 0 |
+--------------------------------------------------------------------------------------+
| 0 2 MIG 2g.12gb 1* 1/1 28 2 0 0 |
| 2 0 |
+--------------------------------------------------------------------------------------+
Consider the name MIG 1c.2g.12gb
. This name indicates that the profile has 1 compute instance, 2 GPU slices, and 12GB of memory. And it means that the compute instances use the SMs exclusively, but they share GPU memory and copy engines, etc. Therefore, I can create two compute instances per GPU instance:
# Once again, let's see the list of the GPU instances
$ nvidia-smi mig -lgi
+-------------------------------------------------------+
| GPU instances: |
| GPU Name Profile Instance Placement |
| ID ID Start:Size |
|=======================================================|
| 0 MIG 2g.12gb 5 1 0:2 |
+-------------------------------------------------------+
| 0 MIG 2g.12gb 5 2 2:2 |
+-------------------------------------------------------+
# Now, I can create two Compute instances of ID 0 for the GPU instance 1
$ nvidia-smi mig -cci 0,0 -gi 1
Successfully created compute instance ID 0 on GPU 0 GPU instance ID 1 using profile MIG 1c.2g.12gb (ID 0)
Successfully created compute instance ID 1 on GPU 0 GPU instance ID 1 using profile MIG 1c.2g.12gb (ID 0)
# Let's do the same for the GPU instance 2
nvidia-smi mig -cci 0,0 -gi 2
Successfully created compute instance ID 0 on GPU 0 GPU instance ID 2 using profile MIG 1c.2g.12gb (ID 0)
Successfully created compute instance ID 1 on GPU 0 GPU instance ID 2 using profile MIG 1c.2g.12gb (ID 0)
# Now, the list of the Compute instances:
$ nvidia-smi mig -lci
+--------------------------------------------------------------------+
| Compute instances: |
| GPU GPU Name Profile Instance Placement |
| Instance ID ID Start:Size |
| ID |
|====================================================================|
| 0 1 MIG 1c.2g.12gb 0 0 0:1 |
+--------------------------------------------------------------------+
| 0 1 MIG 1c.2g.12gb 0 1 1:1 |
+--------------------------------------------------------------------+
| 0 2 MIG 1c.2g.12gb 0 0 0:1 |
+--------------------------------------------------------------------+
| 0 2 MIG 1c.2g.12gb 0 1 1:1 |
+--------------------------------------------------------------------+
# And the list of the GPUs:
$ nvidia-smi -L
GPU 0: NVIDIA A30 (UUID: GPU-d8430827-89da-89df-1d70-bde0c9883859)
MIG 1c.2g.12gb Device 0: (UUID: MIG-89b757f7-3c4c-5b1c-9476-a11b14aa9308)
MIG 1c.2g.12gb Device 1: (UUID: MIG-bdf043cc-8668-599c-95b0-5b640d416440)
MIG 1c.2g.12gb Device 2: (UUID: MIG-6df26cf8-a984-58d3-978e-acb0c808d513)
MIG 1c.2g.12gb Device 3: (UUID: MIG-6f79e1f8-f1ed-5121-976a-4ff3f0df8634)
GPU 1: NVIDIA A30 (UUID: GPU-d1ee693e-571b-7664-f22b-1312585630d0)
GPU 2: NVIDIA A30 (UUID: GPU-fc7c8f73-47c4-7b21-c9da-858462fa0433)
GPU 3: NVIDIA A30 (UUID: GPU-bef3638b-2c97-ea4b-83d9-b1a6b7d2fc29)
# And the list of the MIG devices from nvidia-smi:
$ nvidia-smi
+---------------------------------------------------------------------------------------+
| MIG devices: |
+------------------+--------------------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG |
| | | ECC| |
|==================+================================+===========+=======================|
| 0 1 0 0 | 25MiB / 11968MiB | 14 0 | 2 0 2 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+ +-----------+-----------------------+
| 0 1 1 1 | | 14 0 | 2 0 2 0 0 |
| | | | |
+------------------+--------------------------------+-----------+-----------------------+
| 0 2 0 2 | 25MiB / 11968MiB | 14 0 | 2 0 2 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+ +-----------+-----------------------+
| 0 2 1 3 | | 14 0 | 2 0 2 0 0 |
| | | | |
+------------------+--------------------------------+-----------+-----------------------+
Ok, now I can run four CUDA codes on the four compute instances together. The command would be like this:
CUDA_VISIBLE_DEVICES=MIG-89b757f7-3c4c-5b1c-9476-a11b14aa9308 ./my_cuda_code &
CUDA_VISIBLE_DEVICES=MIG-bdf043cc-8668-599c-95b0-5b640d416440 ./my_cuda_code &
CUDA_VISIBLE_DEVICES=MIG-6df26cf8-a984-58d3-978e-acb0c808d513 ./my_cuda_code &
CUDA_VISIBLE_DEVICES=MIG-6f79e1f8-f1ed-5121-976a-4ff3f0df8634 ./my_cuda_code &