Multi-Instance GPU

Published:
17 minute read

Multi-Instance GPU (MIG) in a nutshell

MIG (Multi-Instance GPU) is a feature of the NVIDIA driver that allows a single GPU to be partitioned into multiple instances, each with its own compute, memory, and I/O resources. This feature is useful when multiple applications require GPU resources, but the applications do not fully utilize the GPU. It is somewhow similar to MPS, but it works in a totally different way! However, they might be used together to improve the GPU utilization even further.

You can read more info about MIG in the NVIDIA document.

Enabling MIG

MIG mode can be enabled on a per-GPU basis. The GPU IDs are the indices of the GPUs that you want to enable MIG mode on. For example, to check all avaiable GPU IDs and enable MIG mode on one of them, run the following command:

nvidia-smi -L
nvidia-smi -i 0 -mig 1 # Enable MIG mode on GPU 0

# To query the MIG mode status
nvidia-smi -i 0 --query-gpu=pci.bus_id,mig.mode.current --format=csv

If you see errors depending on your GPU model, you can take a look at this section.

Creating MIG Instances

After enabling MIG mode, you can create MIG instances. First, to see the available MIG profiles, run the following command:

nvidia-smi mig -lgip 

The MIG profile names are in the format of <GPU Slice Count>g.<Memory Size>gb. The +me suffix indicates that the profile includes Media Extension. For instance, MIG 1g.6gb+me indicates that the profile has 1 GPU slice and 6GB of memory and includes Media Extension.

The syntax of the placement is {<index>}:<GPU Slice Count> shows the placement of the instances on the GPU. The placement index shown indicates how the profiles are mapped on the GPU. For instance:

$ nvidia-smi mig -lgipp
GPU  0 Profile ID 14 Placements: {0,1,2,3}:1
GPU  0 Profile ID 21 Placements: {0,1,2,3}:1
GPU  0 Profile ID  5 Placements: {0,2}:2
GPU  0 Profile ID  6 Placements: {0,2}:2
GPU  0 Profile ID  0 Placement : {0}:4

The placement index {0,1,2,3}:1 with the output of the previous command indicate that the MIG profile with ID 14 has 4 instances, each with 1 GPU slice, and the instances can be placed on GPU slices 0, 1, 2, and 3.

Now, it’s time to create some MIG instances!

# There are basically three ways to do this:
# 1. By specifying a combination of the profile IDs (make sure that the GPU has enough resources)  
nvidia-smi mig -cgi <14,21,5,6,0>

# 2. By specifying the short profile name
nvidia-smi mig -cgi <1g.6gb,1g.6gb+me,2g.12gb,2g.12gb+me,4g.24gb>

# 3. By specifying the full profile name
nvidia-smi mig -cgi <MIG 1g.6gb,MIG 1g.6gb+me,MIG 2g.12gb,MIG 2g.12gb+me,MIG 4g.24gb>

# Or a combination of the above
nvidia-smi mig -cgi 14,1g.6gb

Destroying MIG Instances

To destroy all the CIs and GIs:

sudo nvidia-smi mig -dci # --destroy-compute-instance
sudo nvidia-smi mig -dgi # --destroy-gpu-instance

# Or to destroy a specific instance
# This one destroys the compute instances with IDs 0, 1, and 2 under the GPU instance 1
nvidia-smi mig -dci -ci 0,1,2 -gi 1

# Verify the status
nvidia-smi mig -lgi # --list-gpu-instances

An example on a node with 4 A30 GPUs

Let’s see the list of the GPUs:

$ nvidia-smi -L
GPU 0: NVIDIA A30 (UUID: GPU-d8430827-89da-89df-1d70-bde0c9883859)
GPU 1: NVIDIA A30 (UUID: GPU-d1ee693e-571b-7664-f22b-1312585630d0)
GPU 2: NVIDIA A30 (UUID: GPU-fc7c8f73-47c4-7b21-c9da-858462fa0433)
GPU 3: NVIDIA A30 (UUID: GPU-bef3638b-2c97-ea4b-83d9-b1a6b7d2fc29)

# Or more detailed information (MIG is not enabled yet)
$ nvidia-smi
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A30                     On  | 00000000:17:00.0 Off |                    0 |
| N/A   28C    P0              33W / 165W |      4MiB / 24576MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A30                     On  | 00000000:65:00.0 Off |                    0 |
| N/A   28C    P0              36W / 165W |      4MiB / 24576MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A30                     On  | 00000000:CA:00.0 Off |                    0 |
| N/A   29C    P0              30W / 165W |      4MiB / 24576MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A30                     On  | 00000000:E3:00.0 Off |                    0 |
| N/A   30C    P0              31W / 165W |      4MiB / 24576MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

# Let's enable MIG mode on GPU 0
$ nvidia-smi -i 0 -mig 1
Enabled MIG Mode for GPU 00000000:17:00.0
All done.

$ nvidia-smi
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A30                     On  | 00000000:17:00.0 Off |                   On |
| N/A   27C    P0              26W / 165W |      0MiB / 24576MiB |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A30                     On  | 00000000:65:00.0 Off |                    0 |
| N/A   27C    P0              27W / 165W |      4MiB / 24576MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A30                     On  | 00000000:CA:00.0 Off |                    0 |
| N/A   29C    P0              30W / 165W |      4MiB / 24576MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A30                     On  | 00000000:E3:00.0 Off |                    0 |
| N/A   29C    P0              31W / 165W |      4MiB / 24576MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| MIG devices:                                                                          |
+------------------+--------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |                   Memory-Usage |        Vol|      Shared           |
|      ID  ID  Dev |                     BAR1-Usage | SM     Unc| CE ENC DEC OFA JPG    |
|                  |                                |        ECC|                       |
|==================+================================+===========+=======================|
|  No MIG devices found                                                                 |
+---------------------------------------------------------------------------------------+

# Let's see the available MIG profiles
$ nvidia-smi mig -lgip 
+-----------------------------------------------------------------------------+
| GPU instance profiles:                                                      |
| GPU   Name             ID    Instances   Memory     P2P    SM    DEC   ENC  |
|                              Free/Total   GiB              CE    JPEG  OFA  |
|=============================================================================|
|   0  MIG 1g.6gb        14     4/4        5.81       No     14     1     0   |
|                                                             1     0     0   |
+-----------------------------------------------------------------------------+
|   0  MIG 1g.6gb+me     21     1/1        5.81       No     14     1     0   |
|                                                             1     1     1   |
+-----------------------------------------------------------------------------+
|   0  MIG 2g.12gb        5     2/2        11.69      No     28     2     0   |
|                                                             2     0     0   |
+-----------------------------------------------------------------------------+
|   0  MIG 2g.12gb+me     6     1/1        11.69      No     28     2     0   |
|                                                             2     1     1   |
+-----------------------------------------------------------------------------+
|   0  MIG 4g.24gb        0     1/1        23.44      No     56     4     0   |
|                                                             4     1     1   |
+-----------------------------------------------------------------------------+

# The table SM column shows the number of SMs available to the MIG instance.
# The table CE column shows the number of copy engines available to the MIG instance.

For instance, let’s create two instances, each with with 2 GPU slices (out of 4 avialable) and 12GB of memory. I can use the following command:

$ nvidia-smi mig -cgi 5,5
Successfully created GPU instance ID  1 on GPU  0 using profile MIG 2g.12gb (ID  5)
Successfully created GPU instance ID  2 on GPU  0 using profile MIG 2g.12gb (ID  5)

# To see the status of the created instances
$ nvidia-smi mig -lgi # --list-gpu-instances
+-------------------------------------------------------+
| GPU instances:                                        |
| GPU   Name             Profile  Instance   Placement  |
|                          ID       ID       Start:Size |
|=======================================================|
|   0  MIG 2g.12gb          5        1          0:2     |
+-------------------------------------------------------+
|   0  MIG 2g.12gb          5        2          2:2     |
+-------------------------------------------------------+

# However there are no Compute Instance (CI) created yet! 
$ nvidia-smi mig -lci
No compute instances found: Not Found

# And the list has not changed
$ nvidia-smi -L
GPU 0: NVIDIA A30 (UUID: GPU-d8430827-89da-89df-1d70-bde0c9883859)
GPU 1: NVIDIA A30 (UUID: GPU-d1ee693e-571b-7664-f22b-1312585630d0)
GPU 2: NVIDIA A30 (UUID: GPU-fc7c8f73-47c4-7b21-c9da-858462fa0433)
GPU 3: NVIDIA A30 (UUID: GPU-bef3638b-2c97-ea4b-83d9-b1a6b7d2fc29)

But what the hell is a Compute Instance (CI)?! In very basic terms, apparently, it is the actual instance that your CUDA code can run on. So, it has to be created manually after creating the GPU instances. In the example above that I created two GPU instances, I can update the command to create the CIs as well:

# I have deleted the previous instances!
$ nvidia-smi mig -dgi

# Now, I can create the instances with the following command
$ nvidia-smi mig -cgi 5,5 -C
Successfully created GPU instance ID  1 on GPU  0 using profile MIG 2g.12gb (ID  5)
Successfully created compute instance ID  0 on GPU  0 GPU instance ID  1 using profile MIG 2g.12gb (ID  1)
Successfully created GPU instance ID  2 on GPU  0 using profile MIG 2g.12gb (ID  5)
Successfully created compute instance ID  0 on GPU  0 GPU instance ID  2 using profile MIG 2g.12gb (ID  1)

# To see the status of the created instances (no changes yet)
$ nvidia-smi mig -lgi # --list-gpu-instances

# Now, the list of the GPUs has changed
$ nvidia-smi -L
GPU 0: NVIDIA A30 (UUID: GPU-d8430827-89da-89df-1d70-bde0c9883859)
  MIG 2g.12gb     Device  0: (UUID: MIG-89b757f7-3c4c-5b1c-9476-a11b14aa9308)
  MIG 2g.12gb     Device  1: (UUID: MIG-6df26cf8-a984-58d3-978e-acb0c808d513)
GPU 1: NVIDIA A30 (UUID: GPU-d1ee693e-571b-7664-f22b-1312585630d0)
GPU 2: NVIDIA A30 (UUID: GPU-fc7c8f73-47c4-7b21-c9da-858462fa0433)
GPU 3: NVIDIA A30 (UUID: GPU-bef3638b-2c97-ea4b-83d9-b1a6b7d2fc29)

$ nvidia-smi
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A30                     On  | 00000000:17:00.0 Off |                   On |
| N/A   27C    P0              26W / 165W |     50MiB / 24576MiB |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A30                     On  | 00000000:65:00.0 Off |                    0 |
| N/A   27C    P0              27W / 165W |      4MiB / 24576MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A30                     On  | 00000000:CA:00.0 Off |                    0 |
| N/A   29C    P0              30W / 165W |      4MiB / 24576MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A30                     On  | 00000000:E3:00.0 Off |                    0 |
| N/A   29C    P0              31W / 165W |      4MiB / 24576MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| MIG devices:                                                                          |
+------------------+--------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |                   Memory-Usage |        Vol|      Shared           |
|      ID  ID  Dev |                     BAR1-Usage | SM     Unc| CE ENC DEC OFA JPG    |
|                  |                                |        ECC|                       |
|==================+================================+===========+=======================|
|  0    1   0   0  |              25MiB / 11968MiB  | 28      0 |  2   0    2    0    0 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  0    2   0   1  |              25MiB / 11968MiB  | 28      0 |  2   0    2    0    0 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+

Phew! That was a lot! But now, I have two MIG instances, each with 2 GPU slices and 12GB of memory. I can run my CUDA code on these instances like this:

CUDA_VISIBLE_DEVICES=MIG-89b757f7-3c4c-5b1c-9476-a11b14aa9308 ./my_cuda_code &
CUDA_VISIBLE_DEVICES=MIG-6df26cf8-a984-58d3-978e-acb0c808d513 ./my_cuda_code &

Pay attention! In this case, each GPU instance has two GPU slices (not to be confused with Compute instance!), and we can split them into two further slices. Let’s try that!

# Deleting the previous compute instances
$ nvidia-smi mig -dci

# Let's see the compute instance profiles
$ nvidia-smi mig -lcip
+--------------------------------------------------------------------------------------+
| Compute instance profiles:                                                           |
| GPU     GPU       Name             Profile  Instances   Exclusive       Shared       |
|       Instance                       ID     Free/Total     SM       DEC   ENC   OFA  |
|         ID                                                          CE    JPEG       |
|======================================================================================|
|   0      1       MIG 1c.2g.12gb       0      2/2           14        2     0     0   |
|                                                                      2     0         |
+--------------------------------------------------------------------------------------+
|   0      1       MIG 2g.12gb          1*     1/1           28        2     0     0   |
|                                                                      2     0         |
+--------------------------------------------------------------------------------------+
|   0      2       MIG 1c.2g.12gb       0      2/2           14        2     0     0   |
|                                                                      2     0         |
+--------------------------------------------------------------------------------------+
|   0      2       MIG 2g.12gb          1*     1/1           28        2     0     0   |
|                                                                      2     0         |
+--------------------------------------------------------------------------------------+

Consider the name MIG 1c.2g.12gb. This name indicates that the profile has 1 compute instance, 2 GPU slices, and 12GB of memory. And it means that the compute instances use the SMs exclusively, but they share GPU memory and copy engines, etc. Therefore, I can create two compute instances per GPU instance:

# Once again, let's see the list of the GPU instances
$ nvidia-smi mig -lgi
+-------------------------------------------------------+
| GPU instances:                                        |
| GPU   Name             Profile  Instance   Placement  |
|                          ID       ID       Start:Size |
|=======================================================|
|   0  MIG 2g.12gb          5        1          0:2     |
+-------------------------------------------------------+
|   0  MIG 2g.12gb          5        2          2:2     |
+-------------------------------------------------------+

# Now, I can create two Compute instances of ID 0 for the GPU instance 1 
$ nvidia-smi mig -cci 0,0 -gi 1
Successfully created compute instance ID  0 on GPU  0 GPU instance ID  1 using profile MIG 1c.2g.12gb (ID  0)
Successfully created compute instance ID  1 on GPU  0 GPU instance ID  1 using profile MIG 1c.2g.12gb (ID  0)

# Let's do the same for the GPU instance 2
nvidia-smi mig -cci 0,0 -gi 2
Successfully created compute instance ID  0 on GPU  0 GPU instance ID  2 using profile MIG 1c.2g.12gb (ID  0)
Successfully created compute instance ID  1 on GPU  0 GPU instance ID  2 using profile MIG 1c.2g.12gb (ID  0)

# Now, the list of the Compute instances:
$ nvidia-smi mig -lci
+--------------------------------------------------------------------+
| Compute instances:                                                 |
| GPU     GPU       Name             Profile   Instance   Placement  |
|       Instance                       ID        ID       Start:Size |
|         ID                                                         |
|====================================================================|
|   0      1       MIG 1c.2g.12gb       0         0          0:1     |
+--------------------------------------------------------------------+
|   0      1       MIG 1c.2g.12gb       0         1          1:1     |
+--------------------------------------------------------------------+
|   0      2       MIG 1c.2g.12gb       0         0          0:1     |
+--------------------------------------------------------------------+
|   0      2       MIG 1c.2g.12gb       0         1          1:1     |
+--------------------------------------------------------------------+

# And the list of the GPUs:
$ nvidia-smi -L
GPU 0: NVIDIA A30 (UUID: GPU-d8430827-89da-89df-1d70-bde0c9883859)
  MIG 1c.2g.12gb  Device  0: (UUID: MIG-89b757f7-3c4c-5b1c-9476-a11b14aa9308)
  MIG 1c.2g.12gb  Device  1: (UUID: MIG-bdf043cc-8668-599c-95b0-5b640d416440)
  MIG 1c.2g.12gb  Device  2: (UUID: MIG-6df26cf8-a984-58d3-978e-acb0c808d513)
  MIG 1c.2g.12gb  Device  3: (UUID: MIG-6f79e1f8-f1ed-5121-976a-4ff3f0df8634)
GPU 1: NVIDIA A30 (UUID: GPU-d1ee693e-571b-7664-f22b-1312585630d0)
GPU 2: NVIDIA A30 (UUID: GPU-fc7c8f73-47c4-7b21-c9da-858462fa0433)
GPU 3: NVIDIA A30 (UUID: GPU-bef3638b-2c97-ea4b-83d9-b1a6b7d2fc29)

# And the list of the MIG devices from nvidia-smi:
$ nvidia-smi
+---------------------------------------------------------------------------------------+
| MIG devices:                                                                          |
+------------------+--------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |                   Memory-Usage |        Vol|      Shared           |
|      ID  ID  Dev |                     BAR1-Usage | SM     Unc| CE ENC DEC OFA JPG    |
|                  |                                |        ECC|                       |
|==================+================================+===========+=======================|
|  0    1   0   0  |              25MiB / 11968MiB  | 14      0 |  2   0    2    0    0 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+                                +-----------+-----------------------+
|  0    1   1   1  |                                | 14      0 |  2   0    2    0    0 |
|                  |                                |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  0    2   0   2  |              25MiB / 11968MiB  | 14      0 |  2   0    2    0    0 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+                                +-----------+-----------------------+
|  0    2   1   3  |                                | 14      0 |  2   0    2    0    0 |
|                  |                                |           |                       |
+------------------+--------------------------------+-----------+-----------------------+

Ok, now I can run four CUDA codes on the four compute instances together. The command would be like this:

CUDA_VISIBLE_DEVICES=MIG-89b757f7-3c4c-5b1c-9476-a11b14aa9308 ./my_cuda_code &
CUDA_VISIBLE_DEVICES=MIG-bdf043cc-8668-599c-95b0-5b640d416440 ./my_cuda_code &
CUDA_VISIBLE_DEVICES=MIG-6df26cf8-a984-58d3-978e-acb0c808d513 ./my_cuda_code &
CUDA_VISIBLE_DEVICES=MIG-6f79e1f8-f1ed-5121-976a-4ff3f0df8634 ./my_cuda_code &