



- Graphical Processing Units (GPUs)
  - Many cores and tremendous memory bandwidth
- GPUs vs. CPUs
  - Slower
  - Throughput-oriented
  - Let's look inside!



#### Hopper Architecture

H100 GPU Key features









- 1. Shared Memory
- 2. CUDA Streams
- 3. Unified Memory
- 4. Dynamic Parallelism
- 5. Hyper-Q
- 6. Warp-Level Primitives (Shuffle Instructions)
- 7. MPS

- 8. NCCL library (It's not a feature!)
- 9. Cooperative Groups
- 10. CUDA Graphs
- 11. Multi Instance GPU (MIG)
- 12. Async-Copy
- 13. Thread Collectives
- 14. C++17 STL, templates, pointer aliasing...

#### **GPUs architecture – Simultaneous Multiprocessor (SM)**

| SI | SM                                                              |           |           |           |           |                |                |           |      |                                                                 |           |           |           |           |                 |  |  |
|----|-----------------------------------------------------------------|-----------|-----------|-----------|-----------|----------------|----------------|-----------|------|-----------------------------------------------------------------|-----------|-----------|-----------|-----------|-----------------|--|--|
|    |                                                                 |           |           |           |           |                |                | L1 Instru | ctic | on Cache                                                        |           |           |           |           |                 |  |  |
|    | L0 Instruction Cache                                            |           |           |           |           |                |                |           |      | L0 Instruction Cache                                            |           |           |           |           |                 |  |  |
|    | Warp Scheduler (32 thread/clk)<br>Dispatch Unit (32 thread/clk) |           |           |           |           |                |                |           |      | Warp Scheduler (32 thread/clk)<br>Dispatch Unit (32 thread/clk) |           |           |           |           |                 |  |  |
|    |                                                                 |           |           |           |           |                |                |           |      |                                                                 |           |           |           |           |                 |  |  |
|    | Register File (16,384 x 32-bit)                                 |           |           |           |           |                |                |           |      | Register File (16,384 x 32-bit)                                 |           |           |           |           |                 |  |  |
|    | FP64                                                            | INT       | INT       | FP32      | FP32      | $\square$      |                |           |      | FP64                                                            | INT       | INT       | FP32      | FP32      |                 |  |  |
|    | FP64                                                            | INT       | INT       | FP32      | FP32      | $\square$      |                |           |      | FP64                                                            | INT       | INT       | FP32      | FP32      |                 |  |  |
|    | FP64                                                            | INT       | INT       | FP32      | FP32      | TENSOR<br>CORE | TENSOR<br>CORE |           | FP64 | INT                                                             | INT       | FP32      | FP32      |           |                 |  |  |
|    | FP64                                                            | INT       | INT       | FP32      | FP32      |                |                |           | FP64 | INT                                                             | INT       | FP32      | FP32      | CORE      |                 |  |  |
|    | FP64                                                            | INT       | INT       | FP32      | FP32      |                |                |           | FP64 | INT                                                             | INT       | FP32 FP32 | FP32      |           | CORE            |  |  |
|    | FP64                                                            | INT       | INT       | FP32      | FP32      |                |                |           | FP64 | INT                                                             | INT       | FP32      | FP32      |           |                 |  |  |
|    | FP64                                                            | INT       | INT       | FP32      | FP32      |                |                |           | FP64 | INT                                                             | INT       | FP32      | FP32      |           |                 |  |  |
|    | FP64                                                            | INT       | INT       | FP32      | FP32      |                |                |           |      | FP64                                                            | INT       | INT       | FP32      | FP32      |                 |  |  |
|    | LD/ LD/<br>ST ST                                                | LD/<br>ST | LD/<br>ST | LD/<br>ST | LD/<br>ST | LD/<br>ST      | LD/<br>ST      | SFU       |      | LD/ LD/<br>ST ST                                                | LD/<br>ST | LD/<br>ST | LD/<br>ST | LD/<br>ST | LD/ LD<br>ST ST |  |  |











#### Solution UVA vs. Unified Memory

- Unified memory depends on UVA
- UVA does NOT move data automatically between CPU and GPU.



Advantages:

•

- Ease of programming
- Data is migrated on demand
- Very efficient with complex data structures (e.g. linked list)
- Disadvantage
  - Carefully tuned CUDA program that uses streams to efficiently overlap execution with data transfers may perform better than a CUDA program that only uses Unified Memory.



Images source: https://nichijou.co/cudaRandom-UVA/

### Cooperative Groups

- Intra-block synchronization
- Inter-block synchronization
- Tiled groups



SYNCHRONIZE AT ANY SCALE

Three Key Capabilities



## **NVIDIA Hyper-Q**

 Multiple work queues between the host and the GPU





 Enable co-operative multi-process CUDA applications, typically MPI jobs, to utilize Hyper-Q capabilities



## CUDA Async Copy

- Overlaps copying data from global to shared memory with computation
- Avoids the use of intermediate registers or the L1 cache.
- Benefits:
  - Control flow no longer traverses the memory pipeline twice
  - Not using intermediate registers can reduce

register pressure and increase occupancy

//Without async-copy

using namespace nvcuda::experimental; \_\_shared\_\_ extern int smem[];

// algorithm loop iteration
while ( ... ) {

\_\_syncthreads();

```
// load element into shared mem
for ( i = ... ) {
    // uses intermediate register
    // {int tmp=g[i]; smem[i]=tmp;}
    smem[i] = gldata[i];
}
```

//With async-copy

using namespace nvcuda::experimental; \_\_shared\_\_ extern int smem[];

pipeline pipe;

```
// algorithm loop iteration while ( \dots ) {
```

\_\_\_syncthreads();

```
// wait for async-copy to complete
pipe.commit_and_wait();
```

\_\_syncthreads();

```
/* compute on smem[] */
```

### Multi-Instance GPU

 Allows the NVIDIA A100 GPU to be securely partitioned into up to seven separate GPU Instances for CUDA applications



MULTI-INSTANCE GPU ("MIG")

## **CUDA Thread Collectives**

• CUDA 11 improvements on top of cooperative

groups

// Simple Reduction Sum
#include <cooperative\_groups/reduce.h>

```
...
const int threadId = cta.thread_rank();
int val = A[threadId];
// reduce across tiled partition
reduceArr[threadId] = cg::reduce(tile, val, cg::plus<int>());
// synchronize partition
cg::sync(cta);
// accumulate sum using a leader and return sum
```

### Intra- and Inter-node GPU-Aware Communications

GPU

Host

- CUDA-aware MPI
  - Transfer data buffers across the GPUs efficiently
- Send GPU buffers directly instead of staging GPU buffers through the host memory
- Not only P2P Communications, but also onesided and collectives.



Figure 6: CUDA-aware MPI source: <u>https://developer.nvidia.com/blog/benchmarking-cuda-aware-mpi</u>

#### **Distributive**

## Deep Learning on GPU Clusters

• Data Parallelism vs. Model Parallelism





Figure 9: Data Parallelism vs. Data Parallelism, source: <u>https://frankdenneman.nl/2020/02/19/multi-gpu-and-distributed-deep-learning/</u>

#### Solution Contension Contensi Contension Contension Contension Contension Contension Cont

- CUDA-aware MPI
  - Transfer data buffers across the GPUs efficiently
- GPUDirect RDMA (GDR)
  - Enables on-node or off-node GPUs to directly exchange data without staging it on the host memory.
- GPUDirect P2P
  - Enables the same feature between the GPUs of a node.



# Thank You 🕑

Instead of blaming darkness, let's light a candle!



## Questions, Comments, and Ideas are Welcome!

