

# Enhancing Intra-Node GPU-to-GPU Performance in MPI+UCX through Multi-Path Communication

AmirHossein Sojoodi, Yiltan Hassan Temucin, and Ahmad Afsahi

Parallel Processing Research Laboratory (PPRL), Department of Electrical and Computer Engineering

> Smith Engineering Queen's University, Kingston, Canada

ExHET 2024, Scotland, Edinburgh, March 2<sup>nd</sup>, 2024

| Queens | Introduction | Background | Design | Evaluation | Conclusion | <b>2</b> / 15<br>ExHET 2024 |
|--------|--------------|------------|--------|------------|------------|-----------------------------|
|        |              |            |        |            |            |                             |
|        |              |            |        |            |            |                             |
|        |              |            |        |            |            |                             |
|        |              |            |        |            |            |                             |
|        |              |            |        |            |            |                             |
|        |              |            |        |            |            |                             |
|        |              |            |        |            |            |                             |

Introduction

#### Background

#### **Design and Evaluation**

Conclusion





Queens



• Utilize available interconnects to transfer data more efficiently



Figure 1: A multi-GPU data transfer







Figure 2: MPI and UCX high-level software stack





- 1. Multi-GPU Awareness
- 2. Path Selection
- 3. Communication Scheduling
- 4. Path Optimization
- 5. Data Integrity
- 6. Low Overhead







Figure 3: Framework Design integrated with UCT\_CUDA module

**2-D Pipeline Design** 

Message is dynamically split.
Chunk sizes vary depending on the # of paths and tunings









#### **11** / 22 ExHET 2024

### Ensuring Data Integrity

- 1. Data Corruption when staging
- 2. Data Dependency
- 3. Memory Ordering
- 4. Synchronization



## Experimental Setup

- Compute Canada's Beluga
  - 4 x NVIDIA V100 per node
  - Each GPU pair have two NVLinks
- Compute Canada's Narval
  - 4 x NVIDIA A100 per node
  - Each GPU pair have four NVLinks



Figure 5: A typical Multi-GPU node







OMB Bandwidth







Jacobi Iterative Solver – Communication Pattern



Figure 9: Jacobi Iterative Solver – Dual Path Communication





Figure 10: Jacobi Iterative Solver – Performance Evaluation



#### ່ອງ Conclusions

- Generalized Multi-Path within UCX
- Future Directions
  - Performance Modeling
  - Dynamic communication scheduling
    - Topology awareness
    - Automatically adapt to ongoing communication
    - Online tuning (e.g. through MPI\_info objects)



#### **Acknowledgments**

- Natural Sciences and Engineering Research Council of Canada
- Digital Research Alliance of Canada

# **Thank You!**

Instead of cursing the darkness, better light a candle!



# Questions, Comments, and Ideas are Welcome!

