

### LeGendre Pairs?!

- Again, this talk is not about LeGendre Pairs!
  - Thank you, Alex for helping!
- It is about optimizing it.
  - But how to improve a program without fully understanding it?
- Only realize fundamental features, e.g.:
  - Various sections
  - Code semantics
  - Data and control flow
  - Repetition
  - Memory access pattern
  - Memory packing/unpacking
  - ...





•

•

•

٠

### Section 2017 - General Observations

| Length 45                                                     | Length 117                                                       |
|---------------------------------------------------------------|------------------------------------------------------------------|
| 3800+ lines of code                                           | • 8500+ lines of code                                            |
| Several nested loops -> huge search space                     | <ul> <li>Several nested loops -&gt; huge search space</li> </ul> |
| - 126 * 84 <sup>4</sup>                                       | - 126 * 84 <sup>12</sup>                                         |
| Very large inner-most loop.                                   | Even larger inner-most loop.                                     |
| Two main sections:                                            | Two main sections:                                               |
| <ol> <li>Populate an array =&gt; fixed length (45)</li> </ol> | 1. Populate an array => fixed length (117)                       |
| 2. Perform computation and check some conditions on them      | 2. Perform computation and check some conditions on them         |
| Loop iterations:                                              | Loop iterations:                                                 |
| <ul> <li>Imbalanced and independent</li> </ul>                | <ul> <li>Imbalanced and independent</li> </ul>                   |
| Deterministic                                                 | Deterministic                                                    |
| <ul> <li>No random variable</li> </ul>                        | <ul> <li>No random variable</li> </ul>                           |
| Several variables that can be packed                          | Several variables that can be packed                             |
|                                                               |                                                                  |

### Seminder: Multi-threaded Version and Parallel Efficiency

- How does increasing cores affect performance?
- Let's call OpenMP for help!
  - It is an API for shared-memory multiprocessing programming in C/C++, and Fortran.
- How to tell if increasing cores will definitely improve the performance?
  - Parallel efficiency

### **Reminder:** Parallel Efficiency

The parallel efficiency of a program is the ratio of the speedup factor S(n) and the number of processors. Efficiency = S(n) / n

| # | Version (LP 45)                        | No of CPU cores | Speedup vs. #1 | Parallel Efficiency |  |  |  |
|---|----------------------------------------|-----------------|----------------|---------------------|--|--|--|
| 1 | with -O3 compiler flag                 | 1               | 1              | -                   |  |  |  |
| 2 | #1 with 2 OMP threads (outermost loop) | 2               | 1.91x          | 0.95                |  |  |  |
| 3 | #1 with 4 OMP threads (outermost loop) | 4               | 3.76x          | 0.94                |  |  |  |
| 4 | #1 with 6 OMP threads (outermost loop) | 6               | 5.66x          | 0.94                |  |  |  |

• This virtually means that as long as we have free cores, we can improve the performance.

## **Reminder: GPU Processing Flow**

Grid

\*\*\*\*\*\*\*\*\*\*\*

- How can we run a code in GPU? •
- Keywords to remember: •
  - SM (Streaming Multiprocessor) -
  - ThreadBlock & GridBlock
  - Warp (SIMD execution)
  - Kernel



### Section 245 - CUDA versions

- Profile and monitor GPU codes:
  - Nvtop
  - NVIDIA Visual Profiler (nvvp)
  - NVIDIA Nsight Systems (nsys)
  - NVIDIA Nsight Compute (ncu)
- Results are gathered from NVIDIA GeForce GTX 1060 3GB, and RTX 3070 8GB
  - Need to measure data transfers, too.
  - Table of results



- Simultaneous Multiprocessors
- SIMD (or SIMT) Architectures
- Warp scheduling



| SM        | SM                                                              |           |           |           |           |           |                         |      |                                 |                                |                      |           |           |           |              |           |      |        |  |
|-----------|-----------------------------------------------------------------|-----------|-----------|-----------|-----------|-----------|-------------------------|------|---------------------------------|--------------------------------|----------------------|-----------|-----------|-----------|--------------|-----------|------|--------|--|
|           | L1 Instruction Cache                                            |           |           |           |           |           |                         |      |                                 |                                |                      |           |           |           |              |           |      |        |  |
|           | L0 Instruction Cache                                            |           |           |           |           |           |                         |      |                                 |                                | L0 Instruction Cache |           |           |           |              |           |      |        |  |
|           | Warp Scheduler (32 thread/clk)<br>Dispatch Unit (32 thread/clk) |           |           |           |           |           |                         |      |                                 | Warp Scheduler (32 thread/clk) |                      |           |           |           |              |           |      |        |  |
|           |                                                                 |           |           |           |           |           |                         |      |                                 | Dispatch Unit (32 thread/clk)  |                      |           |           |           |              |           |      |        |  |
|           | Register File (16,384 x 32-bit)                                 |           |           |           |           |           |                         |      | Register File (16,384 x 32-bit) |                                |                      |           |           |           |              |           |      |        |  |
| F         | P64                                                             | INT       | INT       | FP32      | FP32      |           |                         |      |                                 | FP64                           | INT                  | INT       | FP32      | FP32      |              |           |      |        |  |
| F         | P64                                                             | INT       | INT       | FP32      | FP32      |           |                         |      |                                 | FP64                           | INT                  | INT       | FP32      | FP32      |              |           |      |        |  |
| F         | P64                                                             | INT       | INT       | FP32      | FP32      |           |                         |      |                                 | FP64                           | INT                  | INT       | FP32      | FP32      |              |           |      |        |  |
| F         | P64                                                             | INT       | INT       | FP32      | FP32      |           | NSOR TENSOR<br>DRE CORE |      |                                 | F                              | FP64                 | INT       | INT INT F | FP32      | FP32         |           | ISOR | TENSOR |  |
| F         | P64                                                             | INT       | INT       | FP32      | FP32      | CO        |                         | CORE |                                 | FP64                           | INT                  | INT       | FP32      | FP32      | CC           | DRE       | CORE |        |  |
| F         | FP64                                                            |           | INT       | FP32      | FP32      |           |                         |      |                                 | FP64                           | INT                  | INT       | FP32      | FP32      |              |           |      |        |  |
| F         | P64                                                             | INT       | INT       | FP32      | FP32      |           |                         |      |                                 | FP64                           | INT                  | INT       | FP32      | FP32      | H            |           |      |        |  |
| F         | P64                                                             | INT       | INT       | FP32      | FP32      |           |                         |      | $\left[ \right]_{-}$            | FP64                           | INT                  | INT       | FP32      | FP32      | $\mathbb{H}$ |           |      |        |  |
| LD/<br>ST | LD/<br>ST                                                       | LD/<br>ST | LD/<br>ST | LD/<br>ST | LD/<br>ST | LD/<br>ST | LD/<br>ST               | SFU  | $\left[ \right]_{-}$            | LD/ LD/<br>ST ST               | LD/<br>ST            | LD/<br>ST | LD/<br>ST | LD/<br>ST | LD/<br>ST    | LD/<br>ST | SFU  |        |  |

#### 8

Images' source: NVIDIA and Illawarra Dragon Boat Club



• SIMT architecture and warp execution model



• Why is branch divergence important?!

### Solution – Remove Branch Divergence

- Figure out permutations: 2 options
  - Dynamically by each thread
  - Statically and inform all threads through (constant) memory
    - But some parts of constant memory is already used. Solution?

### Solution - CUDA V2, Constant Memory

- GPUs have a dedicated memory section for constant memory.
- Utilizing constant memory significantly decreases the frequent memory accesses latency
- Let's see the code.

### Security CUDA V3, Shared Memory

- Problem with local variables and register spilling
  - Compiler utilizes Global Memory to store local variables
  - Global memory is cached, but if we want to specifically cache some variable, we should use Shared Memory

### Security CUDA V4, Loop Unrolling

- Some of the operations are redundantly done by each thread.
- Unrolling and moving some loops around would be helping with that.

### Section CUDA V5, Fuse PSD Calculations

- It turns out that inner-loop computations were not random!
- Take PSDs[4] into account as an example.

### Security CUDA V6, Removing Shared Memory!

- NVIDIA Visual Profiler to the rescue
- GPU Occupancy is low because we have used up the shared memory available

### Secure CUDA V7, Cache A3 and A13

- Some of the calculations are redundant
- We can store and re-use those as intermediate computations.



### Security CUDA V8, Ctrl + Z !

- Other than some minor optimization, revert everything.
- Why did this happen?
- Curse of optimization before validation

### Several other optimizations that did not help

- Utilizing FFTW library on CPU
- Moving another dimension inside the kernel
- Move two other dimensions inside the kernel
- Moving A to global memory
- Perform reduction using shared memory.
  - This is really cool by the way, but it didn't help... pff



• Let's dive right into the code!



- The same optimizations apply to L117 version.
- Some of them are device-specific, but they have little share of performance.
- Git and Make are your friends
- Profilers and debuggers are your best friends
- Never give up when it comes to optimization!
  - Just remember the cycle for each update



# Thank You 🕑

Instead of blaming darkness, let's light a candle!