Utilize PTX Just-In-Time (JIT) Compilation in CUDA

Published:
6 minute read

In this post I’ve written about how to utilize PTX Just-In-Time (JIT) compilation in CUDA. PTX is a low-level assembly-like language that is used to represent the GPU code. The PTX code is then compiled to the machine code by the NVIDIA driver at runtime. This process is called Just-In-Time (JIT) compilation. But before I write about how to use PTX JIT compilation, I’ll provide some background on why you might want to use it.

Background Scenario

In this scenario, you may want to load a CUDA kernel at runtime as a CUDA Module, then extract a CUDA Funtion from the kernel you wrote, and then get more information from it. Information like the number of register usage per thread, shared memory, etc. Or you may want to intelligently select the number of blocks/threads to optimize SM Occupancy in order to have valid inter-block synchronization capabilities via cooperative groups.

Furthermore, if the aforementioned kernel is defined in a separate CU file, and you are getting these information from a C or C++ code, then it makes more sense to use separate compilation to PTX or FATBIN files, and then load them at runtime.

Loading the CUDA module

To create a PTX file or a FATBIN file, you can use the following command:

# Create a PTX file
nvcc -ptx -o kernel.ptx kernel.cu
# Cre`ate a FATBIN file
nvcc -fatbin -o kernel.fatbin kernel.cu

Then you can load the CUDA module at runtime with something like this (CUDA Driver API):

#include <cuda.h>
#include <iostream>

int main() {
  
  cuInit(0);

  // select the first device
  CUdevice device;
  cuDeviceGet(&device, 0);

  CUcontext context;
  cuCtxCreate(&context, 0, device);

  CUmodule module;
  CUfunction function;
  CUresult result;

  result = cuModuleLoad(&module, "kernel.ptx");
  if (result != CUDA_SUCCESS) {
    std::cerr << "Failed to load the module." << std::endl;
    return 1;
  }

  result = cuModuleGetFunction(&function, module, "kernel");
  if (result != CUDA_SUCCESS) {
    std::cerr << "Failed to get the function." << std::endl;
    return 1;
  }

  // Do something with the function
  // ...
  return 0;
}

Then build the main source with:

nvcc -o main main.cu -lcuda

Thoughts and Improvements

All in all, it’s a straightforward process. However, there are some issues:

  • The build process is a bit complicated, especially if you are using CMake.
  • The FATBIN/PTX file should be addressed correctly, and it is not always preferable.

So, why not storing the PTX code directly in the source file itself? This way, you can avoid the build process and the file management.

Just a reminder, the JIT Process doesn’t accept the actual CUDA code, but the PTX code. So, you need to convert the CUDA code to PTX code first. (I spent 2 hours to understand this!) Here is how you can do it:

#include <cuda.h>
#include <iostream>

// PTX code generated by:
// nvcc -ptx -o kernel.ptx kernel.cu
const char* kernel = R"(
  .version 6.5
  .target sm_70
  .address_size 64

  .visible .entry kernel(
    .param .u64 kernel_param_0
  )
  {
    // Kernel code here
  }
)";

int main() {
  
  cuInit(0);

  // select the first device
  CUdevice device;
  cuDeviceGet(&device, 0);

  CUcontext context;
  cuCtxCreate(&context, 0, device);

  CUmodule module;
  CUfunction function;
  CUresult result;

  result = cuModuleLoadData(&module, kernel);
  if (result != CUDA_SUCCESS) {
    std::cerr << "Failed to load the module." << std::endl;
    return 1;
  }

  result = cuModuleGetFunction(&function, module, "kernel");
  if (result != CUDA_SUCCESS) {
    std::cerr << "Failed to get the function." << std::endl;
    return 1;
  }

  // Do something with the function
  // ...
  return 0;
}

Add Error Management

Sometimes, the PTX code may contain errors, or the process fails for some reasons. To handle this, you can use CUjit_option get more information about the error. Here is an example:

#include <cuda.h>
#include <iostream>

// PTX code generated by:
// nvcc -ptx -o kernel.ptx kernel.cu
const char* kernel = R"(
  .version 6.5
  .target sm_70
  .address_size 64

  .visible .entry kernel(
    .param .u64 kernel_param_0
  )
  {
    // Kernel code here
  }
)";

int main() {

  cuInit(0);

  // select the first device
  CUdevice device;
  cuDeviceGet(&device, 0);

  CUcontext context;
  cuCtxCreate(&context, 0, device);

  CUmodule module;
  CUfunction function;
  CUresult result;

  int logBufferSize = 1024;
  char infoLogBuffer[1024];
  char errorLogBuffer[1024];
  CUjit_option options[] = {CU_JIT_INFO_LOG_BUFFER_SIZE_BYTES, CU_JIT_INFO_LOG_BUFFER,
                            CU_JIT_ERROR_LOG_BUFFER_SIZE_BYTES, CU_JIT_ERROR_LOG_BUFFER};
  void* optionValues[] = {(void*)(uintptr_t)logBufferSize, infoLogBuffer,
                          (void*)(uintptr_t)logBufferSize, errorLogBuffer}; 

  result = cuModuleLoadDataEx(&module, kernel, 4, options, optionValues);
  if (result != CUDA_SUCCESS) {
    std::cerr << "Failed to load the module." << std::endl;
    std::cerr << "CUDA Driver API error = " << result << std::endl;
    std::cerr << "Info Log: " << infoLogBuffer << std::endl;
    std::cerr << "Error Log: " << errorLogBuffer << std::endl;
    return 1;
  }

  result = cuModuleGetFunction(&function, module, "kernel");
  if (result != CUDA_SUCCESS) {
    std::cerr << "Failed to get the function." << std::endl;
    return 1;
  }

  // Do something with the function
  // ...
  return 0;
}

One last suggestion: If your kernel is a simple kernel not requiring any special optimization, you can create the PTX code for an old architecture like sm_50, so that it can be used on any GPU (well, most of them!).

nvcc -ptx -o kernel.ptx -arch=sm_50 kernel.cu

P.S. Pay attention to the entry .version 6.5 in the PTX code. If your target system’s PTX assembler is old, you’ll get runtime error. You may want to edit that field manually, as I didn’t find a way to set it automatically.