NVIDIA Management Library (NVML)

Published:
4 minute read

In this post, I will provide two examples on how to use the NVIDIA Management Library (NVML) to query the GPU information. The first example is a simple C++ program that prints the GPU power and temperature, and the second one is about getting PCIe info.

NVML

The NVIDIA Management Library (NVML) is a C-based API for monitoring and managing various states of NVIDIA GPU devices. The library is used to query the GPU information, such as power, temperature, and utilization. The library is also used to manage the GPU devices, such as setting the power limit and the clock speed. The complete documentation of NVML can be found here. Some famous tools that use NVML are nvidia-smi and nvtop.

Example 1: Power and Temperature

The following C++ program prints the GPU power and temperature.

#include <iostream>
#include <nvml.h>

#define CHECK_NVML(result, message) \
    if (result != NVML_SUCCESS) { \
        std::cerr << message << ": " << nvmlErrorString(result) << std::endl; \
        return 1; \
    }

int main() {
    
    nvmlDevice_t device;
    unsigned int power;
    unsigned int temperature;

    CHECK_NVML(nvmlInit(), "Failed to initialize NVML");
    CHECK_NVML(nvmlDeviceGetHandleByIndex(0, &device), "Failed to get device handle");
    CHECK_NVML(nvmlDeviceGetPowerUsage(device, &power), "Failed to get power usage");
    CHECK_NVML(nvmlDeviceGetTemperature(device, NVML_TEMPERATURE_GPU, &temperature), "Failed to get temperature");
    
    std::cout << "Power: " << power << " mW" << std::endl;
    std::cout << "Temperature: " << temperature << " C" << std::endl;

    nvmlShutdown();
    return 0;
}

The program uses the NVML library, therefore, it should be compiled with the -lnvidia-ml flag.

g++ -o nvml_power_temp nvml_power_temp.cpp -lnvidia-ml
./nvml_power_temp
# If you want a specific GPU, you can set the CUDA_VISIBLE_DEVICES environment variable:
# CUDA_VISIBLE_DEVICES=0 ./nvml_power_temp

Example 2: Some PCIe Information

The following C++ program prints some PCIe information.

#include <iostream>
#include <nvml.h>

#define CHECK_NVML(result, message) \
    if (result != NVML_SUCCESS) { \
        std::cerr << message << ": " << nvmlErrorString(result) << std::endl; \
        return 1; \
    }

int main() {
    
    unsigned int device_count;
    
    CHECK_NVML(nvmlInit(), "Failed to initialize NVML");
    CHECK_NVML(nvmlDeviceGetCount(&device_count), "Failed to get device count");
    
    for (unsigned int i = 0; i < device_count; i++) {
        nvmlDevice_t device;
        nvmlPciInfo_t pci_info;
        unsigned int speed;
        unsigned int max_speed;
        unsigned int max_width;
        unsigned int curr_width;
        unsigned int tx_throughput;
        unsigned int rx_throughput;

        CHECK_NVML(nvmlDeviceGetHandleByIndex(i, &device), 
                    "Failed to get device handle");
        CHECK_NVML(nvmlDeviceGetPciInfo(device, &pci_info), 
                    "Failed to get PCI info");
        CHECK_NVML(nvmlDeviceGetMaxPcieLinkWidth(device, &max_width), 
                    "Failed to get PCIe link max width");
        CHECK_NVML(nvmlDeviceGetCurrPcieLinkWidth(device, &curr_width), 
                    "Failed to get PCIe link current width");
        // For whatever reasons, the following metric doesn't make sense to me.
        CHECK_NVML(nvmlDeviceGetPcieLinkMaxSpeed(device, &max_speed), 
                    "Failed to get PCIe link max speed");
        CHECK_NVML(nvmlDeviceGetPcieSpeed(device, &speed), 
                    "Failed to get PCIe link speed");
        CHECK_NVML(nvmlDeviceGetPcieThroughput(device, NVML_PCIE_UTIL_TX_BYTES, &tx_throughput), 
                    "Failed to get PCIe throughput");
        CHECK_NVML(nvmlDeviceGetPcieThroughput(device, NVML_PCIE_UTIL_RX_BYTES, &rx_throughput), 
                    "Failed to get PCIe throughput");

        std::cout << "Device " << i << ":" << std::endl;
        std::cout << "  Bus ID: " << pci_info.busId << std::endl;
        std::cout << "  PCIe Max Link Width: " << max_width << " lanes" << std::endl;
        std::cout << "  PCIe Current Link Width: " << curr_width << " lanes" << std::endl;
        std::cout << "  PCIe Max Link Speed: " << max_speed << " MB/s" << std::endl;
        std::cout << "  PCIe Link Speed: " << speed << " MB/s" << std::endl;
        std::cout << "  PCIe TX Throughput: " << tx_throughput << " KB/s" << std::endl;
        std::cout << "  PCIe RX Throughput: " << rx_throughput << " KB/s" << std::endl;
    }

    nvmlShutdown();
    return 0;
}

Again, the program uses the NVML library, therefore, it should be compiled with the -lnvidia-ml flag.