Change NUMA node on the fly
NUMA (Non-Uniform Memory Access)
NUMA is a computer memory design used in multiprocessing, where the memory access time depends on the memory location relative to the processor. Under NUMA, a processor can access its own local memory faster than non-local memory (memory local to another processor or memory shared between processors). The benefits of NUMA are limited to particular workloads, notably on servers where the data is often associated with certain tasks or users.
NUMA-aware programming is important, as it can have a significant impact on the performance of applications. Not only the memory bandwidth and latency are different between NUMA nodes, but also the communication between threads running on different NUMA nodes and specific accelerators (e.g., GPUs) can be affected. Therefore, it is important to understand the NUMA architecture and how to program for it. Good news is that it is possible to select the NUMA node on which a program runs, and to migrate a running thread to another NUMA node during runtime.
On Linux, to see the NUMA nodes available on the system, use the numactl
command:
numactl --hardware
If you have NVIDIA GPUs on the system, and you want to check their NUMA affinity, use the nvidia-smi
command:
nvidia-smi topo -map
If you have network interfaces on the system, and you want to check their NUMA affinity, use the lstopo
command:
lstopo --output-format png --output lstopo.png
NUMA static selection
If you have a program that you know it should run on a specific NUMA node, you can use the numactl
command to run the program with a specific NUMA policy. For example, to run a program on NUMA node 1, use the following command:
# Run the program on NUMA node 1
numactl --cpunodebind=1 --membind=1 ./program
NUMA dynamic selection
If you want to change the NUMA node on which a program’s thread runs during runtime, you can utilize the numa
library. The following example shows how to use this function to get the NUMA nodes and migrate a running thread to another NUMA node:
#include <iostream>
#include <vector>
#include <numa.h>
int main() {
int cpu, node, max_nodes;
std::vector<int> nodes;
if (numa_available() < 0) {
std::cerr << "NUMA not available" << std::endl;
return 1;
} else {
std::cout << "NUMA available" << std::endl;
}
cpu = sched_getcpu();
node = numa_node_of_cpu(cpu);
std::cout << "Current thread running on CPU " << cpu << ", NUMA node " << node << std::endl;
max_nodes = numa_max_node() + 1;
// Usually larger numbers of NUMA nodes are for non-physical nodes
max_nodes = std::min(max_nodes, 32);
std::cout << "Maximum ID of NUMA nodes: " << max_nodes << std::endl;
for (int i = 0; i < max_nodes; ++i) {
if (numa_bitmask_isbitset(numa_all_nodes_ptr, i)) {
std::cout << "NUMA Node " << i << " is available" << std::endl;
nodes.push_back(i);
}
}
// Get information about the available memory of each NUMA node
for(auto node : nodes) {
std::cout << "NUMA Node " << node << " has " <<
numa_node_size64(node, nullptr) / 1024 / 1024 << " MB of memory" << std::endl;
}
// Cycle through the NUMA nodes, and migrate the thread to the next NUMA node
for(auto node : nodes) {
numa_run_on_node(node);
cpu = sched_getcpu();
node = numa_node_of_cpu(cpu);
std::cout << "Thread running on CPU " << cpu << ", NUMA node " << node << std::endl;
}
return 0;
}
Compile and run the program:
g++ -lnuma -o numa numa.cpp
./numa