NUMA affinity and CPU pinning for GPU pipelines¶
Scope: binding CPU threads, host memory, and PyTorch DataLoader workers to the GPU's local NUMA node so input pipelines never pay a cross-node hop. Covers deriving a GPU's NUMA node (nvidia-smi topo -m, sysfs numa_node, NVML), pinning with numactl --cpunodebind/--membind, re-pinning workers via worker_init_fn, and the pin_memory / persistent_workers flags that keep page-locked buffers local.
What it is¶
A NUMA (Non-Uniform Memory Access) node is a physical grouping of CPU cores, memory channels, and PCIe-attached devices (GPUs, NICs) that sit close together on one socket. Access to resources inside the node is fast; access across the inter-socket link (UPI/xGMI) is slower. On a typical dual-socket DGX-class server the GPUs split across nodes, for example GPUs 0-3 on node 0, GPUs 4-7 on node 1.
CPU pinning (a.k.a. CPU affinity) forces a process or thread to run only on a chosen set of cores; memory binding forces its allocations to come from a chosen node's DRAM. NUMA-aware pinning means: for each GPU, run its feeder process and allocate its host buffers on the same node the GPU is attached to. The data path memory to CPU to GPU then stays inside one node.
The book frames this as the most common reason GPUs miss full utilization: the CPU prepares the next batch (load from disk, tokenize, transform) and dispatches kernels; if those host-side tasks are slow or badly scheduled, the GPU idles waiting for data (Fregly, Ch. 3).
Why use it¶
Crossing a NUMA boundary costs latency on every access:
- The book reports a measured ~80 ns local vs ~139 ns remote DRAM access on a dual-socket system, roughly a 75% increase, and notes access latency "can nearly double" cross-node (Fregly, Ch. 3).
- Left to itself, Linux NUMA auto-balancing may migrate a process across nodes mid-run, silently turning local accesses into remote ones. The book is explicit that built-in balancing "is usually not sufficient for performance-critical AI workloads" (Fregly, Ch. 3).
- In practice the book reports 5%-10% training-throughput improvement from eliminating cross-NUMA traffic and core migrations, plus reduced jitter and variance (Fregly, Ch. 3).
The failure mode is input starvation: a profiler shows low GPU utilization with gaps on the compute timeline while the data pipeline catches up. Pinning removes the unpredictable scheduler migration that causes a data-loading thread to jump to a remote-node core in the middle of a step. This is the host-side complement to keeping the GPU itself busy. See goodput for the system-level "useful work fraction" framing and roofline / arithmetic intensity for classifying whether you are actually input-bound versus compute-bound.
On Grace-based coherent superchips (GH200, GB200/Grace Blackwell, Vera Rubin) the CPU and GPU share a coherent address space over NVLink-C2C at up to ~900 GB/s. This reduces transfer overhead, but Linux still models CPU DRAM and GPU HBM as separate pools, so binding CPU threads to the local Grace CPU node still helps locality (Fregly, Ch. 3).
When to use it (and when not)¶
Pin when:
- GPU utilization is below target with timeline gaps and the bottleneck is the host input pipeline (data loading, tokenization, augmentation).
- You run one process per GPU (DDP /
torchrun) on a multi-socket box and want each rank glued to its GPU's node. - You observe step-time variance / jitter traceable to scheduler migrations.
- Throughput is high enough that single-digit-percent host wins are worth the cost (long training runs, large clusters).
Do not bother (or expect little) when:
- The box is single-socket / single NUMA node:
numactl --hardwareshowsavailable: 1 nodes. There is no remote node to avoid. - The workload is GPU-compute-bound with a trivial input pipeline (e.g. synthetic data, on-GPU generation); the CPU is idle anyway.
- A higher-level scheduler already enforces topology affinity. Under Kubernetes the Topology Manager plus GPU Operator (and
cpusetcgroups) can place pods NUMA-locally; under Slurm use--cpu-bind. Pinning twice is redundant; verify what the scheduler already does before addingnumactl.
Pinning is orthogonal to but pairs with: pin_memory (page-locked H2D buffers, below), hugepages, and vm.swappiness=0. See performance optimization and diagnostics tools for the surrounding OS-tuning surface.
Architecture¶
The layout is a two-level locality hierarchy. Each socket is one NUMA node holding a slice of the cores, a bank of local DRAM, and the GPUs and NICs on its PCIe root complex. Within a node the path memory to CPU to GPU is a short, fast hop; between nodes every access traverses the inter-socket UPI/xGMI link at higher latency. NUMA-aware pinning keeps a GPU's feeder process, its page-locked host buffers, and its DataLoader workers all on the GPU's local node, so the diagram's dashed cross-node arrow never fires.
flowchart LR
subgraph N0["NUMA node 0 (socket 0)"]
C0["CPU cores 0-23,48-71"]
M0["Local DRAM (~80 ns)"]
G0["GPU 0 (HBM)"]
NIC0["NIC"]
end
subgraph N1["NUMA node 1 (socket 1)"]
C1["CPU cores 24-47,72-95"]
M1["Local DRAM"]
G1["GPU 4 (HBM)"]
end
C0 -->|"local: memory -> CPU -> GPU"| M0
M0 -->|"pinned H2D DMA"| G0
C0 --- NIC0
N0 <-->|"UPI/xGMI inter-socket link (~139 ns, ~75% slower)"| N1
C0 -.->|"cross-node hop (avoid: remote access)"| M1
The control flow that maps onto this layout has four stages, each a HOW facet below: discover which node a GPU lives on (topology), pin a process to that node (numactl), re-pin the framework's DataLoader workers that numactl does not reach (worker_init_fn), and verify the binding actually took (numastat). The core primitive threaded through all four is deriving the exact core set for a node from its sysfs cpulist, validated below.
How to use it: discover the topology and pin a process¶
Discover the topology¶
List nodes and their cores:
Map each GPU to its NUMA node and affine cores. nvidia-smi topo -m prints a matrix whose last columns are CPU Affinity and NUMA Affinity:
Here GPU0 is on NUMA node 0 with cores 0-23,48-71; GPU1 is on node 1. (SYS in the legend means the GPU-to-GPU path crosses the inter-socket link, the same boundary you are avoiding for host traffic.)
The authoritative kernel source is sysfs. Read a GPU's node directly from its PCI device:
# PCI_BUS_ID from `nvidia-smi --query-gpu=pci.bus_id --format=csv,noheader`,
# lowercased to the sysfs form, e.g. 0000:03:00.0
cat /sys/bus/pci/devices/0000:03:00.0/numa_node # prints the node id, or -1 if unknown
A value of -1 means the firmware did not expose affinity; treat it as "unknown" and fall back to node 0 rather than trusting it (kernel sysfs-bus-pci ABI).
Pin a single process with numactl¶
When you already know the node, bind both CPU and memory:
--cpunodebind=<node> restricts execution to that node's cores; --membind=<node> forces allocations from that node's DRAM and fails the allocation rather than spilling remote (numactl(8)). Both policies are inherited by fork-ed children, but not across spawn start methods or exec, which is why per-worker re-pinning (in the integration section) is required.
Core primitive: derive a node's cores from its cpulist¶
Every step above depends on one small algorithm: turn a Linux sysfs cpulist string such as 0-23,48-71 into the exact set of core ids to pin to, and apply the -1 -> node 0 fallback for an unknown numa_node. This is pure host logic, so it is validated here with a numpy-only, self-asserting block (no torch, psutil, or libnuma required). It checks the exact matrix rows above, boundary inputs, a corruption case (a trailing comma must not inject a phantom core 0 that would mis-pin a rank onto the wrong node), equivalence to a slow reference over 2000 randomized inputs, and the node-disjointness invariant that makes pinning correct in the first place.
from __future__ import annotations
import numpy as np
def parse_cpu_list(s: str) -> list[int]:
"""Parse a Linux sysfs cpulist, e.g. '0-3,8-11' -> [0,1,2,3,8,9,10,11]."""
cpus: list[int] = []
for part in filter(None, (p.strip() for p in s.split(","))):
if "-" in part:
lo, hi = (int(x) for x in part.split("-"))
cpus.extend(range(lo, hi + 1))
else:
cpus.append(int(part))
return cpus
def resolve_numa_node(raw: int) -> int:
"""sysfs numa_node rule: return node if >= 0 else fall back to 0 (unknown)."""
return raw if raw >= 0 else 0
def _slow_reference_parse(s: str) -> list[int]:
"""Deliberately naive reference: expand every token by brute-force loop."""
out: list[int] = []
for tok in s.split(","):
tok = tok.strip()
if not tok:
continue
bounds = tok.split("-")
if len(bounds) == 1:
out.append(int(bounds[0]))
else:
start, end = int(bounds[0]), int(bounds[1])
i = start
while i <= end:
out.append(i)
i += 1
return out
# 1. Happy path: the exact matrix rows from the topo table above.
assert parse_cpu_list("0-23,48-71") == list(range(0, 24)) + list(range(48, 72))
assert parse_cpu_list("24-47,72-95") == list(range(24, 48)) + list(range(72, 96))
# 2. Boundary values.
assert parse_cpu_list("0") == [0] # single core, lowest id
assert parse_cpu_list("5-5") == [5] # degenerate range lo == hi
assert parse_cpu_list("") == [] # empty cpulist -> no cores
assert parse_cpu_list(" 0 , 2 ") == [0, 2] # whitespace around tokens
# 3. Adversarial corruption: a trailing comma must NOT inject a phantom core 0.
# Injecting 0 would silently pin a rank onto the wrong node's core.
assert parse_cpu_list("24-47,") == list(range(24, 48))
assert 0 not in parse_cpu_list("24-47,")
# 4. Equivalence to a slow reference over randomized inputs.
rng = np.random.default_rng(0)
for _ in range(2000):
parts: list[str] = []
for _ in range(int(rng.integers(1, 5))):
lo = int(rng.integers(0, 128))
hi = lo + int(rng.integers(0, 16)) # hi >= lo, so lo-hi is valid
parts.append(f"{lo}-{hi}" if hi != lo else f"{lo}")
s = ",".join(parts)
assert parse_cpu_list(s) == _slow_reference_parse(s), f"mismatch on {s!r}"
# 5. Node-disjointness invariant: two nodes must expose DISJOINT core sets, or
# binding a rank to "its" node could still land it on the other GPU's cores.
node0 = np.array(parse_cpu_list("0-23,48-71"))
node1 = np.array(parse_cpu_list("24-47,72-95"))
assert np.intersect1d(node0, node1).size == 0
assert node0.size == node1.size == 48 # 24 physical + 24 SMT siblings
# 6. numa_node resolution rule: -1 ("unknown") falls back to node 0.
assert resolve_numa_node(0) == 0
assert resolve_numa_node(1) == 1
assert resolve_numa_node(-1) == 0
print("PASS: all cpulist + numa-node assertions held")
How to integrate it: re-pin DataLoader workers in PyTorch¶
numactl policy does not reliably propagate to framework-managed worker processes (spawn start method, container runtimes, some kernels). Reassert affinity inside each worker via worker_init_fn, which PyTorch calls "on each worker subprocess with the worker id (an int in [0, num_workers - 1]) as input, after seeding and before data loading" (PyTorch DataLoader docs). Do not touch any torch.cuda.* API in a worker: CUDA in DataLoader workers is unsupported, so pass the GPU index by closure or env var instead (PyTorch DataLoader docs).
The block below is a reference template: it requires torch, psutil, and libnuma (apt install libnuma-dev) plus a real multi-node host with sysfs, none of which run in a generic CI sandbox. Its core host-side math (the cpulist parse feeding _cpus_for_node, and the sysfs numa_node resolution feeding gpu_numa_node) is the parse_cpu_list / resolve_numa_node logic validated in the numpy block above.
# Reference template: needs torch + psutil + libnuma + a real NUMA host.
from __future__ import annotations
import ctypes
import glob
import os
from functools import partial
import psutil
import torch
from torch.utils.data import DataLoader
# libnuma for memory-policy binding (apt install libnuma-dev / numactl)
_libnuma = ctypes.CDLL("libnuma.so")
if _libnuma.numa_available() < 0:
raise RuntimeError("NUMA not available on this system")
_libnuma.numa_run_on_node.argtypes = [ctypes.c_int]
_libnuma.numa_set_preferred.argtypes = [ctypes.c_int]
def _parse_cpu_list(s: str) -> list[int]:
"""Parse '0-3,8-11' -> [0,1,2,3,8,9,10,11]."""
cpus: list[int] = []
for part in filter(None, (p.strip() for p in s.split(","))):
if "-" in part:
lo, hi = (int(x) for x in part.split("-"))
cpus.extend(range(lo, hi + 1))
else:
cpus.append(int(part))
return cpus
def _cpus_for_node(node: int) -> list[int]:
"""Cores belonging to a NUMA node, from sysfs."""
with open(f"/sys/devices/system/node/node{node}/cpulist") as f:
return _parse_cpu_list(f.read().strip())
def gpu_numa_node(device: int) -> int:
"""Resolve a CUDA device's NUMA node via sysfs PCI; -1 -> 0."""
pci = torch.cuda.get_device_properties(device).pci_bus_id.lower()
try:
with open(f"/sys/bus/pci/devices/{pci}/numa_node") as f:
node = int(f.read().strip())
return node if node >= 0 else 0
except (FileNotFoundError, ValueError):
return 0
def bind_to_node(node: int) -> list[int]:
"""Bind current process CPUs + memory policy to a NUMA node."""
cpus = _cpus_for_node(node)
psutil.Process(os.getpid()).cpu_affinity(cpus)
_libnuma.numa_run_on_node(node) # restrict execution to node
_libnuma.numa_set_preferred(node) # prefer node-local allocations
return cpus
def _worker_init(worker_id: int, node: int, cpus: list[int]) -> None:
"""Reapply binding inside each worker. No CUDA calls here."""
psutil.Process(os.getpid()).cpu_affinity(cpus)
_libnuma.numa_run_on_node(node)
_libnuma.numa_set_preferred(node)
def make_loader(dataset, device: int) -> DataLoader:
node = gpu_numa_node(device)
cpus = bind_to_node(node)
return DataLoader(
dataset,
batch_size=32,
num_workers=4,
pin_memory=True, # page-locked H2D buffers, allocated node-local
persistent_workers=True, # keep workers (and their affinity) across epochs
prefetch_factor=2,
worker_init_fn=partial(_worker_init, node=node, cpus=cpus),
)
In the training loop, copy with non_blocking=True so the page-locked buffer overlaps with compute:
Why each flag matters:
pin_memory=Truecopies batches into CUDA page-locked (non-pageable) host memory, which the GPU can DMA directly; the book reports pinned-to-device copies 2-3x faster than pageable, and turning the flag on alone giving up to 10%-20% end-to-end improvement on data-heavy loops (Fregly, Ch. 3; PyTorch DataLoader docs). Pinning is only correct if the page-locked buffer is on the local node, hence pin the worker that allocates it.persistent_workers=Truekeeps workers alive across epochs so they do not re-fork and lose their re-applied affinity each epoch (Fregly, Ch. 3; PyTorch DataLoader docs).non_blocking=Truewithpin_memorylets the H2D transfer run asynchronously on a copy stream (see CUDA streams and concurrency).
How to run it in production: per-GPU launch and locked memory¶
Per-GPU launch loop (derive node, then pin)¶
For a multi-GPU node where the mapping is not hardcoded, query the topology per GPU and launch each rank pinned to its local node:
#!/usr/bin/env bash
set -euo pipefail
for GPU in 0 1 2 3; do
# Pull the NUMA Affinity column for this GPU's row from the topo matrix.
NODE=$(nvidia-smi topo -m \
| awk -v g="GPU${GPU}" '$1==g {print $(NF)}')
# Skip if unknown (-1) or empty: fall back to unbound rather than mis-bind.
if [[ -z "${NODE}" || "${NODE}" == "-1" ]]; then
echo "GPU${GPU}: NUMA node unknown, launching unbound" >&2
CUDA_VISIBLE_DEVICES=${GPU} python train.py --gpu "${GPU}" &
else
numactl --cpunodebind="${NODE}" --membind="${NODE}" \
bash -c "CUDA_VISIBLE_DEVICES=${GPU} python train.py --gpu ${GPU}" &
fi
done
wait
This follows the book's topo-driven pattern, hardened: the book's example used nvidia-smi topo -m -i $GPU, but the robust form parses the full matrix and selects the row by GPU label, and it guards the unknown (-1) case instead of binding to a bogus node (Fregly, Ch. 3).
Raise the locked-memory limit¶
Pinned allocations count against ulimit -l (max locked memory). Large pinned buffers fail or fall back to swappable memory if the limit is too low; the book recommends setting it high or unlimited for AI/HPC workloads (Fregly, Ch. 3):
ulimit -l unlimited # interactive shell
docker run --ulimit memlock=-1 ... # container: lift the memlock cap
In Kubernetes set the equivalent via the pod security context / container limits.
How to maintain it: verify the binding¶
Pinning is silent when it works and silent when it does not. Verify, do not assume.
numactl --show # physcpubind + preferred node of current proc
cat /proc/<pid>/status | grep -i cpus_allowed_list
taskset -cp <pid> # cores a running pid may use
numastat -p <pid> # per-node memory pages actually used by the pid
numastat -p <pid> is the acid test: if a process pinned to node 1 shows pages on node 0, some allocation predated the bind or escaped the policy, so re-check that the memory bind happened before the first large allocation. Cross-check GPU utilization with Nsight profiling / observability monitoring: if utilization gaps close and step variance drops after pinning, the input pipeline was the bottleneck.
How to scale it¶
Scaling from one GPU to a full node and then to a cluster keeps the same rule (each rank on its GPU's local node) but changes who enforces it:
- One process per GPU on a node. Run DDP /
torchrunwith one rank per GPU and pin each rank to its GPU's node using the per-GPU launch loop above. The mapping is not fixed across SKUs, so derive the node per GPU fromnvidia-smi topo -m(or sysfs) rather than hardcoding, and guard the-1unknown case so a rank launches unbound instead of mis-bound. - Across a cluster, let the scheduler own topology. At cluster scale a higher-level scheduler usually places work NUMA-locally already: Kubernetes via the Topology Manager plus GPU Operator and
cpusetcgroups, Slurm via--cpu-bind. Where that holds, do not pin twice; verify what the scheduler did (the maintenance section'snumactl --show/numastat -p) and only addnumactlwhere it did not. - Coherent superchips (GH200, GB200/Grace Blackwell, Vera Rubin). The CPU and GPU share a coherent address space over NVLink-C2C at up to ~900 GB/s, which shrinks software copy overhead, but Linux still models CPU DRAM and GPU HBM as separate pools. Binding CPU threads to the local Grace CPU node still helps locality; the win is smaller because coherence removes much of the copy cost (Fregly, Ch. 3).
- Judge the payoff at scale. The reported win is single-digit percent (5%-10% throughput, plus reduced jitter and variance), so it compounds on long runs and large fleets where a few percent of GPU-hours is material, and it is not worth chasing on a single-socket box or a compute-bound job with a trivial input pipeline (Fregly, Ch. 3).
Failure modes¶
- Trusting
numa_node = -1. A-1from sysfs (or an empty NUMA Affinity column) means the firmware did not expose affinity, not "node -1". Binding to it is a mis-bind; fall back to node 0 (host code) or launch unbound (launch loop) instead (kernel sysfs-bus-pci ABI). - Assuming
numactlreaches DataLoader workers. Policy is inherited by fork-ed children but not acrossspawn/exec, so framework worker subprocesses can run unbound while the parent looks correctly pinned. Reassert affinity insideworker_init_fn(numactl(8); PyTorch DataLoader docs). - Calling
torch.cuda.*inside a worker. CUDA in DataLoader workers is unsupported and can corrupt or crash the worker; pass the GPU index by closure or env var and keep everytorch.cudacall out ofworker_init_fn(PyTorch DataLoader docs). - Allocating before the memory bind.
numastat -p <pid>showing pages on the wrong node means a large allocation predated the bind or escaped--membind. Bind memory before the first big allocation, and remember--membindfails the allocation rather than spilling remote (numactl(8)). - Leaving Linux NUMA auto-balancing to it. Built-in balancing may migrate a process across nodes mid-run, silently turning local accesses into remote ones; the book states it "is usually not sufficient for performance-critical AI workloads". Pin explicitly (Fregly, Ch. 3).
- Pinning without lifting
ulimit -l. Large pinned (page-locked) buffers fail to allocate or fall back to swappable memory when the locked-memory cap is too low, silently erasing the pinned-copy win. Setmemlockunlimited in the shell, container, and pod (Fregly, Ch. 3). - Pinning where there is nothing to gain. On a single-socket / single-NUMA-node box (
numactl --hardwareshowsavailable: 1 nodes) there is no remote node to avoid, and on a compute-bound job with a trivial input pipeline the CPU is idle anyway. The bind adds config surface with no payoff. - Double-pinning under a topology-aware scheduler. If Kubernetes (Topology Manager plus GPU Operator,
cpusetcgroups) or Slurm (--cpu-bind) already places work NUMA-locally, addingnumactlon top is redundant and can conflict with the cgroup cpuset. Verify what the scheduler did before adding your own bind.
References¶
- Chris Fregly, AI Systems Performance Engineering (O'Reilly), Chapter 3, "OS, Docker, and Kubernetes Tuning for GPU-Based Environments": sections "NUMA Awareness and CPU Pinning" and "NUMA-Friendly Memory Allocation and Memory Pinning". Source of the ~80 ns/~139 ns latency figures, the 5%-10% throughput and 2-3x / 10%-20% pinned-memory claims, the
topo-driven launch loop, and theworker_init_fnre-pinning pattern. - PyTorch,
torch.utils.data.DataLoaderdocumentation:pin_memory,persistent_workers,worker_init_fn(worker id is an int in[0, num_workers-1]),num_workers,prefetch_factor; note that CUDA tensors should not be returned from workers. https://docs.pytorch.org/docs/stable/data.html - numactl(8) man page:
--cpunodebind,--membind,--hardware,--showsemantics (membind fails rather than spilling remote; policies inherited by fork). https://man7.org/linux/man-pages/man8/numactl.8.html - Linux kernel, sysfs-bus-pci ABI:
/sys/bus/pci/devices/<addr>/numa_nodereturns the device's node or-1if unknown. https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-bus-pci - NVIDIA
nvidia-smi:topo -mmatrix with CPU Affinity / NUMA Affinity columns and connection-type legend. https://docs.nvidia.com/deploy/nvidia-smi/index.html
Related: Performance Optimization and Tuning · Goodput: Measuring Useful AI Throughput · Data loading pipeline tuning · CUDA Streams and Concurrency · Profiling GPUs: Nsight Systems and Nsight Compute · Glossary