Topology-aware GPU scheduling in Kubernetes¶
Scope: making Kubernetes co-locate the GPUs of a multi-GPU pod on the same NVLink/NUMA domain and align the pod's CPUs and memory to that domain, via the kubelet Topology Manager, GPU Feature Discovery labels, the RDMA device plugin, host networking, Guaranteed QoS, and the MIG single-node placement rule.
flowchart TB
Pod["4-GPU Guaranteed pod<br/>(requests == limits)"]
GFD["GPU Feature Discovery<br/>node labels (gpu.clique)"]
Sched["Scheduler<br/>(nodeSelector / affinity)"]
TM["Topology Manager (kubelet)<br/>policy: restricted / single-numa-node"]
DP["NVIDIA device plugin<br/>GPU NUMA hint"]
CPU["CPU Manager<br/>core hint"]
MEM["Memory Manager<br/>memory hint"]
Aligned["Aligned: GPUs + CPUs + memory<br/>on one NUMA / NVLink domain"]
Blind["Topology-blind placement<br/>job split across NVLink domains"]
Slow["Inter-GPU traffic over PCIe/IB<br/>~half NVLink bandwidth"]
Pod --> Sched
GFD --> Sched
Sched --> TM
DP --> TM
CPU --> TM
MEM --> TM
TM -->|hints align| Aligned
TM -.->|no policy / no hints| Blind
Blind --> Slow
What it is¶
Out of the box Kubernetes treats every GPU as a fungible nvidia.com/gpu integer. It has no idea that GPU 0 and GPU 1 share an NVLink switch and a NUMA node while GPU 0 and GPU 5 do not. On an 8-GPU node split into two 4-GPU NVLink domains, a topology-blind scheduler can hand a 4-GPU job two GPUs from each domain, forcing inter-GPU traffic across the slower PCIe/InfiniBand path instead of NVLink and roughly halving effective bandwidth.1
Topology-aware GPU scheduling is the set of mechanisms that fix this:
- kubelet Topology Manager is a node-level component that collects topology hints from "hint providers" (the CPU Manager for cores, the device plugin for GPUs/NICs, the Memory Manager for memory) and admits a pod only if the resources can be aligned onto a common NUMA node, per the configured policy.16
- NVIDIA device plugin advertises
nvidia.com/gpu(and MIG resources) and reports the NUMA affinity of each GPU as a topology hint so the Topology Manager can align CPUs and memory to the allocated GPUs.162 - GPU Feature Discovery (GFD) labels nodes with GPU attributes (product, count, MIG capability, NVLink fabric clique) used by
nodeSelector/affinity to steer pods to the right nodes.18 - RDMA device plugin + host networking give pods direct, low-latency access to the InfiniBand/RoCE fabric for NCCL and MPI traffic that spans nodes.193
Why it matters¶
GPU-to-GPU bandwidth dominates the step time of collective-heavy distributed training (all-reduce, all-gather). NVLink is roughly an order of magnitude faster than the PCIe/network fallback; on an NVL72 NVLink domain the book quotes ~130 TB/s aggregate, ~1.8 TB/s per GPU, all of which is forfeited the moment the scheduler places a job across NVLink domains or racks.4
CPU/memory alignment matters for the same reason it does on bare metal: a data-loader thread feeding a GPU across a remote NUMA hop pays roughly the ~80 ns vs ~139 ns local-vs-remote latency penalty the book measures, plus jitter from cross-node interrupt servicing.5 Topology Manager extends the OS-level numactl pinning described in NUMA Affinity and CPU Pinning for GPU Pipelines into the orchestrator so the kubelet, not a hand-written wrapper script, guarantees the pod's cores and its GPUs share a NUMA node.6
When it is needed (and when not)¶
Needed:
- Multi-GPU pods (data/tensor/pipeline parallel) on nodes with more than one NVLink or NUMA domain. This is where mis-placement silently halves bandwidth.1
- Latency-sensitive CPU+GPU pipelines where a remote-NUMA data loader starves the GPU.5
- Multi-node training over InfiniBand/RoCE that needs GPUDirect RDMA and predictable NIC affinity.3
Not needed / counterproductive:
- Single-GPU pods. With one GPU there is nothing to co-locate; alignment to that GPU's NUMA node is still beneficial but the placement problem is trivial.
single-numa-nodepolicy on workloads whose CPU + GPU + memory request cannot fit one NUMA node. The pod is rejected with aTopologyAffinityErrorand never schedules.16- MIG for large multi-GPU jobs. MIG disables NVLink/PCIe peer-to-peer between instances, so distributed training that relies on GPU peer paths must not run on MIG; reserve MIG for many small, isolated inference/training jobs.9
How: implement, integrate, maintain¶
1. Enable the Topology Manager on the kubelet¶
Topology Manager is a kubelet setting, so it is configured per node (via the kubelet config file or flags), not in the pod spec. The policy flag is --topology-manager-policy with exactly four values: none (default), best-effort, restricted, single-numa-node.16
# /var/lib/kubelet/config.yaml (KubeletConfiguration)
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
cpuManagerPolicy: static # required to pin whole cores to Guaranteed pods
topologyManagerPolicy: restricted # none | best-effort | restricted | single-numa-node
topologyManagerScope: container # container (default) | pod
Policy semantics, per the Kubernetes docs:16
best-effort: kubelet stores the hint but admits the pod even if no aligned allocation exists. Safe default for getting topology preference without scheduling failures.restricted: admits the pod only if the requested resources can be aligned; otherwise the pod fails admission (it is not retried by the kubelet, it goesTerminatedand the higher-level controller must reschedule).single-numa-nodeis the strictest: every aligned resource must come from one NUMA node, elseTopologyAffinityError.
The topologyManagerScope flag chooses whether alignment is computed per container (default) or per pod (all containers share one NUMA hint). Use pod scope when sidecars must land on the same NUMA node as the main container.16
The book recommends best-effort, restricted, or single-numa-node for multi-GPU and CPU+GPU pods to cut remote-memory access, complementing OS-level NUMA tuning.6
Note:
cpuManagerPolicy: staticonly grants exclusive whole cores to pods in the Guaranteed QoS class with integer CPU requests; fractional or Burstable pods still draw from the shared pool. Topology alignment of CPUs therefore depends on Guaranteed QoS (next section).16
2. Make every pod Guaranteed QoS¶
CPU pinning and meaningful CPU/memory alignment require the Guaranteed QoS class. A pod is Guaranteed only when every container sets both requests and limits, and requests == limits, for both CPU and memory. A high limit alone yields Burstable; no requests/limits yields BestEffort (first to be evicted).177
apiVersion: v1
kind: Pod
metadata:
name: trainer
spec:
# Steer to a node whose GPUs share an NVLink fabric (label from GFD, step 3).
nodeSelector:
nvidia.com/gpu.product: NVIDIA-H100-80GB-HBM3
containers:
- name: train
image: nvcr.io/nvidia/pytorch:25.04-py3
resources:
requests:
cpu: "16"
memory: 64Gi
nvidia.com/gpu: "4"
limits: # equal to requests => Guaranteed QoS
cpu: "16"
memory: 64Gi
nvidia.com/gpu: "4"
With this pod the device plugin reports the NUMA affinity of the 4 allocated GPUs, and (under restricted/single-numa-node) the kubelet pins the 16 exclusive cores and 64Gi on the same NUMA node.162
3. Steer pods to the right node with GFD labels¶
The Topology Manager aligns resources within a node it has already been assigned; it does not choose the node. Node selection is the scheduler's job, driven by labels from GPU Feature Discovery (deployed by the NVIDIA GPU Operator alongside Node Feature Discovery). GFD emits labels such as nvidia.com/gpu.product, nvidia.com/gpu.count, nvidia.com/gpu.family, nvidia.com/gpu.machine, MIG labels (nvidia.com/mig.strategy, nvidia.com/mig-<g>g.<gb>gb.count), and nvidia.com/gpu.clique (NVLink fabric ClusterUUID + CliqueID for multi-node NVLink domains).18
Use these in nodeSelector/nodeAffinity (as in step 2) to land collective-heavy jobs on nodes inside the same fast NVLink domain before any cross-fabric hop.4
Accuracy note: the book states the GPU Operator labels each GPU with "its NUMA node and NVLink/NVSwitch ID".13 The upstream GFD label set documents
nvidia.com/gpu.cliquefor NVLink-fabric grouping but does not document a per-GPU NUMA-node label; intra-node NUMA alignment is delivered through the device plugin's topology hints to the Topology Manager, not a node label.1816 Treat per-GPU NUMA placement as a kubelet (Topology Manager) function, and NVLink-domain node selection as a GFD-label function.
4. Host networking + RDMA device plugin for multi-node fabric¶
For multi-node jobs, the simplest high-performance path is host networking: the pod uses the host's network namespace and InfiniBand interfaces directly, with no overlay/NAT translation, which the book notes is especially useful for MPI because it removes per-rank port mapping.3
When host networking is barred by policy, expose the fabric explicitly with the Mellanox/NVIDIA k8s-rdma-shared-dev-plugin, which advertises shared RDMA resources under the rdma/ prefix (e.g. rdma/hca_shared_devices_a) for IB and RoCE HCAs, giving pods GPUDirect RDMA zero-copy endpoints.1912
Enable GPUDirect RDMA in the NVIDIA driver (NIC must support it) so GPUs exchange data with the NIC bypassing the CPU. Over an overlay network instead, set NCCL_SOCKET_IFNAME and an NCCL_PORT_RANGE so NCCL handshakes traverse the right interface/ports.12
5. The MIG single-node placement constraint¶
MIG slices are advertised as distinct resources, e.g. nvidia.com/mig-2g.45gb. A pod requesting several MIG slices must find them all on one node. A pod cannot span nodes, so the scheduler will only place it if a single node has enough free slices; otherwise it stays Pending indefinitely, even if the cluster has spare MIG capacity elsewhere.8
Plan slice geometry to match request shapes (e.g. host three 2g.45gb instances per GPU so two can co-reside for one pod). MIG mode toggling requires a GPU reset/node reboot, so it is static, not per-job dynamic; the GPU Operator's MIG Manager preserves slices across reboots and driver reloads, and persistence mode should stay on so slices are not rebuilt between jobs. Remember MIG disables NVLink/PCIe P2P between instances. Keep large distributed jobs off MIG.109
Maintain¶
- Label
mig-enabledvsmig-disablednodes and let affinity route small inference pods to MIG nodes and whole-GPU training to the rest.11 - Keep the device plugin and GPU Operator current. Topology-aware GPU scheduling is "still maturing"; older plugins do not emit NUMA/NVLink hints.14
- Verify alignment in practice: a
restricted/single-numa-noderejection surfaces as a pod-levelTopologyAffinityErrorevent; watch for it after changing CPU/memory/GPU request ratios.16 - Host kernel knobs (hugepages, CPU governor,
vm.swappiness) cannot be set from inside a container. Set them on the node image / via the GPU Operator, as covered in Host OS and Kernel Tuning for GPU Nodes.15
References¶
- Chris Fregly, AI Systems Performance Engineering (O'Reilly), Chapter 3, "OS, Docker, and Kubernetes Tuning for GPU-Based Environments" — topology-aware orchestration, Topology Manager policies, MIG single-node constraint, host networking, RDMA, QoS.
- Kubernetes — Control Topology Management Policies on a Node: https://kubernetes.io/docs/tasks/administer-cluster/topology-manager/
- Kubernetes — Pod Quality of Service Classes: https://kubernetes.io/docs/concepts/workloads/pods/pod-qos/
- NVIDIA GPU Feature Discovery (label reference): https://github.com/NVIDIA/k8s-device-plugin/blob/main/docs/gpu-feature-discovery/README.md
- Mellanox/NVIDIA
k8s-rdma-shared-dev-plugin: https://github.com/Mellanox/k8s-rdma-shared-dev-plugin - NVIDIA k8s-device-plugin (NUMA/topology hints): https://github.com/NVIDIA/k8s-device-plugin
Related: Host OS and Kernel Tuning for GPU Nodes · NUMA Affinity and CPU Pinning for GPU Pipelines · GPU Containerization Performance · Containers and Kubernetes for GPUs · NVIDIA Container Toolkit and CDI · GPU Software Stack and Node Administration · Glossary
-
Fregly, Ch. 3, "Kubernetes for Topology-Aware Container Orchestration and Networking": an 8-GPU/two-NVLink-domain node where a topology-blind scheduler splits a 4-GPU job across domains "can cut your inter-GPU bandwidth in half." ↩↩
-
Fregly, Ch. 3: "The device plugin is topology aware ... it can prefer to allocate multiple GPUs from the same NVLink Switch or the same NUMA node for a given pod." ↩↩
-
Fregly, Ch. 3, "Optimizing Network Communication for Kubernetes": set
hostNetwork: true; host networking "allows a container to access the InfiniBand interconnect exactly as the host does ... particularly useful for MPI jobs." ↩↩↩ -
Fregly, Ch. 3: NVL72 "connects 72 GPUs into a single high-bandwidth domain with a combined ~130 TB/s ... (72 GPUs * 1.8 TB/s per GPU)"; "prefer placements that keep traffic inside the fast NVLink domain before crossing the slower network fabric." ↩↩
-
Fregly, Ch. 3, "NUMA Awareness and CPU Pinning": "local NUMA node memory access latency is ~80 ns compared to remote ... ~139 ns." ↩↩
-
Fregly, Ch. 3, "Orchestrating Containers with Kubernetes Topology Manager": "configuring
--topology-manager-policytobest-effort,restricted, or, in some cases,single-numa-node... complements the OS-level NUMA tuning." ↩↩ -
Fregly, Ch. 3, "Improving Resource Guarantees": "To obtain Guaranteed QoS, every container must set
requests == limitsfor both CPU and memory. Setting a high limit alone will result in a Burstable QoS, not Guaranteed." ↩ -
Fregly, Ch. 3, "Slicing a GPU with MIG": "the scheduler cannot split these across GPUs or nodes ... the pod remains in a Kubernetes Pending (unscheduled) state ... even if other nodes collectively have enough MIG capacity." ↩
-
Fregly, Ch. 3: "when a GPU is in MIG mode, GPU-to-GPU peer-to-peer communication (including NVLink) is disabled ... Large-scale training jobs ... are typically not good candidates for MIG." ↩↩
-
Fregly, Ch. 3: "the NVIDIA Kubernetes GPU Operator's MIG Manager can automatically configure and preserve MIG partitions ... across reboots and driver reloads"; "Persistence mode is recommended when using MIG." ↩
-
Fregly, Ch. 3: "You can label one K8s node with 'mig-enabled' and another as 'mig-disabled' and let the scheduler place jobs/pods accordingly." ↩
-
Fregly, Ch. 3: "install the Kubernetes RDMA device plugin from Mellanox ... exposes InfiniBand and GPUDirect RDMA endpoints"; over overlay use
NCCL_PORT_RANGEandNCCL_SOCKET_IFNAME. ↩↩ -
Fregly, Ch. 3: the GPU Operator is "responsible for node labeling using NVIDIA's GPU Feature Discovery to label each GPU with its NUMA node and NVLink/NVSwitch ID." (Per-GPU NUMA label not documented upstream — see GFD reference.) ↩
-
Fregly, Ch. 3: "Topology-aware GPU scheduling is still maturing." ↩
-
Fregly, Ch. 3, "Dealing with I/O Isolation": "containers can't change kernel parameters like hugepage settings or CPU governor limits ... cluster admins set these ... through the base OS image. Or ... the NVIDIA GPU Operator." ↩
-
Kubernetes docs, "Control Topology Management Policies on a Node":
--topology-manager-policyvaluesnone/best-effort/restricted/single-numa-node;topologyManagerScopevaluescontainer/pod; alignment of CPUs requires the CPU Managerstaticpolicy;single-numa-node/restrictedrejections surface asTopologyAffinityError. https://kubernetes.io/docs/tasks/administer-cluster/topology-manager/ ↩↩↩↩↩↩↩↩↩↩ -
Kubernetes docs, "Pod Quality of Service Classes": Guaranteed requires every container to set CPU and memory
requestsequal tolimits, both > 0; classes are Guaranteed, Burstable, BestEffort. https://kubernetes.io/docs/concepts/workloads/pods/pod-qos/ ↩ -
NVIDIA GPU Feature Discovery label reference: emits
nvidia.com/gpu.product,nvidia.com/gpu.count,nvidia.com/gpu.family,nvidia.com/gpu.machine, MIG labels, andnvidia.com/gpu.clique(NVLink fabric grouping); no per-GPU NUMA-node label documented. https://github.com/NVIDIA/k8s-device-plugin/blob/main/docs/gpu-feature-discovery/README.md ↩↩↩ -
k8s-rdma-shared-dev-pluginadvertises shared RDMA resources under therdma/prefix (e.g.rdma/hca_shared_devices_a) for InfiniBand and RoCE HCAs. https://github.com/Mellanox/k8s-rdma-shared-dev-plugin ↩↩