Markdown

Topology-aware GPU scheduling in Kubernetes¶

Scope: making Kubernetes co-locate the GPUs of a multi-GPU pod on the same NVLink/NUMA domain and align the pod's CPUs and memory to that domain, via the kubelet Topology Manager, GPU Feature Discovery labels, the RDMA device plugin, host networking, Guaranteed QoS, and the MIG single-node placement rule.

flowchart TB
    Pod["4-GPU Guaranteed pod<br/>(requests == limits)"]
    GFD["GPU Feature Discovery<br/>node labels (gpu.clique)"]
    Sched["Scheduler<br/>(nodeSelector / affinity)"]
    TM["Topology Manager (kubelet)<br/>policy: restricted / single-numa-node"]
    DP["NVIDIA device plugin<br/>GPU NUMA hint"]
    CPU["CPU Manager<br/>core hint"]
    MEM["Memory Manager<br/>memory hint"]
    Aligned["Aligned: GPUs + CPUs + memory<br/>on one NUMA / NVLink domain"]
    Blind["Topology-blind placement<br/>job split across NVLink domains"]
    Slow["Inter-GPU traffic over PCIe/IB<br/>~half NVLink bandwidth"]

    Pod --> Sched
    GFD --> Sched
    Sched --> TM
    DP --> TM
    CPU --> TM
    MEM --> TM
    TM -->|hints align| Aligned
    TM -.->|no policy / no hints| Blind
    Blind --> Slow

What it is¶

Out of the box Kubernetes treats every GPU as a fungible nvidia.com/gpu integer. It has no idea that GPU 0 and GPU 1 share an NVLink switch and a NUMA node while GPU 0 and GPU 5 do not. On an 8-GPU node split into two 4-GPU NVLink domains, a topology-blind scheduler can hand a 4-GPU job two GPUs from each domain, forcing inter-GPU traffic across the slower PCIe/InfiniBand path instead of NVLink and roughly halving effective bandwidth.¹

Topology-aware GPU scheduling is the set of mechanisms that fix this:

kubelet Topology Manager is a node-level component that collects topology hints from "hint providers" (the CPU Manager for cores, the device plugin for GPUs/NICs, the Memory Manager for memory) and admits a pod only if the resources can be aligned onto a common NUMA node, per the configured policy.¹⁶
NVIDIA device plugin advertises nvidia.com/gpu (and MIG resources) and reports the NUMA affinity of each GPU as a topology hint so the Topology Manager can align CPUs and memory to the allocated GPUs.¹⁶²
GPU Feature Discovery (GFD) labels nodes with GPU attributes (product, count, MIG capability, NVLink fabric clique) used by nodeSelector/affinity to steer pods to the right nodes.¹⁸
RDMA device plugin + host networking give pods direct, low-latency access to the InfiniBand/RoCE fabric for NCCL and MPI traffic that spans nodes.¹⁹³

Why it matters¶

GPU-to-GPU bandwidth dominates the step time of collective-heavy distributed training (all-reduce, all-gather). NVLink is roughly an order of magnitude faster than the PCIe/network fallback; on an NVL72 NVLink domain the book quotes ~130 TB/s aggregate, ~1.8 TB/s per GPU, all of which is forfeited the moment the scheduler places a job across NVLink domains or racks.⁴

CPU/memory alignment matters for the same reason it does on bare metal: a data-loader thread feeding a GPU across a remote NUMA hop pays roughly the ~80 ns vs ~139 ns local-vs-remote latency penalty the book measures, plus jitter from cross-node interrupt servicing.⁵ Topology Manager extends the OS-level numactl pinning described in NUMA Affinity and CPU Pinning for GPU Pipelines into the orchestrator so the kubelet, not a hand-written wrapper script, guarantees the pod's cores and its GPUs share a NUMA node.⁶

When it is needed (and when not)¶

Needed:

Multi-GPU pods (data/tensor/pipeline parallel) on nodes with more than one NVLink or NUMA domain. This is where mis-placement silently halves bandwidth.¹
Latency-sensitive CPU+GPU pipelines where a remote-NUMA data loader starves the GPU.⁵
Multi-node training over InfiniBand/RoCE that needs GPUDirect RDMA and predictable NIC affinity.³

Not needed / counterproductive:

Single-GPU pods. With one GPU there is nothing to co-locate; alignment to that GPU's NUMA node is still beneficial but the placement problem is trivial.
single-numa-node policy on workloads whose CPU + GPU + memory request cannot fit one NUMA node. The pod is rejected with a TopologyAffinityError and never schedules.¹⁶
MIG for large multi-GPU jobs. MIG disables NVLink/PCIe peer-to-peer between instances, so distributed training that relies on GPU peer paths must not run on MIG; reserve MIG for many small, isolated inference/training jobs.⁹

How: implement, integrate, maintain¶

1. Enable the Topology Manager on the kubelet¶

Topology Manager is a kubelet setting, so it is configured per node (via the kubelet config file or flags), not in the pod spec. The policy flag is --topology-manager-policy with exactly four values: none (default), best-effort, restricted, single-numa-node.¹⁶

# /var/lib/kubelet/config.yaml (KubeletConfiguration)
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
cpuManagerPolicy: static          # required to pin whole cores to Guaranteed pods
topologyManagerPolicy: restricted # none | best-effort | restricted | single-numa-node
topologyManagerScope: container   # container (default) | pod

Policy semantics, per the Kubernetes docs:¹⁶

best-effort: kubelet stores the hint but admits the pod even if no aligned allocation exists. Safe default for getting topology preference without scheduling failures.
restricted: admits the pod only if the requested resources can be aligned; otherwise the pod fails admission (it is not retried by the kubelet, it goes Terminated and the higher-level controller must reschedule).
single-numa-node is the strictest: every aligned resource must come from one NUMA node, else TopologyAffinityError.

The topologyManagerScope flag chooses whether alignment is computed per container (default) or per pod (all containers share one NUMA hint). Use pod scope when sidecars must land on the same NUMA node as the main container.¹⁶

The book recommends best-effort, restricted, or single-numa-node for multi-GPU and CPU+GPU pods to cut remote-memory access, complementing OS-level NUMA tuning.⁶

Note: cpuManagerPolicy: static only grants exclusive whole cores to pods in the Guaranteed QoS class with integer CPU requests; fractional or Burstable pods still draw from the shared pool. Topology alignment of CPUs therefore depends on Guaranteed QoS (next section).¹⁶

2. Make every pod Guaranteed QoS¶

CPU pinning and meaningful CPU/memory alignment require the Guaranteed QoS class. A pod is Guaranteed only when every container sets both requests and limits, and requests == limits, for both CPU and memory. A high limit alone yields Burstable; no requests/limits yields BestEffort (first to be evicted).¹⁷⁷

apiVersion: v1
kind: Pod
metadata:
  name: trainer
spec:
  # Steer to a node whose GPUs share an NVLink fabric (label from GFD, step 3).
  nodeSelector:
    nvidia.com/gpu.product: NVIDIA-H100-80GB-HBM3
  containers:
    - name: train
      image: nvcr.io/nvidia/pytorch:25.04-py3
      resources:
        requests:
          cpu: "16"
          memory: 64Gi
          nvidia.com/gpu: "4"
        limits:               # equal to requests => Guaranteed QoS
          cpu: "16"
          memory: 64Gi
          nvidia.com/gpu: "4"

With this pod the device plugin reports the NUMA affinity of the 4 allocated GPUs, and (under restricted/single-numa-node) the kubelet pins the 16 exclusive cores and 64Gi on the same NUMA node.¹⁶²

3. Steer pods to the right node with GFD labels¶

The Topology Manager aligns resources within a node it has already been assigned; it does not choose the node. Node selection is the scheduler's job, driven by labels from GPU Feature Discovery (deployed by the NVIDIA GPU Operator alongside Node Feature Discovery). GFD emits labels such as nvidia.com/gpu.product, nvidia.com/gpu.count, nvidia.com/gpu.family, nvidia.com/gpu.machine, MIG labels (nvidia.com/mig.strategy, nvidia.com/mig-<g>g.<gb>gb.count), and nvidia.com/gpu.clique (NVLink fabric ClusterUUID + CliqueID for multi-node NVLink domains).¹⁸

Use these in nodeSelector/nodeAffinity (as in step 2) to land collective-heavy jobs on nodes inside the same fast NVLink domain before any cross-fabric hop.⁴

Accuracy note: the book states the GPU Operator labels each GPU with "its NUMA node and NVLink/NVSwitch ID".¹³ The upstream GFD label set documents nvidia.com/gpu.clique for NVLink-fabric grouping but does not document a per-GPU NUMA-node label; intra-node NUMA alignment is delivered through the device plugin's topology hints to the Topology Manager, not a node label.¹⁸¹⁶ Treat per-GPU NUMA placement as a kubelet (Topology Manager) function, and NVLink-domain node selection as a GFD-label function.

4. Host networking + RDMA device plugin for multi-node fabric¶

For multi-node jobs, the simplest high-performance path is host networking: the pod uses the host's network namespace and InfiniBand interfaces directly, with no overlay/NAT translation, which the book notes is especially useful for MPI because it removes per-rank port mapping.³

spec:
  hostNetwork: true
  hostIPC: true   # often required for NCCL/MPI shared-memory transport

When host networking is barred by policy, expose the fabric explicitly with the Mellanox/NVIDIA k8s-rdma-shared-dev-plugin, which advertises shared RDMA resources under the rdma/ prefix (e.g. rdma/hca_shared_devices_a) for IB and RoCE HCAs, giving pods GPUDirect RDMA zero-copy endpoints.¹⁹¹²

      resources:
        limits:
          rdma/hca_shared_devices_a: "1"
          nvidia.com/gpu: "4"

Enable GPUDirect RDMA in the NVIDIA driver (NIC must support it) so GPUs exchange data with the NIC bypassing the CPU. Over an overlay network instead, set NCCL_SOCKET_IFNAME and an NCCL_PORT_RANGE so NCCL handshakes traverse the right interface/ports.¹²

5. The MIG single-node placement constraint¶

MIG slices are advertised as distinct resources, e.g. nvidia.com/mig-2g.45gb. A pod requesting several MIG slices must find them all on one node. A pod cannot span nodes, so the scheduler will only place it if a single node has enough free slices; otherwise it stays Pending indefinitely, even if the cluster has spare MIG capacity elsewhere.⁸

      resources:
        limits:
          nvidia.com/mig-2g.45gb: "2"   # both 2g.45gb slices must be free on ONE node

Plan slice geometry to match request shapes (e.g. host three 2g.45gb instances per GPU so two can co-reside for one pod). MIG mode toggling requires a GPU reset/node reboot, so it is static, not per-job dynamic; the GPU Operator's MIG Manager preserves slices across reboots and driver reloads, and persistence mode should stay on so slices are not rebuilt between jobs. Remember MIG disables NVLink/PCIe P2P between instances. Keep large distributed jobs off MIG.¹⁰⁹

Maintain¶

Label mig-enabled vs mig-disabled nodes and let affinity route small inference pods to MIG nodes and whole-GPU training to the rest.¹¹
Keep the device plugin and GPU Operator current. Topology-aware GPU scheduling is "still maturing"; older plugins do not emit NUMA/NVLink hints.¹⁴
Verify alignment in practice: a restricted/single-numa-node rejection surfaces as a pod-level TopologyAffinityError event; watch for it after changing CPU/memory/GPU request ratios.¹⁶
Host kernel knobs (hugepages, CPU governor, vm.swappiness) cannot be set from inside a container. Set them on the node image / via the GPU Operator, as covered in Host OS and Kernel Tuning for GPU Nodes.¹⁵

References¶

Chris Fregly, AI Systems Performance Engineering (O'Reilly), Chapter 3, "OS, Docker, and Kubernetes Tuning for GPU-Based Environments" — topology-aware orchestration, Topology Manager policies, MIG single-node constraint, host networking, RDMA, QoS.
Kubernetes — Control Topology Management Policies on a Node: https://kubernetes.io/docs/tasks/administer-cluster/topology-manager/
Kubernetes — Pod Quality of Service Classes: https://kubernetes.io/docs/concepts/workloads/pods/pod-qos/
NVIDIA GPU Feature Discovery (label reference): https://github.com/NVIDIA/k8s-device-plugin/blob/main/docs/gpu-feature-discovery/README.md
Mellanox/NVIDIA k8s-rdma-shared-dev-plugin: https://github.com/Mellanox/k8s-rdma-shared-dev-plugin
NVIDIA k8s-device-plugin (NUMA/topology hints): https://github.com/NVIDIA/k8s-device-plugin

Fregly, Ch. 3, "Kubernetes for Topology-Aware Container Orchestration and Networking": an 8-GPU/two-NVLink-domain node where a topology-blind scheduler splits a 4-GPU job across domains "can cut your inter-GPU bandwidth in half." ↩↩
Fregly, Ch. 3: "The device plugin is topology aware ... it can prefer to allocate multiple GPUs from the same NVLink Switch or the same NUMA node for a given pod." ↩↩
Fregly, Ch. 3, "Optimizing Network Communication for Kubernetes": set hostNetwork: true; host networking "allows a container to access the InfiniBand interconnect exactly as the host does ... particularly useful for MPI jobs." ↩↩↩
Fregly, Ch. 3: NVL72 "connects 72 GPUs into a single high-bandwidth domain with a combined ~130 TB/s ... (72 GPUs * 1.8 TB/s per GPU)"; "prefer placements that keep traffic inside the fast NVLink domain before crossing the slower network fabric." ↩↩
Fregly, Ch. 3, "NUMA Awareness and CPU Pinning": "local NUMA node memory access latency is ~80 ns compared to remote ... ~139 ns." ↩↩
Fregly, Ch. 3, "Orchestrating Containers with Kubernetes Topology Manager": "configuring --topology-manager-policy to best-effort, restricted, or, in some cases, single-numa-node ... complements the OS-level NUMA tuning." ↩↩
Fregly, Ch. 3, "Improving Resource Guarantees": "To obtain Guaranteed QoS, every container must set requests == limits for both CPU and memory. Setting a high limit alone will result in a Burstable QoS, not Guaranteed." ↩
Fregly, Ch. 3, "Slicing a GPU with MIG": "the scheduler cannot split these across GPUs or nodes ... the pod remains in a Kubernetes Pending (unscheduled) state ... even if other nodes collectively have enough MIG capacity." ↩
Fregly, Ch. 3: "when a GPU is in MIG mode, GPU-to-GPU peer-to-peer communication (including NVLink) is disabled ... Large-scale training jobs ... are typically not good candidates for MIG." ↩↩
Fregly, Ch. 3: "the NVIDIA Kubernetes GPU Operator's MIG Manager can automatically configure and preserve MIG partitions ... across reboots and driver reloads"; "Persistence mode is recommended when using MIG." ↩
Fregly, Ch. 3: "You can label one K8s node with 'mig-enabled' and another as 'mig-disabled' and let the scheduler place jobs/pods accordingly." ↩
Fregly, Ch. 3: "install the Kubernetes RDMA device plugin from Mellanox ... exposes InfiniBand and GPUDirect RDMA endpoints"; over overlay use NCCL_PORT_RANGE and NCCL_SOCKET_IFNAME. ↩↩
Fregly, Ch. 3: the GPU Operator is "responsible for node labeling using NVIDIA's GPU Feature Discovery to label each GPU with its NUMA node and NVLink/NVSwitch ID." (Per-GPU NUMA label not documented upstream — see GFD reference.) ↩
Fregly, Ch. 3: "Topology-aware GPU scheduling is still maturing." ↩
Fregly, Ch. 3, "Dealing with I/O Isolation": "containers can't change kernel parameters like hugepage settings or CPU governor limits ... cluster admins set these ... through the base OS image. Or ... the NVIDIA GPU Operator." ↩
Kubernetes docs, "Control Topology Management Policies on a Node": --topology-manager-policy values none/best-effort/restricted/single-numa-node; topologyManagerScope values container/pod; alignment of CPUs requires the CPU Manager static policy; single-numa-node/restricted rejections surface as TopologyAffinityError. https://kubernetes.io/docs/tasks/administer-cluster/topology-manager/ ↩↩↩↩↩↩↩↩↩↩
Kubernetes docs, "Pod Quality of Service Classes": Guaranteed requires every container to set CPU and memory requests equal to limits, both > 0; classes are Guaranteed, Burstable, BestEffort. https://kubernetes.io/docs/concepts/workloads/pods/pod-qos/ ↩
NVIDIA GPU Feature Discovery label reference: emits nvidia.com/gpu.product, nvidia.com/gpu.count, nvidia.com/gpu.family, nvidia.com/gpu.machine, MIG labels, and nvidia.com/gpu.clique (NVLink fabric grouping); no per-GPU NUMA-node label documented. https://github.com/NVIDIA/k8s-device-plugin/blob/main/docs/gpu-feature-discovery/README.md ↩↩↩
k8s-rdma-shared-dev-plugin advertises shared RDMA resources under the rdma/ prefix (e.g. rdma/hca_shared_devices_a) for InfiniBand and RoCE HCAs. https://github.com/Mellanox/k8s-rdma-shared-dev-plugin ↩↩