Skip to content
Markdown

Containers & Kubernetes for GPUs

Scope: running GPU workloads under containers and Kubernetes: the platform-engineering layer that turns a pool of provisioned nodes (provisioning and scheduling, the GPU software stack) into a self-service cluster.

Overview

Kubernetes does not understand GPUs natively; it sees opaque extended resources unless you install a stack that exposes, schedules, shares, and observes them. The platform skill is assembling that stack (operator, scheduler, topology, networking) and choosing a sharing model that matches the workload. The field is mid-transition from the legacy device plugin to DRA, so knowing both, and why DRA wins, is current and load-bearing.

Architecture: Kubernetes GPU stack

flowchart LR
  OP["GPU Operator"] --> DRV["Driver"]
  OP --> DP["Device plugin / DRA"]
  OP --> EXP["dcgm-exporter"]
  SCH["Gang scheduler: KAI / Volcano"] --> POD["Pod: requests nvidia.com/gpu"]
  POD --> DP
  NET["Network Operator"] -.->|"RDMA / GPUDirect"| POD

Common Kubernetes GPU questions

  • Device plugin vs DRA: which scheduling path? The NVIDIA device plugin advertises whole GPUs as an extended resource (nvidia.com/gpu); Dynamic Resource Allocation (DRA) is the newer, more expressive path for partial and attribute-selected devices. The scheduling-mechanisms section below compares them.
  • MIG or time-slicing to share a GPU? MIG gives hardware-isolated partitions for predictable multi-tenant QoS; time-slicing oversubscribes one GPU for bursty or dev workloads with no isolation. See the GPU-sharing section below.
  • How does this page relate to the other Kubernetes pages? This page is Kubernetes GPU scheduling: device plugin, DRA, MIG, RDMA. For Kubernetes architecture for GPU clusters see Kubernetes for GPU clusters; to install the GPU platform with Helm see Kubernetes and Helm GPU platform.

Core knowledge

Getting a GPU into a pod

  • The NVIDIA Container Toolkit (the GPU software stack) injects driver and devices into the container, increasingly through CDI (Container Device Interface), the runtime-neutral standard.
  • The NVIDIA GPU Operator automates the whole node stack as one Helm release: driver (host or containerised), container toolkit, device plugin, NFD (node feature discovery) + GFD (GPU feature discovery), DCGM + dcgm-exporter (observability), MIG manager, the validator, and now the DRA driver. This is the default way to make a fleet GPU-aware.

Scheduling mechanisms: device plugin vs DRA

  • Device plugin (legacy): advertises nvidia.com/gpu as a countable extended resource. Requests are whole integers only: no fractions, no attributes, no topology. Simple, ubiquitous, but expressively dead-ended.
  • DRA (Dynamic Resource Allocation): stable/GA in Kubernetes 1.34. The NVIDIA DRA driver for GPUs needs K8s 1.34.2+, driver 580+, and CDI enabled. It replaces integer counts with ResourceClaims / DeviceClasses and structured parameters: request "a GPU with ≥40 GB and MIG profile X", partitionable devices, device taints/tolerations, and multi-host/NVLink-attached devices. NVIDIA donated the DRA driver to the CNCF at KubeCon EU 2026. This is the direction; plan migration off the device plugin.

Sharing one GPU across pods

  • Time-slicing: oversubscribe a GPU; the device plugin hands the same GPU to N pods that context-switch. No memory isolation, no fault isolation, no fairness: fine for dev/bursty, dangerous for multi-tenant.
  • MPS: spatial SM sharing, higher throughput for cooperative small jobs, still no fault isolation.
  • MIG: advertise MIG instances as schedulable resources (single or mixed strategy) for hard isolation and per-tenant memory (security and multi-tenancy).
  • Combine MIG and time-slicing for hardware isolation plus flexible oversubscription: time-slice each MIG instance, so a 7-way MIG GPU with N replicas per slice advertises up to 7xN schedulable units (GPU Operator: a MIG strategy plus a time-slicing config keyed to the MIG resource). Hard isolation between slices, soft oversubscription within each.
  • Choose deliberately: time-slicing (cheap, unsafe), MPS (throughput), MIG (isolation). Never give a single large training job a shared slice. When load varies over time, do not hand-pick one static split at all: use dynamic and fractional GPU sharing to rightsize allocations to real utilisation and scale idle replicas to zero.
  • The sharing options available depend on the node's hardware (GPU generations). MIG-backed partitioning needs a MIG-capable GPU: A100/A30, H100/H200, the Blackwell B-series, or the RTX PRO 6000 Blackwell. On consumer or other no-MIG GPUs (GeForce, A40/A10, L40S, RTX 6000 Ada) only time-slicing or MPS is available, with no hard isolation (security and multi-tenancy). The GPU Operator runs on all of them; advertise the right sharing strategy per node tier (MIG strategy where capable, otherwise time-slicing/MPS) rather than assuming MIG cluster-wide. GFD labels expose the GPU model/MIG capability so policy can target the correct nodes.

AI-aware scheduling (gang scheduling matters)

  • The default kube-scheduler places pods one at a time. A distributed training job needs all-or-nothing placement: partial placement holds GPUs idle waiting for peers and can deadlock. Gang scheduling fixes this.
  • KAI Scheduler (NVIDIA, open-sourced Apache-2.0 from the Run:ai platform): gang scheduling, hierarchical fair-share queues, bin-packing to cut fragmentation, a built-in podgrouper, and topology-aware/hierarchical scheduling that integrates with Grove/Dynamo for disaggregated serving (inference serving).
  • Volcano (CNCF): batch/gang scheduling, queues, fair-share, the established open option.
  • Kueue (K8s-native): job queueing and quota management, composes with a gang-capable scheduler.
  • NVIDIA Run:ai: the commercial platform KAI was extracted from (fractional GPU, quotas, dashboards).

Topology and networking

  • Topology Manager (kubelet) aligns GPU, NIC, and CPU on the same NUMA/PCIe domain so GPUDirect RDMA actually engages (GPU performance and health); GFD labels expose GPU model/MIG/NVLink so placement can be rail-aware (networking fabric).
  • Reaching the IB/RoCE fabric from a pod needs the NVIDIA Network Operator: host RDMA, Multus for multiple NICs, SR-IOV/RDMA device plugins, and GPUDirect. Without it, NCCL silently falls back to TCP sockets, an order-of-magnitude slowdown.
  • For GB200/GB300 NVL72 multi-node NVLink domains, IMEX coordinates the cross-node NVLink memory domain under Kubernetes.

Don't-miss checklist

  • Install the GPU Operator rather than hand-rolling driver + plugin + exporter per node.
  • Pick a sharing model on purpose; never present time-slicing as isolation.
  • Run a gang scheduler (KAI/Volcano) for any multi-pod distributed job; integrate Kueue for quota.
  • Wire the Network Operator so pods reach IB/RoCE with GDR; confirm NCCL is not on TCP fallback (performance tuning).
  • Align GPU+NIC+CPU topology; expose NVLink/rail labels for placement.
  • Plan the device-plugin → DRA migration; validate driver 580+ and K8s 1.34.2+.

Failure modes

  • Time-slicing mistaken for isolation: noisy-neighbour and OOM with no per-tenant memory cap.
  • Default scheduler partial-places a distributed job: GPUs held idle, deadlock, collapsed utilisation.
  • Pods cannot reach the fabric (no RDMA device plugin / wrong NCCL_SOCKET_IFNAME / NCCL_IB_HCA): NCCL on TCP, training crawls.
  • Containerised driver fighting a host-installed driver; or MIG state on the node out of sync with what the plugin advertises.

Open questions & validation

  • DRA: author a ResourceClaim/DeviceClass and run the NVIDIA DRA driver end-to-end on K8s 1.34.2+ (manifests in the Kubernetes platform).
  • Gang scheduling config in KAI or Volcano for a real multi-node job, including queue/fair-share.
  • Network Operator + GPUDirect RDMA inside pods, proven with nccl-tests from containers, not just on the host.

References

  • NVIDIA GPU Operator: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html
  • NVIDIA DRA driver for GPUs: https://github.com/NVIDIA/k8s-dra-driver-gpu
  • Kubernetes DRA concept: https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/
  • KAI Scheduler: https://github.com/NVIDIA/KAI-Scheduler
  • Volcano: https://volcano.sh/en/docs/ · Kueue: https://kueue.sigs.k8s.io/
  • NVIDIA Network Operator (RDMA/GPUDirect in k8s): https://github.com/Mellanox/network-operator
  • O'Reilly Radar, Kubernetes in the Age of AI (context: Kubernetes as the default substrate for GenAI and agentic workloads): https://oreillyradar.substack.com/p/kubernetes-in-the-age-of-ai

Related: Fabric · Provisioning · Software Stack · Inference · GPU Allocation Operator · Platform Split-Plane Architecture · Optimization · Security · Glossary