Markdown

Kubernetes for GPU clusters¶

Scope: Kubernetes as the orchestration technology under a GPU platform (its objects, control loops, and CRD model) and how training, inference, and fine-tuning land on it. The technology page behind the GPU-specific stacks in Kubernetes for GPUs/the Kubernetes platform, one of the families surveyed in orchestration overview.

Reference templates on real APIs; pin versions and validate before production use.

What it is¶

Kubernetes is a declarative container orchestrator: you POST desired state (objects) to the api-server, it persists to etcd, and controllers drive actual state toward desired in a continuous reconcile loop. The scheduler binds Pods to nodes; the per-node kubelet runs containers via a CRI runtime (containerd). The unit of compute is the Pod (one or more co-scheduled containers sharing a network namespace); higher objects (Deployment, StatefulSet, Job, DaemonSet) are controllers that manage Pods.

It is extensible by design: Custom Resource Definitions (CRDs) add new object kinds, and operators are controllers that reconcile them. GPU support is entirely operator/CRD-driven. Kubernetes core does not understand GPUs; the NVIDIA GPU Operator and device plugin / DRA make a node GPU-aware (Kubernetes for GPUs).

Why use it¶

Multi-tenant, declarative platform: namespaces, RBAC, quotas, and a single API for many teams (security and multi-tenancy).
Services and networking: stable Service VIPs, DNS, Ingress/Gateway, the right substrate for long-running inference (inference serving).
GitOps: desired state is YAML in git, applied by Argo CD / Flux (auditable, reproducible) (SRE and MLOps practices).
Ecosystem: a vast operator catalogue (KServe, Kubeflow, KubeRay, Volcano, KAI) means most ML systems already ship a CRD.

When to use it (and when not)¶

Use Kubernetes for a shared, multi-tenant platform mixing services, batch, and inference; for anything that benefits from GitOps and a rich operator ecosystem.
Prefer Slurm (Slurm) for tightly-coupled, topology-sensitive multi-node pretraining on bare metal. Slurm's gang scheduling and topology.conf are native, where on K8s they are add-ons.
Prefer / add Ray (Ray) for Python-native distributed workloads and RL; run it on Kubernetes via KubeRay rather than as a parallel stack.
For edge / dev / small clusters where full control-plane overhead is unwarranted, use k3s (k3s): same API, single binary.

Architecture¶

flowchart TB
  subgraph CP["Control plane"]
    API["api-server"]
    SCHED["scheduler (+ KAI / Volcano)"]
    ETCD["etcd"]
    CM["controller-manager"]
  end
  subgraph Node["GPU node"]
    KUBELET["kubelet"]
    GPUOP["GPU Operator"]
    DEV["device plugin / DRA"]
    POD["Pod: limits nvidia.com/gpu"]
  end
  API --- ETCD
  SCHED -->|"bind Pod"| KUBELET
  CM --> API
  KUBELET --> POD
  GPUOP --> DEV
  POD -->|"requests GPU"| DEV

How to use it¶

kubectl is the primary client; everything is an object you apply. A minimal GPU Pod requests the extended resource the device plugin advertises (full GPU-platform install in the Kubernetes platform):

kubectl apply -f gpu-pod.yaml
kubectl get pods -o wide          # see node binding
kubectl logs gpu-smoke            # nvidia-smi output

# gpu-pod.yaml — one whole GPU via the device plugin
apiVersion: v1
kind: Pod
metadata: { name: gpu-smoke, namespace: ml }
spec:
  restartPolicy: Never
  containers:
    - name: smi
      image: nvidia/cuda:13.0.0-base-ubuntu24.04   # pin to a real tag
      command: ["nvidia-smi"]
      resources: { limits: { nvidia.com/gpu: 1 } }

How to develop with it¶

Templating and extension are the day-to-day developer surface:

Helm packages a set of objects as a versioned, parameterised chart (helm install --version <pinned>); Kustomize overlays patches onto a base without templating. The GPU platform itself is installed this way (the Kubernetes platform).
Operators + CRDs extend the API. You author a CRD (a new kind), write a controller that watches it and reconciles, and users then declare intent in that high-level object. KServe, Kubeflow Trainer, and KubeRay are all this pattern.

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia && helm repo update
helm install -n gpu-operator --create-namespace gpu-operator \
  nvidia/gpu-operator --version <pinned>            # see the kubernetes-helm-gpu-platform page for full values
kubectl get crds | grep -E 'nvidia|kserve|ray|kubeflow'

How to scale it¶

Workload scale-out: a Deployment/StatefulSet sets replicas; a HorizontalPodAutoscaler (or KEDA on a queue/custom metric) scales Pods on load.
Node scale-out: the Cluster Autoscaler grows/shrinks pre-defined node groups when Pods are unschedulable; Karpenter provisions the exact GPU instance a pending Pod needs, directly via the cloud API. Both are SIG-Autoscaling projects (cloud and cost).
Gang scheduling for multi-node jobs: the default scheduler places Pods one at a time, which deadlocks a distributed job. KAI Scheduler (NVIDIA, CNCF Sandbox) or Volcano add all-or-nothing placement and topology awareness; Kueue adds fair-share quota (Kubernetes for GPUs, performance tuning).

Inference¶

Kubernetes is the standard substrate for production serving. KServe is the model-inference platform (InferenceService CRD, autoscaling incl. scale-to-zero, canary). Triton (now Dynamo-Triton) implements the KServe V2 protocol and is a drop-in runtime; NVIDIA Dynamo brings datacentre-scale, disaggregated prefill/decode serving with the Grove CRD scheduled by KAI. Serving details live in inference serving and disaggregated inference; this page only places them on K8s.

Fine-tuning¶

Training and post-training run as Jobs or via operators. Kubeflow Trainer (v2: TrainJob / ClusterTrainingRuntime, replacing the v1 PyTorchJob) and Volcano training jobs wrap distributed PyTorch (torchrun) with gang scheduling. RL stacks (verl, slime, …) typically run on Ray via KubeRay (Ray, RL libraries). Methods and recipes are in distributed training/fine-tuning and post-training; see also distributed-training recipes.

Optimised hardware¶

Reaching GPU-fabric performance from a Pod is the platform's job, not the workload's:

GPU Operator installs driver, container toolkit, device plugin / DRA driver (GA in K8s 1.34), DCGM exporter, and MIG manager as one release (Kubernetes for GPUs).
Network Operator wires host RDMA, SR-IOV/RDMA device plugins, Multus, and GPUDirect so NCCL uses IB/RoCE with GDR; without it NCCL silently falls back to TCP (performance tuning, the Kubernetes platform).
Topology Manager (kubelet) aligns GPU + NIC + CPU on one NUMA/PCIe domain so GDR engages; GPU Feature Discovery labels expose model/MIG/NVLink for rail-aware placement (networking fabric, GPU performance and health).
Verify the fast path with NCCL_DEBUG=INFO showing [GDRDMA]; keep PCIe ACS off for P2P/GDR.

Cookbook (common use cases)¶

1. Deploy a GPU workload (Deployment + GPU request)

apiVersion: apps/v1
kind: Deployment
metadata: { name: embed, namespace: ml }
spec:
  replicas: 2
  selector: { matchLabels: { app: embed } }
  template:
    metadata: { labels: { app: embed } }
    spec:
      containers:
        - name: server
          image: myregistry/embed:1.4.0            # pin; never :latest
          resources: { limits: { nvidia.com/gpu: 1 } }

2. Gang-scheduled multi-node job (Volcano)

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata: { name: ddp-train, namespace: ml }
spec:
  minAvailable: 16                                  # all-or-nothing: 2 nodes x 8
  schedulerName: volcano
  tasks:
    - replicas: 16
      name: worker
      template:
        spec:
          containers:
            - name: trainer
              image: nvcr.io/nvidia/pytorch:25.05-py3   # pin to a real NGC tag
              command: ["torchrun", "--nproc_per_node=8", "train.py"]
              resources: { limits: { nvidia.com/gpu: 1, rdma/rdma_shared_device_a: 1 } }

3. Expose RDMA into a Pod (request the RDMA resource)

# Requires the Network Operator + NicClusterPolicy (see kubernetes-helm-gpu-platform); the resource name
# matches what the RDMA shared device plugin advertises.
resources:
  limits:
    nvidia.com/gpu: 8
    rdma/rdma_shared_device_a: 1
# then inside: NCCL_IB_HCA, NCCL_NET_GDR_LEVEL=SYS; confirm [GDRDMA] in NCCL_DEBUG=INFO

Gotchas & failure modes¶

Default scheduler partial-places a distributed job → GPUs idle, deadlock. Run a gang scheduler for any multi-Pod job (Kubernetes for GPUs).
No Network Operator / wrong NCCL_IB_HCA → NCCL on TCP, training crawls (performance tuning).
Time-slicing treated as isolation → no per-tenant memory cap, noisy-neighbour/OOM. Use MIG for hard isolation (security and multi-tenancy).
Mutating cluster state by hand instead of GitOps → drift; apply via Argo CD/Flux (SRE and MLOps practices).
etcd as the bottleneck/SPOF: run an odd, quorum HA control plane; etcd I/O latency gates the whole API.
DRA on K8s < 1.34 or driver < 580 → ResourceClaims never satisfied (the Kubernetes platform).

References¶

Kubernetes docs (concepts, objects): https://kubernetes.io/docs/concepts/
DRA graduated to GA (v1.34): https://kubernetes.io/blog/2025/09/01/kubernetes-v1-34-dra-updates/
NVIDIA GPU Operator: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html
KServe: https://kserve.github.io/website/ · NVIDIA Dynamo: https://docs.nvidia.com/dynamo/latest/
KAI Scheduler: https://github.com/kai-scheduler/KAI-Scheduler · Volcano: https://volcano.sh/en/docs/ · Kueue: https://kueue.sigs.k8s.io/
Kubeflow Trainer: https://www.kubeflow.org/docs/components/trainer/ · Node autoscaling: https://kubernetes.io/docs/concepts/cluster-administration/node-autoscaling/

Related: K8s GPU · K8s Platform · Orchestration · k3s · Ray · Slurm · Glossary