Kubernetes & Helm: GPU platform¶
Scope: the manifests that turn a plain Kubernetes cluster into a GPU platform. The pieces: GPU Operator, network/RDMA, the sharing models, DRA, and gang scheduling with quota. The runnable counterpart to Kubernetes for GPUs.
Reference templates from upstream Helm charts and CRDs. Pin chart and image versions; apply via GitOps (SRE and MLOps practices) rather than by hand in production.
flowchart TB
K8S["Plain Kubernetes"] --> GPUOP["GPU Operator"]
GPUOP --> NETOP["Network Operator"]
NETOP --> SHARE["GPU sharing model"]
SHARE --> SCHED["Gang scheduler and quota"]
SCHED --> SMOKE["Smoke tests"]
Overview¶
Everything here is one of two things: a Helm release that installs an operator, or a CRD that expresses intent (which GPU, how shared, scheduled how). Assemble four: the GPU Operator (make nodes GPU-aware), the Network Operator (RDMA into pods), a sharing model (whole/MIG/time-slice), and a gang scheduler + quota (so distributed jobs and tenants behave). Validate each with a smoke test before layering the next.
1. GPU Operator (make the cluster GPU-aware)¶
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia && helm repo update
helm install --create-namespace -n gpu-operator gpu-operator nvidia/gpu-operator \
--version <pinned> \
--set driver.enabled=false \
--set mig.strategy=single \
--set dcgmExporter.enabled=true \
--set toolkit.enabled=true
Smoke test:
apiVersion: v1
kind: Pod
metadata: { name: cuda-smoke, namespace: gpu-operator }
spec:
restartPolicy: Never
containers:
- name: smi
image: nvidia/cuda:13.0.0-base-ubuntu24.04
command: ["nvidia-smi"]
resources: { limits: { nvidia.com/gpu: 1 } }
kubectl logs cuda-smoke -n gpu-operator must print the GPU table.
2. Network Operator (RDMA / GPUDirect into pods)¶
helm install -n nvidia-network-operator --create-namespace network-operator \
nvidia/network-operator --version <pinned> \
--set nfd.enabled=true \
--set sriovNetworkOperator.enabled=false
Then a NicClusterPolicy (RDMA shared device plugin + secondary network) so pods get an IB/RoCE interface; without it NCCL falls back to TCP (performance tuning). Verify with a 2-node nccl-tests job (workload recipes).
3. Sharing models (pick per workload)¶
Time-slicing (dev/bursty, no isolation). Patch the device-plugin config:
apiVersion: v1
kind: ConfigMap
metadata: { name: time-slicing-config, namespace: gpu-operator }
data:
any: |-
version: v1
sharing:
timeSlicing:
resources:
- name: nvidia.com/gpu
replicas: 4 # 1 physical GPU advertised as 4 -> NO memory isolation
kubectl apply -f time-slicing-config.yaml
helm upgrade gpu-operator nvidia/gpu-operator -n gpu-operator --reuse-values \
--set devicePlugin.config.name=time-slicing-config \
--set devicePlugin.config.default=any
kubectl rollout restart daemonset/nvidia-device-plugin-daemonset -n gpu-operator
MIG (hard isolation). Set the node label the MIG manager watches:
kubectl label node <gpu-node> nvidia.com/mig.config=all-1g.10gb --overwrite
# single strategy: pods request nvidia.com/gpu: 1 (mixed strategy exposes nvidia.com/mig-1g.10gb)
4. DRA: Dynamic Resource Allocation (K8s 1.34+, the successor)¶
# DRA-backed GPU pools must not also advertise the legacy nvidia.com/gpu device plugin.
kubectl label node <gpu-node> nvidia.com/dra-kubelet-plugin=true --overwrite
helm upgrade --install gpu-operator nvidia/gpu-operator \
--version <pinned> \
--create-namespace --namespace gpu-operator \
--set devicePlugin.enabled=false \
--set driver.manager.env[0].name=NODE_LABEL_FOR_GPU_POD_EVICTION \
--set driver.manager.env[0].value=nvidia.com/dra-kubelet-plugin
cat > dra-values.yaml <<'EOF'
kubeletPlugin:
nodeSelector:
nvidia.com/dra-kubelet-plugin: "true"
EOF
helm upgrade --install nvidia-dra-driver-gpu nvidia/nvidia-dra-driver-gpu \
--version <pinned> \
--namespace nvidia-dra-driver-gpu --create-namespace \
--set nvidiaDriverRoot=/run/nvidia/driver \
--set gpuResourcesEnabledOverride=true \
-f dra-values.yaml
# Claim one GPU, then reference it from a pod
apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata: { name: single-gpu, namespace: ml }
spec:
spec:
devices:
requests:
- name: gpu
exactly:
deviceClassName: gpu.nvidia.com
---
apiVersion: v1
kind: Pod
metadata: { name: dra-train, namespace: ml }
spec:
resourceClaims:
- name: g
resourceClaimTemplateName: single-gpu
containers:
- name: trainer
image: nvcr.io/nvidia/pytorch:25.05-py3
resources: { claims: [{ name: g }] }
5. Gang scheduling + quota (distributed jobs and tenants)¶
Install Volcano (or KAI Scheduler) for all-or-nothing placement (Kubernetes for GPUs):
helm repo add volcano-sh https://volcano-sh.github.io/helm-charts
helm install volcano volcano-sh/volcano -n volcano-system --create-namespace
A gang-scheduled job declares minAvailable so it only starts when every worker can be placed (full Job in workload recipes). Add Kueue for fair-share quota across teams:
apiVersion: kueue.x-k8s.io/v1beta2
kind: ResourceFlavor
metadata: { name: h100 }
spec:
nodeLabels:
nvidia.com/gpu.product: NVIDIA-H100-80GB-HBM3
---
apiVersion: kueue.x-k8s.io/v1beta2
kind: ClusterQueue
metadata: { name: research }
spec:
namespaceSelector: {}
resourceGroups:
- coveredResources: ["nvidia.com/gpu"]
flavors:
- name: h100
resources: [{ name: "nvidia.com/gpu", nominalQuota: 64 }]
---
apiVersion: kueue.x-k8s.io/v1beta2
kind: LocalQueue
metadata: { name: gpu-queue, namespace: team-a }
spec:
clusterQueue: research
Don't-miss checklist¶
- Set
driver.enabled=falsein the GPU Operator if the driver is host-installed (Ansible bring-up); never run both. - Pin every chart and image version; apply via Argo CD/Flux, not
helm installby hand (SRE and MLOps practices). - Wire the Network Operator and prove RDMA with
nccl-testsfrom pods, not just the host (performance tuning). - Run a gang scheduler before launching any multi-pod job; add Kueue for tenant quota.
- Choose the sharing model deliberately; time-slicing is not isolation (security and multi-tenancy).
Failure modes¶
- GPU Operator driver container fighting a host driver: node stuck
NotReady. - Time-slicing replicas advertised as if they were isolated GPUs: OOM/noisy-neighbour.
- Default scheduler partial-placing a distributed job: GPUs idle, deadlock.
- DRA installed on K8s <1.34 or driver <580: claims never satisfied.
Open questions & validation¶
- Validate the GPU Operator
ClusterPolicyagainst the deployment model (host driver vs container) on one node first. - Confirm
NicClusterPolicyyields an RDMA device in-pod and GDR engages (NCCL_DEBUG=INFOshows[GDRDMA]). - Exercise DRA partitionable-device claims on 1.34.2+ before depending on them.
References¶
- GPU Operator (Helm): https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html
- Network Operator: https://github.com/Mellanox/network-operator
- DRA driver for GPUs: https://github.com/NVIDIA/k8s-dra-driver-gpu
- Time-slicing / MIG in k8s: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-sharing.html
- Volcano: https://volcano.sh/en/docs/ · Kueue: https://kueue.sigs.k8s.io/ · KAI: https://github.com/NVIDIA/KAI-Scheduler
Per-component cookbook pages¶
- Helm installs: GPU Operator · Network Operator · DRA driver · Volcano · Kueue
- Manifests: GPU Operator ClusterPolicy · NicClusterPolicy · time-slicing · MIG mode · DRA ResourceClaim · DCGM exporter · Volcano Job · Kueue ClusterQueue
- Platform: RBAC for operators · smoke tests
Related: Kubernetes · Optimization · Security · Ansible · Telemetry · Workloads · Glossary