Markdown

Kubernetes & Helm: GPU platform¶

Scope: the manifests that turn a plain Kubernetes cluster into a GPU platform. The pieces: GPU Operator, network/RDMA, the sharing models, DRA, and gang scheduling with quota. The runnable counterpart to Kubernetes for GPUs.

Reference templates from upstream Helm charts and CRDs. Pin chart and image versions; apply via GitOps (SRE and MLOps practices) rather than by hand in production.

flowchart TB
  K8S["Plain Kubernetes"] --> GPUOP["GPU Operator"]
  GPUOP --> NETOP["Network Operator"]
  NETOP --> SHARE["GPU sharing model"]
  SHARE --> SCHED["Gang scheduler and quota"]
  SCHED --> SMOKE["Smoke tests"]

Overview¶

Everything here is one of two things: a Helm release that installs an operator, or a CRD that expresses intent (which GPU, how shared, scheduled how). Assemble four: the GPU Operator (make nodes GPU-aware), the Network Operator (RDMA into pods), a sharing model (whole/MIG/time-slice), and a gang scheduler + quota (so distributed jobs and tenants behave). Validate each with a smoke test before layering the next.

1. GPU Operator (make the cluster GPU-aware)¶

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia && helm repo update
helm install --create-namespace -n gpu-operator gpu-operator nvidia/gpu-operator \
  --version <pinned> \
  --set driver.enabled=false \
  --set mig.strategy=single \
  --set dcgmExporter.enabled=true \
  --set toolkit.enabled=true

Smoke test:

apiVersion: v1
kind: Pod
metadata: { name: cuda-smoke, namespace: gpu-operator }
spec:
  restartPolicy: Never
  containers:
    - name: smi
      image: nvidia/cuda:13.0.0-base-ubuntu24.04
      command: ["nvidia-smi"]
      resources: { limits: { nvidia.com/gpu: 1 } }

kubectl logs cuda-smoke -n gpu-operator must print the GPU table.

2. Network Operator (RDMA / GPUDirect into pods)¶

helm install -n nvidia-network-operator --create-namespace network-operator \
  nvidia/network-operator --version <pinned> \
  --set nfd.enabled=true \
  --set sriovNetworkOperator.enabled=false

Then a NicClusterPolicy (RDMA shared device plugin + secondary network) so pods get an IB/RoCE interface; without it NCCL falls back to TCP (performance tuning). Verify with a 2-node nccl-tests job (workload recipes).

Time-slicing (dev/bursty, no isolation). Patch the device-plugin config:

apiVersion: v1
kind: ConfigMap
metadata: { name: time-slicing-config, namespace: gpu-operator }
data:
  any: |-
    version: v1
    sharing:
      timeSlicing:
        resources:
          - name: nvidia.com/gpu
            replicas: 4          # 1 physical GPU advertised as 4 -> NO memory isolation

kubectl apply -f time-slicing-config.yaml
helm upgrade gpu-operator nvidia/gpu-operator -n gpu-operator --reuse-values \
  --set devicePlugin.config.name=time-slicing-config \
  --set devicePlugin.config.default=any
kubectl rollout restart daemonset/nvidia-device-plugin-daemonset -n gpu-operator

MIG (hard isolation). Set the node label the MIG manager watches:

kubectl label node <gpu-node> nvidia.com/mig.config=all-1g.10gb --overwrite
# single strategy: pods request nvidia.com/gpu: 1 (mixed strategy exposes nvidia.com/mig-1g.10gb)

4. DRA: Dynamic Resource Allocation (K8s 1.34+, the successor)¶

# DRA-backed GPU pools must not also advertise the legacy nvidia.com/gpu device plugin.
kubectl label node <gpu-node> nvidia.com/dra-kubelet-plugin=true --overwrite

helm upgrade --install gpu-operator nvidia/gpu-operator \
  --version <pinned> \
  --create-namespace --namespace gpu-operator \
  --set devicePlugin.enabled=false \
  --set driver.manager.env[0].name=NODE_LABEL_FOR_GPU_POD_EVICTION \
  --set driver.manager.env[0].value=nvidia.com/dra-kubelet-plugin

cat > dra-values.yaml <<'EOF'
kubeletPlugin:
  nodeSelector:
    nvidia.com/dra-kubelet-plugin: "true"
EOF

helm upgrade --install nvidia-dra-driver-gpu nvidia/nvidia-dra-driver-gpu \
  --version <pinned> \
  --namespace nvidia-dra-driver-gpu --create-namespace \
  --set nvidiaDriverRoot=/run/nvidia/driver \
  --set gpuResourcesEnabledOverride=true \
  -f dra-values.yaml

# Claim one GPU, then reference it from a pod
apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata: { name: single-gpu, namespace: ml }
spec:
  spec:
    devices:
      requests:
        - name: gpu
          exactly:
            deviceClassName: gpu.nvidia.com
---
apiVersion: v1
kind: Pod
metadata: { name: dra-train, namespace: ml }
spec:
  resourceClaims:
    - name: g
      resourceClaimTemplateName: single-gpu
  containers:
    - name: trainer
      image: nvcr.io/nvidia/pytorch:25.05-py3
      resources: { claims: [{ name: g }] }

5. Gang scheduling + quota (distributed jobs and tenants)¶

Install Volcano (or KAI Scheduler) for all-or-nothing placement (Kubernetes for GPUs):

helm repo add volcano-sh https://volcano-sh.github.io/helm-charts
helm install volcano volcano-sh/volcano -n volcano-system --create-namespace

A gang-scheduled job declares minAvailable so it only starts when every worker can be placed (full Job in workload recipes). Add Kueue for fair-share quota across teams:

apiVersion: kueue.x-k8s.io/v1beta2
kind: ResourceFlavor
metadata: { name: h100 }
spec:
  nodeLabels:
    nvidia.com/gpu.product: NVIDIA-H100-80GB-HBM3
---
apiVersion: kueue.x-k8s.io/v1beta2
kind: ClusterQueue
metadata: { name: research }
spec:
  namespaceSelector: {}
  resourceGroups:
    - coveredResources: ["nvidia.com/gpu"]
      flavors:
        - name: h100
          resources: [{ name: "nvidia.com/gpu", nominalQuota: 64 }]
---
apiVersion: kueue.x-k8s.io/v1beta2
kind: LocalQueue
metadata: { name: gpu-queue, namespace: team-a }
spec:
  clusterQueue: research

Don't-miss checklist¶

Set driver.enabled=false in the GPU Operator if the driver is host-installed (Ansible bring-up); never run both.
Pin every chart and image version; apply via Argo CD/Flux, not helm install by hand (SRE and MLOps practices).
Wire the Network Operator and prove RDMA with nccl-tests from pods, not just the host (performance tuning).
Run a gang scheduler before launching any multi-pod job; add Kueue for tenant quota.
Choose the sharing model deliberately; time-slicing is not isolation (security and multi-tenancy).

Failure modes¶

GPU Operator driver container fighting a host driver: node stuck NotReady.
Time-slicing replicas advertised as if they were isolated GPUs: OOM/noisy-neighbour.
Default scheduler partial-placing a distributed job: GPUs idle, deadlock.
DRA installed on K8s <1.34 or driver <580: claims never satisfied.

Open questions & validation¶

Validate the GPU Operator ClusterPolicy against the deployment model (host driver vs container) on one node first.
Confirm NicClusterPolicy yields an RDMA device in-pod and GDR engages (NCCL_DEBUG=INFO shows [GDRDMA]).
Exercise DRA partitionable-device claims on 1.34.2+ before depending on them.

References¶

GPU Operator (Helm): https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html
Network Operator: https://github.com/Mellanox/network-operator
DRA driver for GPUs: https://github.com/NVIDIA/k8s-dra-driver-gpu
Time-slicing / MIG in k8s: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-sharing.html
Volcano: https://volcano.sh/en/docs/ · Kueue: https://kueue.sigs.k8s.io/ · KAI: https://github.com/NVIDIA/KAI-Scheduler

Per-component cookbook pages¶

Helm installs: GPU Operator · Network Operator · DRA driver · Volcano · Kueue
Manifests: GPU Operator ClusterPolicy · NicClusterPolicy · time-slicing · MIG mode · DRA ResourceClaim · DCGM exporter · Volcano Job · Kueue ClusterQueue
Platform: RBAC for operators · smoke tests

Related: Kubernetes · Optimization · Security · Ansible · Telemetry · Workloads · Glossary