Markdown

Helm: NVIDIA GPU operator¶

Scope: helm repo add nvidia + helm install gpu-operator, the load-bearing chart values (driver.enabled, driver.version, mig.strategy, toolkit.enabled, devicePlugin, dcgmExporter.enabled, nfd.enabled), air-gapped registry overrides, and the upgrade flow. The Helm-release detail behind step 1 of Kubernetes & Helm: GPU Platform.

Reference template from the upstream chart and docs, not hardware-tested. Chart pinned to v26.3.2 (latest stable at time of writing).¹ Re-verify the chart version and driver.version against your driver tier (Driver and Feature Support by GPU Tier) before rollout, and apply via GitOps (Argo CD/Flux) rather than helm install by hand in production (SRE and MLOps practices).

What it is¶

The GPU Operator is a single Helm release that makes every GPU node Kubernetes-aware. It deploys and reconciles a stack of operands as DaemonSets, driven by one ClusterPolicy CRD (apiVersion: nvidia.com/v1, see GPU Operator ClusterPolicy): the NVIDIA driver, container-toolkit, device-plugin, gpu-feature-discovery (GFD), DCGM + dcgm-exporter, node-feature-discovery (NFD), the operator-validator, and, on MIG-capable hardware, the mig-manager.¹² Each operand's init container gates on the previous one being healthy, so a failing driver pod leaves everything downstream stuck in Init, the most common operational signature on this page.⁶

It replaces the hand-rolled chain of host driver + container toolkit + a manually-applied device plugin. The trade is a reboot-class, node-mutating controller that you must version-pin and roll one node at a time. If the driver is already host-installed (Ansible bring-up, Driver and Feature Support by GPU Tier) set driver.enabled=false and let the Operator manage only the userspace operands. Never run a container driver against a host driver.¹

flowchart TB
  HELM["helm install gpu-operator (v26.3.2)"] --> OP["gpu-operator controller"]
  OP --> CP["ClusterPolicy (nvidia.com/v1)"]
  CP --> NFD["node-feature-discovery"]
  NFD --> GFD["gpu-feature-discovery"]
  CP --> DRV["nvidia-driver-daemonset (driver.enabled)"]
  DRV --> TK["nvidia-container-toolkit-daemonset"]
  TK --> DP["nvidia-device-plugin-daemonset"]
  DP --> VAL["nvidia-operator-validator"]
  CP --> DCGM["nvidia-dcgm-exporter"]
  GFD --> LABELS["node labels: nvidia.com/gpu.* , pci-10de.present"]

Prerequisites¶

A running Kubernetes cluster with kubectl + Helm 3 reachable, and worker nodes carrying NVIDIA GPUs (Kubernetes for GPUs).
Container runtime is containerd / CRI-O (the Operator configures the NVIDIA runtime for you via the toolkit operand).¹
Nodes not running a conflicting GPU stack: either no host driver (let the Operator install it) or a host driver with driver.enabled=false. Do not pre-install the standalone k8s-device-plugin or NFD if the Operator will manage them.¹
For MIG: MIG-capable hardware (A100/H100/H200/B200, see MIG); decide single vs mixed strategy before install.
Outbound pull access to nvcr.io and helm.ngc.nvidia.com, or a local mirror for air-gapped sites.

Install¶

Add the chart repo and install pinned. The repo URL and namespace are fixed by NVIDIA; the version is pinned by you.¹

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia && helm repo update

Driver host-installed (recommended on managed fleets, since driver lifecycle stays with the node image):

helm install gpu-operator nvidia/gpu-operator \
  -n gpu-operator --create-namespace \
  --version v26.3.2 \
  --set driver.enabled=false \
  --set toolkit.enabled=true \
  --set mig.strategy=single \
  --set dcgmExporter.enabled=true \
  --set nfd.enabled=true \
  --wait

Operator-managed driver (Operator installs and reconciles the driver container; pin driver.version to your tier):

helm install gpu-operator nvidia/gpu-operator \
  -n gpu-operator --create-namespace \
  --version v26.3.2 \
  --set driver.enabled=true \
  --set driver.version="580.95.05" \
  --set mig.strategy=single \
  --wait

driver.version above is a reference value. Verify the exact driver branch/patch for your hardware against Driver Versions and Branches and Driver and Feature Support by GPU Tier. toolkit.enabled, dcgmExporter.enabled and nfd.enabled default to true, so they are shown only where the intent should be explicit.¹ Run helm show values nvidia/gpu-operator --version v26.3.2 to see the full surface before pinning anything.

Air-gapped registry overrides¶

NVIDIA's air-gapped flow mirrors every operand image to a local registry and overrides each operand's repository field, then installs from the pulled chart with a values file (there is no single global registry key; repository is set per operand).⁴ Build values.yaml:

# values-airgapped.yaml -- point every operand at the local mirror.
# Image names/versions below are illustrative; pull the exact tags this
# chart version references and pin them. Verify against the air-gapped guide.
operator:
  repository: registry.internal.example.com:5000/nvidia
driver:
  repository: registry.internal.example.com:5000/nvidia
toolkit:
  repository: registry.internal.example.com:5000/nvidia
devicePlugin:
  repository: registry.internal.example.com:5000/nvidia
gfd:
  repository: registry.internal.example.com:5000/nvidia
dcgmExporter:
  repository: registry.internal.example.com:5000/nvidia
validator:
  repository: registry.internal.example.com:5000/nvidia
node-feature-discovery:
  image:
    repository: registry.internal.example.com:5000/nfd/node-feature-discovery

# Pull the chart on a connected host, transfer the .tgz, then:
helm install gpu-operator gpu-operator-v26.3.2.tgz \
  -n gpu-operator --create-namespace \
  -f values-airgapped.yaml --wait

If the local registry requires auth, create a pull secret in gpu-operator and set <operand>.imagePullSecrets for each operand.⁴

Configuration¶

The values that change behaviour. Defaults are the chart defaults at v26.3.2.¹

Value	Default	Effect
`driver.enabled`	`true`	Operator installs/manages the driver container. Set `false` when the host driver is preinstalled; never run both (Driver and Feature Support by GPU Tier).
`driver.version`	per Operator release	Driver branch/patch the Operator deploys when `driver.enabled=true`. Pin to your tier (Driver Versions and Branches).
`driver.upgradePolicy.autoUpgrade`	`true`	Driver upgrade controller; rolls nodes one at a time on a version change (Rolling Driver / CUDA Upgrade).⁵
`driver.upgradePolicy.drain.enable`	`false`	If pod eviction fails during upgrade, `kubectl drain` the node before driver reload.⁵
`mig.strategy`	`single`	GPU advertisement under MIG. `single` advertises `nvidia.com/gpu`; `mixed` advertises `nvidia.com/mig-<slices>g.<mem>gb` (MIG).³
`toolkit.enabled`	`true`	Deploy the NVIDIA Container Toolkit and CDI and wire the NVIDIA container runtime.
`devicePlugin.config.name`	`""`	Name of a `ConfigMap` selecting a device-plugin config (e.g. time-slicing). Empty = stock plugin.³
`devicePlugin.config.default`	`""`	Default config key applied when a node has no `nvidia.com/device-plugin.config` label.
`dcgmExporter.enabled`	`true`	Deploy dcgm-exporter for Prometheus GPU metrics (telemetry).
`nfd.enabled`	`true`	Deploy Node Feature Discovery. Set `false` only if NFD is already cluster-managed (avoid double-deploy).¹
`migManager.enabled`	`true`	Deploy the MIG manager that reconciles `nvidia.com/mig.config` node labels (MIG Mode).

RBAC for the Operator and its operands is created by the chart (ServiceAccounts, ClusterRoles, bindings in the gpu-operator namespace); review it before granting in shared clusters (RBAC for GPU Platform Operators, Security, Isolation and Multi-tenancy).

Apply & verify¶

1. All operand pods reach Running/Completed. Validators run to completion; DaemonSets run one pod per GPU node.¹⁶

kubectl get pods -n gpu-operator

Expected (one set per GPU node): nvidia-driver-daemonset-* (only if driver.enabled=true), nvidia-container-toolkit-daemonset-*, nvidia-device-plugin-daemonset-*, gpu-feature-discovery-*, nvidia-dcgm-exporter-*, nvidia-operator-validator-* (Completed), and the gpu-operator + nfd controller pods.

2. DaemonSets are fully scheduled (DESIRED == READY across the GPU-node operands):

kubectl get ds -n gpu-operator

3. NFD/GFD node labels are present. The PCI label (10de = NVIDIA vendor ID) and the GFD GPU labels confirm discovery worked:¹

kubectl get nodes -o json | jq '.items[].metadata.labels
  | with_entries(select(.key
      | test("nvidia.com/gpu|feature.node.kubernetes.io/pci-10de")))'

Expect feature.node.kubernetes.io/pci-10de.present: "true", nvidia.com/gpu.present: "true", plus nvidia.com/gpu.product, nvidia.com/gpu.count, nvidia.com/gpu.memory, and nvidia.com/cuda.driver.major.¹

4. Operator validator reports success:

kubectl logs -n gpu-operator -l app=nvidia-operator-validator \
  -c nvidia-operator-validator | grep "all validations are successful"

5. A CUDA smoke pod runs nvidia-smi through a scheduled GPU request (the end-to-end proof: driver + toolkit + device plugin all working):⁷

apiVersion: v1
kind: Pod
metadata:
  name: cuda-smoke
  namespace: gpu-operator
spec:
  restartPolicy: Never
  containers:
    - name: smi
      image: nvcr.io/nvidia/cuda:13.0.0-base-ubuntu24.04
      command: ["nvidia-smi"]
      resources:
        limits:
          nvidia.com/gpu: 1

kubectl apply -f cuda-smoke.yaml
kubectl wait --for=jsonpath='{.status.phase}'=Succeeded pod/cuda-smoke -n gpu-operator --timeout=180s
kubectl logs cuda-smoke -n gpu-operator   # must print the GPU table
kubectl delete pod cuda-smoke -n gpu-operator

The log must show the GPU model and driver/CUDA versions. A pod stuck Pending with Insufficient nvidia.com/gpu means the device plugin has not advertised capacity yet. Recheck steps 1-3. Fuller smoke suite in Smoke Tests: GPU Platform.

Upgrade flow¶

Chart and driver upgrades are distinct. Pin the new chart version explicitly; never float latest.¹

Chart upgrade (keeps your prior --set values):

helm repo update
helm upgrade gpu-operator nvidia/gpu-operator \
  -n gpu-operator --version v26.3.2 --reuse-values

Driver upgrade only (Operator-managed driver). Bump driver.version; the upgrade controller rolls nodes one at a time, evicting GPU pods before reloading the driver:⁵

helm upgrade gpu-operator nvidia/gpu-operator \
  -n gpu-operator --reuse-values \
  --set driver.version="580.95.05"

For full node draining on eviction failure add --set driver.upgradePolicy.drain.enable=true. Watch the roll with kubectl get pods -n gpu-operator -w and the per-node sequencing in Rolling Driver / CUDA Upgrade. Across a chart upgrade that changes CRDs, confirm the ClusterPolicy still reconciles (kubectl get clusterpolicy -o wide) before declaring success.

Failure modes¶

Driver pod CrashLoopBackOff, everything downstream Init. Container driver fighting a host driver, or a driver.version the kernel can't build against. Set driver.enabled=false if host-installed, or pin a supported branch (Kernel upgrade: GPU missing, Driver Versions and Branches).⁶
CUDA pod stuck Pending, Insufficient nvidia.com/gpu. Device plugin hasn't advertised capacity: GFD/NFD labels missing or the device-plugin DaemonSet not Ready. Recheck Apply & verify steps 2-3.
mixed strategy but pods request nvidia.com/gpu. Under mixed, MIG devices advertise as nvidia.com/mig-<slices>g.<mem>gb; whole-GPU requests stay Pending. Match the request to the strategy (MIG).³
Air-gapped pods ImagePullBackOff. A repository override missed an operand, a tag isn't mirrored, or the pull secret is absent. Diff helm get values gpu-operator -n gpu-operator against the mirror contents.⁴
Double NFD. nfd.enabled=true on a cluster that already runs NFD produces duplicate workers and label churn. Set nfd.enabled=false to defer to the cluster's NFD.¹
Driver upgrade wedged. Eviction blocked by a PodDisruptionBudget or a pod that won't terminate; node sits in drain-required. Enable driver.upgradePolicy.drain.enable or clear the blocker (Rolling Driver / CUDA Upgrade).⁵
MIG node labels stale after reconfig. mig.config changed but capacity not updated; see Stale MIG State and MIG Mode.

References¶

Installing the NVIDIA GPU Operator (getting started — repo add, install command, default values, node labels, operands): https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html
GPU Operator with MIG (mig.strategy single vs mixed, nvidia.com/mig-* resources): https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-operator-mig.html
Install in air-gapped environments (per-operand repository overrides, local mirror, pull secrets): https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/install-gpu-operator-air-gapped.html
GPU Driver Upgrades / driver.upgradePolicy (autoUpgrade, drain, one-node-at-a-time): https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-driver-upgrades.html
Upgrading the GPU Operator (helm upgrade --reuse-values, --version): https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/upgrade.html
Troubleshooting (validator logs, init-gating, operand status): https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/troubleshooting.html
GPU Operator Helm chart (NGC): https://catalog.ngc.nvidia.com/orgs/nvidia/helm-charts/gpu-operator

NVIDIA GPU Operator — Getting Started: helm repo add nvidia https://helm.ngc.nvidia.com/nvidia && helm repo update; install into namespace gpu-operator via --create-namespace with --version=v26.3.2; chart nvidia/gpu-operator. Defaults: driver.enabled=true, mig.strategy=single, toolkit.enabled=true, dcgmExporter.enabled=true, nfd.enabled=true, devicePlugin.config={}. --set driver.enabled=false when the driver is preinstalled. NFD/GFD node labels: feature.node.kubernetes.io/pci-10de.present=true (0x10de = NVIDIA PCI vendor ID), nvidia.com/gpu.present, nvidia.com/gpu.product, nvidia.com/gpu.count, nvidia.com/gpu.memory, nvidia.com/cuda.driver.major. https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html ↩↩↩↩↩↩↩↩↩↩↩↩↩↩
GPU Operator operands deployed and reconciled from the ClusterPolicy (apiVersion: nvidia.com/v1): driver, container-toolkit, device-plugin, gpu-feature-discovery, dcgm, dcgm-exporter, node-feature-discovery, operator-validator, and mig-manager on supported hardware. https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html ↩
GPU Operator with MIG — mig.strategy values: single (MIG enabled on all GPUs of a node; advertised as nvidia.com/gpu) and mixed (advertised as nvidia.com/mig-<slices>g.<mem>gb). Device-plugin config selection via a ConfigMap referenced by devicePlugin.config.name. https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-operator-mig.html ↩↩↩
GPU Operator air-gapped install — mirror operand images to a local registry and set each operand's repository field (operator, driver, toolkit, devicePlugin, gfd, dcgmExporter, validator, node-feature-discovery) in values.yaml; install from the pulled chart .tgz with -f values.yaml. Auth registries use per-operand imagePullSecrets. https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/install-gpu-operator-air-gapped.html ↩↩↩
GPU Operator driver upgrades — helm upgrade gpu-operator nvidia/gpu-operator -n gpu-operator --set driver.version="<branch.patch>" --reuse-values; driver.upgradePolicy.autoUpgrade (default true) enables the upgrade controller (one node at a time, GPU pods evicted before reload); driver.upgradePolicy.drain.enable triggers kubectl drain when pod eviction fails. https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-driver-upgrades.html ↩↩↩↩
GPU Operator troubleshooting — operand init containers gate on upstream operands, so a failing driver pod leaves downstream operands in Init; verify with kubectl logs -n gpu-operator -l app=nvidia-operator-validator -c nvidia-operator-validator | grep "all validations are successful". https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/troubleshooting.html ↩↩↩
NVIDIA CUDA images on NGC (nvcr.io/nvidia/cuda) ship -base/-runtime/-devel variants per CUDA version and Ubuntu base, e.g. 13.0.0-base-ubuntu24.04; pick a tag whose CUDA series the deployed driver supports. https://catalog.ngc.nvidia.com/orgs/nvidia/containers/cuda ↩