Markdown

Helm: NVIDIA DRA driver for GPUs¶

Scope: install nvidia-dra-driver-gpu via Helm so the kubelet plugin publishes a ResourceSlice per node and pods can claim GPUs through Dynamic Resource Allocation; controller and kubelet-plugin values, prerequisites (Kubernetes 1.34.2+ with DRA GA, CDI enabled in the GPU Operator, driver 580+), and ResourceSlice verification. Pairs with DRA ResourceClaim; this is the GPU platform hub DRA section made runnable.

Reference templates from the upstream chart and NVIDIA docs (chart 25.12.0). Not hardware-tested here. Pin the chart and image versions and apply via GitOps rather than helm install by hand in production.

What it is¶

DRA (Dynamic Resource Allocation) is the Kubernetes 1.34 GA successor to the device plugin's nvidia.com/gpu integer counter. Instead of asking for N GPUs, a pod references a ResourceClaim that selects devices by attribute (memory, MIG profile, UUID) via CEL. The nvidia-dra-driver-gpu chart deploys two pieces: a controller Deployment (cluster-scoped, allocates claims) and a kubelet-plugin DaemonSet (per GPU node, discovers devices and publishes a ResourceSlice). The scheduler matches claims against ResourceSlices; the kubelet plugin prepares the device via CDI at pod admission. ¹

The chart manages two independent subsystems, toggled separately:

GPU allocation: DeviceClasses gpu.nvidia.com and mig.nvidia.com. Requires the GPU kubelet plugin, K8s 1.34.2+. ²
ComputeDomain: DeviceClasses compute-domain-daemon.nvidia.com and compute-domain-default-channel.nvidia.com, for multi-node NVLink (IMEX) fabrics. Works on K8s 1.32+. Out of scope here. ¹

flowchart TB
  CTRL["DRA controller (Deployment)"] --> CLAIM["ResourceClaim allocated"]
  KPLUGIN["GPU kubelet-plugin (DaemonSet, per node)"] --> SLICE["ResourceSlice published"]
  SLICE --> SCHED["kube-scheduler matches claim to slice"]
  CLAIM --> SCHED
  SCHED --> CDI["kubelet-plugin prepares device via CDI"]
  CDI --> POD["Pod gets GPU(s)"]

Prerequisites¶

Kubernetes 1.34.2 or newer for GPU allocation. DRA (resource.k8s.io/v1) is GA and enabled by default in 1.34; no feature gate flag is needed on a stock 1.34 cluster. NVIDIA recommends 1.34.2+ to avoid a known issue. ¹ (ComputeDomain-only mode works on 1.32+. ¹)
NVIDIA driver 580 or later on the GPU nodes, host-installed or managed by the GPU Operator. ¹
CDI enabled in the container runtime. It is the default in GPU Operator v25.10.0+; the DRA driver requires CDI and does not work without it. Verify the runtime exposes /var/run/cdi. ¹
GPU Operator v25.10.0 or later installed in DRA mode for GPU allocation: NVIDIA's device plugin disabled (devicePlugin.enabled=false) to avoid conflicts, CDI enabled, and driver 580+. The DRA driver consumes the Operator-managed driver root at /run/nvidia/driver; if the driver is host-installed, the root is /. ¹
DRA node label on every node that should run the GPU kubelet plugin, e.g. nvidia.com/dra-kubelet-plugin=true. Use the same label in the DRA chart kubeletPlugin.nodeSelector and in the GPU Operator driver.manager.env so driver-container upgrades evict the DRA kubelet plugin correctly. ¹
RBAC for the controller and kubelet plugin (the chart installs its own ServiceAccount, ClusterRole, and bindings; see RBAC for GPU Platform Operators).

Do not also advertise the same GPUs through the legacy device plugin. Run DRA or nvidia.com/gpu, not both, for a given device pool. Mixed advertisement double-counts.

Install¶

Add the NGC chart repo, label the DRA nodes, install/upgrade the GPU Operator in DRA mode, then install the DRA driver. The default DRA chart enables GPU allocation, but gpuResourcesEnabledOverride=true is required to fully enable GPU support. Set it explicitly. ¹

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia && helm repo update

kubectl label node <gpu-node> nvidia.com/dra-kubelet-plugin=true --overwrite

# GPU allocation via DRA: disable the legacy NVIDIA device plugin and make the
# driver manager aware of the DRA kubelet-plugin node label for safe eviction.
helm upgrade --install gpu-operator nvidia/gpu-operator \
  --version v26.3.2 \
  --namespace gpu-operator --create-namespace \
  --set devicePlugin.enabled=false \
  --set driver.manager.env[0].name=NODE_LABEL_FOR_GPU_POD_EVICTION \
  --set driver.manager.env[0].value=nvidia.com/dra-kubelet-plugin

# GPU allocation, GPU Operator-managed driver (driver root /run/nvidia/driver)
helm upgrade --install nvidia-dra-driver-gpu nvidia/nvidia-dra-driver-gpu \
  --version 25.12.0 \
  --namespace nvidia-dra-driver-gpu --create-namespace \
  --set nvidiaDriverRoot=/run/nvidia/driver \
  --set gpuResourcesEnabledOverride=true \
  --set resources.computeDomains.enabled=false \
  -f values.yaml

If the driver is host-installed (not via the Operator), set driver.enabled=false on the GPU Operator command above and drop nvidiaDriverRoot from the DRA command so it defaults to /. Keep the same kubeletPlugin.nodeSelector/tolerations values in a file that does not set nvidiaDriverRoot. ¹

# values-host-driver.yaml -- same DRA node selector, no nvidiaDriverRoot override
kubeletPlugin:
  nodeSelector:
    nvidia.com/dra-kubelet-plugin: "true"
  tolerations:
    - key: nvidia.com/gpu
      operator: Exists
      effect: NoSchedule

helm upgrade --install nvidia-dra-driver-gpu nvidia/nvidia-dra-driver-gpu \
  --version 25.12.0 \
  --namespace nvidia-dra-driver-gpu --create-namespace \
  --set gpuResourcesEnabledOverride=true \
  --set resources.computeDomains.enabled=false \
  -f values-host-driver.yaml

A minimal values.yaml that scopes the controller and kubelet plugin to GPU nodes and tolerates the GPU taint. Confirm exact key paths with helm show values nvidia/nvidia-dra-driver-gpu --version 25.12.0 before applying. Helm value layouts shift across chart releases. ¹

# values.yaml — reference template; verify keys against `helm show values` for 25.12.0
resources:
  gpus:
    enabled: true            # default; GPU DeviceClasses + kubelet plugin
  computeDomains:
    enabled: false           # multi-node NVLink (IMEX); leave off for single-node GPU claims

nvidiaDriverRoot: /run/nvidia/driver

# Alpha: per-device health via NVML. Disabled by default. Unhealthy GPUs drop out of the ResourceSlice.
featureGates:
  NVMLDeviceHealthCheck: false

controller:
  nodeSelector: {}                         # controller can run on control-plane/CPU nodes
  tolerations:
    - key: node-role.kubernetes.io/control-plane
      operator: Exists
      effect: NoSchedule

kubeletPlugin:
  nodeSelector:
    nvidia.com/dra-kubelet-plugin: "true"  # same label passed to GPU Operator driver.manager.env
  tolerations:
    - key: nvidia.com/gpu
      operator: Exists
      effect: NoSchedule

Upgrades are in place. Across a major chart bump, apply the CRDs first, then helm upgrade; do not uninstall/reinstall (that deletes the CRDs and orphans live claims): ¹

kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-dra-driver-gpu/refs/tags/v25.12.0/deployments/helm/nvidia-dra-driver-gpu/crds/resource.nvidia.com_computedomains.yaml
helm upgrade --install nvidia-dra-driver-gpu nvidia/nvidia-dra-driver-gpu --version 25.12.0 -n nvidia-dra-driver-gpu -f values.yaml

Configuration¶

Key / field	Type	Default	Purpose
`--version` (chart)	string	—	Pin the chart, e.g. `25.12.0`. ¹
`gpuResourcesEnabledOverride`	bool	—	Required `true` to fully enable GPU allocation support. ¹
`resources.gpus.enabled`	bool	`true` (enabled)	GPU DeviceClasses (`gpu.nvidia.com`, `mig.nvidia.com`) + GPU kubelet plugin. ¹
`resources.computeDomains.enabled`	bool	enabled	Multi-node NVLink/IMEX ComputeDomain subsystem; set `false` for GPU-only. ¹
`nvidiaDriverRoot`	string	`/`	Driver root. `/run/nvidia/driver` for Operator-managed; `/` for host-installed. ¹
`featureGates.NVMLDeviceHealthCheck`	bool	`false` (alpha)	Per-device NVML health check; unhealthy GPUs leave the ResourceSlice. ¹
`kubeletPlugin.nodeSelector`	map	—	Pin the DaemonSet to labelled DRA GPU nodes, e.g. `nvidia.com/dra-kubelet-plugin: "true"`. ¹
`controller.tolerations` / `kubeletPlugin.tolerations`	list	—	Tolerate control-plane and `nvidia.com/gpu` taints. ¹

CRD/object names the driver auto-creates (consumed by claims): DeviceClasses gpu.nvidia.com and mig.nvidia.com. A claim references one via deviceClassName under spec.devices.requests[].exactly (apiVersion: resource.k8s.io/v1); full claim in DRA ResourceClaim. ²

Apply & verify¶

1. Controller and kubelet-plugin pods Running. One controller, one kubelet-plugin per GPU node: ¹

kubectl get pods -n nvidia-dra-driver-gpu

NAME                                                READY   STATUS    RESTARTS   AGE
nvidia-dra-driver-gpu-controller-67cb99d84b-5q7kj   1/1     Running   0          7m26s
nvidia-dra-driver-gpu-kubelet-plugin-h5xsn          1/1     Running   0          7m27s

2. DeviceClasses registered. ¹

kubectl get deviceclass

NAME                                        AGE
compute-domain-daemon.nvidia.com            55s
compute-domain-default-channel.nvidia.com   55s
gpu.nvidia.com                              55s
mig.nvidia.com                              55s

3. ResourceSlices published, the load-bearing signal. Expect one slice per GPU node from the gpu.nvidia.com driver. No slices means the kubelet plugin found no devices (driver/CDI/version problem); claims will stay Pending forever. ²

kubectl get resourceslices
# one entry per GPU node, NODE column = the GPU node, DRIVER = gpu.nvidia.com
kubectl get resourceslices -o jsonpath='{range .items[*]}{.spec.nodeName}{"\t"}{.spec.driver}{"\t"}{.spec.pool.name}{"\n"}{end}'

4. End-to-end claim. Apply a ResourceClaimTemplate + pod from DRA ResourceClaim, then confirm the claim is Allocated and the pod sees the GPU:

kubectl get resourceclaims -A          # STATE/ALLOCATED -> bound to the pod
kubectl logs <dra-test-pod>            # nvidia-smi prints the GPU table

Failure modes¶

No ResourceSlices. Kubelet plugin running but kubectl get resourceslices empty: driver < 580, CDI not enabled in the runtime, or gpuResourcesEnabledOverride not set. Claims never schedule. Check the kubelet-plugin pod logs and that /var/run/cdi is populated. ¹²
DRA on K8s < 1.34.2. API group resource.k8s.io/v1 absent or behind a disabled gate; the chart installs but claims are never satisfied. Mirrors the hub failure note: DRA on K8s <1.34 / driver <580 → claims never satisfied. ¹
Unhealthy GPU silently removed. With NVMLDeviceHealthCheck on, an unhealthy GPU drops out of the ResourceSlice; after it recovers you must restart the DRA driver for the device to re-enter the pool. ¹
Wrong nvidiaDriverRoot. /run/nvidia/driver set while the driver is host-installed (or vice-versa): kubelet plugin can't reach the driver, publishes no/empty slices.
Double-counting with the device plugin. Same GPUs advertised as both DRA devices and nvidia.com/gpu: schedule races and over-subscription. For DRA GPU allocation, install/upgrade the GPU Operator with devicePlugin.enabled=false for that pool.
CRDs deleted on reinstall. helm uninstall then reinstall drops the CRDs and orphans live claims; always helm upgrade in place. ¹

References¶

NVIDIA DRA Driver for GPUs — install & verify (GPU Operator docs): https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/dra-intro-install.html ¹
k8s-dra-driver-gpu — repository: https://github.com/NVIDIA/k8s-dra-driver-gpu
Validate setup for GPU allocation (wiki): https://github.com/NVIDIA/k8s-dra-driver-gpu/wiki/Validate-setup-for-GPU-allocation ²
Installation (wiki): https://github.com/NVIDIA/k8s-dra-driver-gpu/wiki/Installation
Kubernetes — Install Drivers and Allocate Devices with DRA: https://kubernetes.io/docs/tutorials/cluster-management/install-use-dra/

NVIDIA GPU Operator docs, "NVIDIA DRA Driver for GPUs" — helm 25.12.0 commands, gpuResourcesEnabledOverride/nvidiaDriverRoot/resources.*.enabled/featureGates.NVMLDeviceHealthCheck, K8s 1.34.2+ / driver 580+ / CDI default in Operator 25.10+, devicePlugin.enabled=false for GPU allocation, nvidia.com/dra-kubelet-plugin=true node label, driver.manager.env eviction label, DeviceClass and pod output. https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/dra-intro-install.html ↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩
NVIDIA k8s-dra-driver-gpu wiki, "Validate setup for GPU allocation" and DeviceClasses gpu.nvidia.com / mig.nvidia.com via resource.k8s.io/v1. https://github.com/NVIDIA/k8s-dra-driver-gpu/wiki/Validate-setup-for-GPU-allocation ↩↩↩↩↩