Manifest: GPU operator ClusterPolicy¶
Scope: the ClusterPolicy CRD (apiVersion: nvidia.com/v1) that is the single source of truth for an NVIDIA GPU Operator install. It covers the driver, toolkit, devicePlugin, dcgmExporter, dcgm, migManager, nodeStatusExporter and validator component blocks, how most map closely to a Helm value, and how to read status.state to know the stack is healthy. Verify with kubectl get clusterpolicy (STATUS ready) and describe for the per-component conditions.
Reference template from the upstream NVIDIA GPU Operator chart (pinned
v26.3.2) and itsconfig/samples/v1_clusterpolicy.yaml/clusterpolicy_types.go. Not hardware-tested here. The Helm chart renders this CR ascluster-policy; do not hand-edit it in parallel withhelm upgrade. Pin chart and image versions and apply via GitOps. Builds on Kubernetes & Helm: GPU Platform §1.
flowchart TB
HELM["helm install gpu-operator<br/>--set driver.enabled=..."] --> CP["ClusterPolicy cluster-policy<br/>nvidia.com/v1"]
CP --> DRV["driver DaemonSet"]
CP --> TK["toolkit DaemonSet"]
CP --> DP["devicePlugin DaemonSet"]
CP --> DCGM["dcgm + dcgmExporter"]
CP --> MIG["migManager"]
CP --> VAL["validator (gates readiness)"]
VAL --> ST["status.state = ready"]
What it is¶
ClusterPolicy is a cluster-scoped CRD in the nvidia.com/v1 group. One CR named cluster-policy declares the desired state of every component the GPU Operator manages; the operator reconciles each spec.<component> block into a DaemonSet (or Deployment) and reports aggregate health in status.state. You normally never write this YAML by hand. helm install nvidia/gpu-operator renders it from values.yaml, and most --set <component>.<field>=... flags in the hub map onto a field on this object (a few values, e.g. component version for validator/nodeStatusExporter, which track the chart appVersion, have no direct --set equivalent). Reading and patching the CR directly is the escape hatch for day-2 changes and the thing you describe when a component is wedged.
Two hard rules from upstream:
spec: {}is invalid. An empty spec fails to deploy; the operator needs at least the component toggles. 1ClusterPolicyand the standaloneNVIDIADriverCRD are mutually exclusive. Pick one driver-management model per cluster. Setdriver.useNvidiaDriverCRD: trueonly if you are driving the driver through the separateNVIDIADriverCR. 2
The component blocks (driver, toolkit, devicePlugin, dcgmExporter, dcgm, migManager, nodeStatusExporter, validator) share a common shape: enabled, repository, image, version, imagePullPolicy, imagePullSecrets, args, env, resources, and (where applicable) config. They map onto the same Go struct fields. 3
Prerequisites¶
- A Kubernetes cluster (operator
v26.3.2supports current upstream K8s; check the platform support matrix for your exact version). 6 - Node Feature Discovery: bundled and enabled by the chart unless you point it at an existing NFD (
nfd.enabled). - Decide the driver model first: container-installed driver (
driver.enabled: true) or host/pre-installed driver (driver.enabled: false). Never run both; a container driver fighting a host driver leaves the nodeNotReady(hub §1, Ansible bring-up). - Decide the MIG strategy (
mig.strategy:none/single/mixed) before install; switching it later re-rolls the device plugin (MIG mode). kubectlaccess with rights to read/patchclusterpolicies.nvidia.comand to thegpu-operatornamespace (RBAC for the operators).- Helm 3 and the NVIDIA chart repo added.
The manifest¶
The chart renders the CR; this is the equivalent explicit object, trimmed to the blocks in scope and pinned. Field names are taken verbatim from clusterpolicy_types.go and the upstream sample. 34 Image coordinates are the v26.3.2-era chart defaults; pin them rather than relying on :latest. 5
# clusterpolicy.yaml (rendered by the chart as "cluster-policy")
apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
name: cluster-policy
spec:
operator:
runtimeClass: nvidia # RuntimeClass GPU pods select; the toolkit runtime is auto-detected (no defaultRuntime)
# Global DaemonSet config (tolerations/priorityClass/rollout) applied to all components.
daemonsets:
rollingUpdate:
maxUnavailable: "1"
updateStrategy: RollingUpdate
mig:
strategy: single # none | single | mixed -> how MIG devices are advertised
# --- driver: installs the NVIDIA kernel driver in a container ---
driver:
enabled: false # false when the driver is host-installed (set true for container driver)
useNvidiaDriverCRD: false # true => delegate to the standalone NVIDIADriver CRD instead
repository: nvcr.io/nvidia
image: driver
version: "595.58.03"
rdma:
enabled: false # GPUDirect RDMA kmods; pairs with the Network Operator
upgradePolicy:
autoUpgrade: true
maxParallelUpgrades: 1
drain:
enable: false
# --- toolkit: NVIDIA Container Toolkit (wires the container runtime to the GPU) ---
toolkit:
enabled: true
repository: nvcr.io/nvidia/k8s
image: container-toolkit
version: v1.19.1
# --- devicePlugin: advertises nvidia.com/gpu and schedules it ---
devicePlugin:
enabled: true
repository: nvcr.io/nvidia
image: k8s-device-plugin
version: v0.19.3
config: # points at a sharing ConfigMap (time-slicing, etc.)
name: ""
default: ""
# --- gfd: GPU Feature Discovery (labels nodes with GPU product/memory/MIG) ---
gfd:
enabled: true
repository: nvcr.io/nvidia
image: k8s-device-plugin
version: v0.19.3
# --- dcgm + dcgmExporter: telemetry ---
dcgm:
enabled: true
repository: nvcr.io/nvidia/cloud-native
image: dcgm
version: 4.5.2-1-ubuntu22.04
dcgmExporter:
enabled: true
repository: nvcr.io/nvidia/k8s
image: dcgm-exporter
version: 4.5.3-4.8.2-distroless
config:
name: "" # ConfigMap of custom DCGM fields to export
serviceMonitor:
enabled: false # set true to emit a Prometheus-Operator ServiceMonitor
# --- migManager: applies mig-parted partition profiles ---
migManager:
enabled: true
repository: nvcr.io/nvidia/cloud-native
image: k8s-mig-manager
version: v0.14.2
config:
name: "" # mig-parted ConfigMap (default-mig-parted-config when blank)
gpuClientsConfig:
name: ""
env:
- name: WITH_REBOOT
value: "false"
# --- nodeStatusExporter: exposes operator/node state metrics (off by default) ---
nodeStatusExporter:
enabled: false
repository: nvcr.io/nvidia
image: gpu-operator # version tracks chart appVersion
# --- validator: end-to-end gate; flips status.state to ready ---
validator:
repository: nvcr.io/nvidia
image: gpu-operator # version tracks chart appVersion
plugin:
env:
- name: WITH_WORKLOAD
value: "false" # true => run an actual CUDA workload as part of validation
Install via the chart (preferred). The --set flags below write the corresponding fields above:
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia && helm repo update
helm upgrade --install gpu-operator nvidia/gpu-operator \
-n gpu-operator --create-namespace \
--version v26.3.2 \
--set driver.enabled=false \
--set mig.strategy=single \
--set toolkit.enabled=true \
--set devicePlugin.enabled=true \
--set dcgmExporter.enabled=true \
--set migManager.enabled=true \
--set validator.plugin.env[0].name=WITH_WORKLOAD \
--set validator.plugin.env[0].value="false"
Day-2 change to a running cluster: patch the rendered CR (this is what --reuse-values --set renders to):
kubectl patch clusterpolicies.nvidia.com/cluster-policy --type merge \
-p '{"spec":{"dcgmExporter":{"serviceMonitor":{"enabled":true}}}}'
Configuration¶
Every block uses the same field set (enabled, repository, image, version, imagePullPolicy, imagePullSecrets, args, env, resources, config); the table calls out the load-bearing ones and their Helm equivalents. 3
Field (path under spec) |
Helm value | Meaning |
|---|---|---|
operator.runtimeClass |
operator.runtimeClass |
Name of the RuntimeClass GPU pods select (default nvidia). The toolkit's container runtime is auto-detected; the old operator.defaultRuntime is deprecated and ignored on v26.3.x. 3 |
mig.strategy |
mig.strategy |
How MIG devices are surfaced: none, single (advertise as nvidia.com/gpu), or mixed (per-profile nvidia.com/mig-*). See MIG mode. |
driver.enabled |
driver.enabled |
true runs the container driver DaemonSet; false assumes a host-installed driver. Never both. |
driver.useNvidiaDriverCRD |
driver.useNvidiaDriverCRD |
true hands driver lifecycle to the standalone NVIDIADriver CRD (mutually exclusive with managing it here). 2 |
driver.{repository,image,version} |
driver.* |
Driver container image coordinates; version is the driver branch (e.g. 595.58.03). |
driver.rdma.enabled |
driver.rdma.enabled |
Builds GPUDirect RDMA kernel modules; pair with the Network Operator. |
toolkit.enabled |
toolkit.enabled |
Installs the NVIDIA Container Toolkit; required unless the host already has it (container toolkit). |
devicePlugin.config.name |
devicePlugin.config.name |
ConfigMap holding a sharing config (e.g. time-slicing). |
devicePlugin.config.default |
devicePlugin.config.default |
Which ConfigMap key applies cluster-wide absent a per-node override. |
dcgmExporter.enabled |
dcgmExporter.enabled |
Deploys DCGM-exporter for GPU metrics (telemetry). |
dcgmExporter.config.name |
dcgmExporter.config.name |
ConfigMap of custom DCGM field IDs to export. |
dcgmExporter.serviceMonitor.enabled |
dcgmExporter.serviceMonitor.enabled |
Emits a Prometheus-Operator ServiceMonitor. 3 |
migManager.enabled |
migManager.enabled |
Runs k8s-mig-manager to apply partition profiles. |
migManager.config.name |
migManager.config.name |
mig-parted ConfigMap of named profiles (blank => default-mig-parted-config). |
nodeStatusExporter.enabled |
nodeStatusExporter.enabled |
Operator/node state metrics endpoint. Off by default. |
validator.plugin.env (WITH_WORKLOAD) |
validator.plugin.env |
"true" runs a real CUDA validation workload; "false" does the lightweight check. The validator flipping green is what advances status.state. 4 |
Apply & verify¶
Top-line health: the STATUS column must read ready:
status.state is a small enum: ignored, notReady, ready, disabled. It starts notReady while DaemonSets roll and the validator runs, and may briefly flap back to notReady during an upgrade before settling on ready. 78 Drill into the cause when it is not ready:
kubectl get clusterpolicy cluster-policy \
-o jsonpath='{.status.state}{"\n"}' # ready
kubectl get clusterpolicy cluster-policy -o json \
| jq '.status.state, .status.conditions' # which component is blocking
kubectl describe clusterpolicy cluster-policy # events + per-component status
Confirm the components the spec turned on are actually Running (validator pods reach Completed/Running):
kubectl get pods -n gpu-operator
# nvidia-container-toolkit-daemonset-... Running
# nvidia-device-plugin-daemonset-... Running
# nvidia-dcgm-exporter-... Running
# nvidia-operator-validator-... Running
# gpu-feature-discovery-... Running
Prove the device plugin advertised GPUs (the field this whole CR exists to produce):
kubectl get nodes -l nvidia.com/gpu.present=true \
-o custom-columns='NODE:.metadata.name,GPU_ALLOC:.status.allocatable.nvidia\.com/gpu'
Then run the CUDA smoke pod from the hub; its nvidia-smi table is the end-to-end signal.
Failure modes¶
status.statestucknotReady. A component DaemonSet is not all-Runningor the validator failed.kubectl describe clusterpolicy cluster-policyandkubectl get pods -n gpu-operatorname the culprit; check that pod's logs (nvidia-operator-validator,nvidia-driver-daemonset, etc.).spec: {}/ over-trimmed CR rejected. An empty or near-empty spec fails to deploy; the component toggles are required. Render from the chart rather than minimizing by hand. 1- Both
ClusterPolicyandNVIDIADriverpresent. Driver management collides. Use one model: eitherdriver.enabledhere, ordriver.useNvidiaDriverCRD: trueplus a separateNVIDIADriverCR, never both. 2 - Container driver vs host driver.
driver.enabled: trueon a node that already has a host driver wedges itNotReady. Setdriver.enabled: falsewhen the driver is host-installed (Ansible bring-up). - Hand-edited the CR while Helm owns it. The next
helm upgradereconciles your patch away. Make persistent changes through chart values (orhelm upgrade --reuse-values --set ...), reservingkubectl patchfor transient debugging. mig.strategymismatch.singleadvertises MIG slices asnvidia.com/gpu;mixedadvertisesnvidia.com/mig-<profile>. Pods requesting the wrong resource name stayPending. Align the strategy with how workloads request GPUs (MIG mode).dcgmExporter.serviceMonitor.enabled: truewith no Prometheus Operator. TheServiceMonitorCRD is absent, so the object fails to apply; install the Prometheus Operator first (telemetry).- State flaps
ready->notReady->readyduring upgrades. Expected on multi-GPU nodes as DaemonSets roll; wait for it to settle before alerting. 8
References¶
- GPU Operator getting started / install (chart
v26.3.2, Helm repo): https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html - ClusterPolicy sample (
config/samples/v1_clusterpolicy.yaml): https://github.com/NVIDIA/gpu-operator/blob/main/config/samples/v1_clusterpolicy.yaml - ClusterPolicy API types (
api/nvidia/v1/clusterpolicy_types.go): https://github.com/NVIDIA/gpu-operator/blob/main/api/nvidia/v1/clusterpolicy_types.go - Chart default values (image/version coordinates): https://github.com/NVIDIA/gpu-operator/blob/main/deployments/gpu-operator/values.yaml
- NVIDIA Driver Custom Resource (vs ClusterPolicy): https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-driver-configuration.html
- Troubleshooting (pod/DaemonSet checks): https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/troubleshooting.html
- State flap during upgrade (issue #1567): https://github.com/NVIDIA/gpu-operator/issues/1567
Related: GPU Platform hub · Kubernetes for GPUs · Time-slicing · MIG mode · Container toolkit · Telemetry · Glossary
-
A
ClusterPolicywith an emptyspec{}fails to deploy — NVIDIA GPU Operator docs / driver-configuration. https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-driver-configuration.html ↩↩ -
ClusterPolicyand theNVIDIADrivercustom resource cannot be used at the same time — NVIDIA GPU Operator driver-configuration. https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-driver-configuration.html ↩↩↩ -
Component spec fields (
enabled,repository,image,version,imagePullPolicy,imagePullSecrets,args,env,resources,config,serviceMonitor,plugin) fromclusterpolicy_types.go. https://github.com/NVIDIA/gpu-operator/blob/main/api/nvidia/v1/clusterpolicy_types.go ↩↩↩↩↩ -
validator.plugin.envwithWITH_WORKLOAD,mig.strategy: single,operator.runtimeClass: nvidia, and the componentenabledtoggles from the upstream sample. https://github.com/NVIDIA/gpu-operator/blob/main/config/samples/v1_clusterpolicy.yaml ↩↩ -
Default
repository/image/versionper component from the chartvalues.yaml. https://github.com/NVIDIA/gpu-operator/blob/main/deployments/gpu-operator/values.yaml ↩ -
GPU Operator current release
v26.3.2, Helm repohttps://helm.ngc.nvidia.com/nvidia. https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html ↩ -
status.stateenum values (ignored,ready,notReady,disabled) are defined upstream inclusterpolicy_types.go(theStateconstants). https://github.com/NVIDIA/gpu-operator/blob/main/api/nvidia/v1/clusterpolicy_types.go — query example (kubectl get clusterpolicy ... -o json | jq '.status.state, .status.conditions') per https://kubernetes.recipes/recipes/configuration/gpu-operator-clusterpolicy-reference/ ↩ -
ClusterPolicy state fluctuates
ready->notReady->readyduring upgrade on multi-GPU nodes (issue #1567). https://github.com/NVIDIA/gpu-operator/issues/1567 ↩↩