Markdown

Manifest: GPU operator ClusterPolicy¶

Scope: the ClusterPolicy CRD (apiVersion: nvidia.com/v1) that is the single source of truth for an NVIDIA GPU Operator install. It covers the driver, toolkit, devicePlugin, dcgmExporter, dcgm, migManager, nodeStatusExporter and validator component blocks, how most map closely to a Helm value, and how to read status.state to know the stack is healthy. Verify with kubectl get clusterpolicy (STATUS ready) and describe for the per-component conditions.

Reference template from the upstream NVIDIA GPU Operator chart (pinned v26.3.2) and its config/samples/v1_clusterpolicy.yaml / clusterpolicy_types.go. Not hardware-tested here. The Helm chart renders this CR as cluster-policy; do not hand-edit it in parallel with helm upgrade. Pin chart and image versions and apply via GitOps. Builds on Kubernetes & Helm: GPU Platform §1.

flowchart TB
  HELM["helm install gpu-operator<br/>--set driver.enabled=..."] --> CP["ClusterPolicy cluster-policy<br/>nvidia.com/v1"]
  CP --> DRV["driver DaemonSet"]
  CP --> TK["toolkit DaemonSet"]
  CP --> DP["devicePlugin DaemonSet"]
  CP --> DCGM["dcgm + dcgmExporter"]
  CP --> MIG["migManager"]
  CP --> VAL["validator (gates readiness)"]
  VAL --> ST["status.state = ready"]

What it is¶

ClusterPolicy is a cluster-scoped CRD in the nvidia.com/v1 group. One CR named cluster-policy declares the desired state of every component the GPU Operator manages; the operator reconciles each spec.<component> block into a DaemonSet (or Deployment) and reports aggregate health in status.state. You normally never write this YAML by hand. helm install nvidia/gpu-operator renders it from values.yaml, and most --set <component>.<field>=... flags in the hub map onto a field on this object (a few values, e.g. component version for validator/nodeStatusExporter, which track the chart appVersion, have no direct --set equivalent). Reading and patching the CR directly is the escape hatch for day-2 changes and the thing you describe when a component is wedged.

Two hard rules from upstream:

spec: {} is invalid. An empty spec fails to deploy; the operator needs at least the component toggles. ¹
ClusterPolicy and the standalone NVIDIADriver CRD are mutually exclusive. Pick one driver-management model per cluster. Set driver.useNvidiaDriverCRD: true only if you are driving the driver through the separate NVIDIADriver CR. ²

The component blocks (driver, toolkit, devicePlugin, dcgmExporter, dcgm, migManager, nodeStatusExporter, validator) share a common shape: enabled, repository, image, version, imagePullPolicy, imagePullSecrets, args, env, resources, and (where applicable) config. They map onto the same Go struct fields. ³

Prerequisites¶

A Kubernetes cluster (operator v26.3.2 supports current upstream K8s; check the platform support matrix for your exact version). ⁶
Node Feature Discovery: bundled and enabled by the chart unless you point it at an existing NFD (nfd.enabled).
Decide the driver model first: container-installed driver (driver.enabled: true) or host/pre-installed driver (driver.enabled: false). Never run both; a container driver fighting a host driver leaves the node NotReady (hub §1, Ansible bring-up).
Decide the MIG strategy (mig.strategy: none / single / mixed) before install; switching it later re-rolls the device plugin (MIG mode).
kubectl access with rights to read/patch clusterpolicies.nvidia.com and to the gpu-operator namespace (RBAC for the operators).
Helm 3 and the NVIDIA chart repo added.

The manifest¶

The chart renders the CR; this is the equivalent explicit object, trimmed to the blocks in scope and pinned. Field names are taken verbatim from clusterpolicy_types.go and the upstream sample. ³⁴ Image coordinates are the v26.3.2-era chart defaults; pin them rather than relying on :latest. ⁵

# clusterpolicy.yaml  (rendered by the chart as "cluster-policy")
apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
  name: cluster-policy
spec:
  operator:
    runtimeClass: nvidia            # RuntimeClass GPU pods select; the toolkit runtime is auto-detected (no defaultRuntime)

  # Global DaemonSet config (tolerations/priorityClass/rollout) applied to all components.
  daemonsets:
    rollingUpdate:
      maxUnavailable: "1"
    updateStrategy: RollingUpdate

  mig:
    strategy: single                # none | single | mixed  -> how MIG devices are advertised

  # --- driver: installs the NVIDIA kernel driver in a container ---
  driver:
    enabled: false                  # false when the driver is host-installed (set true for container driver)
    useNvidiaDriverCRD: false       # true => delegate to the standalone NVIDIADriver CRD instead
    repository: nvcr.io/nvidia
    image: driver
    version: "595.58.03"
    rdma:
      enabled: false                # GPUDirect RDMA kmods; pairs with the Network Operator
    upgradePolicy:
      autoUpgrade: true
      maxParallelUpgrades: 1
      drain:
        enable: false

  # --- toolkit: NVIDIA Container Toolkit (wires the container runtime to the GPU) ---
  toolkit:
    enabled: true
    repository: nvcr.io/nvidia/k8s
    image: container-toolkit
    version: v1.19.1

  # --- devicePlugin: advertises nvidia.com/gpu and schedules it ---
  devicePlugin:
    enabled: true
    repository: nvcr.io/nvidia
    image: k8s-device-plugin
    version: v0.19.3
    config:                         # points at a sharing ConfigMap (time-slicing, etc.)
      name: ""
      default: ""

  # --- gfd: GPU Feature Discovery (labels nodes with GPU product/memory/MIG) ---
  gfd:
    enabled: true
    repository: nvcr.io/nvidia
    image: k8s-device-plugin
    version: v0.19.3

  # --- dcgm + dcgmExporter: telemetry ---
  dcgm:
    enabled: true
    repository: nvcr.io/nvidia/cloud-native
    image: dcgm
    version: 4.5.2-1-ubuntu22.04
  dcgmExporter:
    enabled: true
    repository: nvcr.io/nvidia/k8s
    image: dcgm-exporter
    version: 4.5.3-4.8.2-distroless
    config:
      name: ""                      # ConfigMap of custom DCGM fields to export
    serviceMonitor:
      enabled: false                # set true to emit a Prometheus-Operator ServiceMonitor

  # --- migManager: applies mig-parted partition profiles ---
  migManager:
    enabled: true
    repository: nvcr.io/nvidia/cloud-native
    image: k8s-mig-manager
    version: v0.14.2
    config:
      name: ""                      # mig-parted ConfigMap (default-mig-parted-config when blank)
    gpuClientsConfig:
      name: ""
    env:
      - name: WITH_REBOOT
        value: "false"

  # --- nodeStatusExporter: exposes operator/node state metrics (off by default) ---
  nodeStatusExporter:
    enabled: false
    repository: nvcr.io/nvidia
    image: gpu-operator             # version tracks chart appVersion

  # --- validator: end-to-end gate; flips status.state to ready ---
  validator:
    repository: nvcr.io/nvidia
    image: gpu-operator             # version tracks chart appVersion
    plugin:
      env:
        - name: WITH_WORKLOAD
          value: "false"            # true => run an actual CUDA workload as part of validation

Install via the chart (preferred). The --set flags below write the corresponding fields above:

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia && helm repo update
helm upgrade --install gpu-operator nvidia/gpu-operator \
  -n gpu-operator --create-namespace \
  --version v26.3.2 \
  --set driver.enabled=false \
  --set mig.strategy=single \
  --set toolkit.enabled=true \
  --set devicePlugin.enabled=true \
  --set dcgmExporter.enabled=true \
  --set migManager.enabled=true \
  --set validator.plugin.env[0].name=WITH_WORKLOAD \
  --set validator.plugin.env[0].value="false"

Day-2 change to a running cluster: patch the rendered CR (this is what --reuse-values --set renders to):

kubectl patch clusterpolicies.nvidia.com/cluster-policy --type merge \
  -p '{"spec":{"dcgmExporter":{"serviceMonitor":{"enabled":true}}}}'

Configuration¶

Every block uses the same field set (enabled, repository, image, version, imagePullPolicy, imagePullSecrets, args, env, resources, config); the table calls out the load-bearing ones and their Helm equivalents. ³

Field (path under `spec`)	Helm value	Meaning
`operator.runtimeClass`	`operator.runtimeClass`	Name of the `RuntimeClass` GPU pods select (default `nvidia`). The toolkit's container runtime is auto-detected; the old `operator.defaultRuntime` is deprecated and ignored on `v26.3.x`. ³
`mig.strategy`	`mig.strategy`	How MIG devices are surfaced: `none`, `single` (advertise as `nvidia.com/gpu`), or `mixed` (per-profile `nvidia.com/mig-*`). See MIG mode.
`driver.enabled`	`driver.enabled`	`true` runs the container driver DaemonSet; `false` assumes a host-installed driver. Never both.
`driver.useNvidiaDriverCRD`	`driver.useNvidiaDriverCRD`	`true` hands driver lifecycle to the standalone `NVIDIADriver` CRD (mutually exclusive with managing it here). ²
`driver.{repository,image,version}`	`driver.*`	Driver container image coordinates; `version` is the driver branch (e.g. `595.58.03`).
`driver.rdma.enabled`	`driver.rdma.enabled`	Builds GPUDirect RDMA kernel modules; pair with the Network Operator.
`toolkit.enabled`	`toolkit.enabled`	Installs the NVIDIA Container Toolkit; required unless the host already has it (container toolkit).
`devicePlugin.config.name`	`devicePlugin.config.name`	ConfigMap holding a sharing config (e.g. time-slicing).
`devicePlugin.config.default`	`devicePlugin.config.default`	Which ConfigMap key applies cluster-wide absent a per-node override.
`dcgmExporter.enabled`	`dcgmExporter.enabled`	Deploys DCGM-exporter for GPU metrics (telemetry).
`dcgmExporter.config.name`	`dcgmExporter.config.name`	ConfigMap of custom DCGM field IDs to export.
`dcgmExporter.serviceMonitor.enabled`	`dcgmExporter.serviceMonitor.enabled`	Emits a Prometheus-Operator `ServiceMonitor`. ³
`migManager.enabled`	`migManager.enabled`	Runs k8s-mig-manager to apply partition profiles.
`migManager.config.name`	`migManager.config.name`	mig-parted ConfigMap of named profiles (blank => `default-mig-parted-config`).
`nodeStatusExporter.enabled`	`nodeStatusExporter.enabled`	Operator/node state metrics endpoint. Off by default.
`validator.plugin.env` (`WITH_WORKLOAD`)	`validator.plugin.env`	`"true"` runs a real CUDA validation workload; `"false"` does the lightweight check. The validator flipping green is what advances `status.state`. ⁴

Apply & verify¶

kubectl apply -f clusterpolicy.yaml     # only if managing the CR directly; otherwise Helm owns it

Top-line health: the STATUS column must read ready:

kubectl get clusterpolicy
# NAME             STATUS   AGE
# cluster-policy   ready    7m

status.state is a small enum: ignored, notReady, ready, disabled. It starts notReady while DaemonSets roll and the validator runs, and may briefly flap back to notReady during an upgrade before settling on ready. ⁷⁸ Drill into the cause when it is not ready:

kubectl get clusterpolicy cluster-policy \
  -o jsonpath='{.status.state}{"\n"}'                       # ready
kubectl get clusterpolicy cluster-policy -o json \
  | jq '.status.state, .status.conditions'                  # which component is blocking
kubectl describe clusterpolicy cluster-policy               # events + per-component status

Confirm the components the spec turned on are actually Running (validator pods reach Completed/Running):

kubectl get pods -n gpu-operator
# nvidia-container-toolkit-daemonset-...   Running
# nvidia-device-plugin-daemonset-...       Running
# nvidia-dcgm-exporter-...                 Running
# nvidia-operator-validator-...            Running
# gpu-feature-discovery-...                Running

Prove the device plugin advertised GPUs (the field this whole CR exists to produce):

kubectl get nodes -l nvidia.com/gpu.present=true \
  -o custom-columns='NODE:.metadata.name,GPU_ALLOC:.status.allocatable.nvidia\.com/gpu'

Then run the CUDA smoke pod from the hub; its nvidia-smi table is the end-to-end signal.

Failure modes¶

status.state stuck notReady. A component DaemonSet is not all-Running or the validator failed. kubectl describe clusterpolicy cluster-policy and kubectl get pods -n gpu-operator name the culprit; check that pod's logs (nvidia-operator-validator, nvidia-driver-daemonset, etc.).
spec: {} / over-trimmed CR rejected. An empty or near-empty spec fails to deploy; the component toggles are required. Render from the chart rather than minimizing by hand. ¹
Both ClusterPolicy and NVIDIADriver present. Driver management collides. Use one model: either driver.enabled here, or driver.useNvidiaDriverCRD: true plus a separate NVIDIADriver CR, never both. ²
Container driver vs host driver. driver.enabled: true on a node that already has a host driver wedges it NotReady. Set driver.enabled: false when the driver is host-installed (Ansible bring-up).
Hand-edited the CR while Helm owns it. The next helm upgrade reconciles your patch away. Make persistent changes through chart values (or helm upgrade --reuse-values --set ...), reserving kubectl patch for transient debugging.
mig.strategy mismatch. single advertises MIG slices as nvidia.com/gpu; mixed advertises nvidia.com/mig-<profile>. Pods requesting the wrong resource name stay Pending. Align the strategy with how workloads request GPUs (MIG mode).
dcgmExporter.serviceMonitor.enabled: true with no Prometheus Operator. The ServiceMonitor CRD is absent, so the object fails to apply; install the Prometheus Operator first (telemetry).
State flaps ready->notReady->ready during upgrades. Expected on multi-GPU nodes as DaemonSets roll; wait for it to settle before alerting. ⁸

References¶

GPU Operator getting started / install (chart v26.3.2, Helm repo): https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html
ClusterPolicy sample (config/samples/v1_clusterpolicy.yaml): https://github.com/NVIDIA/gpu-operator/blob/main/config/samples/v1_clusterpolicy.yaml
ClusterPolicy API types (api/nvidia/v1/clusterpolicy_types.go): https://github.com/NVIDIA/gpu-operator/blob/main/api/nvidia/v1/clusterpolicy_types.go
Chart default values (image/version coordinates): https://github.com/NVIDIA/gpu-operator/blob/main/deployments/gpu-operator/values.yaml
NVIDIA Driver Custom Resource (vs ClusterPolicy): https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-driver-configuration.html
Troubleshooting (pod/DaemonSet checks): https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/troubleshooting.html
State flap during upgrade (issue #1567): https://github.com/NVIDIA/gpu-operator/issues/1567

A ClusterPolicy with an empty spec{} fails to deploy — NVIDIA GPU Operator docs / driver-configuration. https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-driver-configuration.html ↩↩
ClusterPolicy and the NVIDIADriver custom resource cannot be used at the same time — NVIDIA GPU Operator driver-configuration. https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-driver-configuration.html ↩↩↩
Component spec fields (enabled, repository, image, version, imagePullPolicy, imagePullSecrets, args, env, resources, config, serviceMonitor, plugin) from clusterpolicy_types.go. https://github.com/NVIDIA/gpu-operator/blob/main/api/nvidia/v1/clusterpolicy_types.go ↩↩↩↩↩
validator.plugin.env with WITH_WORKLOAD, mig.strategy: single, operator.runtimeClass: nvidia, and the component enabled toggles from the upstream sample. https://github.com/NVIDIA/gpu-operator/blob/main/config/samples/v1_clusterpolicy.yaml ↩↩
Default repository/image/version per component from the chart values.yaml. https://github.com/NVIDIA/gpu-operator/blob/main/deployments/gpu-operator/values.yaml ↩
GPU Operator current release v26.3.2, Helm repo https://helm.ngc.nvidia.com/nvidia. https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html ↩
status.state enum values (ignored, ready, notReady, disabled) are defined upstream in clusterpolicy_types.go (the State constants). https://github.com/NVIDIA/gpu-operator/blob/main/api/nvidia/v1/clusterpolicy_types.go — query example (kubectl get clusterpolicy ... -o json | jq '.status.state, .status.conditions') per https://kubernetes.recipes/recipes/configuration/gpu-operator-clusterpolicy-reference/ ↩
ClusterPolicy state fluctuates ready->notReady->ready during upgrade on multi-GPU nodes (issue #1567). https://github.com/NVIDIA/gpu-operator/issues/1567 ↩↩