Skip to content
Markdown

Manifest: DCGM exporter

Scope: deploy dcgm-exporter (via the GPU Operator or standalone), wire a Prometheus ServiceMonitor/scrape config, understand the DCGM_FI_DEV_*/DCGM_FI_PROF_* fields it exposes and the SYS_ADMIN/pod-resources requirements, supply a custom metrics CSV, and verify the /metrics endpoint plus a live Prometheus target. The exporter half of the pipeline in Telemetry, Monitoring & Alerting; part of the GPU platform.

Reference templates from the upstream dcgm-exporter chart and DaemonSet. Pin chart and image versions; apply via GitOps. Field IDs follow the deployed DCGM version. Verify against the DCGM field-ID reference. Never hardware-tested here.

flowchart LR
  GPU["NVIDIA GPU"] --> DCGM["DCGM (in-container)"]
  DCGM --> EXP["dcgm-exporter :9400/metrics"]
  KUBELET["kubelet pod-resources"] --> EXP
  EXP --> SM["ServiceMonitor"]
  SM --> PROM["Prometheus"]
  PROM --> GRAF["Grafana / Alertmanager"]

What it is

dcgm-exporter is a Prometheus exporter that polls NVIDIA Data Center GPU Manager (DCGM) and publishes GPU telemetry on :9400/metrics. It runs as a DaemonSet on GPU nodes. Two ways to get it:

  • Via the GPU Operator (dcgmExporter.enabled=true, the default). The operator owns the DaemonSet, Service, and ServiceMonitor. This is the path used by the GPU platform hub; prefer it when the Operator is already installed.
  • Standalone. The dcgm-exporter Helm chart or a hand-rolled DaemonSet. Use when there is no GPU Operator, or to pin/extend the exporter independently.

Pod-level attribution (mapping a metric to the pod holding the GPU) requires the kubelet pod-resources socket mounted into the exporter; profiling fields (DCGM_FI_PROF_*) require the SYS_ADMIN capability. Without SYS_ADMIN the *_PROF_* fields read zero.

Prerequisites

  • NVIDIA driver + container toolkit on GPU nodes; GPUs schedulable (nvidia.com/gpu).
  • Prometheus that watches ServiceMonitor/PodMonitor CRDs (e.g. kube-prometheus-stack) for the CRD path; otherwise a static scrape config. The monitoring.coreos.com/v1 CRDs must exist for kind: ServiceMonitor to apply.
  • kubelet pod-resources socket at /var/lib/kubelet/pod-resources (default) for pod-level labels.
  • RBAC for the operators that manage the exporter. See RBAC for GPU Platform Operators.
  • If the GPU Operator already deploys the exporter, do not also deploy a standalone one on the same nodes; two binders fight for the pod-resources socket and port 9400.

Install

The exporter ships enabled. Relevant chart values:

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia && helm repo update
helm upgrade --install -n gpu-operator --create-namespace gpu-operator nvidia/gpu-operator \
  --version <pinned> \
  --set dcgmExporter.enabled=true \
  --set dcgmExporter.serviceMonitor.enabled=true \
  --set dcgmExporter.enablePodLabels=true        # adds pod name/namespace as metric labels

Custom metrics CSV (replaces the default counter set) via a ConfigMap the Operator mounts:

# 1) author dcgm-metrics.csv (see "Configuration" for the format), then:
kubectl create configmap metrics-config -n gpu-operator --from-file=dcgm-metrics.csv

# 2) point the exporter at it
helm upgrade gpu-operator nvidia/gpu-operator -n gpu-operator --reuse-values \
  --set dcgmExporter.config.name=metrics-config \
  --set dcgmExporter.env[0].name=DCGM_EXPORTER_COLLECTORS \
  --set dcgmExporter.env[0].value=/etc/dcgm-exporter/dcgm-metrics.csv

Verify the exact dcgmExporter.config.name + DCGM_EXPORTER_COLLECTORS wiring against the GPU Operator "Changing the DCGM-Exporter metrics" docs for your Operator version; the key names are stable across recent releases.

Option B: Standalone Helm chart

helm repo add gpu-helm-charts https://nvidia.github.io/dcgm-exporter/helm-charts && helm repo update
helm upgrade --install dcgm-exporter gpu-helm-charts/dcgm-exporter \
  -n monitoring --create-namespace --version <pinned> \
  --set serviceMonitor.enabled=true \
  --set serviceMonitor.interval=15s \
  --set image.repository=nvcr.io/nvidia/k8s/dcgm-exporter \
  --set image.tag=4.5.3-4.8.2-distroless

The manifest

Standalone DaemonSet + Service + ServiceMonitor, mirroring the upstream reference manifest (image, port, env, SYS_ADMIN, pod-resources hostPath). Use this when you are not running the chart or the GPU Operator.

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: dcgm-exporter
  namespace: monitoring
  labels: { app: dcgm-exporter }
spec:
  selector: { matchLabels: { app: dcgm-exporter } }
  updateStrategy: { type: RollingUpdate }
  template:
    metadata:
      labels: { app: dcgm-exporter }
    spec:
      nodeSelector: { nvidia.com/gpu.present: "true" }
      tolerations:
        - { key: nvidia.com/gpu, operator: Exists, effect: NoSchedule }
      containers:
        - name: dcgm-exporter
          image: nvcr.io/nvidia/k8s/dcgm-exporter:4.5.3-4.8.2-distroless   # pin explicitly
          args: ["-f", "/etc/dcgm-exporter/default-counters.csv"]
          env:
            - { name: DCGM_EXPORTER_LISTEN, value: ":9400" }
            - { name: DCGM_EXPORTER_KUBERNETES, value: "true" }            # attribute to pods
            - name: NODE_NAME
              valueFrom: { fieldRef: { fieldPath: spec.nodeName } }
          ports:
            - { name: metrics, containerPort: 9400 }
          securityContext:
            capabilities:
              add: ["SYS_ADMIN"]        # required for DCGM_FI_PROF_* profiling fields
              drop: ["ALL"]
          volumeMounts:
            - { name: pod-gpu-resources, mountPath: /var/lib/kubelet/pod-resources, readOnly: true }
          resources:
            requests: { cpu: 100m, memory: 128Mi }
            limits:   { cpu: 200m, memory: 512Mi }
      volumes:
        - name: pod-gpu-resources
          hostPath: { path: /var/lib/kubelet/pod-resources }
---
apiVersion: v1
kind: Service
metadata:
  name: dcgm-exporter
  namespace: monitoring
  labels: { app: dcgm-exporter }
spec:
  type: ClusterIP
  selector: { app: dcgm-exporter }
  ports:
    - { name: metrics, port: 9400, targetPort: metrics }
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: dcgm-exporter
  namespace: monitoring
  labels: { app: dcgm-exporter, release: kube-prom }   # match Prometheus serviceMonitorSelector
spec:
  selector:
    matchLabels: { app: dcgm-exporter }
  endpoints:
    - port: metrics
      path: /metrics
      interval: 15s

The exporter reads DCGM_FI_DEV_* (counters/gauges sampled from DCGM) and DCGM_FI_PROF_* (profiling) fields. The release: kube-prom label must match the Prometheus serviceMonitorSelector; if the target never appears, that label mismatch is the first suspect.

Note on privilege: the upstream reference DaemonSet grants only the SYS_ADMIN capability and drops ALL. It does not set hostPID or hostNetwork or privileged: true. Pod-resources access comes from the read-only hostPath mount, not from host PID namespace. Do not add hostPID unless a specific field requires it.

Static scrape config (no Prometheus Operator)

If you scrape Prometheus directly rather than via a ServiceMonitor:

scrape_configs:
  - job_name: dcgm-exporter
    kubernetes_sd_configs:
      - role: endpoints
    relabel_configs:
      - source_labels: [__meta_kubernetes_service_label_app]
        regex: dcgm-exporter
        action: keep
      - source_labels: [__meta_kubernetes_endpoint_port_name]
        regex: metrics
        action: keep

Configuration

Key chart values and CRD/field knobs. Standalone-chart keys are from deployment/values.yaml; GPU Operator keys are nested under dcgmExporter.*.

Key / field Where Default Purpose
image.repository chart nvcr.io/nvidia/k8s/dcgm-exporter exporter image
image.tag chart 4.5.3-4.8.2-distroless pin explicitly; format <exporter>-<DCGM>-<variant>
arguments / container args chart / pod ["-f","/etc/dcgm-exporter/default-counters.csv"] counters CSV to collect
service.type / service.port chart ClusterIP / 9400 metrics Service
serviceMonitor.enabled chart true emit the ServiceMonitor CRD
serviceMonitor.interval chart 15s scrape interval
serviceMonitor.honorLabels chart false keep exporter labels over Prometheus'
serviceMonitor.additionalLabels chart {} labels so Prometheus selects it (e.g. release: kube-prom)
serviceMonitor.relabelings chart [] relabel before ingestion
securityContext.capabilities.add chart / pod ["SYS_ADMIN"] unlocks DCGM_FI_PROF_*
kubeletPath chart /var/lib/kubelet/pod-resources pod-resources socket for pod labels
resources.requests/limits chart 100m/128Mi200m/512Mi per-node footprint
dcgmExporter.enabled GPU Operator true deploy via Operator
dcgmExporter.serviceMonitor.enabled GPU Operator true Operator emits the ServiceMonitor
dcgmExporter.enablePodLabels GPU Operator false add pod name/namespace as labels (raises cardinality)
dcgmExporter.config.name GPU Operator (unset) ConfigMap holding the custom metrics CSV
DCGM_EXPORTER_COLLECTORS (env) GPU Operator (unset) path to the mounted CSV, e.g. /etc/dcgm-exporter/dcgm-metrics.csv

Metrics exposed (selected DCGM_FI_* fields, verbatim from the default CSV)

Field Type Help
DCGM_FI_DEV_GPU_UTIL gauge GPU utilization (in %).
DCGM_FI_DEV_MEM_COPY_UTIL gauge Memory utilization (in %).
DCGM_FI_DEV_FB_USED gauge Framebuffer memory used (in MiB).
DCGM_FI_DEV_FB_FREE gauge Framebuffer memory free (in MiB).
DCGM_FI_DEV_GPU_TEMP gauge GPU temperature (in C).
DCGM_FI_DEV_MEMORY_TEMP gauge Memory temperature (in C).
DCGM_FI_DEV_POWER_USAGE gauge Power draw (in W).
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION counter Total energy consumption since boot (in mJ).
DCGM_FI_DEV_SM_CLOCK gauge SM clock frequency (in MHz).
DCGM_FI_DEV_MEM_CLOCK gauge Memory clock frequency (in MHz).
DCGM_FI_DEV_XID_ERRORS gauge Value of the last XID error encountered.
DCGM_FI_DEV_PCIE_REPLAY_COUNTER counter Total number of PCIe retries.
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL counter Total number of NVLink bandwidth counters for all lanes.
DCGM_FI_PROF_GR_ENGINE_ACTIVE gauge Ratio of time the graphics engine is active.
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE gauge Ratio of cycles the tensor (HMMA) pipe is active.
DCGM_FI_PROF_DRAM_ACTIVE gauge Ratio of cycles the device memory interface is active.
DCGM_FI_PROF_PCIE_TX_BYTES / _RX_BYTES gauge PCIe bytes/sec transmitted / received.

DCGM_FI_PROF_SM_ACTIVE / SM_OCCUPANCY and the ECC totals (DCGM_FI_DEV_ECC_SBE_VOL_TOTAL, DCGM_FI_DEV_ECC_DBE_VOL_TOTAL) ship commented out in the default CSV. Enable them via a custom CSV if you alert on SM-active or double-bit ECC (telemetry alert rules). The full field catalogue is the DCGM field-ID reference.

Custom metrics CSV

Three comma-separated columns: DCGM field, Prometheus type, help string. Lines starting with # are comments. A custom CSV replaces the default set, so include everything you scrape:

# dcgm-metrics.csv — DCGM FIELD, Prometheus type, help
DCGM_FI_DEV_GPU_UTIL,                gauge,   GPU utilization (in %).
DCGM_FI_DEV_FB_USED,                 gauge,   Framebuffer memory used (in MiB).
DCGM_FI_DEV_FB_FREE,                 gauge,   Framebuffer memory free (in MiB).
DCGM_FI_DEV_GPU_TEMP,                gauge,   GPU temperature (in C).
DCGM_FI_DEV_POWER_USAGE,             gauge,   Power draw (in W).
DCGM_FI_DEV_XID_ERRORS,              gauge,   Value of the last XID error encountered.
DCGM_FI_DEV_ECC_DBE_VOL_TOTAL,       counter, Total uncorrectable (double-bit) ECC errors.
DCGM_FI_PROF_SM_ACTIVE,              gauge,   Ratio of cycles an SM has at least one warp assigned.
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE,     gauge,   Ratio of cycles the tensor (HMMA) pipe is active.

Apply & verify

# Standalone manifest path:
kubectl apply -f dcgm-exporter.yaml
kubectl rollout status ds/dcgm-exporter -n monitoring
kubectl get pods -n monitoring -l app=dcgm-exporter -o wide   # one Running pod per GPU node

Hit the endpoint from inside the cluster and confirm real samples:

kubectl -n monitoring port-forward ds/dcgm-exporter 9400:9400 &
curl -s localhost:9400/metrics | grep -E "DCGM_FI_DEV_GPU_UTIL|DCGM_FI_DEV_FB_USED" | head
# expected: DCGM_FI_DEV_GPU_UTIL{gpu="0",UUID="GPU-...",Hostname="..."} 0

Confirm profiling fields are non-empty (proves SYS_ADMIN is in effect):

curl -s localhost:9400/metrics | grep DCGM_FI_PROF_GR_ENGINE_ACTIVE | head
# empty/absent => missing SYS_ADMIN, or MIG-mode device without per-instance profiling

Confirm Prometheus discovered the target. The expected signal is state up with health="up":

# Prometheus UI: Status -> Targets -> job "dcgm-exporter" -> UP
# or via API:
curl -s "http://<prometheus>/api/v1/targets" \
  | jq '.data.activeTargets[] | select(.labels.job=="dcgm-exporter") | {health, scrapeUrl}'
# expected: { "health": "up", "scrapeUrl": "http://.../metrics" }

A quick PromQL sanity check (more in telemetry):

count(DCGM_FI_DEV_GPU_UTIL)          # == total GPUs scraped across the fleet
up{job="dcgm-exporter"}              # 1 per exporter pod

Failure modes

  • Target absent in Prometheus. ServiceMonitor labels don't match serviceMonitorSelector (commonly the missing release: <prom-release> label), or the ServiceMonitor lives in a namespace Prometheus doesn't watch.
  • DCGM_FI_PROF_* empty/zero. Missing SYS_ADMIN capability, or the GPU is MIG-mode without per-instance profiling (MIG mode).
  • No pod-level labels. Pod-resources socket not mounted, or kubelet path differs from /var/lib/kubelet/pod-resources; set kubeletPath / fix the hostPath.
  • Two exporters fighting. GPU Operator's DaemonSet plus a standalone one on the same nodes contend for port 9400 / the pod-resources socket. Run exactly one.
  • CrashLoopBackOff / Failed to watch errors. Exporter cannot reach DCGM (driver/toolkit not ready on the node) or RBAC denies the ConfigMap mount (RBAC for GPU Platform Operators).
  • Custom CSV silently drops metrics you rely on. A custom CSV replaces, not extends, the defaults; alerts on omitted fields (XID, ECC, SM-active) go blind.
  • Cardinality blow-up. enablePodLabels=true plus per-PID fields on a busy inference node explodes series count; scope with podLabelAllowlistRegex or relabeling.

References

  • dcgm-exporter (repo + reference manifest/values): https://github.com/NVIDIA/dcgm-exporter
  • dcgm-exporter chart values (deployment/values.yaml): https://github.com/NVIDIA/dcgm-exporter/blob/main/deployment/values.yaml
  • Default counters CSV: https://github.com/NVIDIA/dcgm-exporter/blob/main/etc/default-counters.csv
  • DCGM-Exporter (NVIDIA DCGM docs): https://docs.nvidia.com/datacenter/dcgm/latest/gpu-telemetry/dcgm-exporter.html
  • DCGM field-ID reference: https://docs.nvidia.com/datacenter/dcgm/latest/dcgm-api/dcgm-api-field-ids.html
  • standalone Helm chart repository index: https://nvidia.github.io/dcgm-exporter/helm-charts/index.yaml
  • GPU Operator (custom metrics via dcgmExporter.config.name): https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html
  • Prometheus Operator (ServiceMonitor): https://prometheus-operator.dev/docs/

Related: Telemetry · K8s Platform · GPU Operator (Helm) · Smoke tests · Diagnostics · Security · Glossary