Manifest: DCGM exporter¶
Scope: deploy dcgm-exporter (via the GPU Operator or standalone), wire a Prometheus ServiceMonitor/scrape config, understand the DCGM_FI_DEV_*/DCGM_FI_PROF_* fields it exposes and the SYS_ADMIN/pod-resources requirements, supply a custom metrics CSV, and verify the /metrics endpoint plus a live Prometheus target. The exporter half of the pipeline in Telemetry, Monitoring & Alerting; part of the GPU platform.
Reference templates from the upstream
dcgm-exporterchart and DaemonSet. Pin chart and image versions; apply via GitOps. Field IDs follow the deployed DCGM version. Verify against the DCGM field-ID reference. Never hardware-tested here.
flowchart LR
GPU["NVIDIA GPU"] --> DCGM["DCGM (in-container)"]
DCGM --> EXP["dcgm-exporter :9400/metrics"]
KUBELET["kubelet pod-resources"] --> EXP
EXP --> SM["ServiceMonitor"]
SM --> PROM["Prometheus"]
PROM --> GRAF["Grafana / Alertmanager"]
What it is¶
dcgm-exporter is a Prometheus exporter that polls NVIDIA Data Center GPU Manager (DCGM) and publishes GPU telemetry on :9400/metrics. It runs as a DaemonSet on GPU nodes. Two ways to get it:
- Via the GPU Operator (
dcgmExporter.enabled=true, the default). The operator owns the DaemonSet, Service, and ServiceMonitor. This is the path used by the GPU platform hub; prefer it when the Operator is already installed. - Standalone. The
dcgm-exporterHelm chart or a hand-rolled DaemonSet. Use when there is no GPU Operator, or to pin/extend the exporter independently.
Pod-level attribution (mapping a metric to the pod holding the GPU) requires the kubelet pod-resources socket mounted into the exporter; profiling fields (DCGM_FI_PROF_*) require the SYS_ADMIN capability. Without SYS_ADMIN the *_PROF_* fields read zero.
Prerequisites¶
- NVIDIA driver + container toolkit on GPU nodes; GPUs schedulable (
nvidia.com/gpu). - Prometheus that watches
ServiceMonitor/PodMonitorCRDs (e.g.kube-prometheus-stack) for the CRD path; otherwise a static scrape config. Themonitoring.coreos.com/v1CRDs must exist forkind: ServiceMonitorto apply. - kubelet pod-resources socket at
/var/lib/kubelet/pod-resources(default) for pod-level labels. - RBAC for the operators that manage the exporter. See RBAC for GPU Platform Operators.
- If the GPU Operator already deploys the exporter, do not also deploy a standalone one on the same nodes; two binders fight for the pod-resources socket and port 9400.
Install¶
Option A: GPU Operator (recommended when the Operator is present)¶
The exporter ships enabled. Relevant chart values:
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia && helm repo update
helm upgrade --install -n gpu-operator --create-namespace gpu-operator nvidia/gpu-operator \
--version <pinned> \
--set dcgmExporter.enabled=true \
--set dcgmExporter.serviceMonitor.enabled=true \
--set dcgmExporter.enablePodLabels=true # adds pod name/namespace as metric labels
Custom metrics CSV (replaces the default counter set) via a ConfigMap the Operator mounts:
# 1) author dcgm-metrics.csv (see "Configuration" for the format), then:
kubectl create configmap metrics-config -n gpu-operator --from-file=dcgm-metrics.csv
# 2) point the exporter at it
helm upgrade gpu-operator nvidia/gpu-operator -n gpu-operator --reuse-values \
--set dcgmExporter.config.name=metrics-config \
--set dcgmExporter.env[0].name=DCGM_EXPORTER_COLLECTORS \
--set dcgmExporter.env[0].value=/etc/dcgm-exporter/dcgm-metrics.csv
Verify the exact
dcgmExporter.config.name+DCGM_EXPORTER_COLLECTORSwiring against the GPU Operator "Changing the DCGM-Exporter metrics" docs for your Operator version; the key names are stable across recent releases.
Option B: Standalone Helm chart¶
helm repo add gpu-helm-charts https://nvidia.github.io/dcgm-exporter/helm-charts && helm repo update
helm upgrade --install dcgm-exporter gpu-helm-charts/dcgm-exporter \
-n monitoring --create-namespace --version <pinned> \
--set serviceMonitor.enabled=true \
--set serviceMonitor.interval=15s \
--set image.repository=nvcr.io/nvidia/k8s/dcgm-exporter \
--set image.tag=4.5.3-4.8.2-distroless
The manifest¶
Standalone DaemonSet + Service + ServiceMonitor, mirroring the upstream reference manifest (image, port, env, SYS_ADMIN, pod-resources hostPath). Use this when you are not running the chart or the GPU Operator.
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: dcgm-exporter
namespace: monitoring
labels: { app: dcgm-exporter }
spec:
selector: { matchLabels: { app: dcgm-exporter } }
updateStrategy: { type: RollingUpdate }
template:
metadata:
labels: { app: dcgm-exporter }
spec:
nodeSelector: { nvidia.com/gpu.present: "true" }
tolerations:
- { key: nvidia.com/gpu, operator: Exists, effect: NoSchedule }
containers:
- name: dcgm-exporter
image: nvcr.io/nvidia/k8s/dcgm-exporter:4.5.3-4.8.2-distroless # pin explicitly
args: ["-f", "/etc/dcgm-exporter/default-counters.csv"]
env:
- { name: DCGM_EXPORTER_LISTEN, value: ":9400" }
- { name: DCGM_EXPORTER_KUBERNETES, value: "true" } # attribute to pods
- name: NODE_NAME
valueFrom: { fieldRef: { fieldPath: spec.nodeName } }
ports:
- { name: metrics, containerPort: 9400 }
securityContext:
capabilities:
add: ["SYS_ADMIN"] # required for DCGM_FI_PROF_* profiling fields
drop: ["ALL"]
volumeMounts:
- { name: pod-gpu-resources, mountPath: /var/lib/kubelet/pod-resources, readOnly: true }
resources:
requests: { cpu: 100m, memory: 128Mi }
limits: { cpu: 200m, memory: 512Mi }
volumes:
- name: pod-gpu-resources
hostPath: { path: /var/lib/kubelet/pod-resources }
---
apiVersion: v1
kind: Service
metadata:
name: dcgm-exporter
namespace: monitoring
labels: { app: dcgm-exporter }
spec:
type: ClusterIP
selector: { app: dcgm-exporter }
ports:
- { name: metrics, port: 9400, targetPort: metrics }
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: dcgm-exporter
namespace: monitoring
labels: { app: dcgm-exporter, release: kube-prom } # match Prometheus serviceMonitorSelector
spec:
selector:
matchLabels: { app: dcgm-exporter }
endpoints:
- port: metrics
path: /metrics
interval: 15s
The exporter reads DCGM_FI_DEV_* (counters/gauges sampled from DCGM) and DCGM_FI_PROF_* (profiling) fields. The release: kube-prom label must match the Prometheus serviceMonitorSelector; if the target never appears, that label mismatch is the first suspect.
Note on privilege: the upstream reference DaemonSet grants only the
SYS_ADMINcapability and dropsALL. It does not sethostPIDorhostNetworkorprivileged: true. Pod-resources access comes from the read-onlyhostPathmount, not from host PID namespace. Do not addhostPIDunless a specific field requires it.
Static scrape config (no Prometheus Operator)¶
If you scrape Prometheus directly rather than via a ServiceMonitor:
scrape_configs:
- job_name: dcgm-exporter
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- source_labels: [__meta_kubernetes_service_label_app]
regex: dcgm-exporter
action: keep
- source_labels: [__meta_kubernetes_endpoint_port_name]
regex: metrics
action: keep
Configuration¶
Key chart values and CRD/field knobs. Standalone-chart keys are from deployment/values.yaml; GPU Operator keys are nested under dcgmExporter.*.
| Key / field | Where | Default | Purpose |
|---|---|---|---|
image.repository |
chart | nvcr.io/nvidia/k8s/dcgm-exporter |
exporter image |
image.tag |
chart | 4.5.3-4.8.2-distroless |
pin explicitly; format <exporter>-<DCGM>-<variant> |
arguments / container args |
chart / pod | ["-f","/etc/dcgm-exporter/default-counters.csv"] |
counters CSV to collect |
service.type / service.port |
chart | ClusterIP / 9400 |
metrics Service |
serviceMonitor.enabled |
chart | true |
emit the ServiceMonitor CRD |
serviceMonitor.interval |
chart | 15s |
scrape interval |
serviceMonitor.honorLabels |
chart | false |
keep exporter labels over Prometheus' |
serviceMonitor.additionalLabels |
chart | {} |
labels so Prometheus selects it (e.g. release: kube-prom) |
serviceMonitor.relabelings |
chart | [] |
relabel before ingestion |
securityContext.capabilities.add |
chart / pod | ["SYS_ADMIN"] |
unlocks DCGM_FI_PROF_* |
kubeletPath |
chart | /var/lib/kubelet/pod-resources |
pod-resources socket for pod labels |
resources.requests/limits |
chart | 100m/128Mi … 200m/512Mi |
per-node footprint |
dcgmExporter.enabled |
GPU Operator | true |
deploy via Operator |
dcgmExporter.serviceMonitor.enabled |
GPU Operator | true |
Operator emits the ServiceMonitor |
dcgmExporter.enablePodLabels |
GPU Operator | false |
add pod name/namespace as labels (raises cardinality) |
dcgmExporter.config.name |
GPU Operator | (unset) | ConfigMap holding the custom metrics CSV |
DCGM_EXPORTER_COLLECTORS (env) |
GPU Operator | (unset) | path to the mounted CSV, e.g. /etc/dcgm-exporter/dcgm-metrics.csv |
Metrics exposed (selected DCGM_FI_* fields, verbatim from the default CSV)¶
| Field | Type | Help |
|---|---|---|
DCGM_FI_DEV_GPU_UTIL |
gauge | GPU utilization (in %). |
DCGM_FI_DEV_MEM_COPY_UTIL |
gauge | Memory utilization (in %). |
DCGM_FI_DEV_FB_USED |
gauge | Framebuffer memory used (in MiB). |
DCGM_FI_DEV_FB_FREE |
gauge | Framebuffer memory free (in MiB). |
DCGM_FI_DEV_GPU_TEMP |
gauge | GPU temperature (in C). |
DCGM_FI_DEV_MEMORY_TEMP |
gauge | Memory temperature (in C). |
DCGM_FI_DEV_POWER_USAGE |
gauge | Power draw (in W). |
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION |
counter | Total energy consumption since boot (in mJ). |
DCGM_FI_DEV_SM_CLOCK |
gauge | SM clock frequency (in MHz). |
DCGM_FI_DEV_MEM_CLOCK |
gauge | Memory clock frequency (in MHz). |
DCGM_FI_DEV_XID_ERRORS |
gauge | Value of the last XID error encountered. |
DCGM_FI_DEV_PCIE_REPLAY_COUNTER |
counter | Total number of PCIe retries. |
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL |
counter | Total number of NVLink bandwidth counters for all lanes. |
DCGM_FI_PROF_GR_ENGINE_ACTIVE |
gauge | Ratio of time the graphics engine is active. |
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE |
gauge | Ratio of cycles the tensor (HMMA) pipe is active. |
DCGM_FI_PROF_DRAM_ACTIVE |
gauge | Ratio of cycles the device memory interface is active. |
DCGM_FI_PROF_PCIE_TX_BYTES / _RX_BYTES |
gauge | PCIe bytes/sec transmitted / received. |
DCGM_FI_PROF_SM_ACTIVE / SM_OCCUPANCY and the ECC totals (DCGM_FI_DEV_ECC_SBE_VOL_TOTAL, DCGM_FI_DEV_ECC_DBE_VOL_TOTAL) ship commented out in the default CSV. Enable them via a custom CSV if you alert on SM-active or double-bit ECC (telemetry alert rules). The full field catalogue is the DCGM field-ID reference.
Custom metrics CSV¶
Three comma-separated columns: DCGM field, Prometheus type, help string. Lines starting with # are comments. A custom CSV replaces the default set, so include everything you scrape:
# dcgm-metrics.csv — DCGM FIELD, Prometheus type, help
DCGM_FI_DEV_GPU_UTIL, gauge, GPU utilization (in %).
DCGM_FI_DEV_FB_USED, gauge, Framebuffer memory used (in MiB).
DCGM_FI_DEV_FB_FREE, gauge, Framebuffer memory free (in MiB).
DCGM_FI_DEV_GPU_TEMP, gauge, GPU temperature (in C).
DCGM_FI_DEV_POWER_USAGE, gauge, Power draw (in W).
DCGM_FI_DEV_XID_ERRORS, gauge, Value of the last XID error encountered.
DCGM_FI_DEV_ECC_DBE_VOL_TOTAL, counter, Total uncorrectable (double-bit) ECC errors.
DCGM_FI_PROF_SM_ACTIVE, gauge, Ratio of cycles an SM has at least one warp assigned.
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge, Ratio of cycles the tensor (HMMA) pipe is active.
Apply & verify¶
# Standalone manifest path:
kubectl apply -f dcgm-exporter.yaml
kubectl rollout status ds/dcgm-exporter -n monitoring
kubectl get pods -n monitoring -l app=dcgm-exporter -o wide # one Running pod per GPU node
Hit the endpoint from inside the cluster and confirm real samples:
kubectl -n monitoring port-forward ds/dcgm-exporter 9400:9400 &
curl -s localhost:9400/metrics | grep -E "DCGM_FI_DEV_GPU_UTIL|DCGM_FI_DEV_FB_USED" | head
# expected: DCGM_FI_DEV_GPU_UTIL{gpu="0",UUID="GPU-...",Hostname="..."} 0
Confirm profiling fields are non-empty (proves SYS_ADMIN is in effect):
curl -s localhost:9400/metrics | grep DCGM_FI_PROF_GR_ENGINE_ACTIVE | head
# empty/absent => missing SYS_ADMIN, or MIG-mode device without per-instance profiling
Confirm Prometheus discovered the target. The expected signal is state up with health="up":
# Prometheus UI: Status -> Targets -> job "dcgm-exporter" -> UP
# or via API:
curl -s "http://<prometheus>/api/v1/targets" \
| jq '.data.activeTargets[] | select(.labels.job=="dcgm-exporter") | {health, scrapeUrl}'
# expected: { "health": "up", "scrapeUrl": "http://.../metrics" }
A quick PromQL sanity check (more in telemetry):
count(DCGM_FI_DEV_GPU_UTIL) # == total GPUs scraped across the fleet
up{job="dcgm-exporter"} # 1 per exporter pod
Failure modes¶
- Target absent in Prometheus.
ServiceMonitorlabels don't matchserviceMonitorSelector(commonly the missingrelease: <prom-release>label), or the ServiceMonitor lives in a namespace Prometheus doesn't watch. DCGM_FI_PROF_*empty/zero. MissingSYS_ADMINcapability, or the GPU is MIG-mode without per-instance profiling (MIG mode).- No pod-level labels. Pod-resources socket not mounted, or kubelet path differs from
/var/lib/kubelet/pod-resources; setkubeletPath/ fix thehostPath. - Two exporters fighting. GPU Operator's DaemonSet plus a standalone one on the same nodes contend for port 9400 / the pod-resources socket. Run exactly one.
CrashLoopBackOff/Failed to watcherrors. Exporter cannot reach DCGM (driver/toolkit not ready on the node) or RBAC denies the ConfigMap mount (RBAC for GPU Platform Operators).- Custom CSV silently drops metrics you rely on. A custom CSV replaces, not extends, the defaults; alerts on omitted fields (XID, ECC, SM-active) go blind.
- Cardinality blow-up.
enablePodLabels=trueplus per-PID fields on a busy inference node explodes series count; scope withpodLabelAllowlistRegexor relabeling.
References¶
- dcgm-exporter (repo + reference manifest/values): https://github.com/NVIDIA/dcgm-exporter
- dcgm-exporter chart values (
deployment/values.yaml): https://github.com/NVIDIA/dcgm-exporter/blob/main/deployment/values.yaml - Default counters CSV: https://github.com/NVIDIA/dcgm-exporter/blob/main/etc/default-counters.csv
- DCGM-Exporter (NVIDIA DCGM docs): https://docs.nvidia.com/datacenter/dcgm/latest/gpu-telemetry/dcgm-exporter.html
- DCGM field-ID reference: https://docs.nvidia.com/datacenter/dcgm/latest/dcgm-api/dcgm-api-field-ids.html
- standalone Helm chart repository index: https://nvidia.github.io/dcgm-exporter/helm-charts/index.yaml
- GPU Operator (custom metrics via
dcgmExporter.config.name): https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html - Prometheus Operator (ServiceMonitor): https://prometheus-operator.dev/docs/
Related: Telemetry · K8s Platform · GPU Operator (Helm) · Smoke tests · Diagnostics · Security · Glossary