Telemetry, monitoring & alerting¶
Scope: the runnable monitoring stack (DCGM → Prometheus → Grafana → Alertmanager) with the exporters, scrape config, dashboards, alert rules, and PromQL that matter for GPUs. The manifest counterpart to observability, feeding RAS (reliability and RAS) and optimisation (performance tuning).
Reference templates. Metric names follow
dcgm-exporterfield IDs (verify against the deployed DCGM version). Pin chart/image versions; apply via GitOps (SRE and MLOps practices).
Overview¶
The stack answers three questions: is the GPU healthy, is it doing useful work, and is it about to fail. The first and third are DCGM fields (XID, ECC, temp, throttle); the second is the SM/Tensor-active signal that the naive "GPU-util" hides (observability). Wire dcgm-exporter into Prometheus, alert on the leading indicators, and put MFU-style panels in front of operators.
Pipeline: telemetry flow
flowchart LR
GPU["GPU"] --> EXP["dcgm-exporter"]
HOST["node_exporter"] --> PROM["Prometheus"]
EXP --> PROM
PROM --> GRAF["Grafana"]
PROM --> AM["Alertmanager"]
AM --> PAGE["PagerDuty / Slack"]
1. Prometheus + Grafana + Alertmanager¶
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install kube-prom prometheus-community/kube-prometheus-stack \
-n monitoring --create-namespace --version <pinned> \
--set grafana.defaultDashboardsEnabled=true
2. dcgm-exporter (GPU metrics)¶
The GPU Operator can deploy it (the Kubernetes platform); standalone DaemonSet shown for clarity:
apiVersion: apps/v1
kind: DaemonSet
metadata: { name: dcgm-exporter, namespace: monitoring }
spec:
selector: { matchLabels: { app: dcgm-exporter } }
template:
metadata: { labels: { app: dcgm-exporter } }
spec:
nodeSelector: { nvidia.com/gpu.present: "true" }
tolerations: [{ key: nvidia.com/gpu, operator: Exists, effect: NoSchedule }]
containers:
- name: dcgm-exporter
image: nvcr.io/nvidia/k8s/dcgm-exporter:<pinned>
ports: [{ name: metrics, containerPort: 9400 }]
securityContext: { capabilities: { add: ["SYS_ADMIN"] } }
---
apiVersion: v1
kind: Service
metadata:
name: dcgm-exporter
namespace: monitoring
labels: { app: dcgm-exporter }
spec:
selector: { app: dcgm-exporter }
ports: [{ name: metrics, port: 9400 }]
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata: { name: dcgm-exporter, namespace: monitoring, labels: { release: kube-prom } }
spec:
selector: { matchLabels: { app: dcgm-exporter } }
endpoints: [{ port: metrics, interval: 15s }]
3. Alert rules (the leading indicators)¶
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata: { name: gpu-alerts, namespace: monitoring, labels: { release: kube-prom } }
spec:
groups:
- name: gpu.rules
rules:
- alert: GPUXidError
expr: increase(DCGM_FI_DEV_XID_ERRORS[5m]) > 0
labels: { severity: critical }
annotations: { summary: "XID {{ $value }} on {{ $labels.Hostname }} GPU {{ $labels.gpu }}" }
- alert: GPUDoubleBitECC # uncorrectable -> drain/RMA
expr: increase(DCGM_FI_DEV_ECC_DBE_VOL_TOTAL[10m]) > 0
labels: { severity: critical }
- alert: GPUThermalThrottle
expr: DCGM_FI_DEV_GPU_TEMP > 87
for: 5m
labels: { severity: warning }
- alert: GPUDown # exporter/GPU missing
expr: up{job="dcgm-exporter"} == 0
for: 2m
labels: { severity: critical }
- alert: GPUStarvedAtHighUtil # "busy" but not computing -> dataloader/comms bound
expr: (avg by (Hostname,gpu) (DCGM_FI_PROF_SM_ACTIVE) < 0.20)
and (avg by (Hostname,gpu) (DCGM_FI_DEV_GPU_UTIL) > 80)
for: 15m
labels: { severity: info }
4. Key PromQL (what to graph)¶
# Real work, not "util": fraction of SMs active per GPU
avg by (Hostname, gpu) (DCGM_FI_PROF_SM_ACTIVE)
# Tensor-core engagement (low on a training box => check precision/kernels)
avg by (Hostname) (DCGM_FI_PROF_PIPE_TENSOR_ACTIVE)
# Memory used fraction
DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE)
# Fleet power draw (kW)
sum(DCGM_FI_DEV_POWER_USAGE) / 1000
# Fleet GPU allocation vs capacity (with kube-state-metrics)
sum(kube_pod_container_resource_requests{resource="nvidia_com_gpu"})
/ count(DCGM_FI_DEV_GPU_UTIL) # count of GPUs, not a sum of utilisation values
Recording rule for a cheap fleet "useful-utilisation" rollup:
5. Grafana dashboards¶
- Provision the NVIDIA DCGM Exporter dashboard (grafana.com ID 12239) and the DCGM cluster view via the kube-prometheus-stack sidecar (label a ConfigMap
grafana_dashboard: "1"). - Build one "fleet health" board: SM-active heatmap, XID/ECC event table, temp/throttle, power, allocation vs capacity. Keep
GPU-Utiloff the efficiency panels.
Don't-miss checklist¶
- Scrape
DCGM_FI_PROF_SM_ACTIVE/PIPE_TENSOR_ACTIVE, not justGPU_UTIL(observability). - Alert on XID, double-bit ECC, thermal throttle, exporter-down at minimum (reliability and RAS).
- Route critical alerts to PagerDuty/Opsgenie, warnings to Slack; tie runbooks (the troubleshooting runbook) into annotations.
- Add the
SYS_ADMINcapability the exporter needs for profiling fields, or those metrics read zero. - Keep all of this in git and apply via GitOps (SRE and MLOps practices).
Failure modes¶
- Profiling fields (
*_PROF_*) empty because the exporter lacks privileges or the GPU is MIG-mode without per-instance profiling. - Alerting on
GPU_UTILand missing a starved fleet running at 15% SM-active. - No XID/ECC alerts: hardware faults invisible until a job crashes (reliability and RAS).
- Cardinality blow-up from per-PID metrics on a busy inference node.
Open questions & validation¶
- Confirm the exact
DCGM_FI_*field set exposed by the deployed exporter version and adjust queries. - Validate alert thresholds (temp, ECC rate) against the specific GPU SKU and cooling envelope (datacentre readiness).
- Decide MFU computation: derive from SM-active + achieved FLOPs in the training loop (distributed training) vs proxy via SM-active.
References¶
- dcgm-exporter: https://github.com/NVIDIA/dcgm-exporter
- DCGM field IDs: https://docs.nvidia.com/datacenter/dcgm/latest/dcgm-api/dcgm-api-field-ids.html
- kube-prometheus-stack: https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack
- NVIDIA DCGM Grafana dashboard 12239: https://grafana.com/grafana/dashboards/12239
- Prometheus Operator (ServiceMonitor/PrometheusRule): https://prometheus-operator.dev/docs/
Related: Observability · Reliability · Optimization · Runbook · K8s Platform · Practices · Glossary