Markdown

Telemetry, monitoring & alerting¶

Scope: the runnable monitoring stack (DCGM → Prometheus → Grafana → Alertmanager) with the exporters, scrape config, dashboards, alert rules, and PromQL that matter for GPUs. The manifest counterpart to observability, feeding RAS (reliability and RAS) and optimisation (performance tuning).

Reference templates. Metric names follow dcgm-exporter field IDs (verify against the deployed DCGM version). Pin chart/image versions; apply via GitOps (SRE and MLOps practices).

Overview¶

The stack answers three questions: is the GPU healthy, is it doing useful work, and is it about to fail. The first and third are DCGM fields (XID, ECC, temp, throttle); the second is the SM/Tensor-active signal that the naive "GPU-util" hides (observability). Wire dcgm-exporter into Prometheus, alert on the leading indicators, and put MFU-style panels in front of operators.

Pipeline: telemetry flow

flowchart LR
  GPU["GPU"] --> EXP["dcgm-exporter"]
  HOST["node_exporter"] --> PROM["Prometheus"]
  EXP --> PROM
  PROM --> GRAF["Grafana"]
  PROM --> AM["Alertmanager"]
  AM --> PAGE["PagerDuty / Slack"]

1. Prometheus + Grafana + Alertmanager¶

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install kube-prom prometheus-community/kube-prometheus-stack \
  -n monitoring --create-namespace --version <pinned> \
  --set grafana.defaultDashboardsEnabled=true

2. dcgm-exporter (GPU metrics)¶

The GPU Operator can deploy it (the Kubernetes platform); standalone DaemonSet shown for clarity:

apiVersion: apps/v1
kind: DaemonSet
metadata: { name: dcgm-exporter, namespace: monitoring }
spec:
  selector: { matchLabels: { app: dcgm-exporter } }
  template:
    metadata: { labels: { app: dcgm-exporter } }
    spec:
      nodeSelector: { nvidia.com/gpu.present: "true" }
      tolerations: [{ key: nvidia.com/gpu, operator: Exists, effect: NoSchedule }]
      containers:
        - name: dcgm-exporter
          image: nvcr.io/nvidia/k8s/dcgm-exporter:<pinned>
          ports: [{ name: metrics, containerPort: 9400 }]
          securityContext: { capabilities: { add: ["SYS_ADMIN"] } }
---
apiVersion: v1
kind: Service
metadata:
  name: dcgm-exporter
  namespace: monitoring
  labels: { app: dcgm-exporter }
spec:
  selector: { app: dcgm-exporter }
  ports: [{ name: metrics, port: 9400 }]
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata: { name: dcgm-exporter, namespace: monitoring, labels: { release: kube-prom } }
spec:
  selector: { matchLabels: { app: dcgm-exporter } }
  endpoints: [{ port: metrics, interval: 15s }]

3. Alert rules (the leading indicators)¶

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata: { name: gpu-alerts, namespace: monitoring, labels: { release: kube-prom } }
spec:
  groups:
    - name: gpu.rules
      rules:
        - alert: GPUXidError
          expr: increase(DCGM_FI_DEV_XID_ERRORS[5m]) > 0
          labels: { severity: critical }
          annotations: { summary: "XID {{ $value }} on {{ $labels.Hostname }} GPU {{ $labels.gpu }}" }
        - alert: GPUDoubleBitECC          # uncorrectable -> drain/RMA
          expr: increase(DCGM_FI_DEV_ECC_DBE_VOL_TOTAL[10m]) > 0
          labels: { severity: critical }
        - alert: GPUThermalThrottle
          expr: DCGM_FI_DEV_GPU_TEMP > 87
          for: 5m
          labels: { severity: warning }
        - alert: GPUDown                   # exporter/GPU missing
          expr: up{job="dcgm-exporter"} == 0
          for: 2m
          labels: { severity: critical }
        - alert: GPUStarvedAtHighUtil      # "busy" but not computing -> dataloader/comms bound
          expr: (avg by (Hostname,gpu) (DCGM_FI_PROF_SM_ACTIVE) < 0.20)
                and (avg by (Hostname,gpu) (DCGM_FI_DEV_GPU_UTIL) > 80)
          for: 15m
          labels: { severity: info }

4. Key PromQL (what to graph)¶

# Real work, not "util": fraction of SMs active per GPU
avg by (Hostname, gpu) (DCGM_FI_PROF_SM_ACTIVE)
# Tensor-core engagement (low on a training box => check precision/kernels)
avg by (Hostname) (DCGM_FI_PROF_PIPE_TENSOR_ACTIVE)
# Memory used fraction
DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE)
# Fleet power draw (kW)
sum(DCGM_FI_DEV_POWER_USAGE) / 1000
# Fleet GPU allocation vs capacity (with kube-state-metrics)
sum(kube_pod_container_resource_requests{resource="nvidia_com_gpu"})
  / count(DCGM_FI_DEV_GPU_UTIL)            # count of GPUs, not a sum of utilisation values

Recording rule for a cheap fleet "useful-utilisation" rollup:

- record: fleet:gpu_sm_active:avg
  expr: avg(DCGM_FI_PROF_SM_ACTIVE)

5. Grafana dashboards¶

Provision the NVIDIA DCGM Exporter dashboard (grafana.com ID 12239) and the DCGM cluster view via the kube-prometheus-stack sidecar (label a ConfigMap grafana_dashboard: "1").
Build one "fleet health" board: SM-active heatmap, XID/ECC event table, temp/throttle, power, allocation vs capacity. Keep GPU-Util off the efficiency panels.

Don't-miss checklist¶

Scrape DCGM_FI_PROF_SM_ACTIVE / PIPE_TENSOR_ACTIVE, not just GPU_UTIL (observability).
Alert on XID, double-bit ECC, thermal throttle, exporter-down at minimum (reliability and RAS).
Route critical alerts to PagerDuty/Opsgenie, warnings to Slack; tie runbooks (the troubleshooting runbook) into annotations.
Add the SYS_ADMIN capability the exporter needs for profiling fields, or those metrics read zero.
Keep all of this in git and apply via GitOps (SRE and MLOps practices).

Failure modes¶

Profiling fields (*_PROF_*) empty because the exporter lacks privileges or the GPU is MIG-mode without per-instance profiling.
Alerting on GPU_UTIL and missing a starved fleet running at 15% SM-active.
No XID/ECC alerts: hardware faults invisible until a job crashes (reliability and RAS).
Cardinality blow-up from per-PID metrics on a busy inference node.

Open questions & validation¶

Confirm the exact DCGM_FI_* field set exposed by the deployed exporter version and adjust queries.
Validate alert thresholds (temp, ECC rate) against the specific GPU SKU and cooling envelope (datacentre readiness).
Decide MFU computation: derive from SM-active + achieved FLOPs in the training loop (distributed training) vs proxy via SM-active.

References¶

dcgm-exporter: https://github.com/NVIDIA/dcgm-exporter
DCGM field IDs: https://docs.nvidia.com/datacenter/dcgm/latest/dcgm-api/dcgm-api-field-ids.html
kube-prometheus-stack: https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack
NVIDIA DCGM Grafana dashboard 12239: https://grafana.com/grafana/dashboards/12239
Prometheus Operator (ServiceMonitor/PrometheusRule): https://prometheus-operator.dev/docs/

Related: Observability · Reliability · Optimization · Runbook · K8s Platform · Practices · Glossary