Markdown

GPU observability and DCGM monitoring¶

Scope: seeing what the GPUs and the cluster are actually doing: the metrics that matter, the telemetry stack, profiling, logging, and alerting. The eyes for both ops (reliability and RAS) and optimisation (performance tuning).

flowchart LR
  DCGM["DCGM exporter"] --> PROM["Prometheus"]
  PROM --> GRAFANA["Grafana"]
  PROM --> ALERTS["Alertmanager"]
  LOGS["dmesg and NCCL logs"] --> TRIAGE["Incident triage"]
  PROFILES["Nsight profiles"] --> TUNING["Performance tuning"]

Overview¶

The hardest part of GPU observability is that the obvious number lies. "GPU utilisation" says a kernel was resident, not that the silicon did useful work; a starved GPU reads 100% busy. The skill is collecting the signals that reflect real work (SM and Tensor-Core activity, memory bandwidth, MFU), wiring them into a stack that pages on the right events, and reaching for a profiler instead of guessing when something is slow.

Core knowledge¶

The metrics that matter (and the big misconception)¶

nvidia-smi GPU-Util / DCGM DCGM_FI_DEV_GPU_UTIL measures the percent of time the SMs were busy in the sample window, a useful SM-activity signal but not a goodput/MFU substitute. A GPU at 100% "util" can be 90% idle waiting on memory or comms. It still carries an actionable threshold: sustained SM-util < 50% points to small batches or unfused kernels, so raise the batch size or fuse kernels, then profile (book Table 16-1).
The real signals: SM activity / occupancy (DCGM_FI_PROF_SM_ACTIVE, SM_OCCUPANCY), Tensor-Core activity (PIPE_TENSOR_ACTIVE), memory bandwidth (DRAM_ACTIVE), NVLink/PCIe throughput, and ultimately MFU (distributed training). 100% util with 5% SM-active and 2% tensor-active means starved, not working.

The telemetry stack¶

DCGM (Data Center GPU Manager) + dcgm-exporter: the standard fleet source for profiling metrics, health, ECC/XID, thermals, power, clocks. Deployed via the GPU Operator (Kubernetes for GPUs).
Pipeline: dcgm-exporter → Prometheus → Grafana (NVIDIA ships dashboards), node_exporter for host, Alertmanager for paging.
Logs: kernel/dmesg for XID events (reliability and RAS), framework logs, and NCCL debug (NCCL_DEBUG=INFO|WARN) for collective problems. Fabric counters come from UFM (networking fabric).

Profiling (when a number is bad, find out why)¶

Nsight Systems (nsys): a system timeline covering CPU/GPU overlap, comms gaps, dataloader stalls. The first tool for "why is the training step slow".
Nsight Compute (ncu): single-kernel deep dive, roofline, occupancy limiters.
PyTorch Profiler + Holistic Trace Analysis / TensorBoard for framework-level traces; CUPTI underneath. nvidia-smi dmon/pmon for a quick live look (the GPU software stack).

What to watch in production¶

Thermal and power throttle reasons (clocks_throttle_reasons), ECC error counts, XID errors, power vs cap, NVLink errors, pump/fan and facility signals (datacentre readiness), per-job GPU/memory use, and fabric health (networking fabric).
Per-link NVLink throughput is not exported by default. dcgm-exporter ships only the aggregate DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL; the per-link fields (DCGM_FI_DEV_NVLINK_TX/RX_THROUGHPUT_L*) are commented out. For link-level bandwidth, enable them in the exporter's counters file, query DCGM directly, or fall back to nvidia-smi nvlink / Nsight (book Ch. 16).

The loop, and the MLOps adjacency¶

Closed loop: low real utilisation → profile with nsys → locate the bottleneck (dataloader/comms/kernel) → fix (performance tuning). Health degradation → gate the scheduler / drain the node (provisioning and scheduling, reliability and RAS).
Distinct but adjacent: experiment tracking (Weights & Biases, MLflow, TensorBoard) records training runs/metrics, an MLOps concern alongside, not the same as, infra monitoring.

Don't-miss checklist¶

Don't report GPU-Util as a goodput/efficiency substitute (use SM-active / tensor-active / MFU for that). Do act on it as an SM-activity signal: sustained < 50% means raise batch size or fuse kernels.
dcgm-exporter + Prometheus + Grafana as the baseline; alert on XID, ECC ramp, thermal throttle, NVLink down.
Centralise dmesg/XID collection; it is the first stop for hardware faults (reliability and RAS).
Reach for nsys before guessing at a slow step.
Feed health metrics into scheduling so degraded nodes drain (provisioning and scheduling).

Failure modes¶

Presenting "100% GPU util" as proof of efficiency while MFU is 20%.
No XID/dmesg aggregation: hardware faults invisible until a job crashes.
Alerting on noise, not on the leading indicators (ECC ramp, rising temps, NVLink flaps).
Optimising by guesswork because nothing was profiled.

Open questions & validation¶

Validate DCGM profiling metrics (SM-active, tensor-active, DRAM-active) against a live job; field availability varies by DCGM version and GPU.
Nsight Systems on a real training step end-to-end, reading the timeline.
XID/ECC alerting rules that page before a crash, not after.

References¶

DCGM: https://docs.nvidia.com/datacenter/dcgm/latest/index.html
dcgm-exporter: https://github.com/NVIDIA/dcgm-exporter
DCGM profiling metrics / field IDs: https://docs.nvidia.com/datacenter/dcgm/latest/dcgm-api/dcgm-api-field-ids.html
Nsight Systems: https://docs.nvidia.com/nsight-systems/ · Nsight Compute: https://docs.nvidia.com/nsight-compute/
PyTorch Profiler: https://docs.pytorch.org/docs/stable/profiler.html
dcgm-exporter default counters (per-link NVLink disabled by default): https://github.com/NVIDIA/dcgm-exporter/blob/main/etc/default-counters.csv
Chris Fregly, AI Systems Performance Engineering (O'Reilly), Ch. 16 — GPU-Util as SM-activity signal and Table 16-1 thresholds; dcgm-exporter per-link NVLink caveat.

Related: Fabric · Physical · Provisioning · Software Stack · Reliability · Agentic AIOps · Optimization · Glossary