Markdown

GPU diagnostics & validation¶

Scope: the tooling that proves a GPU node is healthy enough to take work (dcgmi diag run levels, DCGM health watches, nvbandwidth, gpu-burn, and nvidia-smi dmon/pmon), when to run each (acceptance, pre-job gate, incident triage), and how to read pass/fail. This is GPU-level validation; for the interconnect (IB/RoCE/NVLink line-rate proof) see Fabric Bring-Up, Validation and Benchmarking, and for what the failures mean see Reliability, RAS and Failure Modes.

Every command below is a reference template, not hardware-tested. DCGM test coverage, run-level timings, and plugin names vary by DCGM and driver version. Confirm against dcgmi diag --help and the DCGM docs for your installed version before scripting a gate against them. Treat all printed numbers as illustrative, not targets.

What it is¶

Four tools cover almost all GPU node validation, in increasing invasiveness:

DCGM (dcgmi), Data Center GPU Manager. Two distinct surfaces: health watches (dcgmi health), passive background monitoring of the GPU's own error channels, and active diagnostics (dcgmi diag), which run real workloads and grade the result. The diagnostic is the fleet-standard "is this GPU good?" tool; NVIDIA: it works "by running real workloads and analyzing the results" and the active levels "require exclusive access to the target GPUs."² It ships with the datacenter driver stack (GPU Software Stack and Node Administration).
nvbandwidth (NVIDIA/nvbandwidth) measures realised memcpy bandwidth across H2D / D2H / device-to-device (P2P) paths using copy-engine or SM copy methods.³ Higher-level DCGM run levels also invoke an nvbandwidth plugin internally.¹
gpu-burn (wilicc/gpu-burn) is a third-party CUDA matrix-multiply stress test that runs the GPU hot for a set duration and checks every result for computation errors.⁴ It is not an NVIDIA tool; it complements dcgmi diag for sustained-load soak.
nvidia-smi dmon / pmon are the live device and per-process monitors built into nvidia-smi. They are not pass/fail tests; they are the watch-window you keep open while a stress runs or an incident unfolds.⁵ Full flag reference: nvidia-smi Reference.

These sit on top of the driver, kernel modules, and (on NVSwitch systems) Fabric Manager. If those are wrong, diagnostics fail for reasons that are not the silicon. Rule them out first (NVIDIA Kernel Modules, Fabric Manager, GPU Firmware and GSP).

Why it's needed (and when)¶

A GPU that enumerates in nvidia-smi is not a GPU that is fit for a multi-hour collective job. ECC can be degrading, an HBM row-remap can be pending, a PCIe/NVLink lane can have dropped, a board can throttle under sustained power. Validation tooling exists to catch that before a job lands, and to adjudicate after a fault whether a node returns to the pool or goes to RMA. Three moments, three tools:

flowchart LR
  ACC["Acceptance / burn-in"] --> GATE["Pre-job health gate"]
  GATE --> RUN["Job running"]
  RUN --> INC["Incident triage"]
  INC --> DEC{"diag pass?"}
  DEC -->|"yes"| GATE
  DEC -->|"no"| RMA["Drain / RMA"]

Acceptance / burn-in (slow, thorough). New or re-racked hardware: long DCGM diagnostics plus a gpu-burn soak to clear infant mortality (the front of the bathtub curve, Reliability, RAS and Failure Modes). This is where dcgmi diag -r 3 / -r 4 and a multi-hour burn belong. Pair with the fabric proof in Fabric Bring-Up, Validation and Benchmarking. See also Commissioning and Acceptance.
Pre-job health gate (fast, non-destructive). Every scheduling decision: cheap dcgmi diag -r 1 plus a dcgmi health -c read of the watch state. Fast enough to run on node entry without materially delaying jobs; it catches a node that drifted unhealthy since last use. Health gating keeps jobs off known-bad nodes (Reliability, RAS and Failure Modes).
Incident triage (targeted). After an XID/SXID, a NCCL hang, or a throttle alert: re-run dcgmi diag -r 3 post-reset to decide return-to-pool vs RMA, and keep dmon/pmon open to watch the failure recur. This is the Reset -> Diag -> Healthy|RMA arc in the RAS state machine (Reliability, RAS and Failure Modes).

Rule of thumb: passive health watches run always; the cheap diagnostic gates every job; the expensive diagnostic and the burn run at acceptance and at incident close, not in the hot path.

How it's installed & managed¶

DCGM ships as the datacenter-gpu-manager package from NVIDIA's CUDA repository (or bundled by the GPU Operator). The active diagnostic, EUD, and higher-level plugins need the matching DCGM build for your driver. Pin them together (Driver Install and Lifecycle). DCGM runs a host engine, nv-hostengine; dcgmi talks to it either embedded (started for the command) or against a standalone daemon. Profiling/diagnostic paths require administrator privileges.²

Reference template, not hardware-tested.

# DCGM present and the host engine reachable
dcgmi discovery --list                 # enumerate GPUs DCGM sees
nv-hostengine --version                # or: dcgmi --version

nvbandwidth and gpu-burn are built from source on the node (or baked into a validation container). Both need the CUDA toolkit present (CUDA Toolkit and Runtime).

# nvbandwidth (build needs CUDA, CMake 3.20+, Boost program_options)
git clone https://github.com/NVIDIA/nvbandwidth
cd nvbandwidth
sudo apt-get install -y libboost-program-options-dev
cmake . && make

# gpu-burn (third party; binary is gpu_burn with an underscore)
git clone https://github.com/wilicc/gpu-burn
cd gpu-burn
make

nvidia-smi dmon/pmon need nothing extra. They are part of the driver-installed nvidia-smi.

Persistence and clocks. Run validation with Persistence Mode on, so the per-job ECC-scrub re-init and clock drift do not contaminate results, and with locked clocks if you want run-to-run comparability (nvidia-smi Reference). Exclusivity. The active dcgmi diag levels "require exclusive access to the target GPUs"²; cordon/drain the node first or the diagnostic will contend with (or fail because of) a resident job.

Validated usage & tests¶

Reference template, not hardware-tested. Run on an idle, cordoned node; the descriptions are the expected shape of healthy output, never numbers measured on hardware.

DCGM active diagnostic: `dcgmi diag -r`¶

The -r argument selects a run level; higher levels run a superset of tests and take longer. NVIDIA's documented levels:¹

Level	Name	Adds over previous (representative plugins)	Documented runtime (8-GPU)
`-r 1`	Quick (System Validation)	Software/deployment, PCIe+NVLink, GPU memory, memory bandwidth	< ~2.5 s
`-r 2`	Medium (Extended System Validation)	Same plugins as `-r 1`, run longer (more iterations)	< ~10.5 min
`-r 3`	Long (System HW Diagnostics)	+ Diagnostic (EUD), Targeted Stress, Targeted Power, NVBandwidth, NCCL	< ~35 min
`-r 4`	Extended (Longer-running HW Diagnostics)	+ Memtest, Pulse	< ~2.25 h

Exact plugin membership and timings are version-dependent. dcgmi diag --help is authoritative for your build.¹

dcgmi diag -r 1                         # quick pre-job gate
dcgmi diag -r 3                         # acceptance / post-reset adjudication
dcgmi diag -r 3 -i 0,1                  # restrict to specific GPUs
dcgmi diag -r 3 -j                      # JSON, for a gate to parse
dcgmi diag -r pcie,diagnostic          # run named plugins only

Reading the result: DCGM prints a per-test table with Pass / Fail (and skips). Any Fail fails the node: do not return it to the pool. A Fail carries an error category and severity; an isolating error (DCGM_FR_* such as a fallen-off-the-bus or thermal-violation code) means drain, not retry.² Map the failing plugin to the fault class in Reliability, RAS and Failure Modes: a PCIe/NVLink failure is a bus/link fault, a memory or memtest failure points at HBM/ECC, a targeted-power or pulse failure points at board/power delivery. For machine consumption use -j and gate on the result field rather than scraping the table.

DCGM health watches: `dcgmi health`¶

Health watches are the passive side: DCGM watches the GPU's error channels (PCIe/NVLink, memory/ECC/InfoROM, thermal, power, NVLink, NVSwitch, driver) and reports a system-by-system verdict without running a workload.² Set the watches once on a group, then poll the check cheaply:

dcgmi health -g 1 -s mpid              # watch memory, PCIe, InfoROM, driver on group 1
dcgmi health -g 1 -c                   # check current health, non-invasive

Expect an overall Healthy and per-system Healthy; a Warning/Failure names the system (e.g. Memory, Thermal, NVLink) and is your pre-job stop signal. This is the watch you can run while jobs are live (it is non-invasive), whereas dcgmi diag is not.²

Bandwidth: `nvbandwidth`¶

Confirm H2D/D2H and device-to-device (P2P) paths reach the expected fraction of link rate. A path that collapses to a small fraction of its neighbours is a degraded lane or a topology that fell back to PCIe.

./nvbandwidth -l                                   # list testcases (authoritative per CUDA version)
./nvbandwidth                                      # run all
./nvbandwidth -t host_to_device_memcpy_ce          # H2D via copy engine
./nvbandwidth -t device_to_host_memcpy_ce          # D2H
./nvbandwidth -t device_to_device_memcpy_read_ce   # D2D / P2P read (NVLink path on NVSwitch nodes)

Run -l on the build for the exact testcase names. The set varies by CUDA version.³ Interpretation is relative, not absolute: compare each GPU/link against its peers on the same node and against the line-rate ceiling, never against a number copied from another machine. For NVLink/NVSwitch line-rate proof and the Fabric Manager prerequisite, defer to Fabric Bring-Up, Validation and Benchmarking and NVSwitch and NVLink.

Stress / soak: `gpu-burn`¶

gpu-burn runs GEMMs continuously for the given number of seconds and verifies the results. It surfaces GPUs that compute wrong under sustained heat and current, the soak that infant-mortality and marginal-board failures show up under.⁴

./gpu_burn 60          # 60-second smoke (single-precision)
./gpu_burn -d 3600     # 1-hour double-precision soak
./gpu_burn -tc 3600    # use Tensor cores where available
./gpu_burn -i 0 600    # restrict to GPU 0

A clean run reports each GPU completing with no computation errors (OK); any reported errors, or a thermal trip / Xid during the run, condemn the board pending Reliability, RAS and Failure Modes triage. Keep dmon open alongside it (below) to watch temperature, power, and throttle reasons climb under the load. gpu-burn is third-party. Treat it as a soak complement to dcgmi diag, not a replacement for the NVIDIA diagnostic.

Live watch: `nvidia-smi dmon` / `pmon`¶

These are not tests, just the windows you keep open during a stress or an incident. dmon rolls one line per device per sample; pmon attributes utilization and memory to PIDs. Select metric groups with -s (p power/temp, u utilization, c clocks, v power/thermal violations, m memory, e ECC + PCIe replay, t PCIe throughput), set interval with -d, bound the run with -c.⁵

nvidia-smi dmon -s pucvet -d 1     # power/temp, util, clocks, violations, ECC, PCIe throughput @1s
nvidia-smi dmon -s pe -c 120 -i 0  # GPU0: power/temp + ECC/replay, 120 samples then exit
nvidia-smi pmon -i 0,1 -s um       # per-process SM/mem utilization on GPU0,1

During a soak, watch gtemp/mtemp plateau (not climb to the throttle point), the violation (v) and ECC (e) columns stay zero, and clocks hold. During an incident, pmon tells you which PID owns a hung or memory-pegged GPU. Full flag and column reference: nvidia-smi Reference.

Failure modes¶

Diagnostic fails on a healthy GPU because the stack is wrong. Missing/mismatched kernel module, a down Fabric Manager on an NVSwitch box, or a GSP firmware/driver mismatch makes dcgmi diag fail for non-silicon reasons. Rule these out before condemning a board: Kernel upgrade, GPU missing, Fabric Manager Failure, GSP Firmware / Driver Mismatch.
Diagnostic contends with a running job. The active levels need exclusive access²; running them on a busy node yields false failures or refuses to start. Cordon/drain first.
Memory / memtest plugin fails, or ECC column ramps under dmon. This points at HBM/ECC degradation or a pending row-remap. Adjudicate and, on a remap failure, RMA. Toggle/recovery flow: ECC Toggle Recovery; fault classification: Reliability, RAS and Failure Modes.
nvbandwidth shows one link far below its peers. That is a degraded PCIe/NVLink lane or a fabric that fell back to PCIe. Confirm topology and link state, then take it to the fabric procedure (Fabric Bring-Up, Validation and Benchmarking, NVSwitch and NVLink).
gpu-burn reports computation errors or trips thermal. That is a marginal board under sustained load. Drain and RMA-triage (Reliability, RAS and Failure Modes); a thermal trip may be cooling, not silicon (cross-check the thermal path before condemning the GPU).
Trusting a green health watch as proof of fitness. dcgmi health -c is passive: it reads channels, it does not exercise the GPU. A clean health read with no recent active diagnostic is not acceptance; run dcgmi diag for that.¹

References¶

DCGM Diagnostics (run levels 1-4, plugins, exclusive-access requirement, -r/-i/-j syntax, runtimes): https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/dcgm-diagnostics.html
DCGM Feature Overview (dcgmi health watches and categories, embedded vs standalone host engine, error categories): https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/feature-overview.html
DCGM NVBandwidth plugin (bandwidth tests inside the diagnostic): https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/diag-nvbandwidth-plugin.html
NVIDIA/nvbandwidth (build, -l list, -t testcase names, H2D/D2H/D2D): https://github.com/NVIDIA/nvbandwidth
wilicc/gpu-burn (build, gpu_burn binary, -d/-tc/-m/-i flags, duration-in-seconds argument): https://github.com/wilicc/gpu-burn
nvidia-smi manual (dmon/pmon options, -s metric letters, default 1 s interval): https://docs.nvidia.com/deploy/nvidia-smi/index.html

NVIDIA DCGM Diagnostics — run levels 1 Quick / 2 Medium / 3 Long / 4 Extended with the documented per-level plugin sets (software/deployment, PCIe+NVLink, GPU memory, memory bandwidth at r1; r2 runs the same r1 plugins for longer with no new plugin; + Diagnostic/EUD, Targeted Stress, Targeted Power, NVBandwidth, NCCL at r3; + Memtest, Pulse at r4) and 8-GPU runtime bounds; -r <1-4|test_name>, -i <entity list>, -j JSON. https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/dcgm-diagnostics.html ↩↩↩↩
NVIDIA DCGM Feature Overview — dcgmi health watch system with -s watch flags and -c check; passive/background health watches are non-invasive and run while jobs are live; failure systems (Driver, PCIe, Memory, InfoROM, Thermal, Power, NVLink, NVSwitch, ConnectX) with DCGM_FR_* error codes; dcgmi against an embedded or standalone nv-hostengine; diagnostic/profiling paths need administrator privileges. https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/feature-overview.html ↩↩↩↩↩↩↩
NVIDIA/nvbandwidth — measures memcpy bandwidth across H2D/D2H/device-to-device paths via copy-engine (CE) or SM copy; build needs CUDA, CMake 3.20+, Boost program_options (cmake . && make); -l lists testcases (set varies by CUDA version), -t <name> runs one (e.g. host_to_device_memcpy_ce, device_to_device_memcpy_read_ce). https://github.com/NVIDIA/nvbandwidth ↩↩
wilicc/gpu-burn — multi-GPU CUDA matrix-multiply stress test that runs for a duration in seconds and checks results for computation errors; binary gpu_burn; flags -d (double precision), -tc (Tensor cores), -m (memory amount/percent), -i N (single GPU), -l (list GPUs); e.g. gpu_burn -d 3600. https://github.com/wilicc/gpu-burn ↩↩
nvidia-smi manual — dmon (device monitor, up to 16 devices, default 1 s) and pmon (per-process); -s metric letters p (power/temp), u (utilization), c (clocks), v (power/thermal violations), m (memory), e (ECC + PCIe replay), t (PCIe throughput); -i device list, -d interval, -c sample count. https://docs.nvidia.com/deploy/nvidia-smi/index.html ↩↩