Markdown

Troubleshooting runbook¶

Scope: a triage index. When something breaks, match the symptom to the runbook that fixes it. This page is the dispatcher: the detailed step-by-step HOW lives in the focused runbooks linked below. It consolidates the diagnostic threads from the GPU software stack, observability, reliability and RAS, and performance tuning into one place to open under pressure.

flowchart LR
  SYMPTOM["Symptom"] --> CHECKS["First-line checks"]
  CHECKS --> LAYER["Identify failing layer"]
  LAYER --> PROCEDURE["Run linked procedure"]
  PROCEDURE --> VERIFY["Verify recovered metric"]

Overview¶

Most GPU incidents resolve to a short list of root causes, but the symptoms disguise them: a network fault reads as a GPU fault, a starved dataloader reads as a slow GPU, an ACS setting halves bandwidth silently. The discipline is to triage from the symptom to the layer, confirm with one command, and fix at the right layer rather than treating the symptom. Always rule out fabric and stack before blaming silicon.

First-line triage (run these first)¶

nvidia-smi: GPUs present? clocks, power, temp, throttle, ECC at a glance.
nvidia-smi -q -d ECC,ROW_REMAPPER,TEMPERATURE,CLOCK,POWER: detail on errors and throttling.
dmesg -T | grep -iE "Xid|NVRM|nvidia": XID errors, the first stop for hardware faults (reliability and RAS).
dcgmi diag -r 3: deep health check.
systemctl status nvidia-fabricmanager nvidia-persistenced: the two daemons people forget (the GPU software stack).

Symptom -> runbook (start here)¶

Find your symptom, confirm with the one-line check, then open the linked runbook for the full procedure. Each runbook owns the detailed HOW; this table is the dispatcher.

Symptom	Confirm with	Runbook
GPU not detected / missing from `nvidia-smi`	`lspci \\| grep -i nvidia`, `lsmod \\| grep nvidia`	Kernel sees no GPU / GPU missing
Driver / kernel module will not load	`dmesg` for "API mismatch" / GSP / signature errors	Driver module load failure
GPUs cannot see each other over NVLink	`nvidia-smi nvlink -s`	NVLink visibility failure
Fabric Manager down or version-mismatched	`systemctl status nvidia-fabricmanager`	Fabric Manager failure
PCIe bandwidth low / P2P broken	`nvidia-smi topo -m`, `lspci -vv` for `LnkSta`	PCIe / P2P bandwidth regression
Job will not schedule (Kubernetes / Slurm)	pending-pod reason / `squeue`	Scheduler: pending GPU job
Training OOM	model/optimizer/activation memory vs batch/seq	Training OOM
Inference OOM / KV-cache pressure	KV-cache size, max sequences/length	Inference KV-cache OOM
NCCL hangs or collectives are slow	`NCCL_DEBUG=INFO` transport, `ibstat`	NCCL hang
Low GPU utilisation / slow training (MFU low)	SM-active / tensor-active, `nsys` on one step	MFU regression
Inference latency SLO miss	TTFT vs TPOT split, KV-cache preemption	Inference SLO breach
ECC / XID hardware fault (uncorrectable, row-remap failure)	XID number and rate, `nvidia-smi -q -d ECC,ROW_REMAPPER`	GPU fault / RMA decision
Thermal throttling / overheating	`nvidia-smi -q -d TEMPERATURE,CLOCK`, `clocks_throttle_reasons`	Thermal emergency
MIG layout wrong / stale on a node	`nvidia-smi mig -lgip`, advertised MIG profiles	MIG state stale

Cross-layer rule baked into every runbook: rule out fabric and Fabric Manager before blaming silicon, and classify XID as app vs hardware before any RMA (reliability and RAS). App-XIDs (Xid 13/31/43) are job bugs, not hardware; never RMA on one.

Don't-miss checklist¶

Triage from symptom → layer; confirm with one command before acting.
Rule out fabric and Fabric Manager before blaming a GPU.
Classify XID as app vs hardware before any RMA (reliability and RAS).
For "slow", read SM-active/MFU and profile with nsys; do not guess (observability, performance tuning).
Check the cheap, high-impact hygiene first: persistence mode, ACS, PCIe link, NCCL transport.

References¶

NVIDIA GPU debug guidelines (triage flow): https://docs.nvidia.com/deploy/gpu-debug-guidelines/index.html
XID errors: https://docs.nvidia.com/deploy/xid-errors/index.html
DCGM diagnostics: https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/dcgm-diagnostics.html
NCCL troubleshooting: https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html