Troubleshooting runbook¶
Scope: a triage index. When something breaks, match the symptom to the runbook that fixes it. This page is the dispatcher: the detailed step-by-step HOW lives in the focused runbooks linked below. It consolidates the diagnostic threads from the GPU software stack, observability, reliability and RAS, and performance tuning into one place to open under pressure.
flowchart LR
SYMPTOM["Symptom"] --> CHECKS["First-line checks"]
CHECKS --> LAYER["Identify failing layer"]
LAYER --> PROCEDURE["Run linked procedure"]
PROCEDURE --> VERIFY["Verify recovered metric"]
Overview¶
Most GPU incidents resolve to a short list of root causes, but the symptoms disguise them: a network fault reads as a GPU fault, a starved dataloader reads as a slow GPU, an ACS setting halves bandwidth silently. The discipline is to triage from the symptom to the layer, confirm with one command, and fix at the right layer rather than treating the symptom. Always rule out fabric and stack before blaming silicon.
First-line triage (run these first)¶
nvidia-smi: GPUs present? clocks, power, temp, throttle, ECC at a glance.nvidia-smi -q -d ECC,ROW_REMAPPER,TEMPERATURE,CLOCK,POWER: detail on errors and throttling.dmesg -T | grep -iE "Xid|NVRM|nvidia": XID errors, the first stop for hardware faults (reliability and RAS).dcgmi diag -r 3: deep health check.systemctl status nvidia-fabricmanager nvidia-persistenced: the two daemons people forget (the GPU software stack).
Symptom -> runbook (start here)¶
Find your symptom, confirm with the one-line check, then open the linked runbook for the full procedure. Each runbook owns the detailed HOW; this table is the dispatcher.
| Symptom | Confirm with | Runbook |
|---|---|---|
GPU not detected / missing from nvidia-smi |
lspci \| grep -i nvidia, lsmod \| grep nvidia |
Kernel sees no GPU / GPU missing |
| Driver / kernel module will not load | dmesg for "API mismatch" / GSP / signature errors |
Driver module load failure |
| GPUs cannot see each other over NVLink | nvidia-smi nvlink -s |
NVLink visibility failure |
| Fabric Manager down or version-mismatched | systemctl status nvidia-fabricmanager |
Fabric Manager failure |
| PCIe bandwidth low / P2P broken | nvidia-smi topo -m, lspci -vv for LnkSta |
PCIe / P2P bandwidth regression |
| Job will not schedule (Kubernetes / Slurm) | pending-pod reason / squeue |
Scheduler: pending GPU job |
| Training OOM | model/optimizer/activation memory vs batch/seq | Training OOM |
| Inference OOM / KV-cache pressure | KV-cache size, max sequences/length | Inference KV-cache OOM |
| NCCL hangs or collectives are slow | NCCL_DEBUG=INFO transport, ibstat |
NCCL hang |
| Low GPU utilisation / slow training (MFU low) | SM-active / tensor-active, nsys on one step |
MFU regression |
| Inference latency SLO miss | TTFT vs TPOT split, KV-cache preemption | Inference SLO breach |
| ECC / XID hardware fault (uncorrectable, row-remap failure) | XID number and rate, nvidia-smi -q -d ECC,ROW_REMAPPER |
GPU fault / RMA decision |
| Thermal throttling / overheating | nvidia-smi -q -d TEMPERATURE,CLOCK, clocks_throttle_reasons |
Thermal emergency |
| MIG layout wrong / stale on a node | nvidia-smi mig -lgip, advertised MIG profiles |
MIG state stale |
Cross-layer rule baked into every runbook: rule out fabric and Fabric Manager before blaming silicon, and classify XID as app vs hardware before any RMA (reliability and RAS). App-XIDs (Xid 13/31/43) are job bugs, not hardware; never RMA on one.
Don't-miss checklist¶
- Triage from symptom → layer; confirm with one command before acting.
- Rule out fabric and Fabric Manager before blaming a GPU.
- Classify XID as app vs hardware before any RMA (reliability and RAS).
- For "slow", read SM-active/MFU and profile with
nsys; do not guess (observability, performance tuning). - Check the cheap, high-impact hygiene first: persistence mode, ACS, PCIe link, NCCL transport.
References¶
- NVIDIA GPU debug guidelines (triage flow): https://docs.nvidia.com/deploy/gpu-debug-guidelines/index.html
- XID errors: https://docs.nvidia.com/deploy/xid-errors/index.html
- DCGM diagnostics: https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/dcgm-diagnostics.html
- NCCL troubleshooting: https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html
Related: Software Stack · Kubernetes · Observability · Reliability · Optimization · Glossary