Skip to content
Markdown

Troubleshooting runbook

Scope: a triage index. When something breaks, match the symptom to the runbook that fixes it. This page is the dispatcher: the detailed step-by-step HOW lives in the focused runbooks linked below. It consolidates the diagnostic threads from the GPU software stack, observability, reliability and RAS, and performance tuning into one place to open under pressure.

flowchart LR
  SYMPTOM["Symptom"] --> CHECKS["First-line checks"]
  CHECKS --> LAYER["Identify failing layer"]
  LAYER --> PROCEDURE["Run linked procedure"]
  PROCEDURE --> VERIFY["Verify recovered metric"]

Overview

Most GPU incidents resolve to a short list of root causes, but the symptoms disguise them: a network fault reads as a GPU fault, a starved dataloader reads as a slow GPU, an ACS setting halves bandwidth silently. The discipline is to triage from the symptom to the layer, confirm with one command, and fix at the right layer rather than treating the symptom. Always rule out fabric and stack before blaming silicon.

First-line triage (run these first)

  • nvidia-smi: GPUs present? clocks, power, temp, throttle, ECC at a glance.
  • nvidia-smi -q -d ECC,ROW_REMAPPER,TEMPERATURE,CLOCK,POWER: detail on errors and throttling.
  • dmesg -T | grep -iE "Xid|NVRM|nvidia": XID errors, the first stop for hardware faults (reliability and RAS).
  • dcgmi diag -r 3: deep health check.
  • systemctl status nvidia-fabricmanager nvidia-persistenced: the two daemons people forget (the GPU software stack).

Symptom -> runbook (start here)

Find your symptom, confirm with the one-line check, then open the linked runbook for the full procedure. Each runbook owns the detailed HOW; this table is the dispatcher.

Symptom Confirm with Runbook
GPU not detected / missing from nvidia-smi lspci \| grep -i nvidia, lsmod \| grep nvidia Kernel sees no GPU / GPU missing
Driver / kernel module will not load dmesg for "API mismatch" / GSP / signature errors Driver module load failure
GPUs cannot see each other over NVLink nvidia-smi nvlink -s NVLink visibility failure
Fabric Manager down or version-mismatched systemctl status nvidia-fabricmanager Fabric Manager failure
PCIe bandwidth low / P2P broken nvidia-smi topo -m, lspci -vv for LnkSta PCIe / P2P bandwidth regression
Job will not schedule (Kubernetes / Slurm) pending-pod reason / squeue Scheduler: pending GPU job
Training OOM model/optimizer/activation memory vs batch/seq Training OOM
Inference OOM / KV-cache pressure KV-cache size, max sequences/length Inference KV-cache OOM
NCCL hangs or collectives are slow NCCL_DEBUG=INFO transport, ibstat NCCL hang
Low GPU utilisation / slow training (MFU low) SM-active / tensor-active, nsys on one step MFU regression
Inference latency SLO miss TTFT vs TPOT split, KV-cache preemption Inference SLO breach
ECC / XID hardware fault (uncorrectable, row-remap failure) XID number and rate, nvidia-smi -q -d ECC,ROW_REMAPPER GPU fault / RMA decision
Thermal throttling / overheating nvidia-smi -q -d TEMPERATURE,CLOCK, clocks_throttle_reasons Thermal emergency
MIG layout wrong / stale on a node nvidia-smi mig -lgip, advertised MIG profiles MIG state stale

Cross-layer rule baked into every runbook: rule out fabric and Fabric Manager before blaming silicon, and classify XID as app vs hardware before any RMA (reliability and RAS). App-XIDs (Xid 13/31/43) are job bugs, not hardware; never RMA on one.

Don't-miss checklist

  • Triage from symptom → layer; confirm with one command before acting.
  • Rule out fabric and Fabric Manager before blaming a GPU.
  • Classify XID as app vs hardware before any RMA (reliability and RAS).
  • For "slow", read SM-active/MFU and profile with nsys; do not guess (observability, performance tuning).
  • Check the cheap, high-impact hygiene first: persistence mode, ACS, PCIe link, NCCL transport.

References

  • NVIDIA GPU debug guidelines (triage flow): https://docs.nvidia.com/deploy/gpu-debug-guidelines/index.html
  • XID errors: https://docs.nvidia.com/deploy/xid-errors/index.html
  • DCGM diagnostics: https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/dcgm-diagnostics.html
  • NCCL troubleshooting: https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html

Related: Software Stack · Kubernetes · Observability · Reliability · Optimization · Glossary