Runbook: GPU fault, drain, reset, RMA¶
Scope: drain, reset, and (if needed) RMA a faulted GPU after an XID/ECC alert, returning the node to service safely.
Run this when a GPU throws a fatal XID, fails a row-remap, or fails
dcgmi diag. Severity: single-node fault, capacity-impacting; the question is whether the GPU recovers on reset or must be RMA'd. App-class XIDs are a workload bug, not a node fault; do not RMA on those.Reference templates on real APIs; pin versions and validate before production use.
This is the longform GPU-fault path for the fleet. The XID taxonomy and RAS background are in reliability and RAS; the driver / Fabric-Manager / DCGM stack is in the GPU software stack; the general triage tree is in the troubleshooting runbook.
Trigger¶
- A fatal hardware XID in
dmesg/ DCGM: XID 48 (double-bit ECC), XID 79 (GPU fallen off the bus), XID 92 (high single-bit ECC rate), XID 94/95 (contained / uncontained memory error), XID 64 (row-remap failure), per the XID catalog in References and reliability and RAS. - A
dcgmi diagfailure (anyFailin PCIe/NVLink, memory bandwidth, or stress at level 3). - A pending or failed row-remapping reported by
nvidia-smi -q -d ROW_REMAPPER.
App-class XIDs, namely 13 (graphics engine exception), 31 (memory page fault), and 43 (GPU stopped processing), are owned by the workload (run compute-sanitizer/cuda-gdb); the GPU stays healthy. Do not enter this runbook for those.
Pre-checks¶
- Confirm the XID class first (app vs hardware) before touching the node: misclassifying an app bug as a hardware fault wastes an RMA slot (reliability and RAS).
- Record GPU identity: board serial, GPU UUID, PCI bus id, node name. These travel with the RMA.
- Capture the evidence before any reset clears it:
dmesg -T | grep -i xid, the DCGM event log, and the fulldcgmi diagJSON. A reset wipes volatile ECC counters. - Check whether the GPU is NVSwitch-attached (HGX baseboard, NVL72): a single dead GPU can hold the NVLink domain, so coordinate with the Fabric Manager state (the GPU software stack, reliability and RAS).
Flow¶
stateDiagram-v2
[*] --> Triage
Triage --> Healthy: app XID 13/31/43
Triage --> Drain: hardware XID
Drain --> Inspect: workloads evicted
Inspect --> Reset: serials recorded
Reset --> Diag: nvidia-smi -r
Diag --> Healthy: diag pass, ECC stable
Diag --> RMA: diag fail or remap fail
Healthy --> [*]
RMA --> [*]
Procedure¶
- Classify the XID. Application class 13 / 31 / 43 → return to the workload owner, GPU stays healthy. Hardware class 48 / 79 / 92 / 94 / 95 / 64 → continue:
- Cordon and drain so the scheduler stops placing work and running pods leave cleanly:
- Inspect ECC and the row remapper. Look for uncorrectable counts and any pending or failed remap. A pending remap needs a reset to take effect; a failure means the row budget is exhausted:
Record
Remapped Rows,Pending(Yes means reset required), andRemapping Failure Occurred(Yes → RMA candidate). - Reset the GPU to apply a pending remap and clear a transient fault. The GPU must be idle (step 2 evicted its workloads):
If
nvidia-smi -rcannot reset (XID 79 / fallen off bus), a full node reboot is required; if it survives the reboot still off-bus, it is an RMA. - Re-run the long diagnostic (level 3: PCIe/NVLink, memory bandwidth, targeted stress, NCCL):
- Decide. Diag clean and ECC stable → uncordon. Diag
Fail, remap failure, or repeat XID after reset → leave cordoned and open an RMA:
Verification¶
dcgmi diag -r 3returns with noFailacross PCIe/NVLink, memory, and stress.nvidia-smi -q -d ECCshows no new uncorrectable (DBE) errors accruing under a short stress run, andROW_REMAPPERshowsPending: No,Remapping Failure Occurred: No.- A 2-node
nccl-testsall_reduce_perflands on the node at line-rate busbw, confirming NVLink/IB health post-reset (workload recipes). - The node re-advertises
nvidia.com/gpuand the GPU Operator validator goes green (the Kubernetes platform).
Rollback¶
Not applicable: this is a fault path, not a change. There is nothing to revert; the decision is recover-in-place vs replace.
If the verdict is RMA: keep the node cordoned (or cordon just the bad GPU via MIG/device-plugin exclusion), record board serial + GPU UUID + PCI bus id + the captured XID and diag log on the RMA ticket, and physically tag the unit. Return the node to the pool only after the replacement passes the full acceptance gate (commissioning). Track persistent or fleet-wide XID rates as a reliability signal (reliability and RAS, telemetry and monitoring).
Related runbooks¶
- the driver-upgrade runbook: Rolling driver / CUDA upgrade (a node failing diag on both branches diverts here).
- the capacity-add runbook: Add GPU capacity (a new node failing acceptance diverts here).
- the NCCL-hang runbook: NCCL hang / collective stall (drain an offending node into this runbook).
- the thermal-emergency runbook: Thermal emergency (thermal-induced faults overlap with this path).
- operational runbooks: Operational runbooks index.
References¶
- NVIDIA XID errors (catalog, application vs hardware classes): https://docs.nvidia.com/deploy/xid-errors/index.html
- NVIDIA XID catalog (per-XID action table): https://docs.nvidia.com/deploy/xid-errors/analyzing-xid-catalog.html
- NVIDIA GPU memory error management (row remapping, ECC): https://docs.nvidia.com/deploy/a100-gpu-mem-error-mgmt/index.html
- nvidia-smi manual (
-q -d ROW_REMAPPER,ECC,-rreset): https://docs.nvidia.com/deploy/nvidia-smi/index.html - DCGM diagnostics (run levels): https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/dcgm-diagnostics.html
- Compute Sanitizer (app-class XID 13/31 debugging): https://docs.nvidia.com/compute-sanitizer/index.html
Related: Software Stack · Reliability / RAS · Troubleshooting · Commissioning · Telemetry · Operational Runbooks · NCCL Hang · Glossary