Runbook: GPU fault, drain, reset, RMA¶

Scope: drain, reset, and (if needed) RMA a faulted GPU after an XID/ECC alert, returning the node to service safely.

Run this when a GPU throws a fatal XID, fails a row-remap, or fails dcgmi diag. Severity: single-node fault, capacity-impacting; the question is whether the GPU recovers on reset or must be RMA'd. App-class XIDs are a workload bug, not a node fault; do not RMA on those.

Reference templates on real APIs; pin versions and validate before production use.

This is the longform GPU-fault path for the fleet. The XID taxonomy and RAS background are in reliability and RAS; the driver / Fabric-Manager / DCGM stack is in the GPU software stack; the general triage tree is in the troubleshooting runbook.

Trigger¶

A fatal hardware XID in dmesg / DCGM: XID 48 (double-bit ECC), XID 79 (GPU fallen off the bus), XID 92 (high single-bit ECC rate), XID 94/95 (contained / uncontained memory error), XID 64 (row-remap failure), per the XID catalog in References and reliability and RAS.
A dcgmi diag failure (any Fail in PCIe/NVLink, memory bandwidth, or stress at level 3).
A pending or failed row-remapping reported by nvidia-smi -q -d ROW_REMAPPER.

App-class XIDs, namely 13 (graphics engine exception), 31 (memory page fault), and 43 (GPU stopped processing), are owned by the workload (run compute-sanitizer/cuda-gdb); the GPU stays healthy. Do not enter this runbook for those.

Pre-checks¶

Confirm the XID class first (app vs hardware) before touching the node: misclassifying an app bug as a hardware fault wastes an RMA slot (reliability and RAS).
Record GPU identity: board serial, GPU UUID, PCI bus id, node name. These travel with the RMA.
```
ssh "$NODE" nvidia-smi --query-gpu=index,uuid,serial,pci.bus_id --format=csv
```
Capture the evidence before any reset clears it: dmesg -T | grep -i xid, the DCGM event log, and the full dcgmi diag JSON. A reset wipes volatile ECC counters.
Check whether the GPU is NVSwitch-attached (HGX baseboard, NVL72): a single dead GPU can hold the NVLink domain, so coordinate with the Fabric Manager state (the GPU software stack, reliability and RAS).

Flow¶

stateDiagram-v2
    [*] --> Triage
    Triage --> Healthy: app XID 13/31/43
    Triage --> Drain: hardware XID
    Drain --> Inspect: workloads evicted
    Inspect --> Reset: serials recorded
    Reset --> Diag: nvidia-smi -r
    Diag --> Healthy: diag pass, ECC stable
    Diag --> RMA: diag fail or remap fail
    Healthy --> [*]
    RMA --> [*]

Procedure¶

NODE=gpu-19.dc1.internal
GPU=3                              # GPU index reporting the fault

Classify the XID. Application class 13 / 31 / 43 → return to the workload owner, GPU stays healthy. Hardware class 48 / 79 / 92 / 94 / 95 / 64 → continue:
```
ssh "$NODE" 'dmesg -T | grep -i "NVRM: Xid"'
```

Cordon and drain so the scheduler stops placing work and running pods leave cleanly:

kubectl cordon "$NODE"
kubectl drain "$NODE" --ignore-daemonsets --delete-emptydir-data --timeout=15m
# Slurm: scontrol update nodename="$NODE" state=drain reason="XID hardware fault"

Inspect ECC and the row remapper. Look for uncorrectable counts and any pending or failed remap. A pending remap needs a reset to take effect; a failure means the row budget is exhausted:
```
ssh "$NODE" nvidia-smi -i "$GPU" -q -d ECC,ROW_REMAPPER
```
Record Remapped Rows, Pending (Yes means reset required), and Remapping Failure Occurred (Yes → RMA candidate).
Reset the GPU to apply a pending remap and clear a transient fault. The GPU must be idle (step 2 evicted its workloads):
```
ssh "$NODE" nvidia-smi -i "$GPU" -r
```
If nvidia-smi -r cannot reset (XID 79 / fallen off bus), a full node reboot is required; if it survives the reboot still off-bus, it is an RMA.
Re-run the long diagnostic (level 3: PCIe/NVLink, memory bandwidth, targeted stress, NCCL):
```
ssh "$NODE" 'dcgmi diag -r 3'
```
Decide. Diag clean and ECC stable → uncordon. Diag Fail, remap failure, or repeat XID after reset → leave cordoned and open an RMA:
```
kubectl uncordon "$NODE"          # only on a clean pass
```

Verification¶

dcgmi diag -r 3 returns with no Fail across PCIe/NVLink, memory, and stress.
nvidia-smi -q -d ECC shows no new uncorrectable (DBE) errors accruing under a short stress run, and ROW_REMAPPER shows Pending: No, Remapping Failure Occurred: No.
A 2-node nccl-tests all_reduce_perf lands on the node at line-rate busbw, confirming NVLink/IB health post-reset (workload recipes).
The node re-advertises nvidia.com/gpu and the GPU Operator validator goes green (the Kubernetes platform).

Rollback¶

Not applicable: this is a fault path, not a change. There is nothing to revert; the decision is recover-in-place vs replace.

If the verdict is RMA: keep the node cordoned (or cordon just the bad GPU via MIG/device-plugin exclusion), record board serial + GPU UUID + PCI bus id + the captured XID and diag log on the RMA ticket, and physically tag the unit. Return the node to the pool only after the replacement passes the full acceptance gate (commissioning). Track persistent or fleet-wide XID rates as a reliability signal (reliability and RAS, telemetry and monitoring).

the driver-upgrade runbook: Rolling driver / CUDA upgrade (a node failing diag on both branches diverts here).
the capacity-add runbook: Add GPU capacity (a new node failing acceptance diverts here).
the NCCL-hang runbook: NCCL hang / collective stall (drain an offending node into this runbook).
the thermal-emergency runbook: Thermal emergency (thermal-induced faults overlap with this path).
operational runbooks: Operational runbooks index.

References¶

NVIDIA XID errors (catalog, application vs hardware classes): https://docs.nvidia.com/deploy/xid-errors/index.html
NVIDIA XID catalog (per-XID action table): https://docs.nvidia.com/deploy/xid-errors/analyzing-xid-catalog.html
NVIDIA GPU memory error management (row remapping, ECC): https://docs.nvidia.com/deploy/a100-gpu-mem-error-mgmt/index.html
nvidia-smi manual (-q -d ROW_REMAPPER,ECC, -r reset): https://docs.nvidia.com/deploy/nvidia-smi/index.html
DCGM diagnostics (run levels): https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/dcgm-diagnostics.html
Compute Sanitizer (app-class XID 13/31 debugging): https://docs.nvidia.com/compute-sanitizer/index.html