Markdown

Reliability, RAS & failure modes¶

Scope: what goes wrong with GPUs at scale and how it is detected, classified, and remediated. XID/SXID errors, ECC, HBM row remapping, thermal and bus failures, the fleet failure-rate reality, and fault-tolerant operation. The SRE heart of running a cluster, downstream of observability (observability).

Overview¶

RAS = Reliability, Availability, Serviceability. At thousands of GPUs, hardware failure is a steady-state condition, not an exception: a large training job can lose a GPU every few hours, so the question is not "if" but "how fast do you detect, classify, and recover". The skill is reading the GPU's error channels precisely, knowing which faults are user bugs and which are dying silicon, then automating the drain/reset/RMA loop so jobs survive.

State machine: fault remediation

stateDiagram-v2
  [*] --> Healthy
  Healthy --> Triage: XID or ECC alert
  Triage --> Healthy: app XID 13/31/43
  Triage --> Drain: hardware XID 48/79/94-95
  Drain --> Reset: nvidia-smi -r
  Reset --> Diag: dcgmi diag -r 3
  Diag --> Healthy: pass
  Diag --> RMA: fail or row-remap failure
  RMA --> [*]

Core knowledge¶

XID errors (the primary GPU error channel)¶

Reported in dmesg/syslog as NVRM: Xid (PCI:...): <number>, .... The number classifies the fault. The single most important discrimination is application error vs hardware fault: do not RMA a GPU for a user bug.
Application-caused (not a hardware fault): Xid 13 (graphics engine exception, usually illegal memory access), Xid 31 (GPU memory page fault, illegal access in the kernel), Xid 43 (a process hit an exception and stopped). These follow a buggy job around; the GPU is fine.
Hardware / fatal: Xid 48 (double-bit ECC, uncorrectable; RMA if it recurs within a week), Xid 79 (GPU has fallen off the PCIe bus; reseat and try another slot, then RMA), Xid 74 (NVLink error). Xid 63/64 are ECC page-retirement / row-remapping events. On Hopper/Blackwell, Xid 94 (contained ECC) is normal and hardware-handled (do not RMA on occasional 94s), while Xid 95 (uncontained ECC) alongside a rising rate of 94 means degradation past containment: RMA. GSP-related faults surface as their own XIDs (e.g. 119/120) and usually point at firmware/driver (the GPU software stack).
SXID: the NVSwitch analogue, reported by Fabric Manager (the GPU software stack).

ECC and HBM row remapping¶

Correctable (single-bit, SBE): logged and corrected. Treat a rising rate as a leading indicator, not an emergency.
Uncorrectable (double-bit, DBE): kills the job; may force page retirement and a reset.
Row remapping (HBM, Hopper/Blackwell): the GPU remaps a bad row to a spare and logs it. Inspect with nvidia-smi -q -d ROW_REMAPPER,ECC. A remapping failure, or pending remaps that persist after reset, means RMA: the spare rows are exhausted.

Thermal, power, bus, fabric¶

Throttling shows in clocks_throttle_reasons (observability); thermal slowdown precedes thermal shutdown. A liquid-cooling fault (CDU/flow, datacentre readiness) typically appears as throttle then shutdown.
Fallen off the bus (Xid 79): the GPU vanishes from PCIe for a power, thermal, seating, or hardware reason. Drain, reset, reseat, then RMA.
NVLink/IB: replay/symbol errors and link flaps degrade collectives; a down Fabric Manager isolates GPUs (networking fabric).

Detection → remediation pipeline¶

DCGM health + an XID watcher (observability) → classify (app vs hardware) → cordon/drain the node (provisioning and scheduling, Kubernetes for GPUs) → reset (nvidia-smi -r) or reboot → re-run dcgmi diag -r 3 → return to pool or RMA. Health gating keeps jobs off known-bad nodes. Automate this; do not hand-nurse nodes at fleet scale.

Fault-tolerant operation (the math)¶

With N GPUs each at failure rate λ, job MTBF ≈ 1/(Nλ). Checkpoint interval must be shorter than job MTBF (storage and data, distributed training), or you repeatedly lose more work than you save.
Elastic restart, hot spares, redundant replicas (torchft), and straggler eviction keep large jobs alive. Burn-in (commissioning) clears infant mortality, the front of the bathtub curve. DGX/SuperPOD ship NVIDIA's own RAS / Mission Control health framework.

Don't-miss checklist¶

Centralise and classify XID; separate app-XIDs (13/31/43) from hardware-XIDs (48/79/94-95) before acting.
Watch correctable-ECC ramp as a leading indicator; act immediately on uncorrectable or row-remap failure.
Automate drain → reset → dcgmi diag → return/RMA.
Checkpoint interval < job MTBF; build jobs to resume cleanly (distributed training).
Treat Xid 79 as hardware until power/thermal/seating are ruled out.

Failure modes¶

RMA-ing a GPU for an Xid 13/31 that was a user illegal-memory-access bug.
Ignoring a correctable-ECC ramp until a double-bit kills a multi-hour run.
No health gating: jobs repeatedly land on the same degraded node.
Checkpoint cadence longer than MTBF: hours of work lost on every failure.
A Fabric Manager / NVLink fault misread as a GPU fault.

Open questions & validation¶

The XID catalogue from memory: which codes are app vs hardware, and the action for each.
The row-remapping inspection and RMA-decision workflow on a real GPU.
An automated remediation pipeline design (watcher → cordon → reset → diag → RMA).

References¶

NVIDIA XID errors: https://docs.nvidia.com/deploy/xid-errors/index.html
NVIDIA GPU debug guidelines (triage flow): https://docs.nvidia.com/deploy/gpu-debug-guidelines/index.html
GPU memory error management (ECC, row remapping): https://docs.nvidia.com/deploy/a100-gpu-mem-error-mgmt/index.html
DCGM diagnostics: https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/dcgm-diagnostics.html

Related: Commissioning · Software Stack · Storage · Training · Observability · Runbook · Glossary