Markdown

ECC (error-correcting code)¶

Scope: ECC on GPU memory across the tiers: what carries it, how nvidia-smi toggles and reports it, the volatile-vs-aggregate counters, and row remapping on Ampere-and-later silicon; toggle/recovery procedure lives in ECC Toggle Recovery.

What it is¶

ECC adds redundant parity to memory so the GPU can detect and correct single-bit errors (SBE) and detect double-bit / uncorrectable errors (DBE) rather than silently returning corrupt data. On datacenter GPUs it covers both the HBM frame buffer (DRAM) and on-die SRAM structures: register file, L1/L2, shared memory. NVML/nvidia-smi splits the counts accordingly: on Turing and later you see SRAM Correctable / SRAM Uncorrectable / DRAM Correctable / DRAM Uncorrectable.¹²

Two orthogonal mechanisms sit under "ECC":

In-line correction. SBEs corrected transparently, DBEs flagged. A DBE/uncorrectable event is fatal to the affected work and surfaces as Xid 48 (and, on Hopper/Blackwell, the contained/uncontained pair Xid 94/95); see Reliability, RAS and Failure Modes.
Repair. Bad memory is retired so it is not reused: row remapping on Ampere-and-later HBM, dynamic page retirement / offlining on older parts. Covered below.²³

Counters come in two scopes, used constantly in triage:¹

Volatile. Errors since the last driver load (cleared on reload/reset/reboot).
Aggregate. A lifetime counter, persisted in the GPU's inforom.

ECC is not free: enabling it reserves a fraction of frame-buffer capacity and costs a little bandwidth. That trade is correct for any multi-hour training or production inference job; the alternative is silent corruption.

Why it's needed (and when)¶

At fleet scale, memory faults are steady-state, not exceptional: a long training run reads and writes terabytes through HBM, and an uncorrected bit-flip either crashes the job or, worse, corrupts a checkpoint without telling you. ECC turns "mystery NaN three hours in" into a logged, attributable event, and a rising correctable-error rate becomes a leading indicator you can drain on before a DBE kills the run (Reliability, RAS and Failure Modes).

When it matters by tier (the GPU determines whether ECC even exists):

Datacenter (HBM): A100, H100/H200, B-series. HBM ECC, on by default. Leave it on for shared and production fleets.
RTX PRO / professional workstation. The RTX PRO 6000 Blackwell ships 96 GB GDDR7 with error-correction code (ECC). That is ECC on consumer-class GDDR7, which plain GeForce does not provide.⁵ Treat it as a server: ECC on, persistence on (Persistence Mode).
GeForce (RTX 50/40): no ECC. Consumer cards do not expose an ECC toggle; nvidia-smi -q -d ECC reports the feature as unavailable. Do not plan datacenter integrity controls around them (Driver and Feature Support by GPU Tier).

Rule of thumb: if the workload writes state you cannot cheaply re-derive (model weights, optimizer state, KV cache backing a paid SLO), you want ECC and a GPU that has it.

How it's installed & managed¶

ECC needs no separate package; it is a driver/firmware feature exposed through nvidia-smi (NVML). Management is two operations: read state/counters, and toggle the mode (which is not live, see the reset caveat). Both require root for the write path.

Reference templates on real nvidia-smi options; not hardware-tested. Pin behaviour against the nvidia-smi documentation and the GPU memory error management guide for your driver branch.

Check the current ECC mode and whether it is even supported:

# Current vs pending ECC mode, per GPU (scriptable)
nvidia-smi --query-gpu=index,name,ecc.mode.current,ecc.mode.pending --format=csv

Enable or disable ECC. CONFIG is 0|DISABLED or 1|ENABLED; the change is persistent but takes effect after the next reboot, not immediately:¹

# Enable ECC on all GPUs (root). -i <id> targets a single GPU.
sudo nvidia-smi -e 1        # 1|ENABLED ; use 0 to disable

The reset caveat (the detail that bites). nvidia-smi -e only stages the mode; you must apply it. The supported in-place mechanism is a GPU reset, but the documentation is explicit that "Changes to ECC mode require a reboot," and on Linux a reset may fail to flip a pending ECC mode, so a reboot is the reliable apply step.¹ The GPU must be idle (no CUDA contexts, persistence/nvidia-persistenced stopped, MIG disabled) for a reset to succeed.

# Apply staged ECC mode in place (FLR). GPU must be idle; may not flip pending mode on Linux.
sudo nvidia-smi -r          # -i <id> to target one GPU; -r bus for a bus reset
# If pending mode does not clear, reboot the node.

Sequencing in a maintenance window: cordon/drain → stop persistenced → nvidia-smi -e <0|1> → reboot → verify ecc.mode.current → uncordon. The full procedure (and recovery when a node comes back with the wrong mode or a stuck pending state) is ECC Toggle Recovery. Toggling ECC fleet-wide is a planned change, batched to preserve quorum, the same discipline as a driver roll (Rolling Driver / CUDA Upgrade).

Validated usage & tests¶

Inspect ECC state and error counters. Expect a block split into Volatile and Aggregate, each with SBE/DBE (or SRAM/DRAM correctable/uncorrectable) sub-counts:

nvidia-smi -q -d ECC

Expected shape (counts are device-specific, so do not assume any particular number): a top-level ECC mode (Current / Pending: Enabled or Disabled), then Volatile and Aggregate sections each listing correctable and uncorrectable totals broken out by memory structure. On a healthy GPU the uncorrectable counts are zero; a non-zero, rising correctable rate is the signal to watch, not panic.

Query row remapping (Ampere-and-later HBM). This is the repair state that decides drain-vs-RMA:³

nvidia-smi -q -d ROW_REMAPPER

Expected fields: Correctable Error and Uncorrectable Error (rows remapped due to each), Pending, and Remapping Failure Occurred.³⁶ Interpretation:

Pending : Yes means a row is remapped but the change is not yet live; the GPU needs a reset (nvidia-smi -r, then reboot to confirm) for it to take effect. After a successful apply, Pending returns to No.³⁶
Remapping Failure Occurred : Yes means repair could not be applied; the GPU is degraded past what remapping can hide. This is an RMA signal, not a reset-and-return.⁴
Row remapping is persistent for the life of the GPU and does not roll back.³

Scriptable health gate for a fleet (label/drain on any non-clean field):

# Any value other than 0 / No on these warrants attention
nvidia-smi --query-gpu=index,ecc.mode.current,ecc.errors.uncorrected.aggregate.total \
  --query-remapped-rows=remapped_rows.pending,remapped_rows.failure \
  --format=csv,noheader

Older GPUs without row remapping expose the analogous page retirement view instead (nvidia-smi -q -d PAGE_RETIREMENT), retiring individual pages on SBE-double or DBE events.¹² dcgmi diag exercises memory and surfaces ECC/remap health in the health checks (GPU Diagnostics and Validation).

Failure modes¶

ECC toggled but never applied. nvidia-smi -e was run, the node was never rebooted (or the reset silently failed to flip the pending mode), and the GPU is still in the old mode, so ecc.mode.current disagrees with ecc.mode.pending. Recovery: ECC Toggle Recovery.
Pending row remap left stuck. ROW_REMAPPER shows Pending : Yes that never clears because the node was never reset/rebooted; the bad row stays in service. Reset and confirm Pending returns to No.⁶
Remapping failure / exhausted repair → RMA, not reset. Per NVIDIA's policy the failure flag is set on: a remap for an uncorrectable error on a bank that already has eight uncorrectable-error rows remapped; a remap of an already-remapped row; or after 512 total uncorrectable remappings. (On Blackwell a third attempt may invoke HBM channel repair if a spare channel exists, deferring RMA.) Field-diagnose, then RMA.⁴
Treating a correctable-error ramp as noise. Ignoring a rising SBE rate until a DBE (Xid 48 / uncontained Xid 95) kills a multi-hour run (Reliability, RAS and Failure Modes).
Assuming ECC where there is none. Planning integrity controls on GeForce, which has no ECC toggle (Driver and Feature Support by GPU Tier).

References¶

nvidia-smi documentation (-e/--ecc-config, -r/--gpu-reset, -q -d ECC|ROW_REMAPPER|PAGE_RETIREMENT, volatile vs aggregate): https://docs.nvidia.com/deploy/nvidia-smi/index.html
NVIDIA GPU Memory Error Management (ECC, row remapping, dynamic page offlining): https://docs.nvidia.com/deploy/a100-gpu-mem-error-mgmt/index.html
Row Remapping (Ampere+, field names, reset-to-apply): https://docs.nvidia.com/deploy/a100-gpu-mem-error-mgmt/row-remapping.html
RMA Policy / thresholds for row remapping (8-rows / already-remapped / 512-total, Blackwell channel repair): https://docs.nvidia.com/deploy/a100-gpu-mem-error-mgmt/rma-policy-thresholds-for-row-remapping.html
Dynamic Page Retirement (older GPUs): https://docs.nvidia.com/deploy/dynamic-page-retirement/index.html
NVIDIA RTX PRO 6000 Blackwell ("96GB GDDR7 with error-correction code (ECC)"): https://www.nvidia.com/en-us/products/workstations/professional-desktop-gpus/rtx-pro-6000/
nvidia-smi(1) man page (Arch mirror): https://man.archlinux.org/man/nvidia-smi.1.en

nvidia-smi documentation — -e/--ecc-config (0|DISABLED, 1|ENABLED; "takes effect after the next reboot and is persistent"; "Changes to ECC mode require a reboot"), -r/--gpu-reset, -q -d ECC|ROW_REMAPPER|PAGE_RETIREMENT, and "Volatile error counters track the number of errors detected since the last driver load. Aggregate error counts persist indefinitely and thus act as a lifetime counter." https://docs.nvidia.com/deploy/nvidia-smi/index.html ↩↩↩↩↩
NVIDIA GPU Memory Error Management — SRAM/DRAM error categories, dynamic page offlining, repair flow. https://docs.nvidia.com/deploy/a100-gpu-mem-error-mgmt/index.html ↩↩↩
NVIDIA GPU Memory Error Management, Row Remapping — "a hardware mechanism to improve the reliability of frame buffer memory on GPUs starting with the NVIDIA Ampere architecture"; fields Correctable Error, Uncorrectable Error, Pending, Remapping Failure Occurred; "requires a GPU reset to take effect and will remain persistent throughout the life of the GPU." https://docs.nvidia.com/deploy/a100-gpu-mem-error-mgmt/row-remapping.html ↩↩↩↩↩
NVIDIA GPU Memory Error Management, RMA Policy — failure flag set on a remap for an uncorrectable error on a bank with eight uncorrectable-error rows already remapped, on a remap of an already-remapped row, or after 512 total uncorrectable remappings; Blackwell may invoke HBM channel repair on a third attempt. https://docs.nvidia.com/deploy/a100-gpu-mem-error-mgmt/rma-policy-thresholds-for-row-remapping.html ↩↩
NVIDIA RTX PRO 6000 Blackwell Workstation Edition — "GPU Memory: 96GB GDDR7 with error-correction code (ECC)". https://www.nvidia.com/en-us/products/workstations/professional-desktop-gpus/rtx-pro-6000/ ↩
Crusoe Cloud troubleshooting — on Pending : Yes, reset the GPU with nvidia-smi -r then reboot to confirm Row Remapping succeeded (Pending and Remapping Failure Occurred return to No). https://docs.crusoecloud.com/resources/troubleshooting/ ↩↩↩