ECC (error-correcting code)¶
Scope: ECC on GPU memory across the tiers: what carries it, how nvidia-smi toggles and reports it, the volatile-vs-aggregate counters, and row remapping on Ampere-and-later silicon; toggle/recovery procedure lives in ECC Toggle Recovery.
What it is¶
ECC adds redundant parity to memory so the GPU can detect and correct single-bit errors (SBE) and detect double-bit / uncorrectable errors (DBE) rather than silently returning corrupt data. On datacenter GPUs it covers both the HBM frame buffer (DRAM) and on-die SRAM structures: register file, L1/L2, shared memory. NVML/nvidia-smi splits the counts accordingly: on Turing and later you see SRAM Correctable / SRAM Uncorrectable / DRAM Correctable / DRAM Uncorrectable.12
Two orthogonal mechanisms sit under "ECC":
- In-line correction. SBEs corrected transparently, DBEs flagged. A DBE/uncorrectable event is fatal to the affected work and surfaces as Xid 48 (and, on Hopper/Blackwell, the contained/uncontained pair Xid 94/95); see Reliability, RAS and Failure Modes.
- Repair. Bad memory is retired so it is not reused: row remapping on Ampere-and-later HBM, dynamic page retirement / offlining on older parts. Covered below.23
Counters come in two scopes, used constantly in triage:1
- Volatile. Errors since the last driver load (cleared on reload/reset/reboot).
- Aggregate. A lifetime counter, persisted in the GPU's inforom.
ECC is not free: enabling it reserves a fraction of frame-buffer capacity and costs a little bandwidth. That trade is correct for any multi-hour training or production inference job; the alternative is silent corruption.
Why it's needed (and when)¶
At fleet scale, memory faults are steady-state, not exceptional: a long training run reads and writes terabytes through HBM, and an uncorrected bit-flip either crashes the job or, worse, corrupts a checkpoint without telling you. ECC turns "mystery NaN three hours in" into a logged, attributable event, and a rising correctable-error rate becomes a leading indicator you can drain on before a DBE kills the run (Reliability, RAS and Failure Modes).
When it matters by tier (the GPU determines whether ECC even exists):
- Datacenter (HBM): A100, H100/H200, B-series. HBM ECC, on by default. Leave it on for shared and production fleets.
- RTX PRO / professional workstation. The RTX PRO 6000 Blackwell ships 96 GB GDDR7 with error-correction code (ECC). That is ECC on consumer-class GDDR7, which plain GeForce does not provide.5 Treat it as a server: ECC on, persistence on (Persistence Mode).
- GeForce (RTX 50/40): no ECC. Consumer cards do not expose an ECC toggle;
nvidia-smi -q -d ECCreports the feature as unavailable. Do not plan datacenter integrity controls around them (Driver and Feature Support by GPU Tier).
Rule of thumb: if the workload writes state you cannot cheaply re-derive (model weights, optimizer state, KV cache backing a paid SLO), you want ECC and a GPU that has it.
How it's installed & managed¶
ECC needs no separate package; it is a driver/firmware feature exposed through nvidia-smi (NVML). Management is two operations: read state/counters, and toggle the mode (which is not live, see the reset caveat). Both require root for the write path.
Reference templates on real
nvidia-smioptions; not hardware-tested. Pin behaviour against the nvidia-smi documentation and the GPU memory error management guide for your driver branch.
Check the current ECC mode and whether it is even supported:
# Current vs pending ECC mode, per GPU (scriptable)
nvidia-smi --query-gpu=index,name,ecc.mode.current,ecc.mode.pending --format=csv
Enable or disable ECC. CONFIG is 0|DISABLED or 1|ENABLED; the change is persistent but takes effect after the next reboot, not immediately:1
# Enable ECC on all GPUs (root). -i <id> targets a single GPU.
sudo nvidia-smi -e 1 # 1|ENABLED ; use 0 to disable
The reset caveat (the detail that bites). nvidia-smi -e only stages the mode; you must apply it. The supported in-place mechanism is a GPU reset, but the documentation is explicit that "Changes to ECC mode require a reboot," and on Linux a reset may fail to flip a pending ECC mode, so a reboot is the reliable apply step.1 The GPU must be idle (no CUDA contexts, persistence/nvidia-persistenced stopped, MIG disabled) for a reset to succeed.
# Apply staged ECC mode in place (FLR). GPU must be idle; may not flip pending mode on Linux.
sudo nvidia-smi -r # -i <id> to target one GPU; -r bus for a bus reset
# If pending mode does not clear, reboot the node.
Sequencing in a maintenance window: cordon/drain → stop persistenced → nvidia-smi -e <0|1> → reboot → verify ecc.mode.current → uncordon. The full procedure (and recovery when a node comes back with the wrong mode or a stuck pending state) is ECC Toggle Recovery. Toggling ECC fleet-wide is a planned change, batched to preserve quorum, the same discipline as a driver roll (Rolling Driver / CUDA Upgrade).
Validated usage & tests¶
Inspect ECC state and error counters. Expect a block split into Volatile and Aggregate, each with SBE/DBE (or SRAM/DRAM correctable/uncorrectable) sub-counts:
Expected shape (counts are device-specific, so do not assume any particular number): a top-level ECC mode (Current / Pending: Enabled or Disabled), then Volatile and Aggregate sections each listing correctable and uncorrectable totals broken out by memory structure. On a healthy GPU the uncorrectable counts are zero; a non-zero, rising correctable rate is the signal to watch, not panic.
Query row remapping (Ampere-and-later HBM). This is the repair state that decides drain-vs-RMA:3
Expected fields: Correctable Error and Uncorrectable Error (rows remapped due to each), Pending, and Remapping Failure Occurred.36 Interpretation:
- Pending : Yes means a row is remapped but the change is not yet live; the GPU needs a reset (
nvidia-smi -r, then reboot to confirm) for it to take effect. After a successful apply, Pending returns to No.36 - Remapping Failure Occurred : Yes means repair could not be applied; the GPU is degraded past what remapping can hide. This is an RMA signal, not a reset-and-return.4
- Row remapping is persistent for the life of the GPU and does not roll back.3
Scriptable health gate for a fleet (label/drain on any non-clean field):
# Any value other than 0 / No on these warrants attention
nvidia-smi --query-gpu=index,ecc.mode.current,ecc.errors.uncorrected.aggregate.total \
--query-remapped-rows=remapped_rows.pending,remapped_rows.failure \
--format=csv,noheader
Older GPUs without row remapping expose the analogous page retirement view instead (nvidia-smi -q -d PAGE_RETIREMENT), retiring individual pages on SBE-double or DBE events.12 dcgmi diag exercises memory and surfaces ECC/remap health in the health checks (GPU Diagnostics and Validation).
Failure modes¶
- ECC toggled but never applied.
nvidia-smi -ewas run, the node was never rebooted (or the reset silently failed to flip the pending mode), and the GPU is still in the old mode, soecc.mode.currentdisagrees withecc.mode.pending. Recovery: ECC Toggle Recovery. - Pending row remap left stuck.
ROW_REMAPPERshows Pending : Yes that never clears because the node was never reset/rebooted; the bad row stays in service. Reset and confirm Pending returns to No.6 - Remapping failure / exhausted repair → RMA, not reset. Per NVIDIA's policy the failure flag is set on: a remap for an uncorrectable error on a bank that already has eight uncorrectable-error rows remapped; a remap of an already-remapped row; or after 512 total uncorrectable remappings. (On Blackwell a third attempt may invoke HBM channel repair if a spare channel exists, deferring RMA.) Field-diagnose, then RMA.4
- Treating a correctable-error ramp as noise. Ignoring a rising SBE rate until a DBE (Xid 48 / uncontained Xid 95) kills a multi-hour run (Reliability, RAS and Failure Modes).
- Assuming ECC where there is none. Planning integrity controls on GeForce, which has no ECC toggle (Driver and Feature Support by GPU Tier).
References¶
- nvidia-smi documentation (
-e/--ecc-config,-r/--gpu-reset,-q -d ECC|ROW_REMAPPER|PAGE_RETIREMENT, volatile vs aggregate): https://docs.nvidia.com/deploy/nvidia-smi/index.html - NVIDIA GPU Memory Error Management (ECC, row remapping, dynamic page offlining): https://docs.nvidia.com/deploy/a100-gpu-mem-error-mgmt/index.html
- Row Remapping (Ampere+, field names, reset-to-apply): https://docs.nvidia.com/deploy/a100-gpu-mem-error-mgmt/row-remapping.html
- RMA Policy / thresholds for row remapping (8-rows / already-remapped / 512-total, Blackwell channel repair): https://docs.nvidia.com/deploy/a100-gpu-mem-error-mgmt/rma-policy-thresholds-for-row-remapping.html
- Dynamic Page Retirement (older GPUs): https://docs.nvidia.com/deploy/dynamic-page-retirement/index.html
- NVIDIA RTX PRO 6000 Blackwell ("96GB GDDR7 with error-correction code (ECC)"): https://www.nvidia.com/en-us/products/workstations/professional-desktop-gpus/rtx-pro-6000/
- nvidia-smi(1) man page (Arch mirror): https://man.archlinux.org/man/nvidia-smi.1.en
Related: Reliability, RAS and Failure Modes · GPU Software Stack and Node Administration · Glossary
-
nvidia-smi documentation —
-e/--ecc-config(0|DISABLED,1|ENABLED; "takes effect after the next reboot and is persistent"; "Changes to ECC mode require a reboot"),-r/--gpu-reset,-q -d ECC|ROW_REMAPPER|PAGE_RETIREMENT, and "Volatile error counters track the number of errors detected since the last driver load. Aggregate error counts persist indefinitely and thus act as a lifetime counter." https://docs.nvidia.com/deploy/nvidia-smi/index.html ↩↩↩↩↩ -
NVIDIA GPU Memory Error Management — SRAM/DRAM error categories, dynamic page offlining, repair flow. https://docs.nvidia.com/deploy/a100-gpu-mem-error-mgmt/index.html ↩↩↩
-
NVIDIA GPU Memory Error Management, Row Remapping — "a hardware mechanism to improve the reliability of frame buffer memory on GPUs starting with the NVIDIA Ampere architecture"; fields Correctable Error, Uncorrectable Error, Pending, Remapping Failure Occurred; "requires a GPU reset to take effect and will remain persistent throughout the life of the GPU." https://docs.nvidia.com/deploy/a100-gpu-mem-error-mgmt/row-remapping.html ↩↩↩↩↩
-
NVIDIA GPU Memory Error Management, RMA Policy — failure flag set on a remap for an uncorrectable error on a bank with eight uncorrectable-error rows already remapped, on a remap of an already-remapped row, or after 512 total uncorrectable remappings; Blackwell may invoke HBM channel repair on a third attempt. https://docs.nvidia.com/deploy/a100-gpu-mem-error-mgmt/rma-policy-thresholds-for-row-remapping.html ↩↩
-
NVIDIA RTX PRO 6000 Blackwell Workstation Edition — "GPU Memory: 96GB GDDR7 with error-correction code (ECC)". https://www.nvidia.com/en-us/products/workstations/professional-desktop-gpus/rtx-pro-6000/ ↩
-
Crusoe Cloud troubleshooting — on
Pending : Yes, reset the GPU withnvidia-smi -rthen reboot to confirm Row Remapping succeeded (Pending and Remapping Failure Occurred return to No). https://docs.crusoecloud.com/resources/troubleshooting/ ↩↩↩