Markdown

Runbook: ECC toggle recovery¶

Scope: recover a datacenter GPU whose ECC mode was toggled but is stuck with Current disagreeing with Pending, or that needs a reset/reboot after enabling ECC, or that is accumulating ECC errors and a pending row remap. Background and field semantics live in ECC.

Reference templates on real nvidia-smi / dcgmi options; not hardware-tested. Pin behaviour against the nvidia-smi documentation and the GPU memory error management guide for your driver branch before running on production hardware.

This is the longform recovery procedure for the ECC path in operational runbooks. It assumes datacenter silicon that has ECC (A100, H100/H200, B-series, RTX PRO); GeForce has no ECC toggle and is out of scope (ECC).

Trigger¶

Stuck pending mode. nvidia-smi -q -d ECC (or the scriptable form) shows Current and Pending disagree, e.g. nvidia-smi -e 1 was issued in a prior window but the node was never rebooted, or a reset silently failed to flip the pending mode. The mode change to ECC takes effect only after the next reboot and is persistent.¹
Reset needed after enabling ECC. ECC was just enabled and the GPU has to be brought into the new mode before it readmits work.
ECC errors accumulating. Volatile correctable counts are rising, or an uncorrectable event has fired, typically surfaced as Xid 48 (DBE), optionally followed by Xid 63 (row-remap / page-retire entry recorded) or Xid 64 (recording failed).⁴ A row remap may now be Pending : Yes, which is not live until the GPU is reset.²

A rising correctable rate is a leading indicator to drain on before a DBE kills a multi-hour job; an uncorrectable event or Remapping Failure Occurred : Yes is an escalation, not a reset-and-return (reliability and RAS).

Pre-checks¶

Read state before mutating anything. nvidia-smi -e and -r require root.¹

Capture current vs pending ECC mode and the error counters (the central evidence for this runbook):
```
NODE=gpu-07.dc1.internal
GPU=0                                   # target one GPU; omit -i to act on all

ssh "$NODE" nvidia-smi -q -d ECC        # Current vs Pending mode; Volatile + Aggregate counters
ssh "$NODE" nvidia-smi --query-gpu=index,name,ecc.mode.current,ecc.mode.pending \
  --format=csv
```
Read the ECC Mode block (Current / Pending: Enabled or Disabled), then the Volatile (since last driver load) and Aggregate (lifetime, persisted in inforom) sections, each split into SRAM/DRAM correctable and uncorrectable on Turing-and-later.¹ Record the numbers. They are your before/after baseline.
Check the repair state: decide reset-and-return vs RMA before touching the node:
```
ssh "$NODE" nvidia-smi -q -d ROW_REMAPPER
ssh "$NODE" nvidia-smi --query-remapped-rows=remapped_rows.pending,remapped_rows.failure \
  --format=csv,noheader
```
Pending : Yes means a row is remapped but not yet live and a reset will apply it.² Remapping Failure Occurred : Yes means repair is exhausted. Stop and divert to RMA (the GPU-fault runbook); a reset will not fix it.³
Confirm the GPU can be reset. A reset needs the device fully idle: no CUDA contexts, no X server, no second nvidia-smi holding it, nvidia-persistenced stopped, and MIG disabled (Persistence Mode, MIG).¹
Confirm reset support. Some platforms (notably NVSwitch HGX baseboards and NVL72) do not support a single-GPU in-band reset and require a node reboot. Plan the window for a reboot, not a reset (Fabric Manager, NVSwitch and NVLink).
Change ticket raised; batch sized so capacity stays above the healthy quorum. Toggling ECC is a planned change, batched per node behind cordon/drain, the same discipline as a driver roll (Rolling Driver / CUDA Upgrade).

Procedure¶

Run per node, in batches that preserve the healthy quorum. Cordon and drain before mutating ECC mode or resetting: a reset tears down every CUDA context on the device.

flowchart LR
    A["Cordon + Drain"] --> B["Stop persistenced; disable MIG"]
    B --> C["nvidia-smi -e 1 (stage mode)"]
    C --> D["nvidia-smi -r (apply in place)"]
    D --> E{"Pending cleared?"}
    E -- "yes" --> G["Verify Current Enabled"]
    E -- "no / reset blocked" --> F["Reboot node"]
    F --> G
    G --> H["Uncordon"]

Cordon so the scheduler stops placing work (Slurm: scontrol update nodename=<n> state=drain reason="ecc toggle"):
```
kubectl cordon "$NODE"
```

Drain running pods (keep DaemonSets; clear emptyDir):

kubectl drain "$NODE" --ignore-daemonsets --delete-emptydir-data --timeout=15m

Quiesce the device so the reset can succeed: stop persistence and tear down any MIG layout:

ssh "$NODE" sudo systemctl stop nvidia-persistenced
ssh "$NODE" sudo nvidia-smi -i "$GPU" -mig 0      # only if MIG was enabled

Stage the ECC mode. CONFIG is 1|ENABLED or 0|DISABLED; the write is persistent but takes effect only after a reboot.¹ Skip this step if the mode is already correct and you are only applying a pending row remap.
```
ssh "$NODE" sudo nvidia-smi -i "$GPU" -e 1         # enable ECC; -e 0 to disable
```
Apply in place with a GPU reset. This also commits a pending row remap. The GPU must be idle.¹²
```
ssh "$NODE" sudo nvidia-smi -i "$GPU" -r
```
Re-read mode and decide. If Current now matches the intended mode and ROW_REMAPPER Pending is No, go to Verification.
```
ssh "$NODE" nvidia-smi --query-gpu=index,ecc.mode.current,ecc.mode.pending --format=csv
```
Reboot if the reset did not flip the pending mode. On Linux a GPU reset may fail to change a pending ECC mode; NVIDIA states plainly that changes to ECC mode require a reboot, so a reboot is the reliable apply step.¹ A reboot is also mandatory when the reset is blocked or unsupported (busy device, NVSwitch baseboard), and when a row remap is left Pending that a reset would not clear (NVSwitch and NVLink).
```
ssh "$NODE" sudo reboot
```

Restore services the drain/quiesce stopped, after the node is back:

ssh "$NODE" sudo systemctl start nvidia-persistenced
ssh "$NODE" systemctl is-active nvidia-persistenced

Verification¶

Confirm the mode actually changed, the volatile counters are clean, and no repair is left pending. Do not uncordon until all three pass.

# 1. Mode is now what you set, with no leftover pending delta
ssh "$NODE" nvidia-smi --query-gpu=index,ecc.mode.current,ecc.mode.pending --format=csv
#    Expect: ecc.mode.current == ecc.mode.pending == Enabled

# 2. Full ECC block: Current Enabled, Volatile correctable/uncorrectable clear
ssh "$NODE" nvidia-smi -q -d ECC
#    Volatile SRAM/DRAM uncorrectable == 0 (volatile resets on reboot/reset).
#    Aggregate is a lifetime counter and does NOT clear — non-zero aggregate is expected and not a failure by itself.

# 3. No pending or failed row remap
ssh "$NODE" nvidia-smi --query-remapped-rows=remapped_rows.pending,remapped_rows.failure \
  --format=csv,noheader
#    Expect: No, No

Then run a real hardware diagnostic that exercises memory and surfaces ECC/remap health before readmitting work:

ssh "$NODE" 'dcgmi diag -r 2'    # medium HW diag incl. memory checks; -r 3 for the long stress run

dcgmi diag must not report Fail on the memory/ECC checks.⁵ When dcgmi is unavailable, cuda-samples/nccl-tests give a workload-level confirmation that the GPU computes correctly after the reset. But nvidia-smi -q -d ECC showing Current : Enabled with clean volatile counters is the load-bearing proof for this runbook.

Row remap needs a reboot to be confirmed, not just a reset. Even after nvidia-smi -r clears Pending, the documentation's reliable confirmation that a row remap took effect is a reboot followed by re-reading ROW_REMAPPER with Pending : No and Remapping Failure Occurred : No. Treat a remap as fully applied only after the post-reboot read.²

Rollback¶

ECC mode is a single persistent setting. To revert, stage the opposite mode and apply the same way (reset, then reboot if the reset does not flip pending).

ssh "$NODE" sudo systemctl stop nvidia-persistenced
ssh "$NODE" sudo nvidia-smi -i "$GPU" -e 0      # revert: disable ECC (use -e 1 to re-enable)
ssh "$NODE" sudo nvidia-smi -i "$GPU" -r        # apply in place; GPU must be idle
ssh "$NODE" nvidia-smi --query-gpu=ecc.mode.current,ecc.mode.pending --format=csv
# If Current did not flip, reboot the node (ECC mode changes require a reboot):
ssh "$NODE" sudo reboot
ssh "$NODE" sudo systemctl start nvidia-persistenced
kubectl uncordon "$NODE"

Notes that change the decision:

Row remapping does not roll back. It is persistent for the life of the GPU; disabling ECC does not un-retire remapped rows.²
A pending row remap requires a reboot to be confirmed: if rollback leaves ROW_REMAPPER Pending : Yes, reboot before returning the node.²
Remapping Failure Occurred : Yes is not recoverable here. Repair is exhausted; do not loop on reset/reboot. RMA the GPU (the GPU-fault runbook).³
If the node fails dcgmi diag after both the toggle and the rollback, treat it as hardware and divert to the GPU-fault / RMA path (the GPU-fault runbook).

the GPU-fault runbook: GPU fault, drain, reset, RMA (escalation when remapping has failed or diag fails on both modes).
the driver-upgrade runbook: Rolling driver / CUDA upgrade (same cordon/drain/reboot primitives; ECC toggles are batched the same way).
operational runbooks: Operational runbooks index.
ECC: ECC concepts, counters, and row-remapping field reference behind this procedure.

References¶

nvidia-smi documentation (-e/--ecc-config: "takes effect after the next reboot and is persistent"; "Changes to ECC mode require a reboot"; -r/--gpu-reset requirements; -q -d ECC|ROW_REMAPPER|PAGE_RETIREMENT; Volatile vs Aggregate counters): https://docs.nvidia.com/deploy/nvidia-smi/index.html
NVIDIA GPU Memory Error Management (ECC, SRAM/DRAM categories, row remapping, dynamic page offlining): https://docs.nvidia.com/deploy/a100-gpu-mem-error-mgmt/index.html
Row Remapping — fields and "requires a GPU reset to take effect ... persistent throughout the life of the GPU": https://docs.nvidia.com/deploy/a100-gpu-mem-error-mgmt/row-remapping.html
RMA Policy / thresholds for row remapping (8-rows / already-remapped / 512-total): https://docs.nvidia.com/deploy/a100-gpu-mem-error-mgmt/rma-policy-thresholds-for-row-remapping.html
NVIDIA Xid errors (Xid 48 DBE; Xid 63 remap/retire recorded; Xid 64 failure): https://docs.nvidia.com/deploy/xid-errors/index.html
DCGM diagnostics (run levels -r): https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/dcgm-diagnostics.html

nvidia-smi documentation — -e, --ecc-config=CONFIG ("Set the ECC mode for the target GPUs ... Requires root ... This setting takes effect after the next reboot and is persistent"), ECC Mode "Current"/"Pending" fields ("Changes to ECC mode require a reboot"), -r, --gpu-reset ("Requires root. There can't be any applications using these devices"), and "Volatile error counters track the number of errors detected since the last driver load. Aggregate error counts persist indefinitely and thus act as a lifetime counter." https://docs.nvidia.com/deploy/nvidia-smi/index.html ↩↩↩↩↩↩↩
NVIDIA GPU Memory Error Management, Row Remapping — fields Correctable Error, Uncorrectable Error, Pending, Remapping Failure Occurred; "requires a GPU reset to take effect and will remain persistent throughout the life of the GPU." Pending : Yes indicates a row pending remapping that is not yet live. https://docs.nvidia.com/deploy/a100-gpu-mem-error-mgmt/row-remapping.html ↩↩↩↩↩↩
NVIDIA GPU Memory Error Management, RMA Policy — Remapping Failure Occurred is set on a remap for an uncorrectable error on a bank with eight uncorrectable-error rows already remapped, on a remap of an already-remapped row, or after 512 total uncorrectable remappings. https://docs.nvidia.com/deploy/a100-gpu-mem-error-mgmt/rma-policy-thresholds-for-row-remapping.html ↩↩
NVIDIA Xid errors — Xid 48 "A DBE (Double Bit Error) has occurred"; Xid 63 records a row-remapping entry to the inforom (A100+) or a page successfully retired (earlier GPUs); Xid 64 indicates that recording failed. https://docs.nvidia.com/deploy/xid-errors/index.html ↩
NVIDIA DCGM diagnostics — run levels selected with -r (1/short, 2/medium, 3/long, 4/extended); medium and above exercise GPU memory. https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/dcgm-diagnostics.html ↩