Markdown

Runbook: image drift across fleet¶

Scope: converge a GPU fleet back to one pinned software baseline when non-reproducible failures trace to nodes running different driver / CUDA / GSP-firmware / toolkit versions than the golden image. Inventory the fleet, diff against the baseline, cordon the drifted nodes, re-image or config-converge them to the pin, reboot/validate, then enforce the pin in git so drift cannot silently return. Severity: fleet-wide hygiene, executed node-by-node behind cordon/drain inside a maintenance window, never a big-bang reconverge.

All commands below are reference template, not hardware-tested. Package names, field availability, and dcgmi test coverage vary by distro, driver branch, and DCGM build, so confirm against the cited docs in References before running anything in production. Pin the exact baseline your fleet is standardised on; the version numbers shown here are placeholders.

Drift is the failure mode behind a large share of "works on most nodes, fails on some" tickets: the stack on a GPU node is a set of versioned pieces that must agree (driver, CUDA driver libcuda.so, GSP firmware shipped inside the driver, container toolkit, DCGM, and on NVSwitch systems a branch-matched Fabric Manager, see Driver Install and Lifecycle). When one node carries a different combination, a collective job that spans it gets non-deterministic NCCL behaviour, a CUDA-version error, or a silently slower rank. This runbook is the corrective converge; the baseline definition and how images are built and pinned live in Image and Config Management, and a planned branch move (not unplanned drift) is the rolling driver upgrade.

Trigger¶

A failure that is not reproducible across the fleet and correlates with which node the work landed on, not the job itself. Any of:

A multi-node training/inference job fails or hangs only when a specific node is in the placement set, and re-running without that node succeeds (runbook: NCCL hang once you have the node).
nvidia-smi / NVML reports a Driver/library version mismatch on some nodes but not others, or a CUDA app errors with a version/forward-compat message on a subset (runbook: GSP firmware mismatch for the single-node firmware case).
A fleet audit (below) shows nodes off the pinned driver_version, CUDA toolkit, container-toolkit, DCGM, or VBIOS, typically because an unattended-upgrades run, a one-off manual fix, a re-image from a stale template, or a node that missed the last driver upgrade batch left it behind.
Health gating (GPU Diagnostics and Validation) or telemetry shows a cohort of nodes that diverge from the rest on a version label.

This runbook is fleet convergence. If a single node is broken and the fleet is otherwise consistent, use the targeted runbook for that fault (GSP/driver mismatch, kernel/GPU-missing, Fabric Manager) rather than reconverging the whole fleet.

Pre-checks¶

You cannot fix drift you have not measured. Establish the golden baseline as the source of truth (it lives in git / Image and Config Management), inventory every node, then diff. Do this read-only before mutating anything.

Define the baseline (the pin)¶

The pinned set is the authoritative version of every drift-prone component. Keep it in git so the diff target is a single reviewed artifact, not tribal memory (Image and Config Management, Driver Install and Lifecycle):

# baseline.env -- the pinned fleet baseline, version-controlled. Values are placeholders;
# substitute the exact versions your fleet is standardised on and verify against the
# NVIDIA install/release notes before adopting.
DRIVER_VERSION=580.65.06          # nvidia-smi driver_version
CUDA_TOOLKIT=13.0                 # nvcc --version / cuda-toolkit package
CONTAINER_TOOLKIT=1.17.0          # nvidia-container-toolkit
DCGM_VERSION=4.1.0                # datacenter-gpu-manager
FABRIC_MANAGER=580.65.06          # NVSwitch systems only; branch-matched to driver
VBIOS_VERSION=                    # per-GPU-SKU; populate from a known-good node

Inventory scan across the fleet¶

Collect the actual versions per node. nvidia-smi --query-gpu takes a comma-separated field list and requires --format=csv; add noheader,nounits for clean parsing.¹² Driver, VBIOS, and GPU name come from there; CUDA driver and firmware from nvidia-smi -q³; OS packages from dpkg -l. Run across nodes with your existing fan-out (Ansible ad-hoc shown; pdsh/clush work too):

# Per-node GPU stack: driver + VBIOS straight from the query interface.
# vbios_version and driver_version are valid --query-gpu fields. [smi][queries]
ansible gpu -i inventory/hosts.ini -m shell -a \
  'nvidia-smi --query-gpu=index,name,driver_version,vbios_version --format=csv,noheader'

# CUDA *driver* version + GSP firmware version (full per-GPU detail; -q dumps these). [smiq]
ansible gpu -i inventory/hosts.ini -m shell -a \
  'nvidia-smi -q | grep -iE "CUDA Version|GSP Firmware Version|VBIOS Version"'

# OS package state: driver, CUDA toolkit, container toolkit, DCGM, Fabric Manager.
ansible gpu -i inventory/hosts.ini -m shell -a \
  'dpkg -l | grep -iE "nvidia|cuda|datacenter-gpu-manager|libnvidia" || true'   # Debian/Ubuntu
# RHEL/Rocky equivalent: rpm -qa | grep -iE "nvidia|cuda|datacenter-gpu-manager"

For containerised stacks also capture the CUDA toolkit the node actually exposes and the DCGM build, since these drift independently of the kernel driver (NVIDIA Container Toolkit and CDI, GPU Diagnostics and Validation):

ansible gpu -i inventory/hosts.ini -m shell -a 'nvcc --version | tail -1 || true'
ansible gpu -i inventory/hosts.ini -m shell -a 'dcgmi --version || nv-hostengine --version || true'

Diff vs the golden baseline¶

Reduce the scan to a per-node pass/fail against baseline.env. The drifted set is your work list:

# Quick scalar check: which nodes are NOT on the pinned driver. Extend per component.
source baseline.env
ansible gpu -i inventory/hosts.ini -m shell -a \
  "nvidia-smi --query-gpu=driver_version --format=csv,noheader | sort -u" \
  | awk -v want="$DRIVER_VERSION" '/=>/{node=$1} /^[0-9]/{if($0!=want) print node" DRIFT "$0}'

Confirm before converging:

The baseline is correct and adopted. Diffing against an out-of-date pin just churns the fleet. The baseline is reviewed in git (Image and Config Management, SRE, Platform and MLOps Practices).
Drift is real, not a query artifact. A node where nvidia-smi itself fails is a broken node (route to GSP/driver-mismatch or kernel/GPU-missing), not a drift node. Drift means the node is healthy but on the wrong version.
Batch size preserves the healthy quorum. Pick a batch so draining the drifted nodes does not drop fleet capacity below what running jobs and SLOs need. Converge in batches, not all at once.
Maintenance window agreed; rollback variable identified: the previous baseline (PREV_DRIVER_VERSION etc.) is recorded before you start.
One driver source per node. If a node is managed by the GPU Operator's driver containers, do not also converge a host driver; pick one (Driver Install and Lifecycle, NVIDIA Container Toolkit and CDI).

Procedure¶

Run per drifted node, in batches that preserve the healthy quorum. Cordon and drain before mutating. Re-imaging or reinstalling the driver under a live workload corrupts in-flight jobs and can wedge the module unload. NODE is the Kubernetes node name; the Slurm equivalent of each scheduler step is given inline.

NODE=gpu-07.dc1.internal
source baseline.env               # DRIVER_VERSION, FABRIC_MANAGER, ... = the pin
DRIVER_BRANCH=580                 # branch metapackage for the pinned driver; do NOT guess

Cordon so the scheduler stops placing work on the drifted node:
```
kubectl cordon "$NODE"
```
Slurm: scontrol update NodeName="$NODE" State=DRAIN Reason="image drift converge".⁸
Drain running pods, keeping DaemonSets (GPU Operator, DCGM exporter) and clearing emptyDir:
```
kubectl drain "$NODE" --ignore-daemonsets --delete-emptydir-data --timeout=15m
```
On Slurm, the node moves DRAIN -> DRAINED once running jobs finish; wait for DRAINED before mutating.⁸
Converge to the baseline. Two valid paths; pick the one your fleet uses, never both on the same node:

(a) Re-image to the golden image (cattle, not pets, preferred when you have PXE/network install). Reprovision the node from the pinned image so every component is the baseline by construction; the image build and PXE/provisioning path are owned by Image and Config Management and Provisioning and Scheduling. After the node reboots into the fresh image, skip to step 5.

(b) Config-converge with Ansible to the pinned versions in place (when re-imaging is not available). An idempotent role reinstalls the driver, branch-matched Fabric Manager, container toolkit and DCGM to the pin and rebuilds DKMS (Driver Install and Lifecycle). Preview the change first with --check --diff. Check mode is a dry run that reports what would change, --diff shows the before/after, so you see exactly the drift being corrected before applying:⁷

# Dry run: show the drift this converge will correct, change nothing.
ansible-playbook -i inventory/hosts.ini site.yml --limit "$NODE" \
  -e "driver_branch=$DRIVER_BRANCH" --check --diff

# Apply the converge to the pinned baseline.
ansible-playbook -i inventory/hosts.ini site.yml --limit "$NODE" \
  -e "driver_branch=$DRIVER_BRANCH"

If the scan showed two half-installed driver branches on the node, purge the stray branch before converging so the result is a single clean branch. Inspect first, never a blanket purge on a shared image (runbook: GSP firmware mismatch has the purge pattern).

Reboot to load the realigned kernel module and firmware cleanly from a known state (config-converge path; the re-image path already booted fresh):
```
ansible "$NODE" -i inventory/hosts.ini -b -m reboot
```

Bring services up (the package install normally enables them; confirm explicitly) before validating:

ssh "$NODE" 'sudo systemctl enable --now nvidia-persistenced'
ssh "$NODE" 'sudo systemctl enable --now nvidia-fabricmanager'   # NVSwitch/HGX/NVL only

Uncordon only after Verification passes (next section):

kubectl uncordon "$NODE"
# Slurm: scontrol update NodeName="$NODE" State=RESUME   # DRAIN -> IDLE [slurm]

Verification¶

Do not readmit on a green nvidia-smi alone. Prove two things: the node now matches the baseline exactly, and the driver/NVML/CUDA stack is sound. Run on the converged node after reboot.

Versions equal the pin. Re-run the same inventory query and assert it matches baseline.env; this is the direct proof the drift is gone:

source baseline.env
GOT=$(ssh "$NODE" 'nvidia-smi --query-gpu=driver_version --format=csv,noheader | sort -u')
test "$GOT" = "$DRIVER_VERSION" && echo "DRIVER OK ($GOT)" || echo "STILL DRIFTED: $GOT != $DRIVER_VERSION"
ssh "$NODE" 'nvidia-smi -q | grep -iE "CUDA Version|GSP Firmware Version|VBIOS Version"'
ssh "$NODE" 'dpkg -l | grep -iE "nvidia-fabricmanager|datacenter-gpu-manager|nvidia-container-toolkit"'

Every drift-prone component (driver, CUDA driver, GSP firmware, Fabric Manager, container toolkit, DCGM, VBIOS) must read the pinned value. A single field still off the baseline means the converge did not take, so do not uncordon.

DCGM software/deployment validation, the real proof the stack is healthy. dcgmi diag -r 1 is the Quick / System Validation level (seconds) and runs the software/deployment plugin, which checks exactly the layer drift corrupts: NVML library access and compatibility, CUDA library availability and versions, Nouveau-driver conflicts, device-node permissions, pending page retirement, and row-remap state.⁵⁴ The active diagnostic requires exclusive access to the GPUs, but the node is already cordoned/drained, so it will not contend with a job (GPU Diagnostics and Validation):⁶
```
ssh "$NODE" 'dcgmi diag -r 1'
```
Every line must read Pass (or Skip where not applicable), no Fail. A clean -r 1 confirms the deployment layer is correct on the converged versions.
Hardware diagnostic before returning to service. -r 1 is software-only; once clean, run the long hardware diagnostic to clear the node for collective workloads (PCIe/NVLink, memory bandwidth, NCCL, targeted stress):⁴
```
ssh "$NODE" 'dcgmi diag -r 3'
```
Must contain no Fail. On NVSwitch systems also confirm the fabric formed: ssh "$NODE" 'systemctl is-active nvidia-fabricmanager' reads active (Fabric Manager). A node that fails -r 3 on the correct baseline is suspect hardware, not drift; divert to GPU fault / RMA.
Smoke the original failure. Land a short GPU job on the converged node; for the multi-node case that triggered this, re-run a 2-node nccl-tests all_reduce_perf including the formerly-drifted node and confirm busbw at line rate. The failure that was non-reproducible must now be gone before the node rejoins the pool.¹⁰

Repeat per batch; do not advance to the next batch until the current one is green on all four checks.

Rollback¶

If the pinned baseline will not come up clean on a node (DKMS build fails, dcgmi diag -r 1 still fails, GPUs do not enumerate at the new versions), roll that node back to the previous baseline (the same single-variable converge, just the prior version set), then reboot and re-verify. Pin the previous baseline so config management does not re-push the failing one.

PREV_DRIVER_BRANCH=535            # the last baseline this node ran clean
kubectl cordon "$NODE" && kubectl drain "$NODE" --ignore-daemonsets --delete-emptydir-data
ansible-playbook -i inventory/hosts.ini site.yml --limit "$NODE" \
  -e "driver_branch=$PREV_DRIVER_BRANCH"
ansible "$NODE" -i inventory/hosts.ini -b -m reboot
ssh "$NODE" 'dcgmi diag -r 1' && kubectl uncordon "$NODE"
# Slurm readmit: scontrol update NodeName="$NODE" State=RESUME  [slurm]

Then:

Pin the node to the previous baseline in git (inventory / image tag) so the next converge run does not re-push the broken pin, and open a follow-up to fix the new baseline before retrying the move (Image and Config Management, runbook: driver upgrade).
Enforce the pin going forward. Drift recurs when nothing reconciles state. Declare the baseline in git and let it self-heal: hold the packages so unattended-upgrades cannot silently move a node (apt-mark hold, see Driver Install and Lifecycle), and run the converge play (or the GitOps reconciler, Argo CD / Flux for the Kubernetes-managed surface) on a schedule, using ansible-playbook --check --diff as a periodic drift detector that reports any node that wandered off the pin without changing it.⁷⁹ The fix for drift is not a one-off reconverge; it is making the pinned baseline the reconciled state.
Escalate, do not keep reconverging, if a node fails dcgmi diag on both the new and the previous baseline: treat it as hardware and divert to GPU fault / RMA.

Rolling driver / CUDA upgrade: the planned fleet-wide version move; same cordon/drain + single-variable converge primitives. Drift is what happens when a node misses one of its batches.
GSP firmware / driver mismatch: the single-node case where driver and firmware are on different branches (Failed to initialize NVML); the targeted fix when drift is one broken node, not a version cohort.
Kernel modules / GPU missing: when a node's GPUs are absent entirely (not merely the wrong version).
Fabric Manager failure: when a converged node's modules load clean but the NVLink/NVSwitch domain will not form.
GPU fault / RMA: escalation when a node fails diag on both baselines.
operational runbooks: runbook index.

References¶

NVIDIA System Management Interface manual — --query-gpu takes a comma-separated field list and requires --format=csv (mandatory); -q/--query dumps full per-GPU detail; general options -i/--id, -L/--list-gpus, -f/--filename. https://docs.nvidia.com/deploy/nvidia-smi/index.html ↩
NVIDIA, "Useful nvidia-smi Queries" — full list of --query-gpu fields via nvidia-smi --help-query-gpu; driver_version and vbios_version are valid query fields (nvidia-smi --query-gpu=gpu_name,vbios_version --format=csv); add noheader,nounits for parseable output. https://nvidia.custhelp.com/app/answers/detail/a_id/3751/~/useful-nvidia-smi-queries ↩
nvidia-smi manual — -q/--query reports per-GPU "CUDA Version", "GSP Firmware Version" (an alphanumeric string), and "VBIOS Version" (the BIOS of the GPU board); use these for the components not exposed as --query-gpu scalars. https://docs.nvidia.com/deploy/nvidia-smi/index.html ↩
NVIDIA DCGM Diagnostics — run levels: -r 1 Quick / System Validation (seconds), -r 2 Medium / Extended System Validation, -r 3 Long / System HW Diagnostics (PCIe/NVLink, memory bandwidth, NCCL, targeted stress), -r 4 Extra Long / Extended HW Diagnostics (memory stress, power). https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/dcgm-diagnostics.html ↩↩
NVIDIA DCGM Diagnostics — the software/deployment plugin validates compute-environment readiness: NVML library access and compatibility, CUDA library availability and versions, Nouveau-driver conflicts, device-node accessibility and permissions, pending page retirement, and row-remap (pending/failed) state. A YAML config (-c custom-diag-tests.yaml) and JSON output (-j) are supported. https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/dcgm-diagnostics.html ↩
NVIDIA DCGM — the active dcgmi diag diagnostic works "by running real workloads and analyzing the results"; the active levels "require exclusive access to the target GPUs," so the node must be cordoned/drained first. https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/feature-overview.html ↩
Ansible Community Documentation, "Validating tasks: check mode and diff mode" — --check runs the playbook as a dry run, reporting changes without applying them (drift preview); --diff shows before/after for changed resources; a converged, idempotent play run again reports no changes, which makes --check --diff usable as a scheduled drift detector. https://docs.ansible.com/projects/ansible/latest/playbook_guide/playbooks_checkmode.html ↩↩
Slurm scontrol manual — scontrol update NodeName=<n> State=DRAIN Reason="..." stops new jobs and lets running jobs finish (DRAIN -> DRAINED); State=RESUME transitions a node from DRAIN/DRAINING/DOWN/REBOOT back to IDLE. https://slurm.schedmd.com/scontrol.html ↩↩
In-KB definition — GitOps: cluster state declared in git and reconciled by Argo CD / Flux, so drift self-heals (Glossary). https://argo-cd.readthedocs.io/en/stable/ and https://fluxcd.io/flux/concepts/ ↩
NVIDIA nccl-tests (e.g. all_reduce_perf, busbw at line rate). https://github.com/NVIDIA/nccl-tests ↩