Markdown

Runbook: GSP firmware / driver mismatch¶

Scope: recover a node where a partial driver change left the kernel modules and the GSP (GPU System Processor) firmware on different branches, so nvidia-smi fails with "Failed to initialize NVML", the modules will not load, and dmesg shows a GSP/RM firmware-load failure. Realign driver and GSP to one branch behind cordon/drain, rebuild DKMS, reboot, verify. Severity: single-node fault, one node at a time; on a fleet this is the cleanup after a botched rolling driver upgrade.

All commands below are reference template, not hardware-tested. Pin the exact branch your fleet is standardised on and validate on one node before touching others. Package names differ by distro and driver branch, so confirm against the install guides in References.

The concept this runbook recovers is described in GPU firmware and GSP: GSP firmware ships inside the driver package and is not versioned separately, so the firmware blob the driver loads at init must match the driver branch.¹ Branch policy (which LTSB/Production branch you should be on) lives in driver versions and branches.

Trigger¶

A node that was healthy fails after a driver touch: an interrupted upgrade, an unattended-upgrades run, a runfile installed over a packaged driver, or a kernel bump that rebuilt modules against a firmware directory that no longer matches. Symptoms, any of:

nvidia-smi exits non-zero with Failed to initialize NVML: Driver/library version mismatch: the loaded kernel module and the userspace NVML/library are on different versions.³
nvidia-smi reports "No devices were found", or the modules will not load at all.
dmesg shows a GSP firmware load failure and RmInitAdapter failing. The authoritative signature, from the NVIDIA open-kernel-modules tracker (an H200 node where 3 of 8 GPUs failed to init after an upgrade left the firmware path incomplete):⁴

nvidia 0000:06:00.0: loading /lib/firmware/nvidia/570.172.08/gsp_ga10x.bin failed with error -4
nvidia 0000:06:00.0: Direct firmware load for nvidia/570.172.08/gsp_ga10x.bin failed with error -4
NVRM: RmFetchGspRmImages: No firmware image found
NVRM: GPU 0000:06:00.0: RmInitAdapter failed! (0x61:0x56:1770)
NVRM: GPU 0000:06:00.0: rm_init_adapter failed, device minor number 5

error -4/-2 on the gsp_*.bin path means the driver cannot find the matching GSP blob: the /lib/firmware/nvidia/<driver-version>/ directory for the currently loaded module version is missing or incomplete.¹⁴

If instead the GPUs are physically absent from lspci, or modules build but no nvidia device nodes appear, this is the wrong runbook; go to kernel modules / GPU missing. If the modules load and nvidia-smi is clean but the NVLink domain will not form on an NVSwitch system, go to Fabric Manager failure.

Pre-checks¶

Confirm the failure is a branch mismatch (not dead hardware, a Secure Boot signing failure, or a missing kernel-headers build) before reinstalling. Run on the affected node.

# 1. The userspace error itself (records the symptom in the ticket)
nvidia-smi || true

# 2. Loaded kernel module version vs the firmware directories actually present.
#    The directory named for THIS version must exist and hold the arch blob.
cat /sys/module/nvidia/version            # e.g. 570.172.08  (empty => module not loaded)
ls -1 /lib/firmware/nvidia/               # one dir per installed driver version
ls -1 "/lib/firmware/nvidia/$(cat /sys/module/nvidia/version 2>/dev/null)/" 2>&1 \
  | grep -E 'gsp_.*\.bin' || echo "NO MATCHING GSP FIRMWARE DIR FOR LOADED MODULE"

# 3. The kernel's own account of the failure (firmware path + RmInitAdapter).
dmesg | grep -iE 'NVRM|gsp|RmInitAdapter|firmware' | tail -40

# 4. Firmware versions as the driver reports them (when nvidia-smi still runs at all).
#    -q dumps full per-GPU detail incl. the GSP Firmware Version field. [smi]
nvidia-smi -q | grep -iE 'Driver Version|VBIOS Version|GSP Firmware Version' || true

# 5. Package state: which driver/DKMS packages dpkg believes are installed.
dpkg -l | grep -iE 'nvidia|cuda-drivers|libnvidia' || true   # Debian/Ubuntu
# rpm -qa | grep -iE 'nvidia|cuda-drivers'                    # RHEL/Rocky

# 6. DKMS build status for the nvidia module against the running kernel.
dkms status 2>/dev/null | grep -i nvidia || true
uname -r                                  # running kernel the module must match

What you are confirming, and the decision it drives:

nvidia-smi -q GSP Firmware Version is an alphanumeric string that tracks the driver branch (e.g. GSP Firmware Version : 570.133.07 on a 570 driver); it reads N/A only when GSP is disabled/unsupported.² If nvidia-smi -q itself fails, fall back to the kernel views in steps 2–3.
Mismatch confirmed when /sys/module/nvidia/version names a branch with no matching, complete /lib/firmware/nvidia/<version>/ directory, or when dpkg -l shows two driver branches half-installed, or when dkms status shows the module not built against uname -r. Any of these = realign the branch (Procedure below).
Not this runbook if dmesg shows a Secure Boot signature rejection (Loading of unsigned module ... / module verification failed): that is a signing problem, not a GSP mismatch; re-sign or enroll MOK, do not churn the driver. Note that DKMS-built modules are not signed with Canonical's key and will fail under Secure Boot unless you sign them.⁷
Do not "fix" this by disabling GSP. NVreg_EnableGpuFirmware=0 is the documented switch to disable GSP firmware,¹ but on Turing-and-later datacenter GPUs GSP is the default operating model, so disabling it is a support-directed debugging probe, not a recovery. The fix is to make driver and firmware agree, not to turn off the firmware.

Procedure¶

One node at a time. Cordon and drain before mutating the driver, because reinstalling kernel modules under a live workload corrupts in-flight jobs and can wedge the module unload. Set the target branch to the one your fleet is standardised on in driver versions and branches (mid-2026 standing LTSB target is R580; confirm before running).⁸

NODE=gpu-07.dc1.internal
DRIVER_BRANCH=580                 # the fleet's pinned branch; do NOT guess

Cordon so the scheduler stops placing work on the node:
```
kubectl cordon "$NODE"
```
Slurm equivalent: scontrol update nodename="$NODE" state=drain reason="gsp/driver mismatch".
Drain running pods, keeping DaemonSets (GPU Operator, DCGM exporter) and clearing emptyDir:
```
kubectl drain "$NODE" --ignore-daemonsets --delete-emptydir-data --timeout=15m
```
Stop GPU services and unload the stale modules. Do this from the node (SSH). If a process holds the GPU, rmmod fails with "Module nvidia is in use"; find and stop the holder before retrying, and do not force.
```
sudo systemctl stop nvidia-fabricmanager nvidia-persistenced 2>/dev/null || true
sudo fuser -k /dev/nvidia* 2>/dev/null || true           # last resort: kill GPU holders
sudo rmmod nvidia_drm nvidia_modeset nvidia_uvm nvidia 2>/dev/null || true
lsmod | grep -i nvidia || echo "modules unloaded"
```
If the modules will not unload (GPU wedged), defer the unload to the reboot in step 7 and continue; the reinstall still corrects the on-disk package and firmware state.
Reinstall the driver as a single unit, pinned to the target branch. Reinstalling the whole package set is what brings the kernel modules and the matching gsp_*.bin back into lockstep; never hand-copy firmware files.¹ Pick the path for your distro and module flavour (open vs proprietary):

# Debian/Ubuntu, open kernel modules (the KB default for current branches):
sudo apt-get update
sudo apt-get install --reinstall -y "nvidia-open-${DRIVER_BRANCH}"

# Debian/Ubuntu, proprietary modules via the CUDA repo meta-package:
# sudo apt-get install --reinstall -y cuda-drivers-${DRIVER_BRANCH}

# RHEL/Rocky (dnf module stream), open modules:
# sudo dnf module reset -y nvidia-driver
# sudo dnf module install -y "nvidia-driver:${DRIVER_BRANCH}-open"

Package naming and the open-vs-proprietary split are branch- and distro-specific (NVIDIA renamed the Ubuntu packages from branch 590 onward), so verify the exact names against the NVIDIA Ubuntu install guide before running.⁷

If dpkg -l in pre-checks showed two half-installed branches, purge the foreign packages first so the reinstall is a single clean branch, then reinstall the target:

# Inspect, then purge ONLY the stray branch's packages — never a blanket purge on a shared image.
sudo apt-get purge -y 'nvidia-*-<stray-branch>' 'libnvidia-*-<stray-branch>'
sudo apt-get install -y "nvidia-open-${DRIVER_BRANCH}"

Rebuild and confirm the DKMS module against the running kernel, so modules and firmware are consistent on the next boot. The package install normally triggers this; force and verify it explicitly:
```
sudo dkms autoinstall
dkms status | grep -i nvidia            # expect: installed, for the target branch + uname -r
```
Confirm the firmware directory for the branch you just installed now exists and holds the architecture blob:
```
ls -1 "/lib/firmware/nvidia/${DRIVER_BRANCH}".*/ 2>/dev/null | grep -E 'gsp_.*\.bin'
```
Refresh the initramfs so the boot environment does not reinsert a stale module/firmware view (especially after a kernel bump). Distro-specific:
```
sudo update-initramfs -u            # Debian/Ubuntu
# sudo dracut -f                    # RHEL/Rocky
```
Reboot to load the realigned modules cleanly from a known state:
```
sudo systemctl reboot
```

Bring services back and clear the drain only after Verification passes (next section):

sudo systemctl start nvidia-persistenced
sudo systemctl start nvidia-fabricmanager     # NVSwitch/HGX/NVL systems only
kubectl uncordon "$NODE"

Verification¶

Do not uncordon on green nvidia-smi alone; prove the driver/NVML stack is actually healthy. Run on the node after reboot.

Modules loaded, NVML up, all GPUs enumerated, firmware aligned.
```
nvidia-smi --query-gpu=index,name,driver_version --format=csv,noheader
nvidia-smi -q | grep -iE 'Driver Version|GSP Firmware Version'
```
Expect every GPU listed (no "Failed to initialize NVML"), a single consistent Driver Version, and a populated GSP Firmware Version whose branch matches the driver, not N/A on hardware that runs GSP.² Cross-check the kernel's independent view, useful when nvidia-smi is still suspect:¹
```
cat /proc/driver/nvidia/gpus/*/information | grep -E 'Model|GSP Firmware'
dmesg | grep -iE 'NVRM|RmInitAdapter' | tail -20    # expect NO RmInitAdapter failures
```
DCGM software/deployment validation, the real proof the NVML/driver stack is sound. dcgmi diag -r 1 is the "Quick"/System Validation level (runs in seconds) and executes the software/deployment plugin, which checks the NVML Library, CUDA Main Library, Denylist, Persistence Mode, Page Retirement/Row Remap and Inforom, i.e. exactly the layer a driver mismatch breaks.⁵⁶
```
dcgmi diag -r 1
```
Every line must read Pass (or Skip where not applicable), no Fail. A clean -r 1 confirms the deployment layer the mismatch corrupted is now correct.
Confirm hardware before returning to service. -r 1 is software validation only; once it is clean, run the longer hardware diagnostic to clear the node for workloads:
```
dcgmi diag -r 3        # System HW Diagnostics: PCIe/NVLink, memory BW, NCCL, targeted stress
```
dcgmi diag -r 3 must contain no Fail.⁵ On NVSwitch systems also confirm the fabric formed: systemctl is-active nvidia-fabricmanager reports active (see Fabric Manager).
Smoke a real workload. Land a short GPU job on the node; on multi-node fabrics run a 2-node nccl-tests all_reduce_perf and confirm busbw at the expected line rate before readmitting the node to the pool.⁹

Rollback¶

If the target branch will not come up clean (DKMS build fails, -r 1 still fails, or the GPUs do not enumerate), roll the node back to the previously known-good branch (same single-variable reinstall, just a different branch number), then reboot and re-verify.

PREV_BRANCH=535                          # the last branch this node ran clean
kubectl cordon "$NODE" && kubectl drain "$NODE" --ignore-daemonsets --delete-emptydir-data
# on the node:
sudo systemctl stop nvidia-fabricmanager nvidia-persistenced 2>/dev/null || true
sudo apt-get install -y "nvidia-open-${PREV_BRANCH}"
sudo dkms autoinstall && sudo update-initramfs -u && sudo systemctl reboot
# after reboot, on the node:
nvidia-smi && dcgmi diag -r 1

Then uncordon only on a clean dcgmi diag -r 1. Pin the node to PREV_BRANCH in inventory so config management does not re-push the broken branch, and open a follow-up to fix the upgrade path before retrying the move (rolling driver upgrade).

Escalate, do not keep reinstalling, if:

The node fails dcgmi diag on both the target and the previous branch: treat as hardware and divert to the GPU fault / RMA path.
dmesg shows VBIOS/FWSEC errors (e.g. failed VBIOS image preparation) rather than a missing-blob gsp_*.bin path: that points at board firmware, not the driver package; see GPU firmware and GSP and engage NVIDIA support. Reinstalling the host driver does not reflash a board.

Rolling driver / CUDA upgrade: the planned, fleet-wide change this runbook cleans up after; same cordon/drain + single-variable reinstall primitives.
Kernel modules / GPU missing: when the GPUs are absent from lspci or no device nodes appear (not a firmware-branch mismatch).
Fabric Manager failure: when modules load clean but the NVLink/NVSwitch domain will not form.
GPU fault / RMA: escalation when a node fails diag on both branches.
operational runbooks: runbook index.

References¶

NVIDIA driver README, GSP Firmware chapter — gsp_*.bin installed in /lib/firmware/nvidia/<version>/; "The GSP firmware will be used by default for all Turing and later GPUs"; NVreg_EnableGpuFirmware=0/1; per-GPU /proc/driver/nvidia/gpus/<PCI-BUS-ID>/information node. Verified identical across the 570 and 580 branches: https://download.nvidia.com/XFree86/Linux-x86_64/580.65.06/README/gsp.html and https://download.nvidia.com/XFree86/Linux-x86_64/570.86.16/README/gsp.html ↩↩↩↩↩
nvidia-smi manual — -q/--query dumps full per-GPU attributes; "GSP Firmware Version: Firmware version of GSP. This is an alphanumeric string"; "VBIOS Version: The BIOS of the GPU board": https://docs.nvidia.com/deploy/nvidia-smi/index.html ↩↩
The user-facing error for a kernel-module vs userspace-library version skew is Failed to initialize NVML: Driver/library version mismatch; common cause is an incomplete/automatic driver upgrade leaving mismatched components, fixed by removing and reinstalling the matching driver — NVIDIA Developer Forums: https://forums.developer.nvidia.com/t/failed-to-initialize-nvml-driver-library-version-mismatch/255340 ↩
NVIDIA open-gpu-kernel-modules issue #943 — verbatim dmesg signature of the GSP load failure (loading /lib/firmware/nvidia/570.172.08/gsp_ga10x.bin failed with error -4, NVRM: RmFetchGspRmImages: No firmware image found, NVRM: GPU ...: RmInitAdapter failed!) on H200 after a driver change: https://github.com/NVIDIA/open-gpu-kernel-modules/issues/943 ↩↩
DCGM Diagnostics — run levels: -r 1 "Quick"/System Validation (seconds), -r 2 Medium, -r 3 "Long"/System HW Diagnostics (PCIe/NVLink, memory bandwidth, NCCL, targeted stress/power), -r 4 Extended: https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/dcgm-diagnostics.html ↩↩
DCGM Diagnostics — the software/deployment plugin verifies the environment can run CUDA and load NVML; checks include Denylist, NVML Library, CUDA Main Library, Persistence Mode, Page Retirement/Row Remap, Inforom: https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/dcgm-diagnostics.html ↩
NVIDIA Driver Installation Guide (Ubuntu) — open (nvidia-open / nvidia-dkms-open) vs proprietary (cuda-drivers / nvidia-dkms) packaging; packages renamed from branch 590 onward; DKMS modules are unsigned w.r.t. Canonical's key and do not satisfy Secure Boot unless signed: https://docs.nvidia.com/datacenter/tesla/driver-installation-guide/ubuntu.html ↩↩
In-KB branch policy, cross-checked to NVIDIA release notes and endoflife.date — branch format <branch>.<minor>.<patch>; mid-2026 standing LTSB target is R580 (CUDA 13.x), with R570/R535 at or past EOL: driver versions and branches ↩
NVIDIA nccl-tests (e.g. all_reduce_perf busbw): https://github.com/NVIDIA/nccl-tests ↩