Runbook: GSP firmware / driver mismatch¶
Scope: recover a node where a partial driver change left the kernel modules and the GSP (GPU System Processor) firmware on different branches, so nvidia-smi fails with "Failed to initialize NVML", the modules will not load, and dmesg shows a GSP/RM firmware-load failure. Realign driver and GSP to one branch behind cordon/drain, rebuild DKMS, reboot, verify. Severity: single-node fault, one node at a time; on a fleet this is the cleanup after a botched rolling driver upgrade.
All commands below are reference template, not hardware-tested. Pin the exact branch your fleet is standardised on and validate on one node before touching others. Package names differ by distro and driver branch, so confirm against the install guides in References.
The concept this runbook recovers is described in GPU firmware and GSP: GSP firmware ships inside the driver package and is not versioned separately, so the firmware blob the driver loads at init must match the driver branch.1 Branch policy (which LTSB/Production branch you should be on) lives in driver versions and branches.
Trigger¶
A node that was healthy fails after a driver touch: an interrupted upgrade, an unattended-upgrades run, a runfile installed over a packaged driver, or a kernel bump that rebuilt modules against a firmware directory that no longer matches. Symptoms, any of:
nvidia-smiexits non-zero withFailed to initialize NVML: Driver/library version mismatch: the loaded kernel module and the userspace NVML/library are on different versions.3nvidia-smireports "No devices were found", or the modules will not load at all.dmesgshows a GSP firmware load failure andRmInitAdapterfailing. The authoritative signature, from the NVIDIA open-kernel-modules tracker (an H200 node where 3 of 8 GPUs failed to init after an upgrade left the firmware path incomplete):4
nvidia 0000:06:00.0: loading /lib/firmware/nvidia/570.172.08/gsp_ga10x.bin failed with error -4
nvidia 0000:06:00.0: Direct firmware load for nvidia/570.172.08/gsp_ga10x.bin failed with error -4
NVRM: RmFetchGspRmImages: No firmware image found
NVRM: GPU 0000:06:00.0: RmInitAdapter failed! (0x61:0x56:1770)
NVRM: GPU 0000:06:00.0: rm_init_adapter failed, device minor number 5
error -4/-2 on the gsp_*.bin path means the driver cannot find the matching GSP blob: the /lib/firmware/nvidia/<driver-version>/ directory for the currently loaded module version is missing or incomplete.14
If instead the GPUs are physically absent from lspci, or modules build but no nvidia device nodes appear, this is the wrong runbook; go to kernel modules / GPU missing. If the modules load and nvidia-smi is clean but the NVLink domain will not form on an NVSwitch system, go to Fabric Manager failure.
Pre-checks¶
Confirm the failure is a branch mismatch (not dead hardware, a Secure Boot signing failure, or a missing kernel-headers build) before reinstalling. Run on the affected node.
# 1. The userspace error itself (records the symptom in the ticket)
nvidia-smi || true
# 2. Loaded kernel module version vs the firmware directories actually present.
# The directory named for THIS version must exist and hold the arch blob.
cat /sys/module/nvidia/version # e.g. 570.172.08 (empty => module not loaded)
ls -1 /lib/firmware/nvidia/ # one dir per installed driver version
ls -1 "/lib/firmware/nvidia/$(cat /sys/module/nvidia/version 2>/dev/null)/" 2>&1 \
| grep -E 'gsp_.*\.bin' || echo "NO MATCHING GSP FIRMWARE DIR FOR LOADED MODULE"
# 3. The kernel's own account of the failure (firmware path + RmInitAdapter).
dmesg | grep -iE 'NVRM|gsp|RmInitAdapter|firmware' | tail -40
# 4. Firmware versions as the driver reports them (when nvidia-smi still runs at all).
# -q dumps full per-GPU detail incl. the GSP Firmware Version field. [smi]
nvidia-smi -q | grep -iE 'Driver Version|VBIOS Version|GSP Firmware Version' || true
# 5. Package state: which driver/DKMS packages dpkg believes are installed.
dpkg -l | grep -iE 'nvidia|cuda-drivers|libnvidia' || true # Debian/Ubuntu
# rpm -qa | grep -iE 'nvidia|cuda-drivers' # RHEL/Rocky
# 6. DKMS build status for the nvidia module against the running kernel.
dkms status 2>/dev/null | grep -i nvidia || true
uname -r # running kernel the module must match
What you are confirming, and the decision it drives:
nvidia-smi -qGSP Firmware Version is an alphanumeric string that tracks the driver branch (e.g.GSP Firmware Version : 570.133.07on a 570 driver); it readsN/Aonly when GSP is disabled/unsupported.2 Ifnvidia-smi -qitself fails, fall back to the kernel views in steps 2–3.- Mismatch confirmed when
/sys/module/nvidia/versionnames a branch with no matching, complete/lib/firmware/nvidia/<version>/directory, or whendpkg -lshows two driver branches half-installed, or whendkms statusshows the module not built againstuname -r. Any of these = realign the branch (Procedure below). - Not this runbook if
dmesgshows a Secure Boot signature rejection (Loading of unsigned module .../ module verification failed): that is a signing problem, not a GSP mismatch; re-sign or enroll MOK, do not churn the driver. Note that DKMS-built modules are not signed with Canonical's key and will fail under Secure Boot unless you sign them.7 - Do not "fix" this by disabling GSP.
NVreg_EnableGpuFirmware=0is the documented switch to disable GSP firmware,1 but on Turing-and-later datacenter GPUs GSP is the default operating model, so disabling it is a support-directed debugging probe, not a recovery. The fix is to make driver and firmware agree, not to turn off the firmware.
Procedure¶
One node at a time. Cordon and drain before mutating the driver, because reinstalling kernel modules under a live workload corrupts in-flight jobs and can wedge the module unload. Set the target branch to the one your fleet is standardised on in driver versions and branches (mid-2026 standing LTSB target is R580; confirm before running).8
-
Cordon so the scheduler stops placing work on the node:
Slurm equivalent:scontrol update nodename="$NODE" state=drain reason="gsp/driver mismatch". -
Drain running pods, keeping DaemonSets (GPU Operator, DCGM exporter) and clearing emptyDir:
-
Stop GPU services and unload the stale modules. Do this from the node (SSH). If a process holds the GPU,
If the modules will not unload (GPU wedged), defer the unload to the reboot in step 7 and continue; the reinstall still corrects the on-disk package and firmware state.rmmodfails with "Module nvidia is in use"; find and stop the holder before retrying, and do not force. -
Reinstall the driver as a single unit, pinned to the target branch. Reinstalling the whole package set is what brings the kernel modules and the matching
gsp_*.binback into lockstep; never hand-copy firmware files.1 Pick the path for your distro and module flavour (open vs proprietary):
# Debian/Ubuntu, open kernel modules (the KB default for current branches):
sudo apt-get update
sudo apt-get install --reinstall -y "nvidia-open-${DRIVER_BRANCH}"
# Debian/Ubuntu, proprietary modules via the CUDA repo meta-package:
# sudo apt-get install --reinstall -y cuda-drivers-${DRIVER_BRANCH}
# RHEL/Rocky (dnf module stream), open modules:
# sudo dnf module reset -y nvidia-driver
# sudo dnf module install -y "nvidia-driver:${DRIVER_BRANCH}-open"
If dpkg -l in pre-checks showed two half-installed branches, purge the foreign packages first so the reinstall is a single clean branch, then reinstall the target:
# Inspect, then purge ONLY the stray branch's packages — never a blanket purge on a shared image.
sudo apt-get purge -y 'nvidia-*-<stray-branch>' 'libnvidia-*-<stray-branch>'
sudo apt-get install -y "nvidia-open-${DRIVER_BRANCH}"
-
Rebuild and confirm the DKMS module against the running kernel, so modules and firmware are consistent on the next boot. The package install normally triggers this; force and verify it explicitly:
Confirm the firmware directory for the branch you just installed now exists and holds the architecture blob: -
Refresh the initramfs so the boot environment does not reinsert a stale module/firmware view (especially after a kernel bump). Distro-specific:
-
Reboot to load the realigned modules cleanly from a known state:
-
Bring services back and clear the drain only after Verification passes (next section):
Verification¶
Do not uncordon on green nvidia-smi alone; prove the driver/NVML stack is actually healthy. Run on the node after reboot.
-
Modules loaded, NVML up, all GPUs enumerated, firmware aligned.
Expect every GPU listed (no "Failed to initialize NVML"), a single consistentnvidia-smi --query-gpu=index,name,driver_version --format=csv,noheader nvidia-smi -q | grep -iE 'Driver Version|GSP Firmware Version'Driver Version, and a populatedGSP Firmware Versionwhose branch matches the driver, notN/Aon hardware that runs GSP.2 Cross-check the kernel's independent view, useful whennvidia-smiis still suspect:1 -
DCGM software/deployment validation, the real proof the NVML/driver stack is sound.
Every line must readdcgmi diag -r 1is the "Quick"/System Validation level (runs in seconds) and executes the software/deployment plugin, which checks the NVML Library, CUDA Main Library, Denylist, Persistence Mode, Page Retirement/Row Remap and Inforom, i.e. exactly the layer a driver mismatch breaks.56Pass(orSkipwhere not applicable), noFail. A clean-r 1confirms the deployment layer the mismatch corrupted is now correct. -
Confirm hardware before returning to service.
-r 1is software validation only; once it is clean, run the longer hardware diagnostic to clear the node for workloads:dcgmi diag -r 3must contain noFail.5 On NVSwitch systems also confirm the fabric formed:systemctl is-active nvidia-fabricmanagerreportsactive(see Fabric Manager). -
Smoke a real workload. Land a short GPU job on the node; on multi-node fabrics run a 2-node
nccl-tests all_reduce_perfand confirm busbw at the expected line rate before readmitting the node to the pool.9
Rollback¶
If the target branch will not come up clean (DKMS build fails, -r 1 still fails, or the GPUs do not enumerate), roll the node back to the previously known-good branch (same single-variable reinstall, just a different branch number), then reboot and re-verify.
PREV_BRANCH=535 # the last branch this node ran clean
kubectl cordon "$NODE" && kubectl drain "$NODE" --ignore-daemonsets --delete-emptydir-data
# on the node:
sudo systemctl stop nvidia-fabricmanager nvidia-persistenced 2>/dev/null || true
sudo apt-get install -y "nvidia-open-${PREV_BRANCH}"
sudo dkms autoinstall && sudo update-initramfs -u && sudo systemctl reboot
# after reboot, on the node:
nvidia-smi && dcgmi diag -r 1
Then uncordon only on a clean dcgmi diag -r 1. Pin the node to PREV_BRANCH in inventory so config management does not re-push the broken branch, and open a follow-up to fix the upgrade path before retrying the move (rolling driver upgrade).
Escalate, do not keep reinstalling, if:
- The node fails
dcgmi diagon both the target and the previous branch: treat as hardware and divert to the GPU fault / RMA path. dmesgshows VBIOS/FWSEC errors (e.g. failed VBIOS image preparation) rather than a missing-blobgsp_*.binpath: that points at board firmware, not the driver package; see GPU firmware and GSP and engage NVIDIA support. Reinstalling the host driver does not reflash a board.
Related runbooks¶
- Rolling driver / CUDA upgrade: the planned, fleet-wide change this runbook cleans up after; same cordon/drain + single-variable reinstall primitives.
- Kernel modules / GPU missing: when the GPUs are absent from
lspcior no device nodes appear (not a firmware-branch mismatch). - Fabric Manager failure: when modules load clean but the NVLink/NVSwitch domain will not form.
- GPU fault / RMA: escalation when a node fails diag on both branches.
- operational runbooks: runbook index.
References¶
Related: Runbook: Driver Upgrade · GPU Firmware & GSP · Driver Versions & Branches · Kernel Modules · Diagnostics Tools · Glossary
-
NVIDIA driver README, GSP Firmware chapter —
gsp_*.bininstalled in/lib/firmware/nvidia/<version>/; "The GSP firmware will be used by default for all Turing and later GPUs";NVreg_EnableGpuFirmware=0/1; per-GPU/proc/driver/nvidia/gpus/<PCI-BUS-ID>/informationnode. Verified identical across the 570 and 580 branches: https://download.nvidia.com/XFree86/Linux-x86_64/580.65.06/README/gsp.html and https://download.nvidia.com/XFree86/Linux-x86_64/570.86.16/README/gsp.html ↩↩↩↩↩ -
nvidia-smi manual —
-q/--querydumps full per-GPU attributes; "GSP Firmware Version: Firmware version of GSP. This is an alphanumeric string"; "VBIOS Version: The BIOS of the GPU board": https://docs.nvidia.com/deploy/nvidia-smi/index.html ↩↩ -
The user-facing error for a kernel-module vs userspace-library version skew is
Failed to initialize NVML: Driver/library version mismatch; common cause is an incomplete/automatic driver upgrade leaving mismatched components, fixed by removing and reinstalling the matching driver — NVIDIA Developer Forums: https://forums.developer.nvidia.com/t/failed-to-initialize-nvml-driver-library-version-mismatch/255340 ↩ -
NVIDIA
open-gpu-kernel-modulesissue #943 — verbatim dmesg signature of the GSP load failure (loading /lib/firmware/nvidia/570.172.08/gsp_ga10x.bin failed with error -4,NVRM: RmFetchGspRmImages: No firmware image found,NVRM: GPU ...: RmInitAdapter failed!) on H200 after a driver change: https://github.com/NVIDIA/open-gpu-kernel-modules/issues/943 ↩↩ -
DCGM Diagnostics — run levels:
-r 1"Quick"/System Validation (seconds),-r 2Medium,-r 3"Long"/System HW Diagnostics (PCIe/NVLink, memory bandwidth, NCCL, targeted stress/power),-r 4Extended: https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/dcgm-diagnostics.html ↩↩ -
DCGM Diagnostics — the software/deployment plugin verifies the environment can run CUDA and load NVML; checks include Denylist, NVML Library, CUDA Main Library, Persistence Mode, Page Retirement/Row Remap, Inforom: https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/dcgm-diagnostics.html ↩
-
NVIDIA Driver Installation Guide (Ubuntu) — open (
nvidia-open/nvidia-dkms-open) vs proprietary (cuda-drivers/nvidia-dkms) packaging; packages renamed from branch 590 onward; DKMS modules are unsigned w.r.t. Canonical's key and do not satisfy Secure Boot unless signed: https://docs.nvidia.com/datacenter/tesla/driver-installation-guide/ubuntu.html ↩↩ -
In-KB branch policy, cross-checked to NVIDIA release notes and endoflife.date — branch format
<branch>.<minor>.<patch>; mid-2026 standing LTSB target is R580 (CUDA 13.x), with R570/R535 at or past EOL: driver versions and branches ↩ -
NVIDIA
nccl-tests(e.g.all_reduce_perfbusbw): https://github.com/NVIDIA/nccl-tests ↩