Runbook: driver / kernel module load failure¶
Scope: recover a node where the NVIDIA kernel modules fail to load or refuse to bind. Either modprobe nvidia errors, or nvidia-smi returns Failed to initialize NVML: Driver/library version mismatch because the resident kernel module and the userspace libraries are different driver versions (an in-place driver upgrade that never reloaded the module, a half-applied package transaction, or a wedged nvidia_uvm). Distinct from a clean kernel upgrade with no driver present (No devices were found), which is runbook: kernel upgrade, GPU missing. Single-node procedure; run it behind cordon/drain.
Run this when
nvidia-smion a node printsFailed to initialize NVML: Driver/library version mismatchormodprobe nvidiafails, the GPU is still on the PCIe bus (lspcilists it), and a DKMS module exists for the running kernel, i.e. the driver is installed but the loaded module and the userspace stack disagree, or the module will not bind. Severity: node-down, single node; the scheduler routes around it once it stops advertising GPUs.Reference templates, not hardware-tested. Pin exact versions and package names to your adopted driver branch and validate on a canary before fleet use (runbook: driver upgrade).
The version-mismatch case is specific: a driver package upgrade (or unattended-upgrade) replaced libnvidia-ml.so and the rest of userspace without unloading the old in-core module, so the running module's version no longer matches NVML. Until the old module is unloaded and the new one loaded, every NVML caller fails identically. The kernel-space layer (the five nvidia* modules, DKMS, open-vs-proprietary flavor) is described in kernel modules; the package install/upgrade mechanics that produce this state are in the GPU software stack. On NVSwitch systems a module reload also tears down and rebuilds the NVLink domain, so Fabric Manager must be restarted in step (runbook: fabric manager failure).
Trigger¶
nvidia-smireturnsFailed to initialize NVML: Driver/library version mismatch: the loaded kernel module version and the NVML userspace library version differ (commonly after an in-place driver upgrade with no reboot/reload).modprobe nvidiaerrors (ERROR: could not insert 'nvidia',Invalid module format,Exec format error, orUnknown symbol): the module exists but will not bind to the running kernel/driver.- The GPU device-plugin stops advertising
nvidia.com/gpu; DCGM exporter metrics flatline for the node (gpu health gating, troubleshooting).
This is not the "GPU missing after a kernel bump" case: here a module is built and present for uname -r (dkms status shows installed); the problem is a stale resident module or a failed bind, not a missing build. If dkms status has no entry for the running kernel, or nvidia-smi says No devices were found, go to runbook: kernel upgrade, GPU missing. If the module loads but against the wrong GSP blob, go to runbook: GSP firmware mismatch.
Pre-checks¶
Run these read-only first; they confirm this is a reload/bind problem and not a missing build or hardware fault. None mutate the node.
NODE=gpu-07.dc1.internal # set to the affected node
# 1. The authoritative symptom and the two version strings:
nvidia-smi # expect: Failed to initialize NVML: Driver/library version mismatch
cat /proc/driver/nvidia/version # version baked into the RESIDENT kernel module
modinfo -F version nvidia # version of the ON-DISK module that would load next
# 2. Is a (mismatched) module still resident?
lsmod | grep -E 'nvidia' # nvidia / nvidia_uvm / nvidia_modeset / nvidia_drm refcounts
# 3. Is a module BUILT for the running kernel? (distinguishes from the GPU-missing runbook)
uname -r
dkms status # want nvidia/<VER>, <running-kernel>: installed
modinfo -F vermagic nvidia # leading token must equal `uname -r`
# 4. GPU still on the bus? (rules out a seating/hardware fault)
lspci | grep -i nvidia
# 5. Why did a load attempt fail? (the authoritative line)
dmesg | grep -iE 'NVRM|nvidia|Invalid module format|Unknown symbol|disagrees about version'
Read the signal:
/proc/driver/nvidia/versionandmodinfo -F version nvidiadiffer → classic version mismatch: an upgrade swapped userspace + the on-disk module but the old module is still resident. Fix = unload the resident module and load the new one (Procedure). The first value is the running module; the second is what will load after an unload.dmesgshowsdisagrees about version of symbol/Invalid module format/Exec format error→ the on-disk module was built against a different kernel/driver than is running; rebuild it via DKMS (mirrors runbook: kernel upgrade, GPU missing step) before reload.dkms statushas noinstalledrow foruname -r→ wrong runbook: runbook: kernel upgrade, GPU missing.lspcidoes not list the GPU → hardware/seating fault, not a module problem: the GPU-fault runbook.
Flow¶
flowchart TB
A["nvidia-smi: NVML version mismatch / modprobe fails"] --> B{"lspci lists the GPU?"}
B -->|"no"| Z["GPU fault / RMA runbook"]
B -->|"yes"| C{"dkms 'installed' for uname -r?"}
C -->|"no"| Y["Kernel-upgrade GPU-missing runbook"]
C -->|"yes"| D{"module version == userspace (NVML) version?"}
D -->|"differ: stale resident module"| E["Cordon + drain"]
D -->|"dmesg: disagrees about symbol / Invalid module format"| R["Rebuild module via DKMS"]
R --> E
E --> F["Stop GPU consumers: persistenced, FM, MPS, containers"]
F --> G["Unload nvidia_uvm, nvidia_drm, nvidia_modeset, nvidia_peermem, nvidia"]
G --> H{"all modules unloaded?"}
H -->|"in use"| I["fuser -k /dev/nvidia*; recheck lsmod"]
I --> G
H -->|"clean"| J["modprobe nvidia + restart persistenced/FM"]
J --> K["Verify: versions match, dcgmi diag"]
K -->|"pass"| L["Uncordon"]
K -->|"fail / cannot unload"| M["Reboot node, then verify"]
M --> L
Procedure¶
Cordon and drain before unloading modules: unloading nvidia while a job holds /dev/nvidia* will fail, and you do not want the scheduler placing work on a half-fixed node. NODE is the Kubernetes node name (Slurm equivalent: scontrol update nodename=<n> state=drain reason="driver module reload").
- Cordon so nothing new lands:
- Drain running pods, keeping DaemonSets (GPU operator / device-plugin / DCGM exporter):
From here run steps 3–7 on the node (
ssh "$NODE", as root); it keeps CPU/console with the GPU driver down. - If
dmesgshowed a build mismatch (disagrees about version of symbol/Invalid module format), rebuild the module for the running kernel before reloading; otherwise skip to step 4 (kernel modules): - Stop every GPU consumer so the kernel module refcount can reach zero. Persistence mode (and
nvidia-persistenced) deliberately hold the module loaded, so they must be stopped first (runbook: persistence mode, runbook: fabric manager failure):nvidia-smi -pm 0 # disable persistence mode (best-effort; may itself error on mismatch) systemctl stop nvidia-persistenced # releases the persistent /dev/nvidia* handle systemctl stop nvidia-fabricmanager # NVSwitch systems only; FM holds the module # stop MPS / containers / any process still holding the device: echo quit | nvidia-cuda-mps-control 2>/dev/null || true fuser -v /dev/nvidia* # list remaining holders (jobs, exporters) - Unload the resident modules in dependency order (uvm and the display modules depend on
nvidia, so they unload first). This is the step that clears the stale version:Ifmodprobe -r nvidia_uvm nvidia_drm nvidia_modeset nvidia_peermem nvidia # if any is still in use, find and clear the holder, then retry: # lsof /dev/nvidia* # processes holding the device # fuser -k /dev/nvidia* # kill them (you already drained; this clears strays) lsmod | grep -E 'nvidia' || echo "all nvidia modules unloaded"modprobe -rstill reportsModule nvidia is in useafter clearing holders, the safe path is a reboot (Rollback); do not force-remove a live GPU module. - Load the new module and restart the services. A bare
modprobe nvidialoads the core module;nvidia-smithen auto-loadsnvidia_uvmon first use (the GPU software stack): - Confirm the versions now agree before returning to service; this is the proof the mismatch is gone:
Verification¶
The node is fixed only when the resident module and userspace report the same version, GPUs enumerate, and a real compute proof passes, not just "nvidia-smi printed something".
# 1. NVML no longer errors; all GPUs enumerate on one driver version:
ssh "$NODE" nvidia-smi --query-gpu=index,name,driver_version,pstate --format=csv,noheader
# 2. Resident module == on-disk module == NVML (the thing that broke):
ssh "$NODE" 'cat /proc/driver/nvidia/version' # compare against:
ssh "$NODE" 'modinfo -F version nvidia' # must be identical
# 3. No driver errors left in the ring buffer:
ssh "$NODE" "dmesg | grep -iE 'NVRM|Xid|disagrees about version' || echo clean"
# 4. Persistence (and FM on NVSwitch) back up:
ssh "$NODE" systemctl is-active nvidia-persistenced
ssh "$NODE" systemctl is-active nvidia-fabricmanager # NVSwitch systems only -> active
Then a real hardware proof, at least one must pass before uncordon:
# DCGM diagnostic (run level 2 = medium; level 3 adds the long stress/NCCL pass):
ssh "$NODE" 'dcgmi diag -r 2' # must contain no "Fail"
# A short NCCL collective confirms the CUDA path end-to-end (single node):
ssh "$NODE" 'all_reduce_perf -b 8 -e 256M -f 2 -g <NUM_GPUS>' # busbw plausible, no error
dcgmi diag exercising compute is the authoritative "the driver works" signal; matching version strings alone only prove the reload took (gpu health gating). Return to service:
kubectl uncordon "$NODE"
kubectl describe node "$NODE" | grep -E 'nvidia.com/gpu' # GPUs re-advertised
Rollback¶
The module reload is non-destructive (no package or kernel change), so there is little to "revert": the fallbacks are escalating recovery, not undo:
- Reboot the node if the modules cannot be unloaded (
Module nvidia is in useafter clearing holders) or do not load cleanly after reinstall. A reboot loads the matched module from a clean state and resolves the mismatch deterministically; it is the documented fix when a live unload is not possible: - Reinstall the driver package if step 3's rebuild and a reboot both leave a version mismatch, since a half-applied package transaction can leave the on-disk module and userspace inconsistent. Reinstall pins both to your adopted branch, then reboot (the GPU software stack, runbook: driver upgrade):
- If a reboot + reinstall still does not enumerate the GPUs, the fault is not a module reload, so divert: GSP/driver mismatch → runbook: GSP firmware mismatch; suspected hardware → the GPU-fault runbook.
Permanent prevention (so the ticket does not recur): hold unattended driver upgrades on GPU nodes, and bake a module reload (or scheduled reboot behind cordon/drain) into the post-upgrade hook so userspace and the resident module never drift. The planned, fleet-wide version of this is runbook: driver upgrade.
Related runbooks¶
- runbook: kernel upgrade, GPU missing:
No devices were foundafter a kernel bump (no module built); this runbook is the version-mismatch / failed-bind sibling where a module is built. - runbook: driver upgrade: planned, fleet-wide driver/CUDA roll; doing it correctly (reload/reboot per node) prevents this mismatch.
- runbook: GSP firmware mismatch: module loads but against the wrong GSP blob; same family, different layer.
- runbook: fabric manager failure: FM holds the module and must be stopped/restarted around the reload on NVSwitch systems.
- runbook: persistence mode: persistence/
nvidia-persistencedholds the module loaded; stop it before unload, re-arm after. - operational runbooks: runbook index and the shared trigger→verify→rollback shape.
References¶
- NVIDIA Driver Installation Guide — Kernel Modules (the five
nvidia*modules, DKMS, flavors): https://docs.nvidia.com/datacenter/tesla/driver-installation-guide/kernel-modules.html - NVIDIA Driver Installation Guide — Advanced Options (reinstall, flavor switching): https://docs.nvidia.com/datacenter/tesla/driver-installation-guide/advanced-options.html
- NVIDIA Driver Persistence (
nvidia-smi -pm,nvidia-persistencedholds the module loaded): https://docs.nvidia.com/deploy/driver-persistence/index.html - NVIDIA Fabric Manager user guide (FM/driver lifecycle on NVSwitch systems): https://docs.nvidia.com/datacenter/tesla/fabric-manager-user-guide/index.html
- DKMS manual (
dkms status,autoinstall, vermagic): https://github.com/dell/dkms - modprobe(8) —
modprobe -rmodule removal and dependency ordering: https://manpages.ubuntu.com/manpages/noble/man8/modprobe.8.html - fuser(1) / lsof(8) — find and clear processes holding
/dev/nvidia*: https://manpages.ubuntu.com/manpages/noble/man1/fuser.1.html - DCGM diagnostics (run levels for the verification proof): https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/dcgm-diagnostics.html
- nccl-tests (
all_reduce_perf): https://github.com/NVIDIA/nccl-tests - kubectl drain (safely drain a node): https://kubernetes.io/docs/tasks/administer-cluster/safely-drain-node/
Related: Kernel Modules · Software Stack · Fabric Manager · GPU Health Gating · Kernel Upgrade — GPU Missing · Driver Upgrade · Operational Runbooks · Glossary