Markdown

Runbook: driver / kernel module load failure¶

Scope: recover a node where the NVIDIA kernel modules fail to load or refuse to bind. Either modprobe nvidia errors, or nvidia-smi returns Failed to initialize NVML: Driver/library version mismatch because the resident kernel module and the userspace libraries are different driver versions (an in-place driver upgrade that never reloaded the module, a half-applied package transaction, or a wedged nvidia_uvm). Distinct from a clean kernel upgrade with no driver present (No devices were found), which is runbook: kernel upgrade, GPU missing. Single-node procedure; run it behind cordon/drain.

Run this when nvidia-smi on a node prints Failed to initialize NVML: Driver/library version mismatch or modprobe nvidia fails, the GPU is still on the PCIe bus (lspci lists it), and a DKMS module exists for the running kernel, i.e. the driver is installed but the loaded module and the userspace stack disagree, or the module will not bind. Severity: node-down, single node; the scheduler routes around it once it stops advertising GPUs.

Reference templates, not hardware-tested. Pin exact versions and package names to your adopted driver branch and validate on a canary before fleet use (runbook: driver upgrade).

The version-mismatch case is specific: a driver package upgrade (or unattended-upgrade) replaced libnvidia-ml.so and the rest of userspace without unloading the old in-core module, so the running module's version no longer matches NVML. Until the old module is unloaded and the new one loaded, every NVML caller fails identically. The kernel-space layer (the five nvidia* modules, DKMS, open-vs-proprietary flavor) is described in kernel modules; the package install/upgrade mechanics that produce this state are in the GPU software stack. On NVSwitch systems a module reload also tears down and rebuilds the NVLink domain, so Fabric Manager must be restarted in step (runbook: fabric manager failure).

Trigger¶

nvidia-smi returns Failed to initialize NVML: Driver/library version mismatch: the loaded kernel module version and the NVML userspace library version differ (commonly after an in-place driver upgrade with no reboot/reload).
modprobe nvidia errors (ERROR: could not insert 'nvidia', Invalid module format, Exec format error, or Unknown symbol): the module exists but will not bind to the running kernel/driver.
The GPU device-plugin stops advertising nvidia.com/gpu; DCGM exporter metrics flatline for the node (gpu health gating, troubleshooting).

This is not the "GPU missing after a kernel bump" case: here a module is built and present for uname -r (dkms status shows installed); the problem is a stale resident module or a failed bind, not a missing build. If dkms status has no entry for the running kernel, or nvidia-smi says No devices were found, go to runbook: kernel upgrade, GPU missing. If the module loads but against the wrong GSP blob, go to runbook: GSP firmware mismatch.

Pre-checks¶

Run these read-only first; they confirm this is a reload/bind problem and not a missing build or hardware fault. None mutate the node.

NODE=gpu-07.dc1.internal     # set to the affected node

# 1. The authoritative symptom and the two version strings:
nvidia-smi                                          # expect: Failed to initialize NVML: Driver/library version mismatch
cat /proc/driver/nvidia/version                     # version baked into the RESIDENT kernel module
modinfo -F version nvidia                            # version of the ON-DISK module that would load next

# 2. Is a (mismatched) module still resident?
lsmod | grep -E 'nvidia'                             # nvidia / nvidia_uvm / nvidia_modeset / nvidia_drm refcounts

# 3. Is a module BUILT for the running kernel? (distinguishes from the GPU-missing runbook)
uname -r
dkms status                                          # want nvidia/<VER>, <running-kernel>: installed
modinfo -F vermagic nvidia                           # leading token must equal `uname -r`

# 4. GPU still on the bus? (rules out a seating/hardware fault)
lspci | grep -i nvidia

# 5. Why did a load attempt fail? (the authoritative line)
dmesg | grep -iE 'NVRM|nvidia|Invalid module format|Unknown symbol|disagrees about version'

Read the signal:

/proc/driver/nvidia/version and modinfo -F version nvidia differ → classic version mismatch: an upgrade swapped userspace + the on-disk module but the old module is still resident. Fix = unload the resident module and load the new one (Procedure). The first value is the running module; the second is what will load after an unload.
dmesg shows disagrees about version of symbol / Invalid module format / Exec format error → the on-disk module was built against a different kernel/driver than is running; rebuild it via DKMS (mirrors runbook: kernel upgrade, GPU missing step) before reload.
dkms status has no installed row for uname -r → wrong runbook: runbook: kernel upgrade, GPU missing.
lspci does not list the GPU → hardware/seating fault, not a module problem: the GPU-fault runbook.

Flow¶

flowchart TB
    A["nvidia-smi: NVML version mismatch / modprobe fails"] --> B{"lspci lists the GPU?"}
    B -->|"no"| Z["GPU fault / RMA runbook"]
    B -->|"yes"| C{"dkms 'installed' for uname -r?"}
    C -->|"no"| Y["Kernel-upgrade GPU-missing runbook"]
    C -->|"yes"| D{"module version == userspace (NVML) version?"}
    D -->|"differ: stale resident module"| E["Cordon + drain"]
    D -->|"dmesg: disagrees about symbol / Invalid module format"| R["Rebuild module via DKMS"]
    R --> E
    E --> F["Stop GPU consumers: persistenced, FM, MPS, containers"]
    F --> G["Unload nvidia_uvm, nvidia_drm, nvidia_modeset, nvidia_peermem, nvidia"]
    G --> H{"all modules unloaded?"}
    H -->|"in use"| I["fuser -k /dev/nvidia*; recheck lsmod"]
    I --> G
    H -->|"clean"| J["modprobe nvidia + restart persistenced/FM"]
    J --> K["Verify: versions match, dcgmi diag"]
    K -->|"pass"| L["Uncordon"]
    K -->|"fail / cannot unload"| M["Reboot node, then verify"]
    M --> L

Procedure¶

Cordon and drain before unloading modules: unloading nvidia while a job holds /dev/nvidia* will fail, and you do not want the scheduler placing work on a half-fixed node. NODE is the Kubernetes node name (Slurm equivalent: scontrol update nodename=<n> state=drain reason="driver module reload").

Cordon so nothing new lands:
```
kubectl cordon "$NODE"
```
Drain running pods, keeping DaemonSets (GPU operator / device-plugin / DCGM exporter):
```
kubectl drain "$NODE" --ignore-daemonsets --delete-emptydir-data --timeout=15m
```
From here run steps 3–7 on the node (ssh "$NODE", as root); it keeps CPU/console with the GPU driver down.

If dmesg showed a build mismatch (disagrees about version of symbol / Invalid module format), rebuild the module for the running kernel before reloading; otherwise skip to step 4 (kernel modules):

dkms autoinstall                                  # rebuild+install every registered module for `uname -r`
modinfo -F vermagic nvidia | grep "$(uname -r)"   # confirm the on-disk module now matches the kernel

Stop every GPU consumer so the kernel module refcount can reach zero. Persistence mode (and nvidia-persistenced) deliberately hold the module loaded, so they must be stopped first (runbook: persistence mode, runbook: fabric manager failure):

nvidia-smi -pm 0                                  # disable persistence mode (best-effort; may itself error on mismatch)
systemctl stop nvidia-persistenced                # releases the persistent /dev/nvidia* handle
systemctl stop nvidia-fabricmanager               # NVSwitch systems only; FM holds the module
# stop MPS / containers / any process still holding the device:
echo quit | nvidia-cuda-mps-control 2>/dev/null || true
fuser -v /dev/nvidia*                             # list remaining holders (jobs, exporters)

Unload the resident modules in dependency order (uvm and the display modules depend on nvidia, so they unload first). This is the step that clears the stale version:

modprobe -r nvidia_uvm nvidia_drm nvidia_modeset nvidia_peermem nvidia
# if any is still in use, find and clear the holder, then retry:
#   lsof /dev/nvidia*            # processes holding the device
#   fuser -k /dev/nvidia*        # kill them (you already drained; this clears strays)
lsmod | grep -E 'nvidia' || echo "all nvidia modules unloaded"

If modprobe -r still reports Module nvidia is in use after clearing holders, the safe path is a reboot (Rollback); do not force-remove a live GPU module.

Load the new module and restart the services. A bare modprobe nvidia loads the core module; nvidia-smi then auto-loads nvidia_uvm on first use (the GPU software stack):
```
modprobe nvidia
systemctl start nvidia-persistenced
systemctl start nvidia-fabricmanager              # NVSwitch systems only
```

Confirm the versions now agree before returning to service; this is the proof the mismatch is gone:

cat /proc/driver/nvidia/version                   # resident module version
modinfo -F version nvidia                          # on-disk module version  -> the two MUST match now
nvidia-smi --query-gpu=index,name,driver_version --format=csv,noheader   # enumerates, no NVML error

Verification¶

The node is fixed only when the resident module and userspace report the same version, GPUs enumerate, and a real compute proof passes, not just "nvidia-smi printed something".

# 1. NVML no longer errors; all GPUs enumerate on one driver version:
ssh "$NODE" nvidia-smi --query-gpu=index,name,driver_version,pstate --format=csv,noheader

# 2. Resident module == on-disk module == NVML (the thing that broke):
ssh "$NODE" 'cat /proc/driver/nvidia/version'        # compare against:
ssh "$NODE" 'modinfo -F version nvidia'              # must be identical

# 3. No driver errors left in the ring buffer:
ssh "$NODE" "dmesg | grep -iE 'NVRM|Xid|disagrees about version' || echo clean"

# 4. Persistence (and FM on NVSwitch) back up:
ssh "$NODE" systemctl is-active nvidia-persistenced
ssh "$NODE" systemctl is-active nvidia-fabricmanager   # NVSwitch systems only -> active

Then a real hardware proof, at least one must pass before uncordon:

# DCGM diagnostic (run level 2 = medium; level 3 adds the long stress/NCCL pass):
ssh "$NODE" 'dcgmi diag -r 2'        # must contain no "Fail"
# A short NCCL collective confirms the CUDA path end-to-end (single node):
ssh "$NODE" 'all_reduce_perf -b 8 -e 256M -f 2 -g <NUM_GPUS>'   # busbw plausible, no error

dcgmi diag exercising compute is the authoritative "the driver works" signal; matching version strings alone only prove the reload took (gpu health gating). Return to service:

kubectl uncordon "$NODE"
kubectl describe node "$NODE" | grep -E 'nvidia.com/gpu'       # GPUs re-advertised

Rollback¶

The module reload is non-destructive (no package or kernel change), so there is little to "revert": the fallbacks are escalating recovery, not undo:

Reboot the node if the modules cannot be unloaded (Module nvidia is in use after clearing holders) or do not load cleanly after reinstall. A reboot loads the matched module from a clean state and resolves the mismatch deterministically; it is the documented fix when a live unload is not possible:
```
ssh "$NODE" sudo reboot
# then re-run the Verification block before uncordon
```
Reinstall the driver package if step 3's rebuild and a reboot both leave a version mismatch, since a half-applied package transaction can leave the on-disk module and userspace inconsistent. Reinstall pins both to your adopted branch, then reboot (the GPU software stack, runbook: driver upgrade):
```
# Debian/Ubuntu (pin <BRANCH>; -open on Turing+/Blackwell/Grace Hopper):
ssh "$NODE" 'sudo apt-get install --reinstall -y "nvidia-driver-<BRANCH>-open"'
# RHEL/Rocky:
ssh "$NODE" 'sudo dnf reinstall -y "nvidia-driver" "kmod-nvidia-open-dkms"'
ssh "$NODE" sudo reboot
```
If a reboot + reinstall still does not enumerate the GPUs, the fault is not a module reload, so divert: GSP/driver mismatch → runbook: GSP firmware mismatch; suspected hardware → the GPU-fault runbook.

Permanent prevention (so the ticket does not recur): hold unattended driver upgrades on GPU nodes, and bake a module reload (or scheduled reboot behind cordon/drain) into the post-upgrade hook so userspace and the resident module never drift. The planned, fleet-wide version of this is runbook: driver upgrade.

runbook: kernel upgrade, GPU missing: No devices were found after a kernel bump (no module built); this runbook is the version-mismatch / failed-bind sibling where a module is built.
runbook: driver upgrade: planned, fleet-wide driver/CUDA roll; doing it correctly (reload/reboot per node) prevents this mismatch.
runbook: GSP firmware mismatch: module loads but against the wrong GSP blob; same family, different layer.
runbook: fabric manager failure: FM holds the module and must be stopped/restarted around the reload on NVSwitch systems.
runbook: persistence mode: persistence/nvidia-persistenced holds the module loaded; stop it before unload, re-arm after.
operational runbooks: runbook index and the shared trigger→verify→rollback shape.

References¶

NVIDIA Driver Installation Guide — Kernel Modules (the five nvidia* modules, DKMS, flavors): https://docs.nvidia.com/datacenter/tesla/driver-installation-guide/kernel-modules.html
NVIDIA Driver Installation Guide — Advanced Options (reinstall, flavor switching): https://docs.nvidia.com/datacenter/tesla/driver-installation-guide/advanced-options.html
NVIDIA Driver Persistence (nvidia-smi -pm, nvidia-persistenced holds the module loaded): https://docs.nvidia.com/deploy/driver-persistence/index.html
NVIDIA Fabric Manager user guide (FM/driver lifecycle on NVSwitch systems): https://docs.nvidia.com/datacenter/tesla/fabric-manager-user-guide/index.html
DKMS manual (dkms status, autoinstall, vermagic): https://github.com/dell/dkms
modprobe(8) — modprobe -r module removal and dependency ordering: https://manpages.ubuntu.com/manpages/noble/man8/modprobe.8.html
fuser(1) / lsof(8) — find and clear processes holding /dev/nvidia*: https://manpages.ubuntu.com/manpages/noble/man1/fuser.1.html
DCGM diagnostics (run levels for the verification proof): https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/dcgm-diagnostics.html
nccl-tests (all_reduce_perf): https://github.com/NVIDIA/nccl-tests
kubectl drain (safely drain a node): https://kubernetes.io/docs/tasks/administer-cluster/safely-drain-node/