Runbook: kernel upgrade, GPU missing¶
Scope: a single node that booted into a new kernel and now shows no GPUs (nvidia-smi: No devices were found) because the NVIDIA kernel module was never rebuilt for the running kernel, nouveau won the boot race, or Secure Boot rejected an unsigned/re-signed module. Recover the node and re-arm DKMS so the next kernel bump is a non-event. This is a single-node procedure; run it behind cordon/drain.
Reference templates, not hardware-tested. Pin exact versions and package names to your adopted driver branch and validate on a canary before fleet use (runbook: driver upgrade).
The kernel-space layer this runbook repairs (the five nvidia* modules, the open-vs-proprietary flavor, DKMS, nouveau blacklisting, MOK signing) is described in kernel modules; the package-level install/upgrade mechanics are in install lifecycle. This page is the "kernel changed, GPUs vanished" procedure those pages point at.
Trigger¶
- After an unattended-upgrade,
apt upgrade/dnf update, or an image roll bumped the kernel, the node reboots andnvidia-smireturnsNo devices were found(orNVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver). The GPU hardware is on the PCIe bus (lspcistill lists it); only the driver is gone. - The GPU device-plugin stops advertising
nvidia.com/gpu; the node drops to zero allocatable GPUs and the scheduler routes around it (Kubernetes GPU). - DCGM exporter / node GPU metrics flatline for the node (telemetry and monitoring).
Two distinct root causes hide behind the same symptom, and the pre-checks disambiguate them: (a) DKMS did not build the module for the new kernel (missing headers or a build failure), so there is no nvidia.ko for uname -r; (b) the module exists but cannot load because nouveau claimed the device first, or Secure Boot rejected an unsigned module (dmesg: Key was rejected by service). A third, adjacent failure (modules load against the wrong GSP firmware blob) presents similarly but is repaired in runbook: GSP firmware mismatch, not here.
Pre-checks¶
Run these read-only first; they tell you which of the two root causes you have and therefore which branch of the Procedure to take. None mutate the node.
NODE=gpu-07.dc1.internal # set to the affected node
# 1. Which kernel is actually running now?
uname -r # e.g. 6.8.0-51-generic
# 2. Is there a DKMS-built NVIDIA module for THAT kernel?
dkms status # want: nvidia/<VER>, <running-kernel>, x86_64: installed
# states: "added" (registered, not built) / "built" (compiled, not installed) / "installed" (in place)
# 3. What is loaded right now — nvidia, or nouveau, or neither?
lsmod | grep -E 'nvidia|nouveau' # nouveau present == it won the boot race
# 4. Is the GPU even on the bus? (rules out a hardware/seating fault)
lspci | grep -i nvidia # expect one line per GPU
# 5. Boot args — did someone pass nomodeset / rd.driver.blacklist, etc.?
cat /proc/cmdline
# 6. Secure Boot on? (decides whether modules must be signed)
mokutil --sb-state # "SecureBoot enabled" / "disabled"
# 7. Why did the module fail to load? (the authoritative line)
dmesg | grep -iE 'nvidia|nouveau|NVRM|taint|Key was rejected'
Read the signal:
dkms statuslists no entry for the running kernel, or showsadded/builtbut notinstalled→ root cause (a), a missing/failed DKMS build. Confirm the matching headers are present:dpkg -l | grep "linux-headers-$(uname -r)"(Debian/Ubuntu) orrpm -q "kernel-devel-$(uname -r)"(RHEL-family). A DKMS build silently no-ops without them.dkms statusshowsinstalledfor the running kernel butlsmodhasnouveau(and nonvidia) → root cause (b), thenouveaurace; blacklist and rebuild initramfs.installed+ Secure Boot enabled +dmesgshowsKey was rejected by service→ root cause (b), an unsigned/unenrolled module; (re-)sign via the MOK./proc/cmdlinecontainsnomodeset,nouveau.modeset=0missing, orrd.driver.blacklist=nvidia→ a boot-arg problem; fix the bootloader entry, not the driver.
Verify the driver version still installed (so step 3 of the Procedure pins the right one):
# Debian/Ubuntu
dpkg -l | grep -E 'nvidia-(driver|dkms)'
# RHEL-family
rpm -qa | grep -E 'nvidia|kmod-nvidia'
Procedure¶
Cordon and drain before touching modules or rebooting: the rebuild may need a second reboot, and you do not want the scheduler placing work on a half-fixed node. NODE is the Kubernetes node name (Slurm equivalent: scontrol update nodename=<n> state=drain reason="kernel/driver rebuild").
- Cordon so nothing new lands on the node:
- Drain running pods, keeping DaemonSets (GPU operator / device-plugin / DCGM exporter):
From here, run steps 3–8 on the node (
ssh "$NODE", as root). It still has CPU/console even with the GPU driver down. - Install the matching kernel headers (the build is a no-op without them; this is the single most common reason
dkms autoinstall"succeeds" but builds nothing): - Rebuild and install the NVIDIA module for the running kernel. Prefer DKMS; it is what the
-dkms/--dkmspackages register and what keeps future upgrades automatic (kernel modules).If DKMS is not registered or the build is broken, reinstall the driver package, which re-triggers the DKMS hook:sudo dkms autoinstall # build+install every registered module for `uname -r` # or pin the exact module explicitly (VER from the Pre-checks): sudo dkms install "nvidia/<VER>" -k "$(uname -r)"# Debian / Ubuntu (pin <BRANCH> to your adopted branch; -open on Turing+/Blackwell/Grace Hopper) sudo apt-get install --reinstall -y "nvidia-driver-<BRANCH>-open" # RHEL / Rocky sudo dnf reinstall -y "nvidia-driver" "kmod-nvidia-open-dkms".run-installer nodes instead re-run the installer against the running kernel (it re-registers DKMS with--dkms): If a stale half-built tree is fighting you, remove it before rebuilding:sudo dkms remove "nvidia/<VER>" --all(package path) orsudo nvidia-installer --uninstall(.runpath), then redo step 4. - Blacklist
nouveauif the Pre-checks showed it loaded (root cause b). Datacenter packages usually do this, but a kernel/initramfs change can drop the config: - Re-sign for Secure Boot if
mokutil --sb-statereported enabled. A freshly DKMS-built module is unsigned unless DKMS is configured to sign it; an unsigned module under Secure Boot fails to load and looks identical to "GPU missing". Configure DKMS to sign every rebuild, then confirm the key is enrolled:On Ubuntu, the prebuilt signed packages from# one-time: generate/enroll a Machine Owner Key, set a one-time password, enroll at next boot sudo mokutil --import /var/lib/dkms/mok.pub # point DKMS at the signing key+cert so EVERY future rebuild is signed automatically: # in /etc/dkms/framework.conf -> mok_signing_key="..." and mok_certificate="..." # verify the just-built module actually carries a signature: modinfo nvidia | grep -i 'sig' # expect sig_id / signer / sig_key fields populated mokutil --list-enrolled | grep -i 'DKMS\|nvidia'ubuntu-driversavoid manual MOK handling entirely; prefer them when available (kernel modules, install lifecycle). - Confirm the build before rebooting, cheaper than a failed reboot. The module's
vermagicmust match the running kernel: - Reboot to load the freshly built module cleanly (and to pass the MOK-enrollment screen if step 6 ran):
On Secure Boot nodes the first reboot after
mokutil --importshows the blue MOK Manager: choose Enroll MOK and enter the one-time password, or the new module still will not load.
Verification¶
The node is fixed only when the GPUs enumerate, the module matches the kernel, and a real workload-facing proof passes, not just "nvidia-smi printed something".
# 1. All GPUs enumerate, on the new driver:
ssh "$NODE" nvidia-smi --query-gpu=index,name,driver_version,pstate --format=csv,noheader
# 2. The loaded module was built for THIS kernel (the thing that broke):
ssh "$NODE" 'modinfo -F vermagic nvidia | grep "$(uname -r)"' # non-empty == match
# 3. nouveau is gone; nvidia is loaded:
ssh "$NODE" "lsmod | grep -E 'nvidia|nouveau'" # nvidia present, nouveau absent
# 4. No driver errors in the ring buffer:
ssh "$NODE" "dmesg | grep -iE 'NVRM|Xid|Key was rejected' || echo clean"
# 5. Re-arm + confirm PERSISTENCE so the GPUs hold context (see persistence-mode):
ssh "$NODE" systemctl is-active nvidia-persistenced # active
ssh "$NODE" nvidia-smi --query-gpu=persistence_mode --format=csv,noheader # Enabled
Then a real hardware proof, in increasing strength. At least one must pass before uncordon:
# DCGM diagnostic (run level 2 = medium; level 3 adds the long stress/NCCL pass):
ssh "$NODE" 'dcgmi diag -r 2' # must contain no "Fail"
# A short NCCL collective confirms the CUDA path end-to-end (single node here):
ssh "$NODE" 'all_reduce_perf -b 8 -e 256M -f 2 -g <NUM_GPUS>' # busbw plausible, no error
dcgmi diag exercising compute is the authoritative "the driver works" signal; nvidia-smi alone only proves the module loaded (diagnostics tools, fabric bring-up & benchmarking). On NVSwitch systems also confirm Fabric Manager came back (systemctl is-active nvidia-fabricmanager) before returning the node.
Return to service:
kubectl uncordon "$NODE"
kubectl describe node "$NODE" | grep -E 'nvidia.com/gpu' # GPUs re-advertised
Rollback¶
The change here is additive (rebuild a module), so the fast rollback is to boot the previous, known-good kernel (the one whose DKMS module is still present) and recover capacity while you debug the new-kernel build offline.
# List installed kernels and the matching DKMS modules:
ssh "$NODE" 'dkms status' # find the kernel that still shows "installed"
# Boot the prior kernel ONCE (Debian/Ubuntu, GRUB submenu index):
ssh "$NODE" 'sudo grub-reboot "Advanced options for Ubuntu>Ubuntu, with Linux <PRIOR-KERNEL>"'
ssh "$NODE" sudo reboot
# RHEL-family equivalent: sudo grubby --set-default "/boot/vmlinuz-<PRIOR-KERNEL>" && reboot
After it comes up on the old kernel, re-run the Verification block; if green, kubectl uncordon "$NODE" to restore capacity. This is a holding action: the node is now pinned to an old kernel and will break again on the next boot into the new one, so still fix the DKMS build (Procedure steps 3–4) and clear the pin.
If the GPUs do not come back on the prior kernel either, the fault is not the kernel/DKMS rebuild; treat it as hardware/firmware and divert: GSP/driver mismatch → runbook: GSP firmware mismatch; suspected hardware → the GPU-fault runbook.
Permanent prevention (do this so the ticket does not recur): keep linux-headers/kernel-devel pinned alongside the kernel, hold unattended kernel upgrades on GPU nodes, and confirm dkms autoinstall runs in the post-kernel hook, covered in install lifecycle.
Related runbooks¶
- runbook: driver upgrade: planned, fleet-wide driver/CUDA roll (this runbook is the unplanned single-node recovery of the same stack).
- runbook: GSP firmware mismatch: modules load but against the wrong GSP blob; same "no GPU after reboot" symptom, different fix.
- runbook: persistence mode: persistence/
nvidia-persistencednot enabled after recovery. - the GPU-fault runbook: escalation when the GPU is absent even on the prior kernel (hardware path).
- operational runbooks: runbook index and the shared trigger→verify→rollback shape.
References¶
- NVIDIA Driver Installation Guide — Kernel Modules (flavors, DKMS, nouveau, Secure Boot): https://docs.nvidia.com/datacenter/tesla/driver-installation-guide/kernel-modules.html
- NVIDIA Driver Installation Guide — Advanced Options (runfile
-m=kernel-open, flavor switching): https://docs.nvidia.com/datacenter/tesla/driver-installation-guide/advanced-options.html - NVIDIA Linux x86_64 driver README —
nvidia-installer(--uninstall, module signing), DKMS, nouveau: https://download.nvidia.com/XFree86/Linux-x86_64/580.105.08/README/installdriver.html - NVIDIA runfile README — Open Linux Kernel Modules (
-m=kernel/-m=kernel-open): https://download.nvidia.com/XFree86/Linux-x86_64/580.105.08/README/kernel_open.html - DKMS manual (
dkms status,autoinstall,install,remove, framework.conf signing): https://github.com/dell/dkms - mokutil manual (
--sb-state,--import,--list-enrolledfor Secure Boot / MOK): https://manpages.ubuntu.com/manpages/noble/man1/mokutil.1.html - Ubuntu Server — install NVIDIA drivers (
ubuntu-drivers, prebuilt signed under Secure Boot): https://ubuntu.com/server/docs/how-to/graphics/install-nvidia-drivers/ - DCGM diagnostics (run levels for the verification proof): https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/dcgm-diagnostics.html
- nccl-tests (
all_reduce_perf): https://github.com/NVIDIA/nccl-tests - kubectl drain (safely drain a node): https://kubernetes.io/docs/tasks/administer-cluster/safely-drain-node/
Related: Kernel Modules · Install Lifecycle · Glossary