Markdown

Runbook: kernel upgrade, GPU missing¶

Scope: a single node that booted into a new kernel and now shows no GPUs (nvidia-smi: No devices were found) because the NVIDIA kernel module was never rebuilt for the running kernel, nouveau won the boot race, or Secure Boot rejected an unsigned/re-signed module. Recover the node and re-arm DKMS so the next kernel bump is a non-event. This is a single-node procedure; run it behind cordon/drain.

Reference templates, not hardware-tested. Pin exact versions and package names to your adopted driver branch and validate on a canary before fleet use (runbook: driver upgrade).

The kernel-space layer this runbook repairs (the five nvidia* modules, the open-vs-proprietary flavor, DKMS, nouveau blacklisting, MOK signing) is described in kernel modules; the package-level install/upgrade mechanics are in install lifecycle. This page is the "kernel changed, GPUs vanished" procedure those pages point at.

Trigger¶

After an unattended-upgrade, apt upgrade/dnf update, or an image roll bumped the kernel, the node reboots and nvidia-smi returns No devices were found (or NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver). The GPU hardware is on the PCIe bus (lspci still lists it); only the driver is gone.
The GPU device-plugin stops advertising nvidia.com/gpu; the node drops to zero allocatable GPUs and the scheduler routes around it (Kubernetes GPU).
DCGM exporter / node GPU metrics flatline for the node (telemetry and monitoring).

Two distinct root causes hide behind the same symptom, and the pre-checks disambiguate them: (a) DKMS did not build the module for the new kernel (missing headers or a build failure), so there is no nvidia.ko for uname -r; (b) the module exists but cannot load because nouveau claimed the device first, or Secure Boot rejected an unsigned module (dmesg: Key was rejected by service). A third, adjacent failure (modules load against the wrong GSP firmware blob) presents similarly but is repaired in runbook: GSP firmware mismatch, not here.

Pre-checks¶

Run these read-only first; they tell you which of the two root causes you have and therefore which branch of the Procedure to take. None mutate the node.

NODE=gpu-07.dc1.internal     # set to the affected node

# 1. Which kernel is actually running now?
uname -r                                    # e.g. 6.8.0-51-generic

# 2. Is there a DKMS-built NVIDIA module for THAT kernel?
dkms status                                 # want: nvidia/<VER>, <running-kernel>, x86_64: installed
# states: "added" (registered, not built) / "built" (compiled, not installed) / "installed" (in place)

# 3. What is loaded right now — nvidia, or nouveau, or neither?
lsmod | grep -E 'nvidia|nouveau'            # nouveau present == it won the boot race

# 4. Is the GPU even on the bus? (rules out a hardware/seating fault)
lspci | grep -i nvidia                      # expect one line per GPU

# 5. Boot args — did someone pass nomodeset / rd.driver.blacklist, etc.?
cat /proc/cmdline

# 6. Secure Boot on? (decides whether modules must be signed)
mokutil --sb-state                          # "SecureBoot enabled" / "disabled"

# 7. Why did the module fail to load? (the authoritative line)
dmesg | grep -iE 'nvidia|nouveau|NVRM|taint|Key was rejected'

Read the signal:

dkms status lists no entry for the running kernel, or shows added/built but not installed → root cause (a), a missing/failed DKMS build. Confirm the matching headers are present: dpkg -l | grep "linux-headers-$(uname -r)" (Debian/Ubuntu) or rpm -q "kernel-devel-$(uname -r)" (RHEL-family). A DKMS build silently no-ops without them.
dkms status shows installed for the running kernel but lsmod has nouveau (and no nvidia) → root cause (b), the nouveau race; blacklist and rebuild initramfs.
installed + Secure Boot enabled + dmesg shows Key was rejected by service → root cause (b), an unsigned/unenrolled module; (re-)sign via the MOK.
/proc/cmdline contains nomodeset, nouveau.modeset=0 missing, or rd.driver.blacklist=nvidia → a boot-arg problem; fix the bootloader entry, not the driver.

Verify the driver version still installed (so step 3 of the Procedure pins the right one):

# Debian/Ubuntu
dpkg -l | grep -E 'nvidia-(driver|dkms)'
# RHEL-family
rpm -qa | grep -E 'nvidia|kmod-nvidia'

Procedure¶

Cordon and drain before touching modules or rebooting: the rebuild may need a second reboot, and you do not want the scheduler placing work on a half-fixed node. NODE is the Kubernetes node name (Slurm equivalent: scontrol update nodename=<n> state=drain reason="kernel/driver rebuild").

Cordon so nothing new lands on the node:
```
kubectl cordon "$NODE"
```
Drain running pods, keeping DaemonSets (GPU operator / device-plugin / DCGM exporter):
```
kubectl drain "$NODE" --ignore-daemonsets --delete-emptydir-data --timeout=15m
```
From here, run steps 3–8 on the node (ssh "$NODE", as root). It still has CPU/console even with the GPU driver down.

Install the matching kernel headers (the build is a no-op without them; this is the single most common reason dkms autoinstall "succeeds" but builds nothing):

# Debian / Ubuntu
sudo apt-get update && sudo apt-get install -y "linux-headers-$(uname -r)"
# RHEL / Rocky
sudo dnf install -y "kernel-devel-$(uname -r)" "kernel-headers-$(uname -r)"

Rebuild and install the NVIDIA module for the running kernel. Prefer DKMS; it is what the -dkms/--dkms packages register and what keeps future upgrades automatic (kernel modules).
```
sudo dkms autoinstall                       # build+install every registered module for `uname -r`
# or pin the exact module explicitly (VER from the Pre-checks):
sudo dkms install "nvidia/<VER>" -k "$(uname -r)"
```
If DKMS is not registered or the build is broken, reinstall the driver package, which re-triggers the DKMS hook:
```
# Debian / Ubuntu (pin <BRANCH> to your adopted branch; -open on Turing+/Blackwell/Grace Hopper)
sudo apt-get install --reinstall -y "nvidia-driver-<BRANCH>-open"
# RHEL / Rocky
sudo dnf reinstall -y "nvidia-driver" "kmod-nvidia-open-dkms"
```
.run-installer nodes instead re-run the installer against the running kernel (it re-registers DKMS with --dkms):
```
sudo sh ./NVIDIA-Linux-x86_64-<VERSION>.run -m=kernel-open --dkms --silent
```
If a stale half-built tree is fighting you, remove it before rebuilding: sudo dkms remove "nvidia/<VER>" --all (package path) or sudo nvidia-installer --uninstall (.run path), then redo step 4.

Blacklist nouveau if the Pre-checks showed it loaded (root cause b). Datacenter packages usually do this, but a kernel/initramfs change can drop the config:

printf 'blacklist nouveau\noptions nouveau modeset=0\n' \
  | sudo tee /etc/modprobe.d/blacklist-nouveau.conf
# rebuild the initramfs so the blacklist applies at next boot:
sudo update-initramfs -u            # Debian / Ubuntu
sudo dracut --force                 # RHEL / Rocky (instead of update-initramfs)

Re-sign for Secure Boot if mokutil --sb-state reported enabled. A freshly DKMS-built module is unsigned unless DKMS is configured to sign it; an unsigned module under Secure Boot fails to load and looks identical to "GPU missing". Configure DKMS to sign every rebuild, then confirm the key is enrolled:

# one-time: generate/enroll a Machine Owner Key, set a one-time password, enroll at next boot
sudo mokutil --import /var/lib/dkms/mok.pub
# point DKMS at the signing key+cert so EVERY future rebuild is signed automatically:
#   in /etc/dkms/framework.conf  ->  mok_signing_key="..."  and  mok_certificate="..."
# verify the just-built module actually carries a signature:
modinfo nvidia | grep -i 'sig'      # expect sig_id / signer / sig_key fields populated
mokutil --list-enrolled | grep -i 'DKMS\|nvidia'

On Ubuntu, the prebuilt signed packages from ubuntu-drivers avoid manual MOK handling entirely; prefer them when available (kernel modules, install lifecycle).

Confirm the build before rebooting, cheaper than a failed reboot. The module's vermagic must match the running kernel:

dkms status | grep "$(uname -r)"                  # nvidia/<VER>, <running-kernel>, x86_64: installed
modinfo -F vermagic nvidia                         # leading token must equal `uname -r`

Reboot to load the freshly built module cleanly (and to pass the MOK-enrollment screen if step 6 ran):
```
sudo reboot
```
On Secure Boot nodes the first reboot after mokutil --import shows the blue MOK Manager: choose Enroll MOK and enter the one-time password, or the new module still will not load.

Verification¶

The node is fixed only when the GPUs enumerate, the module matches the kernel, and a real workload-facing proof passes, not just "nvidia-smi printed something".

# 1. All GPUs enumerate, on the new driver:
ssh "$NODE" nvidia-smi --query-gpu=index,name,driver_version,pstate --format=csv,noheader

# 2. The loaded module was built for THIS kernel (the thing that broke):
ssh "$NODE" 'modinfo -F vermagic nvidia | grep "$(uname -r)"'   # non-empty == match

# 3. nouveau is gone; nvidia is loaded:
ssh "$NODE" "lsmod | grep -E 'nvidia|nouveau'"                  # nvidia present, nouveau absent

# 4. No driver errors in the ring buffer:
ssh "$NODE" "dmesg | grep -iE 'NVRM|Xid|Key was rejected' || echo clean"

# 5. Re-arm + confirm PERSISTENCE so the GPUs hold context (see persistence-mode):
ssh "$NODE" systemctl is-active nvidia-persistenced            # active
ssh "$NODE" nvidia-smi --query-gpu=persistence_mode --format=csv,noheader   # Enabled

Then a real hardware proof, in increasing strength. At least one must pass before uncordon:

# DCGM diagnostic (run level 2 = medium; level 3 adds the long stress/NCCL pass):
ssh "$NODE" 'dcgmi diag -r 2'        # must contain no "Fail"
# A short NCCL collective confirms the CUDA path end-to-end (single node here):
ssh "$NODE" 'all_reduce_perf -b 8 -e 256M -f 2 -g <NUM_GPUS>'   # busbw plausible, no error

dcgmi diag exercising compute is the authoritative "the driver works" signal; nvidia-smi alone only proves the module loaded (diagnostics tools, fabric bring-up & benchmarking). On NVSwitch systems also confirm Fabric Manager came back (systemctl is-active nvidia-fabricmanager) before returning the node.

Return to service:

kubectl uncordon "$NODE"
kubectl describe node "$NODE" | grep -E 'nvidia.com/gpu'       # GPUs re-advertised

Rollback¶

The change here is additive (rebuild a module), so the fast rollback is to boot the previous, known-good kernel (the one whose DKMS module is still present) and recover capacity while you debug the new-kernel build offline.

# List installed kernels and the matching DKMS modules:
ssh "$NODE" 'dkms status'                         # find the kernel that still shows "installed"

# Boot the prior kernel ONCE (Debian/Ubuntu, GRUB submenu index):
ssh "$NODE" 'sudo grub-reboot "Advanced options for Ubuntu>Ubuntu, with Linux <PRIOR-KERNEL>"'
ssh "$NODE" sudo reboot
# RHEL-family equivalent:  sudo grubby --set-default "/boot/vmlinuz-<PRIOR-KERNEL>" && reboot

After it comes up on the old kernel, re-run the Verification block; if green, kubectl uncordon "$NODE" to restore capacity. This is a holding action: the node is now pinned to an old kernel and will break again on the next boot into the new one, so still fix the DKMS build (Procedure steps 3–4) and clear the pin.

If the GPUs do not come back on the prior kernel either, the fault is not the kernel/DKMS rebuild; treat it as hardware/firmware and divert: GSP/driver mismatch → runbook: GSP firmware mismatch; suspected hardware → the GPU-fault runbook.

Permanent prevention (do this so the ticket does not recur): keep linux-headers/kernel-devel pinned alongside the kernel, hold unattended kernel upgrades on GPU nodes, and confirm dkms autoinstall runs in the post-kernel hook, covered in install lifecycle.

runbook: driver upgrade: planned, fleet-wide driver/CUDA roll (this runbook is the unplanned single-node recovery of the same stack).
runbook: GSP firmware mismatch: modules load but against the wrong GSP blob; same "no GPU after reboot" symptom, different fix.
runbook: persistence mode: persistence/nvidia-persistenced not enabled after recovery.
the GPU-fault runbook: escalation when the GPU is absent even on the prior kernel (hardware path).
operational runbooks: runbook index and the shared trigger→verify→rollback shape.

References¶

NVIDIA Driver Installation Guide — Kernel Modules (flavors, DKMS, nouveau, Secure Boot): https://docs.nvidia.com/datacenter/tesla/driver-installation-guide/kernel-modules.html
NVIDIA Driver Installation Guide — Advanced Options (runfile -m=kernel-open, flavor switching): https://docs.nvidia.com/datacenter/tesla/driver-installation-guide/advanced-options.html
NVIDIA Linux x86_64 driver README — nvidia-installer (--uninstall, module signing), DKMS, nouveau: https://download.nvidia.com/XFree86/Linux-x86_64/580.105.08/README/installdriver.html
NVIDIA runfile README — Open Linux Kernel Modules (-m=kernel / -m=kernel-open): https://download.nvidia.com/XFree86/Linux-x86_64/580.105.08/README/kernel_open.html
DKMS manual (dkms status, autoinstall, install, remove, framework.conf signing): https://github.com/dell/dkms
mokutil manual (--sb-state, --import, --list-enrolled for Secure Boot / MOK): https://manpages.ubuntu.com/manpages/noble/man1/mokutil.1.html
Ubuntu Server — install NVIDIA drivers (ubuntu-drivers, prebuilt signed under Secure Boot): https://ubuntu.com/server/docs/how-to/graphics/install-nvidia-drivers/
DCGM diagnostics (run levels for the verification proof): https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/dcgm-diagnostics.html
nccl-tests (all_reduce_perf): https://github.com/NVIDIA/nccl-tests
kubectl drain (safely drain a node): https://kubernetes.io/docs/tasks/administer-cluster/safely-drain-node/

Related: Kernel Modules · Install Lifecycle · Glossary