Markdown

Driver install & lifecycle¶

Scope: how the NVIDIA driver and surrounding stack get onto a GPU node and stay correct over its life. Covers apt network repo vs runfile, open vs proprietary kernel modules, DKMS, Secure Boot signing, reboot/validate, kernel upgrades, and fleet rollout behind cordon/drain.

All command snippets below are reference templates, not hardware-tested. Pin the exact versions for your target GPU and driver branch against the cited NVIDIA docs before running anything in production.

What it is¶

The install path for a single GPU node is a stack of versioned pieces that must agree: GSP firmware (ships inside the driver), the kernel modules (nvidia, nvidia-uvm, nvidia-modeset, nvidia-peermem), the CUDA driver (libcuda.so, versioned with the driver), and on NVSwitch systems a branch-matched Fabric Manager. The stack itself is described in the GPU software stack; this page is only about getting it installed and keeping it consistent across reboots, kernel bumps, and fleet-wide branch moves.

Two install mechanisms exist, and you should pick one and stick to it per fleet:

apt network repository (cuda-keyring + the NVIDIA CUDA repo): declarative, the package manager owns the modules, plays correctly with unattended-upgrade pinning. This is the default for managed fleets.
The .run runfile installer is a self-contained binary that builds and installs the modules directly. Useful for air-gapped or odd-distro nodes, but it sits outside the package database, so kernel-upgrade and removal hygiene is on you.

On any node you also choose a kernel module flavor, open (nvidia-open) or proprietary (cuda-drivers), and that choice is constrained by the GPU architecture (see below).

Why it's needed (and when)¶

A GPU node does nothing until the driver loads cleanly and matches the silicon, the kernel, and (on NVSwitch boxes) Fabric Manager. Most "GPU disappeared" tickets are install/lifecycle failures, not dead hardware: a kernel upgrade with no module rebuild, a GSP-firmware/driver mismatch from a partial install, persistence mode left off, Secure Boot rejecting an unsigned module, or version drift across the fleet. When a GPU is missing after boot, work the kernel/GPU-missing runbook.

You touch this path when:

Provisioning a new node (Driver Install and Lifecycle is the bring-up step; automation lives in Ansible bring-up).
Adopting a new driver branch fleet-wide, driven as a rolling change in the driver-upgrade runbook.
A kernel update lands (security or LTS HWE), which forces a module rebuild.
A framework or CUDA toolkit requirement moves the minimum driver (Frameworks).

Branch selection (LTS vs Production), CUDA forward-compat, and per-tier driver rules are out of scope here. See driver versions & branches and driver by tier. Pick one branch and hold the whole fleet on it.

How it's installed & managed¶

apt network repository (recommended)¶

The canonical sequence for Ubuntu installs the cuda-keyring package, which wires up the signed NVIDIA CUDA apt repository. $distro is the dotless release id (ubuntu2204, ubuntu2404); the arch path is x86_64 for amd64 and sbsa for arm64/Grace.¹

# Kernel headers must be present before the driver builds its modules
sudo apt-get install -y linux-headers-$(uname -r)

# On hosts previously set up with the pre-2022 apt key, remove the stale key first
sudo apt-key del 7fa2af80 || true

# Install the NVIDIA CUDA repo keyring (amd64 shown; use sbsa for arm64)
distro=ubuntu2404
wget https://developer.download.nvidia.com/compute/cuda/repos/${distro}/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update

The apt-key del 7fa2af80 step is NVIDIA's documented fix for hosts carrying the rotated-out GPG key; skip it on fresh nodes.² The keyring package installs the current signing key to /usr/share/keyrings/cuda-archive-keyring.gpg.¹

Then install one driver flavor:

# Open kernel modules (Turing and newer) -- the modern default
sudo apt-get install -y nvidia-open

# Proprietary kernel modules (Maxwell/Pascal/Volta, or where mandated)
sudo apt-get install -y cuda-drivers

nvidia-open and cuda-drivers are the metapackages NVIDIA documents for the two flavors; both pull DKMS so the modules rebuild on kernel changes.³ To pin a specific branch rather than track latest, install the branch metapackage and hold it (the install guide ships a nvidia-driver-pinning-<branch> package, and apt pins work too):³

# Example: hold the fleet on a named branch (substitute your adopted branch)
sudo apt-get install -y nvidia-open-580          # branch-pinned open metapackage
sudo apt-mark hold nvidia-open-580 cuda-drivers  # block silent unattended bumps

For compute-only nodes (no display stack) NVIDIA documents a slimmer set; the open variant is:³

sudo apt -V install libnvidia-compute nvidia-dkms-open

On NVSwitch systems (HGX/DGX 8-GPU baseboards, NVL72) install the branch-matched Fabric Manager in the same step. A driver-only bump that strands an incompatible Fabric Manager will not form the NVLink domain (Fabric Manager, nvswitch & NVLink). The package is versioned and the service must be enabled:⁸

sudo apt-get install -y nvidia-fabricmanager-580   # match the driver branch
sudo systemctl enable --now nvidia-fabricmanager

Fabric Manager requires a compatible driver starting at R450; current Blackwell B200/B300 platforms need 570.xx or newer. Verify the minimum for your GPU on the FM compatibility page.⁸

Open vs proprietary kernel modules¶

The flavor is constrained by architecture, not preference:

Open (nvidia-open) supports Turing and later GPUs and is the default starting in the R560 driver release.³⁴
Grace Hopper and Blackwell: NVIDIA states you must use the open kernel modules.⁴
Maxwell, Pascal, Volta: the open modules are not compatible; stay on the proprietary driver.³⁴

Open and proprietary are built from the same upstream source; the open flavor is dual-licensed MIT/GPLv2.⁴ More detail in kernel modules and GSP firmware.

The .run runfile installer¶

For nodes that cannot use the apt repo. Download the runfile for your driver version, then register the build with DKMS and select the module flavor non-interactively. The datacenter install guide uses --kernel-module-type=open|proprietary; the driver README documents the equivalent short form -m=kernel-open.³⁵

# reference template, not hardware-tested -- substitute the exact driver version
sudo sh ./NVIDIA-Linux-x86_64-<DRIVER_VERSION>.run \
  --dkms \
  --kernel-module-type=open \
  --silent

--dkms registers the module with DKMS so it rebuilds on kernel upgrade; --silent (-s) drives an unattended install for automation.⁵ On a Secure Boot host the runfile will offer to generate a signing keypair (see below). The runfile does not manage Fabric Manager or the toolkit; install those separately.

DKMS (kernel rebuilds)¶

DKMS is what keeps the module alive across kernel upgrades. With the apt packages it is wired automatically; with the runfile you opt in via --dkms. Confirm registration and that the module is built for the running kernel:

dkms status                         # expect: nvidia/<version>, <kernel>, x86_64: installed
sudo dkms autoinstall               # force a rebuild against the current kernel

A kernel bump with no matching rebuild is the classic "no GPU after reboot": the modules built for the old kernel will not load. Keep the matching linux-headers-$(uname -r) installed for every kernel you boot.

Secure Boot and MOK enrolment¶

With Secure Boot enabled, the kernel refuses unsigned out-of-tree modules, so the rebuilt nvidia.ko must be signed by a key the firmware trusts. The mechanism is a Machine Owner Key (MOK): a key/cert pair that signs the module and is enrolled into the firmware via mokutil.⁶

DKMS can sign automatically on each rebuild if you point its framework.conf at the key/cert (mok_signing_key, mok_certificate); enrolment of the certificate is the one manual step:⁶

# Enrol the signing certificate; mokutil prompts for a one-time password,
# which MokManager asks for on the NEXT reboot (blue enrolment screen)
sudo mokutil --import /path/to/MOK.der
sudo reboot

# After enrolment, confirm the key is trusted
mokutil --list-enrolled | grep -i nvidia || true

If the cert is not enrolled, the module simply will not load and the GPU is absent after boot. A Secure Boot failure looks identical to a missing-driver fault, so check mokutil --sb-state and dmesg when chasing it (kernel/GPU-missing runbook). The package installs handle DKMS signing on most distros; the burden is on the runfile and custom-key paths.

Reboot and persistence¶

Both flavors require a reboot to load the new modules cleanly. After reboot, turn on persistence so the driver state stays resident and the GPU does not re-initialise and clock down between jobs (persistence mode). NVIDIA documents that the legacy nvidia-smi -pm 1 flag "will be deprecated and removed in favor of the NVIDIA Persistence Daemon," so prefer the daemon (/usr/bin/nvidia-persistenced, shipped with the driver):⁷

sudo systemctl enable --now nvidia-persistenced

Pinning and lifecycle hygiene¶

Pin the branch. Hold the driver/Fabric Manager packages (apt-mark hold, or an apt pin) so unattended-upgrade does not silently move a node off the fleet branch, a real source of FM/driver mismatch on Ubuntu.³
Do not double-install. If the node is managed by the GPU Operator's driver containers, do not also install a host driver; pick one (container toolkit).
One branch per fleet. Driver / CUDA driver / Fabric Manager / GSP must stay consistent across every node; drift is the root cause behind a large share of fabric and collective faults (reliability & RAS).

Validated usage & tests¶

After install + reboot, validate on the node. These are reference checks; expected shape of output is described, not invented numbers.

nvidia-smi

Expect a populated table: every installed GPU listed, a non-ERR! driver version and CUDA-driver version in the header, and persistence mode reading On once the daemon is enabled. An empty list, No devices were found, or Failed to initialize NVML means the module did not load. Go to the kernel/GPU-missing runbook. Confirm the loaded driver and module flavor:

nvidia-smi --query-gpu=name,driver_version,persistence_mode --format=csv,noheader
modinfo nvidia | grep -E '^version|^license'   # license MIT/GPL on the open flavor
lsmod | grep -E '^nvidia'                       # nvidia, nvidia_uvm, nvidia_modeset present

The modinfo license line distinguishes open (dual MIT/GPL) from proprietary; the version must equal the branch you installed. On NVSwitch systems also confirm Fabric Manager and persistence are up before admitting work:

systemctl is-active nvidia-fabricmanager nvidia-persistenced   # expect: active

nvidia-fabricmanager reporting anything but active on an NVSwitch box means the NVLink domain has not formed. Collectives will degrade to PCIe or error (Fabric Manager, runbook). For a deeper, node-level acceptance pass (PCIe/NVLink, memory bandwidth, NCCL, stress) run the field diagnostics rather than relying on nvidia-smi alone (diagnostics tools, fabric bring-up & benchmarking):

dcgmi diag -r 3

Expect no Fail lines. Treat a node that fails this on a known-good branch as suspect hardware, not an install problem.

Fleet rollout (cordon/drain)¶

Never big-bang a fleet. Roll one node at a time behind cordon/drain, in batches that preserve a healthy quorum, inside a maintenance window. The end-to-end rolling procedure (Ansible reinstall, branch-matched Fabric Manager, reboot, validate, and rollback to the previous branch) is the driver-upgrade runbook. The minimal Kubernetes shape:

NODE=gpu-07.dc1.internal
kubectl cordon "$NODE"
kubectl drain "$NODE" --ignore-daemonsets --delete-emptydir-data --timeout=15m
# ... reinstall via your config-management role, then reboot ...
ssh "$NODE" 'systemctl is-active nvidia-fabricmanager nvidia-persistenced && dcgmi diag -r 3'
kubectl uncordon "$NODE"

Validate green before advancing to the next batch. The Slurm equivalent of cordon is scontrol update nodename=<n> state=drain reason="driver upgrade".

Failure modes¶

Brief; each links its runbook.

No GPU after reboot, kernel upgrade without module rebuild. Modules built for the old kernel will not load. Check dkms status and linux-headers-$(uname -r). → kernel/GPU-missing runbook.
No GPU after reboot, Secure Boot rejecting an unsigned module. Looks identical to a missing driver. Check mokutil --sb-state and dmesg for module-signature rejections; re-enrol the MOK. → kernel/GPU-missing runbook.
Modules will not load: GSP firmware / driver mismatch after a partial install. Reinstall the full driver package so firmware and modules match. → GSP firmware-mismatch runbook.
NVLink domain will not form: Fabric Manager down or version-mismatched to the driver after a driver-only bump. → Fabric Manager failure runbook.
Slow job starts / bouncing clocks: persistence off. Enable nvidia-persistenced. → persistence-mode runbook.
Silent fleet drift: unattended-upgrade moved a node off the branch. Hold the packages. → driver-upgrade runbook.

References¶

NVIDIA Driver Installation Guide for Linux — Ubuntu (cuda-keyring, headers, install commands): https://docs.nvidia.com/datacenter/tesla/driver-installation-guide/ubuntu.html
NVIDIA Driver Installation Guide — Kernel Modules (open vs proprietary, metapackages, DKMS, runfile --kernel-module-type): https://docs.nvidia.com/datacenter/tesla/driver-installation-guide/kernel-modules.html
NVIDIA Driver Installation Guide — Advanced Options (switching flavors, pinning, meta packages): https://docs.nvidia.com/datacenter/tesla/driver-installation-guide/advanced-options.html
"NVIDIA Transitions Fully Towards Open-Source GPU Kernel Modules" (R560 default; Grace Hopper/Blackwell must use open; Maxwell/Pascal/Volta proprietary): https://developer.nvidia.com/blog/nvidia-transitions-fully-towards-open-source-gpu-kernel-modules/
NVIDIA/open-gpu-kernel-modules (supported "Turing or later", licensing): https://github.com/NVIDIA/open-gpu-kernel-modules
NVIDIA Linux driver README — runfile -m=kernel-open, --dkms, silent install: https://download.nvidia.com/XFree86/Linux-x86_64/545.29.02/README/kernel_open.html
"Updating the CUDA Linux GPG Repository Key" (apt-key del 7fa2af80): https://developer.nvidia.com/blog/updating-the-cuda-linux-gpg-repository-key/
NVIDIA Driver Persistence (legacy persistence mode deprecation; nvidia-persistenced daemon): https://docs.nvidia.com/deploy/driver-persistence/index.html
NVIDIA Fabric Manager User Guide (package name, nvidia-fabricmanager service, driver compatibility R450+/570.xx): https://docs.nvidia.com/datacenter/tesla/fabric-manager-user-guide/index.html
DKMS Secure Boot module signing (mokutil --import, framework.conf mok_signing_key/mok_certificate): https://github.com/dell/dkms

NVIDIA Driver Installation Guide for Linux, Ubuntu network-repo section — wget .../repos/$distro/{x86_64,sbsa}/cuda-keyring_1.1-1_all.deb, dpkg -i, apt update, apt install linux-headers-$(uname -r); $distro examples ubuntu2204/ubuntu2404; keyring at /usr/share/keyrings/cuda-archive-keyring.gpg. https://docs.nvidia.com/datacenter/tesla/driver-installation-guide/ubuntu.html ↩↩
NVIDIA, "Updating the CUDA Linux GPG Repository Key" — remove the rotated-out key with sudo apt-key del 7fa2af80 on previously-configured Debian/Ubuntu/WSL hosts before installing cuda-keyring. https://developer.nvidia.com/blog/updating-the-cuda-linux-gpg-repository-key/ ↩
NVIDIA Driver Installation Guide — Kernel Modules / Advanced Options: open via apt install nvidia-open, proprietary via apt install cuda-drivers; compute-only open apt -V install libnvidia-compute nvidia-dkms-open; pinning via nvidia-driver-pinning-<branch>; runfile flavor --kernel-module-type=open|proprietary; open supports Turing and newer. https://docs.nvidia.com/datacenter/tesla/driver-installation-guide/kernel-modules.html ↩↩↩↩↩↩↩
NVIDIA Technical Blog, "NVIDIA Transitions Fully Towards Open-Source GPU Kernel Modules" — "For cutting-edge platforms such as NVIDIA Grace Hopper or NVIDIA Blackwell, you must use the open-source GPU kernel modules"; open became default in R560; Maxwell/Pascal/Volta remain on proprietary; dual MIT/GPLv2. https://developer.nvidia.com/blog/nvidia-transitions-fully-towards-open-source-gpu-kernel-modules/ ↩↩↩↩
NVIDIA Linux x86_64 driver README — open kernel modules support "Turing or later GPUs"; runfile selects open with -m=kernel-open; --dkms registers DKMS; -s/--silent for unattended install. https://download.nvidia.com/XFree86/Linux-x86_64/545.29.02/README/kernel_open.html ↩↩
Module signing under Secure Boot uses a Machine Owner Key enrolled via mokutil --import (password confirmed by MokManager on next reboot); DKMS can auto-sign each rebuild via mok_signing_key/mok_certificate in framework.conf. https://github.com/dell/dkms ↩↩
NVIDIA Driver Persistence docs — the legacy persistence mode "will be deprecated and removed in favor of the NVIDIA Persistence Daemon"; the daemon ships with the driver (since 319) at /usr/bin/nvidia-persistenced and is the recommended approach. https://docs.nvidia.com/deploy/driver-persistence/index.html ↩
NVIDIA Fabric Manager User Guide — package nvidia-fabricmanager-<version>, service nvidia-fabricmanager (systemctl enable/start); requires a compatible driver from R450 (e.g. 570.xx for B200/B300); must be branch-matched to the driver on NVSwitch systems. https://docs.nvidia.com/datacenter/tesla/fabric-manager-user-guide/index.html ↩↩