Markdown

NVIDIA kernel modules¶

Scope: the kernel-space half of the GPU driver on a single Linux node: the five nvidia* modules, the open-vs-proprietary flavor choice, and the install-time mechanics (DKMS rebuilds, nouveau blacklisting, Secure Boot signing) that decide whether a node sees its GPUs after the next reboot.

What it is¶

The NVIDIA Linux GPU driver is split into kernel-space modules (the subject of this page) and the user-space components: libcuda.so (the CUDA driver), the GSP firmware blob, and the management tools. The kernel side is five modules (NVIDIA Driver Installation Guide, Kernel Modules):

Module	Loads as	Role
`nvidia.ko`	`nvidia`	Core driver. Device init, register access, GPU memory management, talks to the on-GPU GSP firmware. Everything else depends on it.
`nvidia-uvm.ko`	`nvidia_uvm`	Unified Virtual Memory — backs `cudaMallocManaged`, CUDA demand paging, and managed-memory migration. Required for CUDA compute.
`nvidia-modeset.ko`	`nvidia_modeset`	Display mode-setting core; sits under the DRM module.
`nvidia-drm.ko`	`nvidia_drm`	DRM/KMS interface to the kernel display stack (kernel mode-setting, PRIME, Wayland).
`nvidia-peermem.ko`	`nvidia_peermem`	Exposes GPU video memory to third-party RDMA NICs for GPUDirect RDMA. Not auto-loaded.

Note the naming asymmetry: the on-disk module file uses a hyphen (nvidia-uvm.ko), the loaded module name uses an underscore (nvidia_uvm). lsmod, modinfo, and /sys/module/ all use the underscore form; modprobe accepts either.

For a headless compute node, nvidia and nvidia_uvm are the load-bearing pair; nvidia_drm/nvidia_modeset matter only when the node drives a display or a Wayland session. nvidia_peermem is loaded explicitly on RDMA fabric nodes (see below).

Open vs proprietary flavor¶

Since the 515 driver series the modules ship in two mutually exclusive flavors (Kernel Modules):

Open (nvidia-open, source-published, dual-licensed MIT/GPLv2). Supported on Turing and newer only, because the open modules depend on the GSP, first shipped in Turing. The open-module source tree states plainly: "The NVIDIA open kernel modules can be used on any Turing or later GPU" and "must be used with GSP firmware and user-space NVIDIA GPU driver components from a corresponding ... driver release" (NVIDIA/open-gpu-kernel-modules README).
Proprietary (closed-source). Required for the older Maxwell, Pascal, and Volta architectures, which predate the GSP (Kernel Modules).

Open became the default and recommended flavor in the R560 release (Kernel Modules; NVIDIA: Transitions Fully Towards Open-Source GPU Kernel Modules). For the architecture-to-branch mapping see driver versions & branches and driver by tier.

Why it's needed (and when)¶

You touch this layer directly in three situations, and each is a common "the GPUs are gone" ticket:

Blackwell and Grace Hopper force the open flavor. Per NVIDIA: "For cutting-edge platforms such as NVIDIA Grace Hopper or NVIDIA Blackwell, you must use the open-source GPU kernel modules. The proprietary drivers are unsupported on these platforms." (Transitions Fully blog). Install the proprietary flavor on a B-series node and the modules will not bind the devices; there is no fallback. Confirm the flavor before declaring a Blackwell node healthy.
GPUDirect RDMA needs nvidia_peermem loaded. It is not loaded automatically; collectives over InfiniBand/RoCE quietly miss the GPU-direct path until it is up. See Validated usage.
A kernel upgrade can strand the driver. The modules are built against a specific kernel. Bump the kernel without rebuilding (no DKMS, or a DKMS failure) and the node boots with no GPU. This is the single most common cause behind runbook: kernel present but GPU missing.

Two adjacent failure surfaces share this layer: a GSP firmware / driver mismatch (the modules load against the wrong firmware blob), and Secure Boot rejecting unsigned modules (below). Both present as "no GPU after reboot" and are diagnosed here, not in CUDA.

How it's installed & managed¶

Reference templates, not hardware-tested. Pin exact versions to your adopted driver branch and validate on a canary node before fleet rollout (runbook: driver upgrade).

Selecting the flavor¶

Package manager (datacenter driver, the fleet path). The flavor is encoded in the package. NVIDIA's guide installs the open driver as the nvidia-open meta-package; the CUDA repo also ships an explicit DKMS kmod (Kernel Modules, Advanced Options). Exact package names track the OS release; confirm against the guide for your distro version:

# Debian / Ubuntu — open flavor
sudo apt-get install -y nvidia-open
# (recent Ubuntu also pairs it with nvidia-driver-open)

# RHEL / Rocky — open flavor
sudo dnf install -y --allowerasing nvidia-open
# explicit DKMS kmod from the CUDA repo:
sudo dnf install -y --allowerasing kmod-nvidia-open-dkms

Ubuntu archives additionally expose branch-pinned packages of the form nvidia-driver-<BRANCH>-open (for example nvidia-driver-580-open); pin one to hold the fleet on a single driver branch.

Runfile (.run) installer. "The two flavors are mutually exclusive: they cannot be used within the kernel at the same time". Pick one at install time with --kernel-module-type (short form -M), with value open or proprietary (NVIDIA runfile README, Open Linux Kernel Modules). Confirm the exact flag against the README for the driver version you are installing; older 5xx-series READMEs spell this -m=kernel-open:

# --kernel-module-type=open         -> open modules (-M=open)
# --kernel-module-type=proprietary  -> proprietary modules (-M=proprietary)
sudo sh ./NVIDIA-Linux-x86_64-<VERSION>.run --kernel-module-type=open --dkms

Switching flavor means removing the installed one first. After install, confirm which is loaded:

modinfo -F license nvidia        # open: "Dual MIT/GPL"; proprietary: "NVIDIA"

DKMS: survive kernel upgrades¶

DKMS rebuilds the modules automatically when the kernel changes, which is what keeps a node's GPUs alive across apt upgrade/dnf update. The -open-dkms / --dkms packages register the source automatically; verify and inspect:

dkms status                      # expect: nvidia/<VERSION>, <KERNEL>, x86_64: installed
sudo dkms autoinstall            # force a rebuild against the running kernel

A node booted into a kernel with no matching DKMS-built module is the classic post-reboot GPU-missing case. The build needs the matching linux-headers (Debian/Ubuntu) or kernel-devel (RHEL) for the running kernel installed first.

Blacklist nouveau¶

The in-tree open-source nouveau driver binds NVIDIA GPUs on boot and conflicts with the NVIDIA modules. Datacenter driver packages handle this, but verify it: a nouveau that wins the race blocks nvidia from binding. To blacklist explicitly:

# Debian / Ubuntu
printf 'blacklist nouveau\noptions nouveau modeset=0\n' | sudo tee /etc/modprobe.d/blacklist-nouveau.conf
sudo update-initramfs -u
sudo reboot

On RHEL-family rebuild initramfs with sudo dracut --force instead of update-initramfs. Confirm afterward that lsmod | grep nouveau is empty.

Secure Boot: MOK signing¶

With Secure Boot enabled, the kernel refuses to load unsigned third-party modules; the nvidia, nvidia_drm, nvidia_modeset, and nvidia_uvm modules must be signed by a key the firmware trusts, a Machine Owner Key (MOK). Two viable paths:

Prebuilt signed packages. On Ubuntu, ubuntu-drivers installs prebuilt, signed modules that work under Secure Boot without manual key handling (Ubuntu Server, install NVIDIA drivers). Preferred when available.
Enroll your own MOK (required for DKMS-built modules). Generate a key pair, enroll the public key, then have DKMS sign every rebuild:

sudo mokutil --import /var/lib/dkms/mok.pub   # set a one-time password
sudo reboot                                    # enroll via the MOK Manager (blue) screen on boot
mokutil --list-enrolled | grep -i nvidia       # verify after reboot

Configure DKMS to sign automatically (mok_signing_key / mok_certificate in /etc/dkms/framework.conf) so kernel upgrades do not silently produce unsigned, unloadable modules. An unsigned module under Secure Boot fails to load with Key was rejected by service in dmesg, which looks identical to "GPU missing" from nvidia-smi.

Validated usage & tests¶

The following are reference templates; described output is the documented/expected shape, not measured values from a specific machine.

Confirm the modules are loaded. nvidia and nvidia_uvm should be present on any compute node; Used by shows the dependency chain (nvidia_uvm, nvidia_modeset, nvidia_drm all reference nvidia):

lsmod | grep nvidia

Expected: lines for nvidia_uvm, nvidia_drm, nvidia_modeset, and nvidia (the largest), with the Used by column wiring them to nvidia. Absence of nvidia here is the headline symptom of no-GPU-after-reboot.

Inspect a module (version, license, parameters). modinfo reads metadata without loading anything:

modinfo nvidia | grep -E '^(version|license|vermagic):'

Expected: a version matching the installed driver, license of Dual MIT/GPL (open) or NVIDIA (proprietary), and a vermagic whose kernel string matches the running kernel (uname -r). A vermagic mismatch is exactly the case DKMS exists to prevent.

Load nvidia_peermem for GPUDirect RDMA. Per the module README, "there is no service to automatically load nvidia-peermem". Load it manually and persist it (NVIDIA GPUDirect RDMA Peer Memory Client):

sudo modprobe nvidia-peermem
lsmod | grep nvidia_peermem            # expect a non-empty line once loaded
echo nvidia-peermem | sudo tee /etc/modules-load.d/nvidia-peermem.conf   # load on boot

Ordering caveat from the same README: "If the NVIDIA GPU driver is installed before MOFED, the GPU driver must be uninstalled and installed again to make sure nvidia-peermem is compiled with the RDMA APIs". Install the InfiniBand stack (MOFED/DOCA-OFED) before the GPU driver, or reinstall the driver afterward. A modprobe nvidia-peermem returning Invalid argument typically means the driver was built without the RDMA peer-memory APIs (no OFED present at build time). On nodes where rebuilding against per-kernel OFED is painful, NVIDIA's newer DMA-BUF path provides GPUDirect RDMA without this module; verify support for your driver and NIC against the GPUDirect RDMA docs.

End-to-end check. Once modules are loaded and (if RDMA) nvidia_peermem is up, nvidia-smi should enumerate every GPU; see nvidia-smi reference. Module health is necessary but not sufficient; also confirm persistence mode and, on NVSwitch systems, Fabric Manager.

Failure modes¶

Brief; each links the matching runbook.

No nvidia in lsmod after a kernel upgrade. DKMS did not rebuild (missing headers, build error) or modules are unsigned under Secure Boot. Runbook: kernel present but GPU missing.
Proprietary flavor on Blackwell / Grace Hopper. Modules will not bind the devices; the open flavor is mandatory there. See driver by tier and install lifecycle.
nouveau won the boot race. nvidia cannot bind; lsmod | grep nouveau is non-empty. Re-blacklist and rebuild initramfs. Runbook: kernel present but GPU missing.
Secure Boot rejects unsigned modules. dmesg shows Key was rejected by service; enroll the MOK and configure DKMS signing. See install lifecycle.
nvidia_peermem absent or modprobe fails. GPUDirect RDMA silently disabled, or driver built without OFED RDMA APIs. See NVSwitch/NVLink and RDMA.
Modules load against the wrong GSP firmware blob. Partial or mismatched upgrade. Runbook: GSP firmware mismatch.

References¶

NVIDIA Driver Installation Guide — Kernel Modules: https://docs.nvidia.com/datacenter/tesla/driver-installation-guide/kernel-modules.html
NVIDIA Driver Installation Guide — Advanced Options (flavor switching): https://docs.nvidia.com/datacenter/tesla/driver-installation-guide/advanced-options.html
NVIDIA/open-gpu-kernel-modules (source, Turing+ / GSP requirement): https://github.com/NVIDIA/open-gpu-kernel-modules
NVIDIA blog — Transitions Fully Towards Open-Source GPU Kernel Modules (Blackwell/Grace Hopper require open; R560 default): https://developer.nvidia.com/blog/nvidia-transitions-fully-towards-open-source-gpu-kernel-modules/
NVIDIA runfile README — Open Linux Kernel Modules (--kernel-module-type / -M, values open|proprietary; legacy form -m=kernel-open): https://download.nvidia.com/XFree86/Linux-x86_64/580.105.08/README/kernel_open.html
NVIDIA README — GPUDirect RDMA Peer Memory Client (nvidia-peermem, manual load, MOFED ordering): https://download.nvidia.com/XFree86/Linux-x86_64/510.39.01/README/nvidia-peermem.html
GPUDirect RDMA documentation: https://docs.nvidia.com/cuda/gpudirect-rdma/
Ubuntu Server — install NVIDIA drivers (prebuilt signed under Secure Boot): https://ubuntu.com/server/docs/how-to/graphics/install-nvidia-drivers/