Driver versions & branches¶
Scope: NVIDIA data center (Tesla) GPU driver branches, Long Term Support (LTSB) vs Production, their support windows, how to choose and pin a fleet to one branch, and CUDA forward/minor-version compatibility so a newer CUDA toolkit runs on an older driver. This is the version-policy layer under the GPU software stack; the rolling-change mechanics live in the driver-upgrade runbook.
Reference templates on real NVIDIA APIs and package names. Not hardware-tested. Pin exact versions and validate on a canary node before any production rollout.
What it is¶
NVIDIA ships the data center GPU driver as numbered branches (R535, R570, R580, R590, R595, R610, …). The leading number is the branch; the full version is <branch>.<minor>.<patch> for Linux (e.g. 580.82.07, 580.167.08). Each branch is classified into one of three lifecycle types, which is the only attribute that matters for fleet planning:
- Long Term Support Branch (LTSB): "A production branch that will be supported and maintained for a much longer time than a normal production branch." Quarterly (or as-needed) bug and critical-security updates for 3 years.1
- Production Branch (PB): "Qualified for use in production for enterprise/data center GPUs. Bug fixes and security updates are provided for up to 1 year," with quarterly releases.1
- New Feature Branch (NFB): earliest features, shortest life; for evaluation, not fleet standardisation.1
Cadence: NVIDIA releases roughly two production branches per year (~every six months) and an LTSB at least once per hardware architecture.1 A branch is the unit you standardise on; minor/patch releases roll within it.
The driver branch determines the CUDA driver version (libcuda.so, the driver API) and therefore the maximum CUDA toolkit the host can run natively. It is versioned with the driver, not with the toolkit (the GPU software stack).
Live branch status (verify against the support matrix2 before quoting; dates below are as of 2026-06):
| Branch | Type | Latest (Linux) | Max CUDA | Support ends |
|---|---|---|---|---|
| R535 | LTSB | 535.309.01 |
12.x | 01 Jun 2026 (reached EOL) |
| R570 | Production | 570.211.01 |
12.x | 27 Jan 2026 (reached EOL) |
| R580 | LTSB | 580.167.08 |
13.x | 04 Aug 2028 |
| R590 | Production | 590.48.01 |
13.x | 22 Dec 2026 |
| R595 | Production | 595.71.05 |
13.x | 24 Mar 2027 |
| R610 | New Feature | — | 13.x | (NFB) |
Source: endoflife.date/nvidia3 cross-checked against the NVIDIA support matrix.2 Note the matrix HTML rounds R580 to "June 2028" while the R580 release notes and endoflife.date give 04 Aug 2028 (branch QA started 04 Aug 2025; +3 yr LTSB). Treat August 2028 as authoritative and re-verify near the date.3
Operational read as of mid-2026: R570 (Jan 2026) and R535 (Jun 2026) are at or past EOL. The standing LTSB target is R580 (CUDA 13.x, supported into Aug 2028). R535 is the prior LTSB for CUDA 12.x fleets that cannot yet move to CUDA 13; it is now end-of-life, so plan the migration.
Why it's needed (and when)¶
A GPU fleet is only as stable as its least-consistent node. Branch policy exists to make the driver a single, pinned, fleet-wide variable so that security response, framework qualification, and failure triage are tractable.
Pick a branch type by the workload's tolerance for change:
- LTSB: the default for a production training/inference fleet. Three-year window means you qualify a branch once and absorb only quarterly security/bug patches within it, not a feature-churning base. Choose LTSB when you value a stable ABI and a long security runway over new features.
- Production: when you need a feature or hardware-enablement that only a newer branch carries and can accept re-qualifying within ~12 months. Two PBs land per year; budget the re-pin.
- NFB: evaluation/canary only. Never the fleet standard.
When the branch choice bites:
- New silicon. A new GPU architecture is enabled by a specific minimum branch; older branches will not initialise it. Match the branch to the generation (GPU generations).
- Legacy silicon drop. Each branch eventually drops older architectures. NVIDIA states R470 was the last branch to support data-center Kepler GPUs;4 a heterogeneous fleet mixing old and new GPUs can be forced onto a branch that still carries the oldest card. Confirm your oldest SKU is still supported on the target branch (verify the per-architecture support list in the branch release notes4 before committing). Do not assume a branch supports a card just because it is recent.
- CVE response. A driver/container-toolkit advisory forces a patch within the pinned branch (preferred) or a branch move if the branch is EOL, which is why running an EOL branch (R570/R535 today) is a liability: no further fixes ship.1
- CUDA requirement. A framework needs a newer CUDA toolkit than the pinned driver natively supports, handled by compatibility (below), not necessarily a driver bump.
How it's installed & managed¶
Install from the CUDA network repository (.deb/.rpm) or the runfile, and pin the branch explicitly so unattended upgrades cannot drift the fleet. Use DKMS so kernel-mode modules rebuild on kernel upgrade; a kernel bump with no matching module rebuild leaves the node with no GPU after reboot (the GPU software stack).
The metapackage names below select a branch at install time. Verify the exact available package names against the install guide for your distro and target branch.7
# Debian/Ubuntu: install a specific data center driver branch (R580 LTSB) and hold it.
# Open kernel modules are the default/required flavour on recent architectures.
sudo apt-get update
sudo apt-get install -y nvidia-open-580 # branch-pinned open kernel modules + driver
sudo apt-mark hold 'nvidia-*' 'libnvidia-*' # freeze the branch; no silent drift
# RHEL/Rocky (dnf module): select and lock the 580 stream.
sudo dnf module enable -y nvidia-driver:580-open
sudo dnf module install -y nvidia-driver:580-open
sudo dnf versionlock add 'nvidia-driver*' # requires the versionlock plugin
Fleet-pinning discipline:
- One branch across the whole fleet. Keep the target
driver_branchin version control and roll it as a single variable per node, behind cordon/drain, in batches, never a big-bang. The end-to-end procedure (cordon → drain → reinstall → reboot → validate → uncordon, plus rollback) is the driver-upgrade runbook. - Fabric Manager is lockstep with the driver. On NVSwitch systems (HGX 8-GPU baseboards, NVL72), reinstall the branch-matched
nv-fabricmanagerin the same step; a driver-only bump that strands an incompatible Fabric Manager will not form the NVLink domain (reliability and RAS, fabric bring-up). - GSP firmware ships inside the driver package, not versioned separately, but a partial install (new modules, stale firmware) breaks module load (the GPU software stack).
- Host driver vs GPU Operator driver containers: pick one. Do not run a host-installed driver and the Operator's driver containers on the same node.
CUDA compatibility: newer CUDA on an older driver¶
Two distinct mechanisms let a newer toolkit run without bumping the driver branch. Know which one applies.
1. Minor-version compatibility (no extra package). From CUDA 11 onward, an application compiled with any toolkit inside a CUDA major family runs on any driver in that same major family that meets the minimum driver version:5
| CUDA toolkit | Minimum Linux driver |
|---|---|
| 13.x | >= 580 (e.g. 580.65.06) |
| 12.x | >= 525 (525.60.13) |
| 11.x | >= 450 (450.80.02) |
So a CUDA 12.4 app runs on a 525/535/550-class driver unchanged; a CUDA 13.2 app runs on any 580+ driver. Caveats: features that span toolkit and driver can return cudaErrorCallRequiresNewerDriver on an older driver, and device code emitted only as PTX will not run on a driver older than the one that introduced that PTX ISA. Compile with an explicit nvcc -arch=sm_XX for your hardware.5
2. Forward compatibility (the cuda-compat package), crosses major families. To run a toolkit from a newer major release on a driver that does not meet that major's minimum (e.g. CUDA 13.x on an R535 driver), install the matching cuda-compat-<major>-<minor> package. It places a newer user-mode libcuda alongside the older kernel module; the loader picks it up via LD_LIBRARY_PATH.6
# reference template, not hardware-tested — run a CUDA 13.0 build on an older datacenter driver
sudo apt-get install -y cuda-compat-13-0 # installs to /usr/local/cuda-13.0/compat/
export LD_LIBRARY_PATH=/usr/local/cuda-13.0/compat:$LD_LIBRARY_PATH
# verify the compat libcuda is the one being loaded, then run the workload
ldconfig -p | grep -m1 libcuda.so # sanity-check the resolved libcuda
nvidia-smi # driver/branch the kernel module reports
Hard constraints on forward compatibility, which is why it is a tactical bridge, not a fleet strategy:
- Data-center GPUs only. Supported on NVIDIA data center GPUs (plus select NGC-Server-Ready RTX SKUs and Jetson); explicitly not desktop/GeForce GPUs.6
- Older
cuda-compatpackages are not supported on newer drivers; compat goes new-toolkit-on-old-driver, never the reverse.6 - Some features need kernel-mode support and will not work through compat (e.g. CUDA–OpenGL/Vulkan interop is unsupported).6
- It pins you to managing an extra moving part in
LD_LIBRARY_PATHon every affected node. Prefer pulling the fleet onto an LTSB whose native CUDA meets the requirement; usecuda-compatto unblock, then upgrade the branch.
Validated usage & tests¶
Reference templates on real commands. Not hardware-tested. Expected output is described, never invented.
Confirm the branch a node is actually running, fleet-wide:
Expect one line per GPU; thedriver_version field is the full 5xx.yy.zz string. The leading three digits are the branch. Every node in the fleet must report the same branch. The CUDA version in the nvidia-smi header banner is the maximum CUDA the driver supports (driver API), not the installed toolkit.
Confirm the installed toolkit (runtime) version, which is independent of the driver:
Expect arelease 13.x (or 12.x) line. A node can show toolkit 12.x under a 580 driver, or, with cuda-compat, run a 13.x build under an older driver.
Confirm the branch is pinned/held (will not silently upgrade):
apt-mark showhold | grep -E 'nvidia|libnvidia' # Debian/Ubuntu
dnf versionlock list | grep nvidia # RHEL/Rocky
Verify a forward-compat build resolves the compat libcuda rather than the system one:
/usr/local/cuda-<ver>/compat/ to appear before any system libcuda.so.1. If the system library resolves first, the compat layer is inert and a newer-major build will fail with an insufficient-driver error.
End-to-end branch validation on a canary before any fleet roll: dcgmi diag -r 3 clean and a real smoke job, per the driver-upgrade runbook.
Failure modes¶
Brief; each links its handling path.
- Branch drift across the fleet: nodes on mixed branches; triage and CVE response become per-node. Pin and hold every node to one branch (above); reconcile via the driver-upgrade runbook.
- Running an EOL branch: R570 (Jan 2026) / R535 (Jun 2026) receive no further security or bug fixes; an advisory then forces an unplanned branch migration.1 Track EOL3 and move to the current LTSB ahead of the date.
- Driver bumped, Fabric Manager not: version-mismatched
nv-fabricmanagercannot form the NVLink domain; GPUs isolate and collectives degrade. Reinstall FM in lockstep (reliability and RAS, the driver-upgrade runbook). - Kernel upgrade without DKMS: node reboots with no GPU because the module did not rebuild for the new kernel (the GPU software stack).
cudaErrorCallRequiresNewerDriver/ insufficient-driver: the app uses a feature or PTX newer than the driver; either the minimum-driver rule is unmet or a neededcuda-compatis absent or not first inLD_LIBRARY_PATH.5 6- Forward-compat misused on GeForce:
cuda-compatis unsupported on desktop GPUs and will not provide a valid driver path there.6 - Target branch missing the oldest GPU: a heterogeneous fleet's oldest SKU is no longer supported on the chosen branch; verify per-architecture support in the branch release notes4 before pinning (GPU generations).
References¶
- NVIDIA Data Center Driver Lifecycle Policy (LTSB/PB/NFB definitions, 3-yr/1-yr windows, cadence): https://docs.nvidia.com/datacenter/tesla/drivers/driver-lifecycle.html
- Supported Drivers and CUDA Toolkit Versions (branch ↔ CUDA ↔ EOL matrix): https://docs.nvidia.com/datacenter/tesla/drivers/supported-drivers-and-cuda-toolkit-versions.html
- NVIDIA Data Center Driver 580.82.07 release notes (R580, CUDA 13.x; Kepler/R470 note): https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-580-82-07/index.html
- CUDA Compatibility — Minor Version Compatibility (CUDA 11+ minimum-driver rule, version table): https://docs.nvidia.com/deploy/cuda-compatibility/minor-version-compatibility.html
- CUDA Compatibility — Forward Compatibility (
cuda-compat, path, data-center-only constraint): https://docs.nvidia.com/deploy/cuda-compatibility/forward-compatibility.html - NVIDIA driver install guide (metapackages, repo/runfile): https://docs.nvidia.com/datacenter/tesla/tesla-installation-notes/index.html
- endoflife.date — NVIDIA driver branches (cross-check of EOL dates and latest versions): https://endoflife.date/nvidia
Related: Software Stack · Driver Upgrade Runbook · GPU Generations · Reliability & RAS · Glossary
-
NVIDIA Data Center Driver Lifecycle Policy — https://docs.nvidia.com/datacenter/tesla/drivers/driver-lifecycle.html ↩↩↩↩↩↩
-
Supported Drivers and CUDA Toolkit Versions — https://docs.nvidia.com/datacenter/tesla/drivers/supported-drivers-and-cuda-toolkit-versions.html ↩↩
-
endoflife.date — NVIDIA driver branches — https://endoflife.date/nvidia ↩↩↩
-
NVIDIA Data Center Driver 580.82.07 release notes — https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-580-82-07/index.html ↩↩↩
-
CUDA Compatibility — Minor Version Compatibility — https://docs.nvidia.com/deploy/cuda-compatibility/minor-version-compatibility.html ↩↩↩
-
CUDA Compatibility — Forward Compatibility — https://docs.nvidia.com/deploy/cuda-compatibility/forward-compatibility.html ↩↩↩↩↩↩
-
NVIDIA driver installation guide — https://docs.nvidia.com/datacenter/tesla/tesla-installation-notes/index.html ↩