Markdown

CUDA toolkit and runtime¶

Scope: the difference between the CUDA Toolkit (compiler, headers, static libs), the CUDA runtime that ships inside applications, and the driver underneath them; how to install and pin the toolkit on a managed node, and how forward and minor-version compatibility let app and driver versions diverge on purpose.

Shell and Python blocks below are runnable diagnostics or reference templates. The two Python blocks that model CUDA version compatibility are self-checking (numpy only, no GPU); the torch block is a labelled reference template. Pin versions and validate against your own nodes before production use.

What it is¶

Three things called "CUDA" sit at different layers and version independently. Getting a node right means knowing which is which.

CUDA driver (libcuda.so, the driver API): part of the GPU driver package, not the toolkit. Its version is the maximum CUDA the box can run. nvidia-smi's "CUDA Version" header reports this driver-API ceiling, not anything you installed with the toolkit (nvidia-smi reference, CUDA driver). ¹
CUDA Toolkit: the build-time SDK. It bundles nvcc, headers (cuda_runtime.h), static libraries (libcudart_static.a, libcublas_static.a), cuda-gdb, compute-sanitizer, and Nsight. Installed under /usr/local/cuda-13.3/ with a /usr/local/cuda symlink. nvcc --version reports the toolkit's runtime-API version. ¹
CUDA runtime (libcudart.so.<major>, the runtime API): the thin layer apps actually link against. It is bundled with the application: every PyTorch/JAX/TensorRT wheel carries its own libcudart plus cuBLAS/cuDNN/NCCL (CUDA libraries, GPU frameworks). The host toolkit version is irrelevant to a wheel that ships its own.

The load-bearing consequence: nvidia-smi (driver ceiling) and nvcc (installed toolkit) can and routinely do report different CUDA versions on a healthy node. That is expected, not a fault. ¹

Why use it¶

Reference templates ¹ aside, the operational reasons to care:

You need the Toolkit only to compile: building a custom kernel, a CUDA C++ extension, TensorRT plugins, or anything that calls nvcc/ptxas. A pure inference or training node running prebuilt wheels often needs no system toolkit at all, only a driver new enough for the wheels' bundled runtime.
Minor-version compatibility decouples app and driver within a CUDA major release. An app built against any CUDA 13.x toolkit runs on any driver in the 13.x family at or above the minimum: CUDA 13.x requires driver >= 580; CUDA 12.x requires >= 525; CUDA 11.x requires >= 450. ² This is why a fleet can hold one driver branch and still consume newer framework releases.
Forward compatibility goes further: run a newer-major toolkit's apps on an older base driver, via the cuda-compat-<major>-<minor> package, but only on datacenter-class systems (see How to run it in production). Use it when the driver branch is frozen by qualification but you must ship a CUDA-N+1 app. ³

When to use it (and when not)¶

Install a system toolkit when the node compiles CUDA: custom kernels, C++ extensions, TensorRT plugins, or profiling/debugging builds that need nvcc, ptxas, cuda-gdb, or compute-sanitizer on the host.
Skip the system toolkit when every workload is a container or a wheel. Those carry their own runtime and libraries; the host only owes them a sufficiently new driver. Re-introducing a host nvcc only invites a third version into the mix and more drift to police (install lifecycle, driver versions and branches).
Reach for forward compatibility only on datacenter-class hardware, and only when a qualification freeze pins the driver below what a required CUDA-N+1 app needs. It is not a substitute for a driver upgrade on general-purpose fleets, and not available on GeForce. ³
Prefer a minor-pinned meta-package (cuda-toolkit-13-3) over a floating one on any managed node, so an apt upgrade cannot silently jump majors or drag in a driver.

Architecture¶

The three layers stack bottom-up: the driver defines the ceiling, the toolkit is consumed only at build time, and the runtime the toolkit (or a wheel) produced is what actually loads against the driver at run time.

flowchart LR
  DRV["NVIDIA driver: libcuda.so (driver API, max CUDA = nvidia-smi header)"]
  TK["CUDA Toolkit: nvcc, headers, static libs (nvcc --version)"]
  APP["Application / framework wheel"]
  RT["App-bundled runtime: libcudart.so (+ cuBLAS/cuDNN/NCCL)"]
  COMPAT["cuda-compat-<ver>: forward-compat driver libs (datacenter only)"]

  TK -->|"build time only"| APP
  APP --> RT
  RT -->|"runs on"| DRV
  COMPAT -.->|"opt-in via LD_LIBRARY_PATH"| RT

Read the diagram as two independent gates, both of which must pass for an app to run: the driver's CUDA-major ceiling (what nvidia-smi shows) must cover the runtime's major, and the installed driver branch must meet that major's documented minimum. The next block encodes exactly that logic and checks it, including boundary and adversarial cases.

# Runnable, self-checking model of CUDA minor-version compatibility (numpy not required).
# Encodes the rule this page teaches: an app built against a CUDA MAJOR runs on any
# driver in that family AT OR ABOVE the minimum driver branch.
# Minimums (References): CUDA 13.x -> driver >= 580, 12.x -> 525, 11.x -> 450.
from __future__ import annotations

MIN_DRIVER = {11: 450, 12: 525, 13: 580}   # documented minimum driver branch per CUDA major


def cuda_major(version: str) -> int:
    """Major from a 'MAJOR.MINOR' CUDA string, e.g. '13.3' -> 13."""
    major_text = version.split(".", 1)[0]
    if not major_text.isdigit():
        raise ValueError(f"not a CUDA version: {version!r}")
    return int(major_text)


def app_runs(runtime_cuda: str, driver_max_cuda: str, driver_branch: int) -> bool:
    """Will a wheel built with `runtime_cuda` run under this driver?

    Two independent gates from NVIDIA's compatibility model:
      1. the driver's max supported CUDA major (nvidia-smi header) must be
         >= the runtime's CUDA major, and
      2. the installed driver branch must be >= the documented minimum for
         that major.
    """
    need = cuda_major(runtime_cuda)
    ceiling = cuda_major(driver_max_cuda)
    if need not in MIN_DRIVER:
        raise ValueError(f"unknown CUDA major: {need}")
    return ceiling >= need and driver_branch >= MIN_DRIVER[need]


# Happy path: a CUDA 13.3 wheel on a 13.x / 580 driver runs.
assert app_runs("13.3", "13.0", 580) is True
assert app_runs("12.4", "12.6", 560) is True

# Boundary: exactly the minimum branch is allowed; one below is not.
assert app_runs("13.0", "13.0", 580) is True     # inclusive floor
assert app_runs("13.0", "13.0", 579) is False    # one below the 580 floor
assert app_runs("11.8", "11.8", 450) is True      # 11.x floor
assert app_runs("11.8", "11.8", 449) is False     # just under 450

# Adversarial: driver major ceiling too low (the classic
# "CUDA driver version is insufficient for CUDA runtime version").
assert app_runs("13.3", "12.6", 600) is False     # 600 >= 580 but ceiling major 12 < 13
assert app_runs("12.1", "11.8", 999) is False     # huge branch, still major-blocked

# nvcc vs nvidia-smi: a toolkit differing from the driver ceiling is NORMAL and
# never by itself blocks a wheel that ships its own runtime.
toolkit_nvcc = "13.3"   # what `nvcc --version` reports
driver_smi = "13.0"     # what `nvidia-smi` header reports (driver ceiling)
assert app_runs("12.4", driver_smi, 580) is True  # 12.4 wheel runs though nvcc says 13.3

# Corruption: malformed input is rejected, not silently passed.
for bad in ("", "abc", "x.3", "cuda13"):
    try:
        cuda_major(bad)
    except ValueError:
        pass
    else:
        raise AssertionError(f"expected ValueError for {bad!r}")

# Unknown major (a future 14.x with no recorded minimum) must raise, not guess a floor.
try:
    app_runs("14.0", "14.0", 600)
except ValueError:
    pass
else:
    raise AssertionError("expected ValueError for unknown CUDA major 14")

print("compat_logic: all asserts passed")

Running this prints compat_logic: all asserts passed.

How to install and pin it¶

apt network repo (Ubuntu 24.04, recommended for fleets). Install the keyring, then a pinned meta-package. ⁴

# 1. NVIDIA CUDA apt repo keyring (Ubuntu 24.04 / x86_64)
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update

# 2. Toolkit ONLY, capped at the 13.x series (no driver, no rolling major bump)
sudo apt-get install -y cuda-toolkit-13

Pick the meta-package deliberately; the names encode upgrade policy, and the wrong one drags in a driver or jumps majors on the next apt upgrade: ⁴

cuda-toolkit-13: all toolkit packages, will not upgrade beyond the 13.x series. The right choice for a node that should track 13.x but not silently move to 14.
cuda-toolkit: toolkit, but follows the next major when released. Avoid on managed nodes.
cuda / cuda-13-3: installs all CUDA Toolkit packages (cuda does so "with a full desktop experience" and also pulls the next major when released; cuda-13-3 stays at the named minor). These are the full installs rather than the toolkit-only pin, so prefer cuda-toolkit-* where the driver and footprint are managed separately (CUDA driver, driver by tier).
cuda-runtime-13-3: runtime libraries and driver, no compiler or desktop ("specific for compute nodes"). Installs a driver, so only use it if you want NVIDIA's apt repo to own the driver too.

For a minor-pinned toolkit, cuda-toolkit-13-3 installs all CUDA Toolkit packages at the named version (13.3) and stays there. Other minor-pinned names also exist (cuda-13-3, cuda-libraries-13-3, cuda-runtime-13-3); of these the Meta Packages table documents a driver only for cuda-runtime-13-3 ("and driver ... Specific for compute nodes"). Verify the exact set against that table for the release you deploy. ⁴

Post-install PATH and library setup (apt drops files under /usr/local/cuda-13.3/): ⁴

export PATH=/usr/local/cuda-13.3/bin:${PATH}
export LD_LIBRARY_PATH=/usr/local/cuda-13.3/lib64:${LD_LIBRARY_PATH}

Runfile (air-gapped or a specific minor). The runfile bundles a driver (its version is in the filename, e.g. cuda_13.3.0_<driver>_linux.run) and presents an ncurses menu to select components. On a node whose driver is managed elsewhere, install the toolkit only and never let the runfile touch the driver: ⁴

# Toolkit only, non-interactive; does NOT install the bundled driver
sudo sh cuda_13.3.0_<driver>_linux.run --silent --toolkit

Confirm the exact <driver> string and any per-version flags against the runfile's --help and the install guide before running. Mixing a runfile driver with a DKMS or apt-managed driver is a classic way to strand modules (kernel modules, install lifecycle).

How to use it¶

Once installed, characterise the node's CUDA layers before trusting a build or a wheel. These are diagnostics; the described output is the shape to expect, not specific measured values.

nvcc --version

Prints the installed toolkit release ("Cuda compilation tools, release 13.x") and build date. If nvcc is not found, no system toolkit is installed, which is fine for a wheel-only node. ¹

nvidia-smi

The top-right "CUDA Version" is the driver's maximum supported CUDA (driver API), independent of nvcc. Expect it to be >= the toolkit you build against, and frequently a different number than nvcc reports. ¹ For the driver-version detail see nvidia-smi reference.

cat /usr/local/cuda/version.json   # toolkit version metadata, when installed
ls -l /usr/local/cuda              # symlink -> /usr/local/cuda-13.3

Confirms which minor the /usr/local/cuda symlink resolves to, the version a bare nvcc or -I/usr/local/cuda/include build will pick up. ⁴

How to integrate with frameworks¶

A framework wheel carries its own runtime, so the number that matters for it is the CUDA it was built with, not the host toolkit. This block reads that:

# Reference template, not hardware-tested (needs torch + a GPU). App-bundled runtime vs driver ceiling.
import torch
print(torch.version.cuda)              # CUDA the wheel was BUILT with (its bundled runtime)
print(torch.cuda.is_available())       # True iff the driver satisfies that runtime's minimum

torch.version.cuda is the wheel's runtime, set at build time and unrelated to any host toolkit; availability depends only on the driver meeting the minimum for that CUDA major. ² If it is False on a node with a healthy nvidia-smi, suspect a too-old driver for the wheel's CUDA major, not a missing toolkit (GPU frameworks).

The core arithmetic behind that is_available() gate is small and testable without a GPU. This numpy-only block validates it across a whole fleet, checked against a slow scalar reference plus boundary and adversarial cases:

# numpy-only validation of the math the torch template illustrates:
# torch.cuda.is_available() is True IFF the node's driver branch meets the documented
# minimum for the wheel's CUDA major. We vectorise that gate across a fleet and check it
# against a slow scalar reference.
from __future__ import annotations

import numpy as np

MIN_DRIVER = {11: 450, 12: 525, 13: 580}


def fleet_available(driver_branches: np.ndarray, wheel_cuda_major: int) -> np.ndarray:
    """Vectorised availability: which nodes can run a wheel of this CUDA major."""
    if wheel_cuda_major not in MIN_DRIVER:
        raise ValueError(f"unknown CUDA major: {wheel_cuda_major}")
    floor = MIN_DRIVER[wheel_cuda_major]
    return np.asarray(driver_branches) >= floor


def _reference(driver_branches, wheel_cuda_major: int) -> list[bool]:
    """Deliberately-slow scalar reference to check the vectorised version."""
    floor = MIN_DRIVER[wheel_cuda_major]
    return [bool(int(d) >= floor) for d in driver_branches]


# A fleet spanning the 13.x floor (580), one just below it, and older branches.
branches = np.array([525, 570, 579, 580, 581, 600, 470])

got = fleet_available(branches, 13)
assert got.tolist() == _reference(branches, 13)                       # equivalence to slow ref
assert got.tolist() == [False, False, False, True, True, True, False]  # explicit expected mask

# Boundary is inclusive: exactly 580 passes, 579 fails.
assert fleet_available(np.array([580]), 13).item() is True
assert fleet_available(np.array([579]), 13).item() is False

# Same fleet, older wheel (CUDA 11.x, floor 450) -> far more nodes qualify.
got11 = fleet_available(branches, 11)
assert got11.tolist() == _reference(branches, 11)
assert got11.tolist() == [True, True, True, True, True, True, True]

# Adversarial: an unknown CUDA major must raise, never assume a floor.
try:
    fleet_available(branches, 14)
except ValueError:
    pass
else:
    raise AssertionError("expected ValueError for unknown CUDA major 14")

# Empty fleet is a valid length-0 result, not a crash.
assert fleet_available(np.array([], dtype=int), 13).tolist() == []

print("fleet_avail: all asserts passed; qualifying nodes for CUDA 13.x =",
      int(fleet_available(branches, 13).sum()), "of", branches.size)

Running this prints fleet_avail: all asserts passed; qualifying nodes for CUDA 13.x = 3 of 7.

How to run it in production¶

Host vs container. In containers the host normally provides only the driver; the NVIDIA Container Toolkit injects libcuda.so and the device nodes, while the image carries the toolkit, runtime, and libraries it needs. Do not bake a system toolkit into a node just to run containers; let the image own it, and keep the host to driver plus container toolkit.

Forward-compatibility package (datacenter only). cuda-compat-<major>-<minor> ships driver libraries (libcuda.so.*, libnvidia-ptxjitcompiler.so.*, libnvidia-nvvm.so.*, and so on) into /usr/local/cuda-13.3/compat/; it does not configure the loader. Point apps at it via LD_LIBRARY_PATH or an ld.so.conf.d entry; do not blindly prepend it system-wide, or you override the real driver for every process. ³

sudo apt-get install -y cuda-compat-13-3
# Opt a specific app in, e.g.:  export LD_LIBRARY_PATH=/usr/local/cuda-13.3/compat:${LD_LIBRARY_PATH}

Forward compatibility is "applicable only for systems with: NVIDIA Data Center GPUs[,] Select NGC Server Ready SKUs of RTX cards[,] Jetson boards". It is not a path for GeForce (driver by tier, driver versions and branches). ³

To prove forward compatibility end-to-end you would build a sample with a newer toolkit, run it on an older datacenter driver with the compat libraries on LD_LIBRARY_PATH, and confirm it loads; treat the numbers from any such run as environment-specific. ³

How to maintain it¶

Keep the driver and the toolkit on separate upgrade tracks. Pin the toolkit with a minor meta-package (cuda-toolkit-13-3) and let the driver be owned by DKMS/apt or the container toolkit, never by a runfile that also drops a driver (install lifecycle, kernel modules).
Re-check nvcc against nvidia-smi after any change. They are allowed to differ; what you are watching for is the driver dropping below an app's CUDA-major minimum (>= 580 for 13.x). ²
Remove stale forward-compat packages on system upgrades. Older cuda-compat packages are not supported on new driver versions, so a leftover one can shadow or conflict with the upgraded driver; drop it as part of the upgrade. ³
Watch the meta-package on apt upgrade. A floating name (cuda, cuda-toolkit) will follow the next major; audit installed CUDA meta-packages so a node cannot silently move majors (driver versions and branches).

How to scale it across a fleet¶

Standardise on one driver branch, let frameworks float within it. Minor-version compatibility means a single qualified driver (>= the family minimum) serves every wheel in that CUDA major, so you upgrade PyTorch/JAX without touching the driver. ² The fleet_available block above is the exact predicate for "which nodes can take this wheel".
HPC / Lmod nodes: expose the toolkit as modules, not a system default. On shared clusters the toolkit is usually offered as environment modules so users select a version per job. The exact module name is site-defined; verify with module avail:

module avail cuda
module load cuda/13.3      # site-defined name; sets PATH, LD_LIBRARY_PATH, CUDA_HOME
nvcc --version

Container fleets: version lives in the image. With the host reduced to driver plus NVIDIA Container Toolkit, scaling CUDA versions is a matter of image tags, and the host driver only has to satisfy the newest major any image ships.

Failure modes¶

Brief; each links its runbook.

nvcc and nvidia-smi disagree, expected behaviour, not a bug: nvcc = installed toolkit, nvidia-smi = driver ceiling. ¹ Only a problem if the driver is below the app's CUDA-major minimum (>= 580 for 13.x). ²
App built with newer CUDA than the driver supports: CUDA driver version is insufficient for CUDA runtime version. Fix by raising the driver to the minimum for that major, or applying forward compatibility on datacenter hardware. ²³ See runbook: kernel / GPU missing for the broader "GPU not usable after change" path.
Forward-compat package shadowing the real driver: a system-wide compat path makes every process load the older bundled libcuda.so; scope it per-app instead. ³
Wrong meta-package pulled a driver: the Meta Packages table documents a driver for cuda-runtime-13-3 ("specific for compute nodes"), which can collide with the DKMS or apt-managed one; prefer cuda-toolkit-* to avoid pulling a driver at all. ⁴ Recovery via runbook: driver upgrade.
Runfile driver vs managed driver: a runfile that installed its bundled driver strands modules on the next kernel bump if DKMS owns a different one (kernel modules); see runbook: kernel / GPU missing.

References¶

CUDA Installation Guide for Linux (apt/runfile commands, meta-packages, post-install PATH): https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html
CUDA Compatibility, overview, minimum-driver and runtime-vs-driver distinction: https://docs.nvidia.com/deploy/cuda-compatibility/index.html
Minor Version Compatibility (CUDA 13.x >= 580, 12.x >= 525, 11.x >= 450; nvcc vs nvidia-smi): https://docs.nvidia.com/deploy/cuda-compatibility/minor-version-compatibility.html
Forward Compatibility (cuda-compat package, datacenter-only, /usr/local/cuda-/compat): https://docs.nvidia.com/deploy/cuda-compatibility/forward-compatibility.html
CUDA Toolkit 13.3 Release Notes (toolkit version, bundled/minimum driver): https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html
Supported Drivers and CUDA Toolkit Versions: https://docs.nvidia.com/datacenter/tesla/drivers/supported-drivers-and-cuda-toolkit-versions.html

nvidia-smi reports the driver-API maximum CUDA (from the driver package); nvcc reports the installed toolkit's runtime-API version; the two can differ on a healthy node. CUDA Installation Guide for Linux and Minor Version Compatibility: https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html , https://docs.nvidia.com/deploy/cuda-compatibility/minor-version-compatibility.html ↩↩↩↩↩↩↩
Minor-version compatibility: apps built with a CUDA major-release toolkit run on any driver in that family at or above the minimum, 13.x >= 580, 12.x >= 525, 11.x >= 450. https://docs.nvidia.com/deploy/cuda-compatibility/minor-version-compatibility.html ↩↩↩↩↩↩
Forward Compatibility uses cuda-compat-<major>-<minor>, installs driver libs to /usr/local/cuda-<ver>/compat/, does not configure the loader (use LD_LIBRARY_PATH/ld.so.conf), and is applicable only to NVIDIA Data Center GPUs, select NGC Server Ready RTX SKUs, and Jetson; older forward-compatibility packages are not supported on new driver versions, so remove them on system upgrades. https://docs.nvidia.com/deploy/cuda-compatibility/forward-compatibility.html ↩↩↩↩↩↩↩↩
apt network-repo flow (cuda-keyring_1.1-1_all.deb, apt-get install cuda-toolkit-13), meta-package semantics (cuda-toolkit-13 caps at 13.x; cuda-toolkit-13-3 is the minor-pinned toolkit-only package; the Meta Packages table ties a driver only to cuda-runtime-13-3, "and driver ... Specific for compute nodes"), runfile --silent --toolkit, and /usr/local/cuda-13.3 PATH/LD_LIBRARY_PATH. CUDA Installation Guide for Linux: https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html ↩↩↩↩↩↩↩