Markdown

TensorRT)¶

Scope: how ML frameworks ship their own CUDA/cuDNN/NCCL inside wheels and containers, why this decouples the application stack from the host stack, the driver floor that still gates them, and the handful of checks that confirm a framework actually sees the GPUs.

What it is¶

PyTorch, JAX and TensorRT are the application layer that sits on top of the node software stack (GPU Software Stack and Node Administration). The operationally important fact is that a modern framework does not consume the host CUDA toolkit. The pip wheel (or the NGC container) carries its own copy of the CUDA runtime (libcudart), cuDNN, cuBLAS and NCCL. The only host component the framework links against at runtime is the CUDA driver (libcuda.so, the driver API), which is supplied by the kernel-mode driver and is forward/backward versioned independently of the toolkit (CUDA Driver, CUDA Toolkit and Runtime).

Consequence: the version that matters for "will this framework run on this node" is the host driver branch, not whatever CUDA toolkit (if any) is installed on the host. Two pods on the same node can run PyTorch built against CUDA 12.x and CUDA 13.x simultaneously, because each brings its own runtime; both only require the host driver to satisfy the higher of the two minimum-driver requirements.

Why use it¶

Decoupling cadence. Framework releases move faster than the fleet driver. Bundling the runtime lets a team adopt a new PyTorch (and the CUDA features it needs) without a fleet-wide driver upgrade (Driver Install and Lifecycle, Driver Versions and Branches). The driver upgrade (the disruptive, reboot-class change) is decoupled from the framework upgrade, which is a pip install or an image tag bump.
Reproducibility. The bundled stack pins exact cuDNN/NCCL versions, so a build is reproducible across nodes regardless of host drift. This is the entire argument for NGC containers: a single tag pins a tested PyTorch + CUDA + cuDNN + NCCL + Ubuntu combination (see How to run it in production).
One boundary to reason about. Because everything above the driver ships inside the wheel or image, the only host variable is the driver branch. That collapses "does my stack work here" from a matrix of toolkit/cuDNN/NCCL versions to a single scalar comparison, which is exactly the check under Architecture.

When to use it (and when not)¶

Use bundled wheels and NGC containers as the default for every training and inference workload: they are how you decouple the framework cadence from the driver cadence. The cases where the host still constrains you, and where the escape hatch does or does not apply, are the "when not by default" cases below.

The host driver still gates you. Decoupling has one floor: each CUDA major series has a minimum driver version. The R580 data center driver ships CUDA Toolkit 13.x,¹ and a CUDA-13 wheel or container needs a host driver >= 580 on Linux.² A CUDA-12 wheel runs on older branches. If the host driver is below the floor, the framework reports no devices even though the GPUs are healthy (Kernel Upgrade: GPU Missing).
Datacenter forward-compatibility (escape hatch), used sparingly. When you genuinely cannot move the host driver, the CUDA Forward Compatibility package lets a newer-CUDA container run on an older datacenter driver branch, but only specific branches qualify. NVIDIA's support matrix explicitly excludes the old branches: "users should upgrade from all R418, R440, R450, R460, R510, R520, R530, R545, R555, and R560 drivers, which are not forward-compatible with CUDA 12.8."³ Treat forward-compat as a stopgap, not a fleet strategy; prefer pinning one LTS branch (Driver and Feature Support by GPU Tier).

Eligibility for that escape hatch is a set-membership test against the exclusion list, not "is the branch recent enough": R560 is a higher branch number than R545, yet both are excluded. The block below encodes the native floor plus the exclusion list and asserts the decision, including the adversarial case that a "recent but excluded" branch (R560) must be rejected even with the package installed. Validated with numpy 2.4.6 under python3:

# Runnable on system python3 (numpy). Core decision the page teaches: the CUDA
# Forward Compatibility escape hatch lets a newer-CUDA container run on an OLDER
# data-center driver branch, but ONLY specific branches qualify. NVIDIA's support
# matrix excludes R418/R440/R450/R460/R510/R520/R530/R545/R555/R560 for CUDA 12.8+.
# Eligibility is NOT "is the branch recent enough"; it is set-membership exclusion.
import numpy as np

# The exact exclusion list quoted above from NVIDIA's support matrix.
FWD_COMPAT_EXCLUDED = frozenset({418, 440, 450, 460, 510, 520, 530, 545, 555, 560})
# Native floor: without forward-compat, a CUDA-13 container needs driver >= 580.
NATIVE_CUDA13_FLOOR = 580


def can_run_cuda13(host_branch: int, forward_compat_installed: bool) -> bool:
    """Can a CUDA-13 container run here? Native path OR a qualifying forward-compat branch."""
    if host_branch >= NATIVE_CUDA13_FLOOR:
        return True                                   # native, no escape hatch needed
    if not forward_compat_installed:
        return False
    return host_branch not in FWD_COMPAT_EXCLUDED     # excluded branches never qualify


# 1. Native path dominates: an R580 host runs CUDA 13 with or without the package.
assert can_run_cuda13(580, forward_compat_installed=False) is True
assert can_run_cuda13(580, forward_compat_installed=True) is True

# 2. Escape hatch: a below-floor branch NOT on the exclusion list can run CUDA 13
#    only when the forward-compat package is present.
assert can_run_cuda13(570, forward_compat_installed=False) is False   # below floor, no hatch
assert can_run_cuda13(570, forward_compat_installed=True) is True     # 570 not excluded

# 3. Adversarial / the trap this section warns about: several EXCLUDED branches are
#    numerically "recent" (e.g. R560, R555) yet must be rejected even WITH the
#    package. "Recent enough" is the wrong mental model; membership is what counts.
for excluded in sorted(FWD_COMPAT_EXCLUDED):
    assert can_run_cuda13(excluded, forward_compat_installed=True) is False, excluded
# R560 (excluded) is a HIGHER number than R555/R545 yet still ineligible: prove that
# a naive "branch >= some threshold" rule would wrongly admit it.
naive_threshold = 500
assert 560 >= naive_threshold                          # a naive rule would admit R560
assert can_run_cuda13(560, forward_compat_installed=True) is False, \
    "R560 is on the exclusion list and must be rejected despite being 'recent'"

# 4. Equivalence to an explicit reference over a full branch sweep: the decision must
#    match (native OR (package AND not-excluded)) computed independently.
def reference(host_branch: int, pkg: bool) -> bool:
    native = host_branch >= NATIVE_CUDA13_FLOOR
    hatch = pkg and (host_branch not in FWD_COMPAT_EXCLUDED)
    return native or hatch

branches = list(range(410, 600, 5)) + sorted(FWD_COMPAT_EXCLUDED)
for b in branches:
    for pkg in (False, True):
        assert can_run_cuda13(b, pkg) == reference(b, pkg), (b, pkg)

# 5. Monotonic sanity: with the package installed, once a non-excluded branch is at
#    or above the native floor it stays runnable as the branch number rises.
above_floor = [can_run_cuda13(b, True) for b in range(580, 600)]
assert all(above_floor), "every branch >= 580 runs CUDA 13 natively"

n_excluded = len(FWD_COMPAT_EXCLUDED)
print(f"forward-compat gating OK: {n_excluded} excluded branches all rejected even with the",
      "package; R560 (recent but excluded) rejected; matches reference over full sweep")

Architecture¶

The runtime dependency is a single boundary: the host supplies only the driver (libcuda.so); the wheel or container supplies the CUDA runtime, cuDNN and NCCL above it.

flowchart LR
  HD["Host driver libcuda.so R570 / R580"] --> WHEEL["pip wheel or NGC container"]
  WHEEL --> RT["Bundled CUDA runtime libcudart"]
  WHEEL --> CUDNN["Bundled cuDNN"]
  WHEEL --> NCCL["Bundled NCCL"]
  RT --> FW["PyTorch / JAX / TensorRT"]
  CUDNN --> FW
  NCCL --> FW

This is the same boundary the NVIDIA Container Toolkit and CDI enforces: the toolkit injects the host driver into the container; the container supplies everything above it.

Because the driver is the only host component in play, "will this framework run on this node" reduces to one comparison: is the host driver branch at or above the bundled CUDA series' minimum-driver floor? When several containers share a node, the host driver must clear the highest floor among them; then all of them run. The block below encodes that decision, checks it against a per-wheel reference over 2000 random cases, and asserts the adversarial case that a host with only an old CUDA toolkit (or none) changes nothing because the toolkit is not on the runtime path. Validated with numpy 2.4.6 under python3:

# Runnable on system python3 (numpy). Core decision the page teaches: "will this
# framework run on this node?" is gated by the HOST DRIVER branch vs each wheel's
# bundled-CUDA minimum-driver floor, NOT by any host CUDA toolkit. When several
# containers share a node, the host driver must satisfy the MAXIMUM floor in use.
import numpy as np

# Documented CUDA-major -> minimum host driver branch (Linux data-center driver).
# CUDA 13.x needs driver >= 580; CUDA 12.x runs on older branches. (JAX install
# page + R580 release notes; the >= 580 floor for CUDA 13 is stated above.)
CUDA_MAJOR_TO_MIN_DRIVER = {12: 525, 13: 580}


def framework_runs(host_driver: int, wheel_cuda_major: int) -> bool:
    """A single wheel/container runs iff the host driver meets its CUDA floor."""
    floor = CUDA_MAJOR_TO_MIN_DRIVER[wheel_cuda_major]
    return host_driver >= floor


def node_ok(host_driver: int, wheel_cuda_majors: list[int]) -> bool:
    """Co-scheduled wheels on ONE node: host driver must clear the highest floor."""
    assert wheel_cuda_majors, "at least one wheel"
    required = max(CUDA_MAJOR_TO_MIN_DRIVER[m] for m in wheel_cuda_majors)
    return host_driver >= required


# 1. The headline example: a CUDA-12 and a CUDA-13 pod share a node. The host driver
#    only has to satisfy the HIGHER floor (580), after which BOTH run.
assert node_ok(580, [12, 13]) is True
assert framework_runs(580, 12) and framework_runs(580, 13)

# 2. Boundary: exactly at the floor passes; one branch below fails (">=", not ">").
assert framework_runs(580, 13) is True            # 580 == floor
assert framework_runs(579, 13) is False           # 579 < 580 -> no devices
assert framework_runs(525, 12) is True and framework_runs(524, 12) is False

# 3. The asymmetric failure the "Failure modes" section calls out: a CUDA-13 image
#    fails on an R570 host while the CUDA-12 image on the SAME host still works
#    ("works in one container, fails in another"). node_ok must reject the mix.
assert framework_runs(570, 12) is True
assert framework_runs(570, 13) is False
assert node_ok(570, [12, 13]) is False, "R570 clears the 12-floor but not the 13-floor"

# 4. Equivalence to a slow reference: node_ok(d, majors) must equal ANDing the
#    per-wheel decision over every wheel, for a sweep of drivers and wheel sets.
def node_ok_ref(host_driver: int, majors: list[int]) -> bool:
    return all(framework_runs(host_driver, m) for m in majors)

rng = np.random.default_rng(0)
for _ in range(2000):
    d = int(rng.integers(500, 620))
    majors = [int(m) for m in rng.choice([12, 13], size=int(rng.integers(1, 4)))]
    assert node_ok(d, majors) == node_ok_ref(d, majors), (d, majors)

# 5. Adversarial: the classic misdiagnosis is "the host CUDA toolkit is too old."
#    Prove the toolkit is irrelevant: the SAME (driver, wheel) verdict holds no
#    matter what host toolkit is present, including none. Only the driver gates.
for host_toolkit in (None, 11, 12, 13):            # decoy variable, must not matter
    assert framework_runs(580, 13) is True
    assert framework_runs(579, 13) is False

# 6. Monotonicity: raising the host driver never turns a runnable node unrunnable.
for majors in ([12], [13], [12, 13]):
    ok = [node_ok(d, majors) for d in range(500, 620)]
    assert ok == sorted(ok), (majors, "runnable is monotone non-decreasing in driver")

print("driver-floor gating OK: mixed [12,13] needs >=580; 579 fails, 580 passes;",
      "verdict matches per-wheel reference over 2000 cases; host toolkit irrelevant")

How to use it¶

Two delivery mechanisms. Pick one per environment and do not mix host-CUDA assumptions into either.

Pip wheels (bundled CUDA, the default)¶

PyTorch and JAX wheels carry their own CUDA runtime; no host toolkit required, only a sufficient host driver.

# PyTorch: CUDA 13.x build (bundles cudart/cuDNN/NCCL); needs host driver >= 580
pip install torch --index-url https://download.pytorch.org/whl/cu130

# JAX: CUDA 13 wheels (Linux only); needs host driver >= 580
pip install --upgrade "jax[cuda13]"

JAX jax[cuda13] pulls the CUDA/cuDNN pip packages; jax[cuda13-local] instead links the framework against a CUDA toolkit you installed on the host, for sites that must use a system CUDA.² On CUDA 13 JAX supports GPUs with SM (compute capability) 7.5 or newer.² Always confirm the exact wheel index/extra against the install page below. Do not hardcode an index URL from memory; verify against the official selector.

Smoke test: does the framework see the GPUs?¶

The single most useful framework smoke test answers "does the framework see the GPUs, and with which bundled stack?" Run inside the wheel/container environment.

python - <<'PY'
import torch
print("torch", torch.__version__)
print("cuda available", torch.cuda.is_available())   # bool: driver + runtime usable
print("torch.version.cuda", torch.version.cuda)       # CUDA the wheel was built with, or None for CPU build
print("cudnn", torch.backends.cudnn.version())        # int, see decoding below
print("nccl", torch.cuda.nccl.version())              # tuple, e.g. (2, 18, 1)
print("device count", torch.cuda.device_count())
print("device 0", torch.cuda.get_device_name(0) if torch.cuda.is_available() else "none")
PY

Expected (healthy node): cuda available is True; torch.version.cuda is the build's CUDA string (e.g. a 12.x or 13.x value), not the host driver's CUDA; device count equals the number of GPUs the container/host exposes; get_device_name(0) is the marketed board name. Interpreting failure: if torch.cuda.is_available() is False but nvidia-smi lists healthy GPUs, the fault is almost always the boundary, not the GPU: wrong/too-old host driver for the bundled CUDA, missing device injection into the container, or a CPU-only wheel (torch.version.cuda is None). Cross-check with nvidia-smi Reference on the host first.

JAX has the analogous check:

python -c "import jax; print(jax.devices())"

Expected: a list of CudaDevice (or gpu) entries, one per visible GPU. An empty list or a CPU-only fallback means the driver floor or device visibility is wrong, the same boundary diagnosis as above.²

How to integrate with it¶

The container path pins the same boundary as the wheel path: everything above the driver ships in the image, and the NVIDIA Container Toolkit and CDI injects the host driver at run time. Do not install a host CUDA toolkit to "help" the container; it is not on the framework's runtime path (the Architecture block asserts exactly this).

TensorRT / Torch-TensorRT (optimised inference) integrates at the model-artifact level. Standalone TensorRT consumes an ONNX model and emits a serialized engine; trtexec is the fastest way to confirm a working install and to benchmark.⁶

# Build an engine from ONNX (add --fp16 for half precision), then load and run it
trtexec --onnx=model.onnx --saveEngine=model.engine
trtexec --loadEngine=model.engine --shapes=input:1x3x224x224

Expected: trtexec reports successful engine build then throughput/latency for the benchmark run; treat the absolute numbers as environment-specific, not as published figures. Torch-TensorRT keeps the PyTorch frontend and uses TensorRT as a torch.compile backend, so it integrates into an existing PyTorch graph without an ONNX export step:⁷

# Reference template (needs torch + torch_tensorrt + an NVIDIA GPU). Not executed here.
import torch, torch_tensorrt  # noqa: F401  (import registers the backend)
opt = torch.compile(model, backend="torch_tensorrt")

Reference template, not hardware-tested. Shapes, model names and precision flags are illustrative; supply your own model and validate accuracy after optimisation. The core "will it run" math this integration depends on (the driver floor) is validated by the numpy block under Architecture.

How to run it in production¶

Prefer NGC containers for production: a pinned, tested stack removes host drift as a variable.

NGC containers (pinned, tested stack). Each monthly NGC tag pins one validated combination. NVIDIA states the container "is released monthly to provide you with the latest NVIDIA deep learning software libraries ... The libraries and contributions have all been tested, tuned, and optimized."⁴ For example, the PyTorch 26.05 image bundles PyTorch 2.12.0a0, CUDA 13.2.1, on Ubuntu 24.04.⁴

# Pull and run a pinned NGC PyTorch image; --gpus all uses the host driver via the container toolkit
docker run --gpus all --rm -it nvcr.io/nvidia/pytorch:26.05-py3

The container only needs the host driver injected by the NVIDIA Container Toolkit and CDI; it does not need a host CUDA toolkit. Driver requirements for any given tag defer to the CUDA Compatibility Guide; confirm the host branch satisfies the image's CUDA series before rollout.⁴³ On NVSwitch systems the same node prerequisites still apply: Fabric Manager up and version-matched, Persistence Mode on.

Reference template, not hardware-tested. Image tags, wheel indices and driver floors must be re-verified against the official pages in References before use.

How to maintain it¶

Maintenance is mostly version hygiene: log the bundled stack a build actually loaded, and re-run the smoke test after any host change.

Decode the cuDNN version explicitly. torch.backends.cudnn.version() returns cuDNN's packed CUDNN_VERSION integer, MAJOR*10000 + MINOR*100 + PATCH, so 90100 decodes to cuDNN 9.1.0 (matching NVIDIA's macro and the sibling CUDA Libraries page).⁵ Decode it, do not eyeball it:

python -c "v=__import__('torch').backends.cudnn.version(); print(v//10000, (v%10000)//100, v%100)"

That divisor is load-bearing: the wrong one (//1000) mis-reads 90100 as major 90. The block below decodes and round-trips the packed integer, and asserts two adversarial corruptions (the wrong divisor, and a "first digit" shortcut that breaks on a future two-digit major) both diverge from the correct decode. Validated with numpy 2.4.6 under python3:

# Runnable on system python3 (numpy). Core algorithm the page teaches: cuDNN's
# packed version integer CUDNN_VERSION = MAJOR*10000 + MINOR*100 + PATCH, as
# returned by torch.backends.cudnn.version(). Decode it correctly, round-trip it,
# and prove the wrong divisor (//1000) and the leading-digit shortcut both corrupt.
import numpy as np


def decode_cudnn(packed: int) -> tuple[int, int, int]:
    """Split the packed cuDNN integer into (major, minor, patch)."""
    assert packed >= 0
    major = packed // 10000
    minor = (packed % 10000) // 100
    patch = packed % 100
    return major, minor, patch


def encode_cudnn(major: int, minor: int, patch: int) -> int:
    """Inverse: pack (major, minor, patch) back into the CUDNN_VERSION integer."""
    assert 0 <= minor < 100 and 0 <= patch < 100
    return major * 10000 + minor * 100 + patch


# 1. Documented anchor: 90100 decodes to cuDNN 9.1.0 (matches NVIDIA's CUDNN_VERSION
#    macro and the sibling cuda-libraries.md page).
assert decode_cudnn(90100) == (9, 1, 0), decode_cudnn(90100)

# 2. Round-trip equivalence against a reference: encode then decode is identity
#    for every plausible (major, minor, patch), checked exhaustively.
for major in range(0, 12):
    for minor in range(0, 100):
        for patch in range(0, 100):
            packed = encode_cudnn(major, minor, patch)
            assert decode_cudnn(packed) == (major, minor, patch), packed

# 3. Boundary values: lowest, a patch-carry edge, and a two-digit minor.
assert decode_cudnn(0) == (0, 0, 0)
assert decode_cudnn(90099) == (9, 0, 99)           # 9.0.99, one below 90100
assert decode_cudnn(encode_cudnn(9, 12, 45)) == (9, 12, 45)

# 4. Adversarial / corruption detection A: the WRONG divisor (//1000) mis-decodes
#    90100 as major 90. Prove it diverges from truth.
def buggy_divisor(packed: int) -> tuple[int, int, int]:
    return packed // 1000, (packed % 1000) // 100, packed % 100

assert buggy_divisor(90100) == (90, 1, 0)          # nonsensical "cuDNN 90.1.0"
assert buggy_divisor(90100) != decode_cudnn(90100), \
    "//1000 divisor MUST corrupt the decode; the correct divisor is //10000"

# 5. Adversarial / corruption detection B: "major = first digit of str(packed)"
#    silently mis-reads a two-digit major (a future cuDNN 10.x -> 100000).
def buggy_leading_digit(packed: int) -> int:
    return int(str(packed)[0])

ten_x = encode_cudnn(10, 0, 0)                      # 100000
assert decode_cudnn(ten_x)[0] == 10
assert buggy_leading_digit(ten_x) != decode_cudnn(ten_x)[0], \
    "leading-digit shortcut MUST corrupt a two-digit major"

# 6. Encode rejects out-of-range minor/patch (the field widths are fixed at 2 digits).
for bad in ((9, 100, 0), (9, 0, 100)):
    try:
        encode_cudnn(*bad)
        raise SystemExit(f"expected AssertionError for {bad}")
    except AssertionError:
        pass

maj, mnr, pat = decode_cudnn(90100)
print(f"cuDNN decode OK: 90100 -> {maj}.{mnr}.{pat}; round-trip verified for 12x100x100 triples;",
      "//1000 divisor and leading-digit shortcut both proven to corrupt")

The companion values decode the same way: torch.cuda.nccl.version() returns a version tuple such as (2, 18, 1); torch.version.cuda is the build's toolkit string like 13.x. After any host driver change, re-run the smoke test above and confirm the bundled versions match what you intended to pin; a mismatch means the wheel or image bundled something other than expected.

How to scale it¶

Scaling out is where the co-scheduling floor and the fabric below the framework start to matter.

Co-scheduled CUDA series on one node. When you pack CUDA-12 and CUDA-13 workloads onto the same host, the host driver must clear the highest floor in use (the node_ok decision validated under Architecture). Align the fleet on a single LTS driver branch that meets the highest CUDA series you run, rather than tracking per-image floors (Driver Versions and Branches).
Multi-GPU collectives. The bundled NCCL drives all-reduce / all-gather across GPUs; at scale, throughput and hangs are usually a property of the fabric below the framework, not the wheel. On NVSwitch systems the node prerequisites gate the collective: Fabric Manager up and version-matched, Persistence Mode on, and the fabric validated with NCCL tests before blaming the bundled NCCL (NVSwitch and NVLink).
Inference optimisation. For latency-bound serving, the TensorRT / Torch-TensorRT path above trades a build step for lower per-request cost; scale the engine build into your image pipeline rather than building on the serving host.

Failure modes¶

Framework reports no GPU, nvidia-smi is fine. Host driver below the bundled CUDA's minimum (CUDA 13.x needs r580+), or devices not injected into the container. Diagnose host-side, then the boundary: Kernel Upgrade: GPU Missing.
CPU-only wheel. torch.version.cuda is None and is_available() is False; the wrong wheel/index was installed. Reinstall the CUDA build (Driver Install and Lifecycle).
NCCL hang / collective failure under multi-GPU. Often a fabric/transport problem below the framework, not the wheel; on NVSwitch systems check Fabric Manager and NVSwitch and NVLink before the bundled NCCL: Fabric Manager Failure.
Works in one container, fails in another on the same host. Two images pin different CUDA series; the host driver satisfies one floor but not the other. Align on a host driver that meets the highest CUDA series in use (Driver Versions and Branches).
Driver upgrade breaks a previously-working image. The host change, not the image, is the variable; follow Rolling Driver / CUDA Upgrade and re-run the smoke test.

References¶

PyTorch CUDA semantics (bundled runtime; driver requirement): https://docs.pytorch.org/docs/2.12/notes/cuda.html
PyTorch torch.backends reference (cudnn.version()): https://docs.pytorch.org/docs/stable/backends.html
PyTorch install selector (wheel index URLs): https://pytorch.org/get-started/locally/
JAX installation (GPU wheels, jax[cuda13], driver >= 580, SM 7.5+): https://docs.jax.dev/en/latest/installation.html
NGC PyTorch container release notes (tested/tuned stack; 26.05 contents): https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/index.html
NGC Frameworks support matrix (driver requirements, forward-compat exclusions): https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html
CUDA Compatibility (forward/backward, minor-version): https://docs.nvidia.com/deploy/cuda-compatibility/index.html
NVIDIA data center driver R580 release notes (ships CUDA Toolkit 13.x): https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-580-65-06/index.html
TensorRT Quick Start Guide (trtexec, ONNX -> engine): https://docs.nvidia.com/deeplearning/tensorrt/latest/getting-started/quick-start-guide.html
Torch-TensorRT (torch.compile backend, ir options): https://docs.pytorch.org/TensorRT/
cuDNN version macro (CUDNN_VERSION = CUDNN_MAJOR*10000 + CUDNN_MINOR*100 + CUDNN_PATCHLEVEL): https://docs.nvidia.com/deeplearning/cudnn/backend/latest/developer/misc.html

NVIDIA Data Center GPU Driver release notes, R580 (580.65.06): software versions list "CUDA Toolkit 13: 13.x", i.e. the R580 driver ships CUDA Toolkit 13.x. The >= 580 driver floor for CUDA 13 is documented on the JAX install page (²) and the CUDA Compatibility Guide. https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-580-65-06/index.html ↩
JAX Installation: pip install --upgrade "jax[cuda13]"; driver version must be >= 580 for CUDA 13 on Linux; CUDA 13 supports SM 7.5+; jax[cuda13-local] uses a host CUDA install. https://docs.jax.dev/en/latest/installation.html ↩↩↩↩↩
NVIDIA NGC Frameworks Support Matrix: forward-compatibility package excludes R418/R440/R450/R460/R510/R520/R530/R545/R555/R560. https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html ↩↩
NGC PyTorch container release notes: monthly releases are "tested, tuned, and optimized"; release 26.05 = PyTorch 2.12.0a0, CUDA 13.2.1, Ubuntu 24.04. https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/index.html ↩↩↩
cuDNN version integer encoding: cuDNN's cudnnGetVersion() returns the CUDNN_VERSION macro CUDNN_MAJOR*10000 + CUDNN_MINOR*100 + CUDNN_PATCHLEVEL (e.g. 90100 = 9.1.0), and torch.backends.cudnn.version() returns that same integer for a cuDNN 9 build. https://docs.nvidia.com/deeplearning/cudnn/backend/latest/developer/misc.html ↩
NVIDIA TensorRT Quick Start Guide: trtexec --onnx=... --saveEngine=... builds an engine from ONNX; --loadEngine ... --shapes=... runs it. https://docs.nvidia.com/deeplearning/tensorrt/latest/getting-started/quick-start-guide.html ↩
Torch-TensorRT: invoke the backend with torch.compile(model, backend="torch_tensorrt"); torch_tensorrt.compile accepts an ir (torch_compile/dynamo). https://docs.pytorch.org/TensorRT/ ↩