Markdown

CUDA math & communication libraries¶

Scope: the math and collective-communication libraries that sit between the CUDA runtime and the frameworks (cuBLAS, cuDNN, NCCL, CUTLASS, cuFFT): how they are packaged, how they are pinned to a framework, and how to verify the right ones are loaded at runtime.

All snippets below are reference templates, not hardware-tested. Treat package versions and SONAMEs as illustrative; resolve the exact pins against the official matrix for your CUDA toolkit and framework build before deploying.

What it is¶

A short layer of NVIDIA libraries the framework links against. They consume the CUDA driver (libcuda.so) and runtime (libcudart.so) and are consumed by PyTorch, JAX, TensorRT (see Frameworks). Where they sit in the stack: GPU Software Stack and Node Administration.

cuBLAS: dense linear algebra (GEMM, GEMV, batched GEMM). The matmul engine under every dense layer. Ships inside the CUDA Toolkit (CUDA Toolkit and Runtime); SONAME libcublas.so.12 on CUDA 12. cuBLASLt is the lightweight successor API for tuned/epilogue-fused GEMM. (cuBLAS docs)
cuDNN: deep-learning primitives (convolution, attention, normalization, pooling, the graph API). Distributed separately from the toolkit, versioned on its own major (cuDNN 9.x), SONAME libcudnn.so.9. (cuDNN install)
NCCL handles multi-GPU / multi-node collectives: all-reduce, all-gather, reduce-scatter, broadcast, reduce. The data-plane for every data-parallel and tensor-parallel job. Distributed separately, SONAME libnccl.so.2. Topology- and transport-aware (NVLink/NVSwitch via NVSwitch and NVLink, PCIe P2P, IB/RoCE GPUDirect RDMA). (NCCL env)
CUTLASS: header-only CUDA C++ template library (now with Python DSLs) for high-performance GEMM, convolution and related linear algebra; targets Volta through Blackwell. Not a runtime .so you install; applications include its headers and compile their own kernels. This is what custom/fused kernels (and parts of frameworks) build on. (CUTLASS)
cuFFT: fast Fourier transforms (1D/2D/3D, batched). Ships inside the CUDA Toolkit; SONAME libcufft.so major-versioned with the toolkit. Used by signal/spectral and some scientific workloads. (cuFFT wheel)

Where they sit

flowchart LR
  DRV["CUDA driver: libcuda.so"] --> RT["CUDA runtime: libcudart.so"]
  RT --> BLAS["cuBLAS / cuBLASLt"]
  RT --> DNN["cuDNN 9"]
  RT --> NCCL["NCCL 2"]
  RT --> FFT["cuFFT"]
  CUTLASS["CUTLASS headers"] -.->|"compiled into"| FW
  BLAS --> FW["Frameworks: PyTorch, JAX, TensorRT"]
  DNN --> FW
  NCCL --> FW
  FFT --> FW
  NCCL ==>|"NVLink / IB / RoCE"| NET["Other GPUs and nodes"]

Why it's needed (and when)¶

These libraries are where the FLOPs and the bytes actually go. The framework is glue; cuBLAS/cuDNN do the compute and NCCL moves the gradients.

cuBLAS / cuDNN: every training and inference step. cuDNN convolution/attention kernels and cuBLAS GEMM dominate single-GPU throughput; the version you load directly affects MFU.
NCCL: the moment you cross one GPU. Data-parallel, FSDP, tensor- and pipeline-parallel all bottleneck on NCCL collectives over the fabric. NCCL tuning (transport selection, IB HCA binding) is a first-order lever on multi-node scaling efficiency; see Fabric Bring-Up, Validation and Benchmarking.
CUTLASS comes in when stock library kernels are not enough: custom fused/quantized GEMM, new dtypes, kernel research. You reach for it deliberately, not by default.
cuFFT: spectral workloads; rare in pure LLM clusters, common in scientific/HPC tenants.

The operational rule: you do not pick these versions independently; the framework build pins them. A PyTorch wheel is compiled against a specific cuDNN and NCCL ABI and ships them in its own site-packages/nvidia/*/lib. Mixing a system cuDNN/NCCL with a framework that bundled a different one is the root cause of a large share of "works on my node" failures.

How it's installed & managed¶

Three independent supply chains. Pick one per concern and do not let them collide on the library search path.

1. pip wheels (recommended for framework-pinned installs)¶

The standard path. Installing a CUDA-enabled framework pulls the matching library wheels transitively; you rarely name them by hand.

# PyTorch (cu124 build) pulls its own cuDNN/NCCL/cuBLAS wheels transitively
pip install torch --index-url https://download.pytorch.org/whl/cu124

# The underlying NVIDIA runtime wheels (normally resolved automatically):
pip install nvidia-cudnn-cu12 nvidia-nccl-cu12 nvidia-cublas-cu12 nvidia-cufft-cu12

Package names confirmed on PyPI: nvidia-cudnn-cu12, nvidia-nccl-cu12, nvidia-cublas-cu12, nvidia-cufft-cu12. CUDA 13 builds use the -cu13 suffix (e.g. nvidia-cudnn-cu13). CUTLASS is header-only; its wheel is nvidia-cutlass.

Pin everything in the lockfile and let the framework wheel be the single source of truth for cuDNN/NCCL; do not also apt-install them into the same image.

2. apt (system-wide, for bare-metal nodes and base images)¶

Use the CUDA network repo and the keyring. cuBLAS and cuFFT arrive with the toolkit; cuDNN and NCCL are explicit packages.

# CUDA network repo keyring (substitute distro/arch, e.g. ubuntu2404/x86_64)
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update

# cuDNN 9 for CUDA 12 (meta-package); use cudnn9-cuda-13 on CUDA 13
sudo apt-get -y install cudnn9-cuda-12

# NCCL runtime + headers
sudo apt-get -y install libnccl2 libnccl-dev

cuDNN package breakdown (CUDA 12 variants; -cuda-13 for CUDA 13). The meta-package cudnn9-cuda-12 pulls these: runtime libcudnn9-cuda-12, headers libcudnn9-headers-cuda-12, dev libcudnn9-dev-cuda-12. Only one CUDA-major of cuDNN 9 can be installed at a time. (cuDNN install guide)

Pin to a known-good version and freeze it, so an unrelated apt upgrade cannot bump the ABI under a framework:

# Pin exact NCCL build (pattern: <pkg>=<x.y.z>-1+cuda<A.B>); resolve the live
# version from the NCCL download page before committing this.
sudo apt-get -y install libnccl2=2.30.7-1+cuda12.9 libnccl-dev=2.30.7-1+cuda12.9
sudo apt-mark hold libnccl2 libnccl-dev

The exact 2.x.y-1+cudaA.B string must come from the NCCL install guide / download page for your toolkit; do not assume the value above is current.

3. NGC containers (the usual production answer)¶

NGC framework images (e.g. nvcr.io/nvidia/pytorch:<YY.MM>-py3) ship a tested, internally-consistent set of driver-userspace, CUDA, cuBLAS, cuDNN and NCCL. This sidesteps the three-way version problem entirely and is the recommended base for cluster workloads; the container userspace still rides on the host CUDA Driver via the NVIDIA Container Toolkit and CDI. Match the NGC tag to a known driver floor rather than mixing in pip/apt CUDA libraries on top.

NCCL environment variables that matter¶

NCCL is configured at runtime by environment, not config files. The load-bearing ones, all from the NCCL env-var reference:

NCCL_DEBUG: log verbosity. VERSION prints the NCCL version at startup; WARN prints an explicit message on any error; INFO prints which transports/rings were selected (the first thing to set when debugging); TRACE is replayable per-call tracing.
NCCL_DEBUG_SUBSYS: scope the debug log to subsystems (INIT, COLL, P2P, NET, GRAPH, TUNING, …); prefix a name with ^ to exclude it. Default is INIT,BOOTSTRAP,ENV. Pair with NCCL_DEBUG=INFO.
NCCL_IB_HCA: which RDMA (InfiniBand/RoCE) interfaces NCCL may use; a list of prefixes, optionally with a port, e.g. mlx5 (all mlx5 ports), mlx5_0:1,mlx5_1:1 (port 1 of two cards). A leading ^ makes it an exclude list, e.g. ^=mlx5_1,mlx5_4 to avoid two cards. Getting this right pins each rank to the correct rail-aligned NIC.
NCCL_P2P_LEVEL: maximum topological distance at which GPU-to-GPU P2P transport is used. Accepted values, increasing reach: LOC (never use P2P, always disabled), NVL (only over NVLink), PIX (same PCI switch), PXB (across PCI switches), PHB (same NUMA node, via the CPU), SYS (across NUMA nodes / SMP interconnect). Use it to force or forbid a transport when topology auto-detection misbehaves.

Set these in the job environment (Slurm --export, K8s pod env:, or the launcher), not globally on the node. A standard debugging opener:

export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=INIT,NET,GRAPH
# then launch the multi-GPU job and read which transports NCCL chose

Validated usage & tests¶

Reference templates, not hardware-tested. Each verifies a different question: is the library on the loader path, and which version will the framework actually use.

1. Is the shared object visible to the dynamic linker? ldconfig -p greps the linker cache.

ldconfig -p | grep -E 'libcudnn|libnccl|libcublas|libcufft'

Expect one line per library mapping the SONAME to a resolved path, e.g. libcudnn.so.9 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libcudnn.so.9 and libnccl.so.2 => …. If a library is missing here it was installed only inside a Python venv (site-packages/nvidia/*/lib, not in the global cache) or not at all; in that case the framework still finds it via its package, but a non-Python consumer will not. Absence of libcudnn.so.9 for an app that needs it is the exact cause of the common ImportError: libcudnn.so.9: cannot open shared object file.

2. Which versions will the framework load? This is the authoritative check, because the framework's bundled copies win over system ones.

python -c "import torch; \
print('cuda', torch.version.cuda); \
print('cudnn', torch.backends.cudnn.version()); \
print('nccl', torch.cuda.nccl.version())"

Expected forms (do not treat the numbers as fixed): torch.backends.cudnn.version() returns an integer encoding MAJOR*10000 + … so a cuDNN 9.x build reports as a five-digit 9xxxx; torch.cuda.nccl.version() returns a version tuple such as (2, X, Y); torch.version.cuda is the toolkit string like 12.x. Confirm these match the cuDNN/NCCL you intended to pin; a mismatch here means the framework bundled something other than what you apt/pip-installed system-wide. (cuDNN's own cudnnGetVersion() returns the CUDNN_VERSION macro, which for cuDNN 9.x is CUDNN_MAJOR*10000 + CUDNN_MINOR*100 + CUDNN_PATCHLEVEL (e.g. 90100 for 9.1.0), matching PyTorch's five-digit value for a cuDNN 9 build.) (cuDNN version macro)

3. Does a collective actually run across GPUs? A minimal NCCL all-reduce over the visible devices, with NCCL logging on, exercises transport selection end to end.

NCCL_DEBUG=INFO python -c "
import torch, torch.distributed as dist
dist.init_process_group('nccl')
x = torch.ones(1, device=f'cuda:{dist.get_rank() % torch.cuda.device_count()}')
dist.all_reduce(x)
print('rank', dist.get_rank(), 'sum', x.item())
dist.destroy_process_group()
"

Launch under torchrun --nproc_per_node=<N>. Expect every rank to print a reduced value equal to the world size, and the NCCL INFO lines to show the chosen transport (NVLink/NVLS, P2P, or NET via the IB HCA). The build-level NCCL/cuFFT self-tests (nccl-tests, cuFFT samples) are the next step for throughput rather than correctness; for fabric-scale all-reduce bandwidth see Fabric Bring-Up, Validation and Benchmarking and GPU Diagnostics and Validation.

Failure modes¶

Most are version/path mismatches, not bugs.

ABI / version mismatch: system cuDNN or NCCL differs from what the framework bundled; symptoms range from undefined symbol to silent wrong results. Fix by pinning to the framework (let the wheel or NGC image own the library) and apt-mark hold anything installed system-wide.
libcudnn.so.9 / libnccl.so.2 not found: library only inside a venv, or LD_LIBRARY_PATH not set for a non-Python consumer. Confirm with ldconfig -p (test 1). Generic kernel/GPU-not-present faults: Kernel Upgrade: GPU Missing.
NCCL hang or timeout on a collective: wrong transport, a bad/blocked IB HCA, or a single slow rank stalling the all-reduce. Turn on NCCL_DEBUG=INFO, check NCCL_IB_HCA rail binding and NCCL_P2P_LEVEL, and confirm NVLink/Fabric Manager health (Fabric Manager, NVSwitch and NVLink). Full procedure: NCCL Hang / Collective Stall.
Collectives silently slow: NCCL fell back from NVLink/IB to PCIe or TCP because the fast transport was unavailable; NCCL_DEBUG=INFO shows the selected path. Diagnose against fabric bring-up: Fabric Bring-Up, Validation and Benchmarking.
Two CUDA library copies on the path (pip and apt/conda): undefined which loads first; remove one supply chain. Prefer the framework-pinned set.

References¶

cuBLAS documentation: https://docs.nvidia.com/cuda/cublas/
cuDNN install guide (Linux, package names): https://docs.nvidia.com/deeplearning/cudnn/installation/latest/linux.html
cuDNN version macro (CUDNN_VERSION = CUDNN_MAJOR*10000 + CUDNN_MINOR*100 + CUDNN_PATCHLEVEL): https://docs.nvidia.com/deeplearning/cudnn/backend/latest/developer/misc.html
NCCL install guide (apt, version pinning): https://docs.nvidia.com/deeplearning/nccl/install-guide/index.html
NCCL environment variables (NCCL_DEBUG, NCCL_IB_HCA, NCCL_P2P_LEVEL): https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html
NCCL download page: https://developer.nvidia.com/nccl
CUTLASS (header-only templates + Python DSLs): https://github.com/NVIDIA/cutlass
PyPI nvidia-nccl-cu12: https://pypi.org/project/nvidia-nccl-cu12/
PyPI nvidia-cublas-cu12: https://pypi.org/project/nvidia-cublas-cu12/
PyPI nvidia-cufft-cu12: https://pypi.org/project/nvidia-cufft-cu12/
PyPI nvidia-cutlass: https://pypi.org/project/nvidia-cutlass/