Markdown

GPU software stack & node administration¶

Scope: the software stack on a single GPU node and the day-to-day administration of it. Driver, CUDA, the management interfaces, GPU partitioning, and the levers a sysadmin or GPU server engineer actually pulls. This is the layer between racked hardware (commissioning) and the platform (Kubernetes for GPUs).

Overview¶

Once a node powers on and the fabric is clean, it still does nothing useful until the GPU software stack is correct and consistent. Most "GPU is broken" tickets are really stack tickets: a driver/firmware mismatch, a dead Fabric Manager, persistence mode off, or version drift across the fleet. The skill is knowing the layers, how they version against each other, and the handful of commands that diagnose and control them.

Architecture: the GPU software stack

flowchart TB
  FW["GPU firmware: VBIOS + GSP"] --> KM["Kernel modules: nvidia, uvm, peermem"]
  KM --> CDRV["CUDA driver (libcuda.so)"]
  CDRV --> CT["CUDA toolkit / runtime"]
  CT --> LIB["Libraries: cuDNN, cuBLAS, NCCL"]
  LIB --> FR["Frameworks: PyTorch, JAX, TensorRT"]
  FM["Fabric Manager"] -.->|"programs NVSwitch"| KM

Core knowledge¶

The stack, bottom-up¶

GPU firmware: VBIOS plus GSP (GPU System Processor) firmware. Hopper and Blackwell offload large parts of driver logic onto an on-GPU RISC-V GSP core; the kernel driver loads a GSP firmware blob that must match the driver version. Mismatch is a hard module-load failure.
Kernel modules: nvidia.ko plus nvidia-uvm (unified memory), nvidia-modeset, and nvidia-peermem (the module that exposes GPU memory for GPUDirect RDMA, see GPU performance and health). On Blackwell the open GPU kernel modules (nvidia-open) are the required/default flavour; the legacy proprietary modules are deprecated for new architectures.
CUDA driver (libcuda.so, the driver API): versioned with the driver, not the toolkit.
CUDA toolkit / runtime (libcudart): versioned independently of the driver.
Math/comm libraries: cuBLAS, cuDNN, NCCL, CUTLASS, cuFFT.
Frameworks: PyTorch, JAX, TensorRT. They ship their own bundled CUDA/cuDNN/NCCL, which is why a framework can run a newer CUDA than the host toolkit.

Versioning rules that bite¶

A driver supports its own CUDA toolkit version and all older ones (backward compatible). A newer toolkit on an older driver needs the CUDA forward-compatibility package, and only on datacenter (LTS) driver branches.
Architecture compatibility is separate from version compatibility. A CUBIN (compiled SASS) is architecture-specific and is not forward-compatible to a newer GPU architecture. Forward compatibility comes from embedded PTX, which the driver JIT-compiles for a newer arch at load time. Ship fat binaries (PTX for forward compatibility plus SASS for current archs) so a binary built today still runs on a future GPU. (Fregly, AI Systems Performance Engineering, O'Reilly, ch. 3; PTX ISA.)
Datacenter drivers ship as numbered branches designated Long-Term-Support (LTS) or Production (shorter-lived). As of mid-2026 R535 and R580 are LTS branches (R580 current, on CUDA 13.x, supported into 2028) while R570 is a Production branch (end-of-life early 2026); verify the live support matrix. Pick a branch and hold the fleet on it; do not let nodes drift.
Fabric Manager and the driver are lockstep: same version, always.
GSP firmware ships inside the driver package; you do not version it separately, but a partial install (new modules, stale firmware) breaks it.

nvidia-smi: the primary instrument¶

Inspect: nvidia-smi, -q (full detail), --query-gpu=... --format=csv (scriptable), dmon (rolling device metrics), pmon (per-process). See observability for fleet-scale.
Control: -pm 1 (persistence mode), -pl <W> (power limit), -lgc/-lmc (lock graphics/memory clocks), -c (compute mode: default / exclusive-process / prohibited), -e 1|0 (ECC, needs reset), -r (reset a GPU), --gpu-reset.
Persistence mode (or the nvidia-persistenced daemon) keeps driver state resident so the GPU does not re-initialise and clock down between jobs. Off-by-default on a fresh install; turn it on for any server. Forgetting this is a classic cause of high job-start latency and erratic clocks.

Fabric Manager (NVSwitch systems)¶

Required on any NVSwitch-based system: HGX/DGX 8-GPU baseboards and GB200/GB300 NVL72 racks. The nv-fabricmanager service programs the NVSwitches so the GPUs form a single NVLink domain (the Blackwell platform).
If it is down, version-mismatched, or failed: GPUs cannot see each other over NVLink, collectives fall back to PCIe or error out, and you see "system is not ready" / NVLink-inactive states. Check Fabric Manager before ever blaming NVLink or NCCL.
For multi-node NVLink (NVL72 spanning nodes) the IMEX (Internode Memory Exchange) service coordinates the cross-node NVLink memory domain. See Kubernetes for GPUs.

Driver & feature differences by GPU tier¶

The stack above is not uniform across the NVIDIA range; the driver branch and several node-management features depend on the GPU tier. Match the branch to the silicon, and do not assume a feature exists.

Driver branch by tier. Datacenter GPUs (A100, H100/H200, B-series) run the datacenter (Tesla) driver as numbered Production/LTS branches (R580-class as of mid-2026, verify the live matrix). GeForce cards (RTX 50/40) run the GeForce driver, whose licence (§2.8) states the software "is not licensed for datacenter deployment", a real compliance differentiator, not just a support boundary. Professional RTX PRO / workstation GPUs run the NVIDIA RTX Enterprise Production Branch driver (ISV certification, long life-cycle, security updates). DGX systems run DGX OS, a customized Ubuntu with NVIDIA drivers, CUDA, and diagnostics preinstalled.
Fabric Manager applies only to NVSwitch systems. The Fabric Manager guide scopes it to "NVSwitch-based single-node HGX and DGX systems" (8-GPU HGX/DGX baseboards; the NVL72 racks extend this with IMEX). PCIe-attached datacenter cards, all GeForce, and the no-NVLink workstation GPUs have no NVSwitch and never run nv-fabricmanager.
MIG is tier-restricted. MIG is supported only on A100, A30, H100, H200, the B-series (B200/GB200/B300), and the RTX PRO 6000 Blackwell (NVIDIA's supported list; "starting with Ampere, compute capability >= 8.0"). It is not on GeForce (RTX 50/40), nor on A40/A10/L40S/RTX 6000 Ada. The L40S datasheet states "MIG Support: No". Compute capability alone is not sufficient; check the supported-GPUs list.
Persistence & ECC. Persistence mode (nvidia-persistenced) and ECC apply to datacenter, RTX PRO/workstation, and DGX GPUs; turn persistence on for every server (off by default). GeForce has no ECC (and no -e toggle); RTX PRO 6000 Blackwell carries GDDR7 with ECC, unlike consumer GDDR7. Toggling ECC still needs a GPU reset where supported.

These differences flow into the Blackwell platform and the bring-up automation in Ansible bring-up; per-feature terms are in the Glossary.

Partitioning: MIG and MPS¶

MIG (Multi-Instance GPU): hard-partitions one GPU into up to seven instances, each with dedicated SM slices, L2, memory, and fault isolation. Profiles named by compute/memory fraction (1g.Xgb … 7g.Ygb). Enable with nvidia-smi mig -i <id> -cgi <profile> -C; the GPU must be idle and may need a reset. Use for inference, multi-tenant, and dev sharing where isolation matters (security and multi-tenancy). Supported only on the tier-restricted GPU list above. MIG mode disables GPU-to-GPU peer-to-peer (both NVLink and PCIe P2P), so it is a poor fit for large-scale distributed training and sparse-MoE expert-parallel inference, which depend on fast GPU-to-GPU paths; CUDA IPC across instances is also limited. (Fregly, AI Systems Performance Engineering, O'Reilly, ch. 3; MIG user guide.)
MPS (Multi-Process Service): lets several processes share one GPU context and spatially share SMs without hard partitioning or fault isolation. Good for packing many small jobs; one process faulting can take down the rest. Managed by nvidia-cuda-mps-control.
Rule of thumb: MIG for isolation/multi-tenant, MPS for throughput on cooperative workloads, neither for a single large training job (give it the whole GPU).

Install and lifecycle¶

Install via the CUDA apt/yum repo (.deb/.rpm) or the runfile; use DKMS so modules rebuild on kernel upgrade. A kernel bump with no matching module rebuild = no GPU after reboot.
The NVIDIA Container Toolkit (nvidia-ctk) injects the driver and devices into containers, increasingly via CDI (Container Device Interface). This is the hand-off to Kubernetes for GPUs.
CUDA-in-container driver rule. The container ships the CUDA runtime; the host ships the driver. The host driver must be at least the minimum the container's CUDA needs: R580+ for CUDA 13.x, R525+ for CUDA 12.x. A newer CUDA runtime on an older host driver fails CUDA init at container start; verify the live matrix. (Fregly, AI Systems Performance Engineering, O'Reilly, ch. 3; supported drivers & CUDA versions.)
Node validation: dcgmi diag -r 3 (deep), nvbandwidth, gpu-burn, and the field tools in commissioning/observability.

Don't-miss checklist¶

Persistence mode on for every server; confirm nvidia-persistenced is running.
Fabric Manager running and version-matched to the driver on NVSwitch systems, checked before any NVLink fault hunt.
Driver / CUDA / Fabric Manager / GSP all consistent across the whole fleet; pin one LTS branch (provisioning and scheduling).
Open kernel modules on Blackwell.
Lock clocks for reproducible benchmarks; unlock (or set boost) for production throughput.
DKMS in place so a kernel upgrade does not strand the driver.

Failure modes¶

GSP firmware / driver mismatch after a partial upgrade: modules will not load.
Fabric Manager down or mismatched: GPUs isolated, NVLink not formed, collectives degrade silently to PCIe.
Kernel upgrade without DKMS rebuild: node boots with no GPU.
Persistence mode off: slow job starts, clocks bouncing, flaky benchmarks.
ECC toggled without the required reset; or stale MIG state confusing the scheduler (Kubernetes for GPUs).

Open questions & validation¶

Fabric Manager and IMEX on an NVSwitch baseboard or NVL72: practise bring-up and the failure signature when the service is down (playbooks in Ansible bring-up).
MIG profile planning for a real multi-tenant case: which profiles, how scheduled (Kubernetes for GPUs, security and multi-tenancy).
Open-vs-proprietary kernel module nuances and the forward-compatibility package on the chosen LTS branch.

Deep-dive pages¶

Each component has its own page: what it is, why and when it is needed, how it is installed and managed, validated commands, and failure modes.

CUDA stack: CUDA driver · toolkit & runtime · math & comms libraries · frameworks (PyTorch/JAX/TensorRT)
Modules & firmware: kernel modules · GPU firmware & GSP · container toolkit & CDI
Drivers: versions & branches · support by tier · install & lifecycle
Instrument & tune: nvidia-smi reference · persistence mode · ECC · diagnostics & validation
Interconnect & partitioning: Fabric Manager · NVSwitch & NVLink · MIG · MPS
Runbooks: GSP/driver mismatch · Fabric Manager failure · kernel upgrade, GPU missing · persistence / clock bounce · ECC toggle recovery · stale MIG state

References¶

CUDA / driver compatibility matrix: https://docs.nvidia.com/datacenter/tesla/drivers/index.html
Supported drivers and CUDA toolkit versions: https://docs.nvidia.com/datacenter/tesla/drivers/supported-drivers-and-cuda-toolkit-versions.html
Fabric Manager user guide: https://docs.nvidia.com/datacenter/tesla/fabric-manager-user-guide/index.html
MIG user guide: https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html
MIG supported GPUs: https://docs.nvidia.com/datacenter/tesla/mig-user-guide/supported-gpus.html
Multi-Process Service (MPS): https://docs.nvidia.com/deploy/mps/index.html
NVIDIA Container Toolkit: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/index.html
GeForce driver licence (datacenter-deployment restriction, §2.8): https://www.nvidia.com/en-us/drivers/geforce-license/
NVIDIA RTX Enterprise / professional drivers: https://www.nvidia.com/en-us/drivers/
DGX OS 7 user guide: https://docs.nvidia.com/dgx/dgx-os-7-user-guide/index.html
L40S datasheet (MIG/NVLink "No"): https://www.nvidia.com/en-us/data-center/l40s/

Related: Commissioning · Provisioning · Performance · Kubernetes · Observability · Glossary