Ansible role: nvidia_stack¶
Scope: install the NVIDIA GPU software stack on a prepared node, namely the kernel driver (tier-aware package, branch-pinned), the CUDA toolkit, nvidia-persistenced, nvidia-fabricmanager (only on NVSwitch baseboards), DCGM for telemetry, and the NVIDIA Container Toolkit wired into containerd via CDI. The second role in the site.yml from the bring-up hub; runs after base_tuning (which has already blacklisted nouveau and set the kernel cmdline) and before rdma_fabric. Reference template, not hardware-tested. Pin every branch to your platform's qualified driver and validate on one canary node before a fleet roll. For a maintained, batteries-included alternative see NVIDIA DeepOps, the GPU Operator, and the nvidia.nvidia_driver Ansible collection (References).
flowchart LR
PREP["base_tuning done: nouveau out, cmdline set"] --> PKG["Select driver package by gpu_tier"]
PKG --> DRV["Driver + CUDA toolkit + DCGM (branch-pinned)"]
DRV --> FM["Fabric Manager (NVSwitch only)"]
FM --> PERS["nvidia-persistenced"]
PERS --> CTK["Container Toolkit + CDI -> containerd"]
CTK --> READY["Reboot once, ready for rdma_fabric"]
What it does¶
nvidia_stack brings a node from "driver-less but prepared" to "GPUs visible, schedulable, and exposed to the container runtime." It is the role that actually binds the nvidia kernel modules, so it depends on base_tuning having already evicted nouveau and rebuilt the initramfs; without that, the driver install compiles but the modules lose the boot race. Six concerns, in order:
- Tier-aware driver install. The driver package is selected from
gpu_tier, not hard-coded. Datacenter nodes (HGX/DGX, NVL, A/H/B-series, RTX PRO Blackwell servers) take the Enterprise Ready Driver open-kernel-module package; workstation and consumer tiers take a different package and skip Fabric Manager (no NVSwitch). NVIDIA recommends the open kernel modules for Turing and newer, and is making them the default going forward; the proprietary modules remain only for pre-Turing silicon.12 The branch is pinned (driver_branch) soupdate_cachecannot silently jump the driver. - CUDA toolkit. The runtime/compiler toolkit is installed from the CUDA network repo as a versioned meta-package (
cuda-toolkit-<cuda_branch>), which pulls the toolkit without dragging in a second copy of the driver. Thecudameta-package would install both and fight the tier-selected driver.5 nvidia-fabricmanager, NVSwitch only. Fabric Manager configures the NVSwitches into a single NVLink memory fabric; it is scoped to NVSwitch-based HGX/DGX and NVL systems and is installed/started only whennvidia_nvswitchis true. Its version must match the installed driver branch exactly: at start-up the FM service checks the loaded driver stack and aborts on a mismatch.6 See Fabric Manager.nvidia-persistenced. Enables driver persistence so kernel state survives the absence of GPU clients, cutting cold-start latency and keeping ECC/clocks initialised. Preferred over the deprecatednvidia-smi -pm 1persistence-mode flag.7- DCGM. Installs the Data Center GPU Manager so reliability/RAS and the validate role can run
dcgmi diagand stream health telemetry. The package carries a-cuda<major>suffix that must match the installed CUDA user-mode driver major version.8 - Container Toolkit + CDI. Installs
nvidia-container-toolkitand generates a Container Device Interface spec so containerd injects GPUs declaratively, rather than via the legacynvidiaOCI hook.910 See the GPU software stack for where this sits under Kubernetes.
Each package set that needs a fresh boot to bind cleanly notifies the shared reboot node handler; a clean re-run is a no-op and reboots nothing. nvidia_stack deliberately does not install OFED, load nvidia_peermem, or write /etc/nccl.conf; those belong to rdma_fabric, which must run after the driver so the peer-memory client builds against the RDMA APIs.
Variables¶
Inventory-level keys (gpu_tier, driver_branch, cuda_branch, nvidia_nvswitch) are set on the [gpu_nodes:vars] group in the hub inventory; role-only knobs live in roles/nvidia_stack/defaults/main.yml. The driver package and whether Fabric Manager runs are both functions of gpu_tier and nvidia_nvswitch.
| Variable | Scope | Default | Purpose |
|---|---|---|---|
gpu_tier |
inventory | datacenter |
datacenter -> ERD open-module driver + Fabric Manager (if NVSwitch); workstation -> RTX/PRO driver, no FM; consumer -> GeForce driver, no FM. Drives nvidia_driver_pkg. |
driver_branch |
inventory | 580 |
Driver branch pinned into the package name and matched by Fabric Manager. From branch 590 the Ubuntu packages dropped the branch suffix and pinning moves to version-locking — verify naming for your branch.4 |
cuda_branch |
inventory | 13-0 |
CUDA toolkit branch -> cuda-toolkit-{{ cuda_branch }} from the CUDA network repo. Hyphenated form (13-0), as the meta-package uses. |
nvidia_nvswitch |
inventory | false |
True only on NVSwitch baseboards (HGX/DGX 8-GPU, NVL). Gates nvidia-fabricmanager install + start. False for PCIe/RTX/consumer. |
nvidia_dcgm_cuda_major |
role | 13 |
DCGM CUDA-major suffix -> datacenter-gpu-manager-4-cuda{{ }}. Must match the installed CUDA user-mode driver major; Maxwell/Volta/Pascal on driver 580 need 12.8 |
nvidia_driver_pkg |
role | computed | Resolved driver package (see the set_fact task). Override to set an exact package (e.g. a -server-open ERD variant) if the default mapping does not fit your distro. |
nvidia_fabricmanager_pkg |
role | nvidia-fabricmanager-{{ driver_branch }} |
FM package. Use cuda-drivers-fabricmanager-{{ driver_branch }} to let it pull a matching driver, or nvlink5-{{ driver_branch }} on 4th-gen NVSwitch (B200/B300).6 |
nvidia_container_runtime |
role | containerd |
Runtime configured by nvidia-ctk runtime configure. containerd for K8s nodes; docker for Docker hosts. |
nvidia_cdi_enabled |
role | true |
Generate a CDI spec with nvidia-ctk cdi generate. Runtime configuration is a separate nvidia-ctk runtime configure --runtime=... step. |
nvidia_reboot_timeout |
role | 1200 |
Seconds the shared reboot node handler waits for the node to return (driver DKMS builds are slow). |
# roles/nvidia_stack/defaults/main.yml
gpu_tier: datacenter # datacenter | workstation | consumer
driver_branch: "580" # pin to your platform's qualified branch
cuda_branch: "13-0" # -> cuda-toolkit-13-0
nvidia_nvswitch: false # true only on NVSwitch baseboards (HGX/DGX/NVL)
nvidia_dcgm_cuda_major: "13" # DCGM -cuda<major>; 12 for Maxwell/Volta/Pascal on r580
nvidia_fabricmanager_pkg: "nvidia-fabricmanager-{{ driver_branch }}"
nvidia_container_runtime: containerd
nvidia_cdi_enabled: true
nvidia_reboot_timeout: 1200
The tier -> driver-package mapping is resolved once in vars/main.yml so the task list stays declarative. Datacenter uses the Enterprise Ready Driver open-kernel-module package; the -server suffix marks the ERD line and -open the open modules.13
# roles/nvidia_stack/vars/main.yml
# Datacenter = Enterprise Ready Driver, open kernel modules (-server-open).
# Workstation/consumer use the UDA package; -open for Turing+.
# Verify exact names against your distro/branch (see References).
_nvidia_driver_pkg_map:
datacenter: "nvidia-driver-{{ driver_branch }}-server-open"
workstation: "nvidia-driver-{{ driver_branch }}-open"
consumer: "nvidia-driver-{{ driver_branch }}-open"
nvidia_driver_pkg: "{{ _nvidia_driver_pkg_map[gpu_tier] }}"
Tasks¶
Real, idempotent tasks/main.yml. Stock modules only (ansible.builtin.*); assumes the CUDA toolkit repo, NVIDIA Container Toolkit repo keyrings, and the distro/NVIDIA driver repo that serves your selected nvidia_driver_pkg are already configured (a repo-setup task, omitted here, or your base image). The default package map uses Ubuntu-style driver package names; if your fleet installs from NVIDIA's CUDA network repo where the current guide uses nvidia-open / cuda-drivers, override nvidia_driver_pkg and pin with apt preferences or version locks. The set_fact fails loudly if gpu_tier is not a key in the map.
# roles/nvidia_stack/tasks/main.yml
- name: Resolve driver package for this gpu_tier
ansible.builtin.set_fact:
nvidia_driver_pkg: "{{ _nvidia_driver_pkg_map[gpu_tier] }}"
# KeyError here means gpu_tier is not one of datacenter|workstation|consumer
- name: Install GPU driver, CUDA toolkit, DCGM (branch-pinned)
ansible.builtin.apt:
name:
- "{{ nvidia_driver_pkg }}" # tier-selected, branch-pinned
- "cuda-toolkit-{{ cuda_branch }}" # toolkit only, not the cuda meta-package
- "datacenter-gpu-manager-4-cuda{{ nvidia_dcgm_cuda_major }}" # DCGM 4.x, -cuda<major>
- "nvidia-container-toolkit" # bridge to the container runtime
state: present
update_cache: true
allow_change_held_packages: false # respect apt-mark holds on a pinned branch
notify: reboot node
- name: Install Fabric Manager (NVSwitch baseboards only)
ansible.builtin.apt:
name: "{{ nvidia_fabricmanager_pkg }}" # version must match driver_branch exactly
state: present
update_cache: true
when: nvidia_nvswitch | bool
notify: reboot node
- name: Enable and start Fabric Manager (NVSwitch baseboards only)
ansible.builtin.systemd_service:
name: nvidia-fabricmanager
enabled: true
state: started
when: nvidia_nvswitch | bool
# FM aborts at start-up if its version does not match the loaded driver stack
- name: Enable and start nvidia-persistenced (driver persistence)
ansible.builtin.systemd_service:
name: nvidia-persistenced
enabled: true
state: started
- name: Check whether containerd already has the NVIDIA runtime stanza
ansible.builtin.command:
cmd: grep -q 'nvidia-container-runtime' /etc/containerd/config.toml
register: nvidia_runtime_configured
changed_when: false
failed_when: false
when: nvidia_container_runtime == 'containerd'
- name: Configure the container runtime for NVIDIA
ansible.builtin.command:
cmd: nvidia-ctk runtime configure --runtime={{ nvidia_container_runtime }}
register: nvidia_runtime_configure
changed_when: nvidia_runtime_configure.rc == 0
when:
- nvidia_container_runtime != 'containerd' or nvidia_runtime_configured.rc != 0
notify: restart containerd
- name: Generate the CDI specification for the GPUs
ansible.builtin.command:
cmd: nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml
creates: /etc/cdi/nvidia.yaml
when: nvidia_cdi_enabled | bool
# On Container Toolkit >= 1.18.0 the nvidia-cdi-refresh systemd unit regenerates
# this on install/upgrade; the `creates` guard keeps this task idempotent.
Idempotency notes. ansible.builtin.apt with state: present is convergent: an already-installed branch reports ok, so the reboot node notify does not fire on a clean re-run; pinning driver_branch (and an apt-mark hold) is what stops update_cache from silently upgrading the driver out from under DKMS.11 ansible.builtin.systemd_service reports changed only when the unit's enabled/active state actually changes.12 nvidia-ctk runtime configure is guarded by a read-only check for the containerd runtime stanza; if you use Docker or a nonstandard containerd config path, adjust that probe rather than keying it to /etc/cdi/nvidia.yaml. The CDI generator is separately guarded with creates: /etc/cdi/nvidia.yaml; on toolkit >= 1.18.0 the nvidia-cdi-refresh unit regenerates it on install/upgrade.10
Handlers¶
Two handlers. reboot node is the shared handler name used by base_tuning and rdma_fabric; Ansible de-duplicates notifications, so multiple notify across all three roles collapse to a single reboot at the end of the play. restart containerd bounces the runtime so it picks up the NVIDIA/CDI configuration without a full reboot.
# roles/nvidia_stack/handlers/main.yml
- name: reboot node
ansible.builtin.reboot:
reboot_timeout: "{{ nvidia_reboot_timeout }}"
# shared name with base_tuning/rdma_fabric; Ansible de-dupes to one reboot per play
- name: restart containerd
ansible.builtin.systemd_service:
name: containerd
state: restarted
nvidia_stack uses only ansible.builtin modules, so no extra collection is required for this role (the sibling rdma_fabric pulls community.general for modprobe, and base_tuning pulls ansible.posix for sysctl). Pin Ansible's own version in requirements.yml alongside those.
Apply & verify¶
Run the whole node bring-up (the hub site.yml applies nvidia_stack after base_tuning), or target just this role with a tag:
# whole bring-up
ansible-playbook -i inventory/hosts.ini site.yml --limit gpu-07.dc1.internal
# this role only, if tagged `nvidia` in site.yml
ansible-playbook -i inventory/hosts.ini site.yml --tags nvidia --limit gpu-07.dc1.internal
After the role reboots the node, confirm the signals it owns. Validation is read-only; fold the same checks into role: validate_health so a regressed node fails the play instead of drifting silently.
NODE=gpu-07.dc1.internal
# 1. Driver bound, GPUs enumerated, persistence on.
ssh "$NODE" "nvidia-smi --query-gpu=name,driver_version,persistence_mode --format=csv,noheader"
# expect one row per GPU, driver_version == your driver_branch, persistence_mode == Enabled
# 2. Fabric Manager active -- ONLY on NVSwitch baseboards (nvidia_nvswitch=true).
ssh "$NODE" "systemctl is-active nvidia-fabricmanager"
# expect: active (on PCIe/RTX/consumer this service is absent -- that is correct)
# 3. Container Toolkit present and the CDI spec generated.
ssh "$NODE" "nvidia-ctk --version"
ssh "$NODE" "test -f /etc/cdi/nvidia.yaml && echo cdi-spec-present"
ssh "$NODE" "nvidia-ctk cdi list" # lists nvidia.com/gpu=0, ...=all devices
# 4. DCGM up; Level-1 diagnostic is the fast smoke test (use -r 3 for the long run).
ssh "$NODE" "dcgmi discovery -l" # enumerates GPUs DCGM can see
ssh "$NODE" "dcgmi diag -r 1" # expect all checks Pass
Expected signal, together: nvidia-smi lists every GPU at the pinned driver_version with persistence_mode Enabled; on NVSwitch hosts nvidia-fabricmanager is active (and rightly absent elsewhere); /etc/cdi/nvidia.yaml exists and nvidia-ctk cdi list shows the GPU devices; dcgmi diag -r 1 passes. The container path is only fully proven once a GPU workload runs through containerd via CDI; see the GPU software stack and install lifecycle.
Failure modes¶
| Symptom | Likely cause | Runbook |
|---|---|---|
nvidia-smi reports "No devices were found" / "couldn't communicate with the NVIDIA driver" after reboot |
nouveau still bound (initramfs not rebuilt in base_tuning), or the DKMS module failed to build against the running kernel. |
kernel/GPU missing |
nvidia-fabricmanager fails to start; GPUs do not form the NVLink domain |
FM package version does not match the installed driver branch — the FM service checks the loaded driver stack at start-up and aborts on mismatch.6 | fabric-manager failure |
| Fabric Manager installed/started on a PCIe or RTX node | nvidia_nvswitch set true on a non-NVSwitch host; there is no switch to configure, so FM errors or no-ops. Set nvidia_nvswitch: false. |
fabric-manager failure |
DCGM installs but dcgmi errors on load / version mismatch |
-cuda<major> suffix does not match the installed CUDA user-mode driver major (e.g. -cuda13 on a Maxwell/Volta/Pascal r580 host that needs -cuda12).8 |
kernel/GPU missing |
Container sees no GPU / nvidia.com/gpu unknown to containerd |
/etc/cdi/nvidia.yaml missing or stale, or containerd not restarted after nvidia-ctk runtime configure. Regenerate the spec and bounce containerd. |
kernel/GPU missing |
| Driver silently upgraded; DKMS rebuild on a kernel bump breaks the stack | driver_branch not pinned / not apt-mark hold-ed, so update_cache pulled a newer branch. Pin the branch and re-run after kernel upgrades so DKMS rebuilds are captured. |
kernel/GPU missing |
| Both host driver and the GPU Operator's driver container present | Host-installed driver and the Operator's driver pod conflict. Choose one path; do not run both. | Software stack |
References¶
- NVIDIA Driver Installation Guide — Ubuntu (
nvidia-open,cuda-drivers, package naming): https://docs.nvidia.com/datacenter/tesla/driver-installation-guide/latest/ubuntu.html - NVIDIA — open-source GPU kernel modules become the default (Turing+ open modules): https://developer.nvidia.com/blog/nvidia-transitions-fully-towards-open-source-gpu-kernel-modules/
- Ubuntu Server — NVIDIA drivers (
-serverERD vs-opensuffixes, e.g.nvidia-driver-550-server-open): https://ubuntu.com/server/docs/how-to/graphics/install-nvidia-drivers/ - CUDA Installation Guide for Linux (network repo;
cuda-toolkitmeta-package installs the toolkit without the driver): https://docs.nvidia.com/cuda/cuda-installation-guide-linux/ - NVIDIA Fabric Manager user guide (NVSwitch HGX/DGX scope;
nvidia-fabricmanager/cuda-drivers-fabricmanager-<branch>/nvlink5-<branch>; version-must-match-driver): https://docs.nvidia.com/datacenter/tesla/fabric-manager-user-guide/index.html - NVIDIA Fabric Manager apt packaging (
nvidia-fabricmanager-XXX,cuda-drivers-fabricmanager-XXX): https://github.com/NVIDIA/apt-packaging-fabric-manager - NVIDIA Persistence Daemon (
nvidia-persistenced, preferred overnvidia-smi -pm): https://docs.nvidia.com/deploy/driver-persistence/persistence-daemon.html - NVIDIA DCGM — Getting Started (
datacenter-gpu-manager-4-cuda<major>,dcgmi diag): https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/getting-started.html - NVIDIA Container Toolkit — install guide (
nvidia-container-toolkit,nvidia-ctk runtime configure): https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html - NVIDIA Container Toolkit — Container Device Interface (
nvidia-ctk cdi generate,nvidia-cdi-refresh): https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/cdi-support.html ansible.builtin.apt(state,update_cache,allow_change_held_packages): https://docs.ansible.com/ansible/latest/collections/ansible/builtin/apt_module.htmlansible.builtin.systemd_service: https://docs.ansible.com/ansible/latest/collections/ansible/builtin/systemd_service_module.htmlansible.builtin.command(createsfor idempotency): https://docs.ansible.com/ansible/latest/collections/ansible/builtin/command_module.htmlansible.builtin.reboot: https://docs.ansible.com/ansible/latest/collections/ansible/builtin/reboot_module.html- NVIDIA DeepOps (Ansible/Kubespray cluster deploy): https://github.com/NVIDIA/deepops
nvidia.nvidia_driverAnsible collection: https://galaxy.ansible.com/ui/repo/published/nvidia/nvidia_driver/
Related: Ansible bring-up · base_tuning · rdma_fabric · install lifecycle · Fabric Manager · Glossary
-
NVIDIA Driver Installation Guide — Ubuntu. Open kernel modules via
apt install nvidia-open; proprietary viaapt install cuda-drivers. https://docs.nvidia.com/datacenter/tesla/driver-installation-guide/latest/ubuntu.html ↩↩ -
NVIDIA, "Transitions Fully Towards Open-Source GPU Kernel Modules" — open modules recommended for Turing and newer and becoming the install default. https://developer.nvidia.com/blog/nvidia-transitions-fully-towards-open-source-gpu-kernel-modules/ ↩
-
Ubuntu Server docs — Enterprise Ready Drivers carry the
-serversuffix;-opendenotes the open kernel module variant (e.g.nvidia-driver-550-server-open). https://ubuntu.com/server/docs/how-to/graphics/install-nvidia-drivers/ ↩ -
NVIDIA Driver Installation Guide — from branch 590 the Ubuntu packages drop the branch designation from the name and switching branches is handled via version-locking (pinning). https://docs.nvidia.com/datacenter/tesla/driver-installation-guide/latest/ubuntu.html ↩
-
CUDA Installation Guide for Linux — use the
cuda-toolkitmeta-package to install the toolkit without the bundled driver (thecudameta-package installs both). https://docs.nvidia.com/cuda/cuda-installation-guide-linux/ ↩ -
NVIDIA Fabric Manager user guide — install
cuda-drivers-fabricmanager-<driver-branch>(ornvidia-fabricmanager-<branch>;nvlink5-<branch>for 4th-gen NVSwitch B200/B300); servicenvidia-fabricmanager; the FM version must match the driver and the service aborts at start-up on a mismatch; scoped to NVSwitch HGX/DGX/NVL systems. https://docs.nvidia.com/datacenter/tesla/fabric-manager-user-guide/index.html ↩↩↩ -
NVIDIA Driver Persistence — the Persistence Daemon (
nvidia-persistenced) is the recommended mechanism, superseding persistence mode set vianvidia-smi -pm. https://docs.nvidia.com/deploy/driver-persistence/persistence-daemon.html ↩ -
NVIDIA DCGM — Getting Started. The DCGM package carries a
-cuda<major>suffix matching the CUDA user-mode driver major (datacenter-gpu-manager-4-cuda13); Maxwell/Volta/Pascal on driver 580 must use the-cuda12build. https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/getting-started.html ↩↩↩ -
NVIDIA Container Toolkit — Installing the NVIDIA Container Toolkit. Package
nvidia-container-toolkit; configure containerd withnvidia-ctk runtime configure --runtime=containerd. https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html ↩ -
NVIDIA Container Toolkit — Support for Container Device Interface.
nvidia-ctk cdi generate --output=...writes the spec; as of toolkit v1.18.0 thenvidia-cdi-refreshsystemd service regenerates it on install/upgrade. https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/cdi-support.html ↩↩ -
ansible.builtin.apt—state: present(default) installs if absent;update_cache: truerunsapt-get updatefirst;allow_change_held_packagesgoverns whether held packages may change. https://docs.ansible.com/ansible/latest/collections/ansible/builtin/apt_module.html ↩ -
ansible.builtin.systemd_service— manages unitenabled/state, reportingchangedonly on an actual state transition. https://docs.ansible.com/ansible/latest/collections/ansible/builtin/systemd_service_module.html ↩