Markdown

Ansible role: nvidia_stack¶

Scope: install the NVIDIA GPU software stack on a prepared node, namely the kernel driver (tier-aware package, branch-pinned), the CUDA toolkit, nvidia-persistenced, nvidia-fabricmanager (only on NVSwitch baseboards), DCGM for telemetry, and the NVIDIA Container Toolkit wired into containerd via CDI. The second role in the site.yml from the bring-up hub; runs after base_tuning (which has already blacklisted nouveau and set the kernel cmdline) and before rdma_fabric. Reference template, not hardware-tested. Pin every branch to your platform's qualified driver and validate on one canary node before a fleet roll. For a maintained, batteries-included alternative see NVIDIA DeepOps, the GPU Operator, and the nvidia.nvidia_driver Ansible collection (References).

flowchart LR
  PREP["base_tuning done: nouveau out, cmdline set"] --> PKG["Select driver package by gpu_tier"]
  PKG --> DRV["Driver + CUDA toolkit + DCGM (branch-pinned)"]
  DRV --> FM["Fabric Manager (NVSwitch only)"]
  FM --> PERS["nvidia-persistenced"]
  PERS --> CTK["Container Toolkit + CDI -> containerd"]
  CTK --> READY["Reboot once, ready for rdma_fabric"]

What it does¶

nvidia_stack brings a node from "driver-less but prepared" to "GPUs visible, schedulable, and exposed to the container runtime." It is the role that actually binds the nvidia kernel modules, so it depends on base_tuning having already evicted nouveau and rebuilt the initramfs; without that, the driver install compiles but the modules lose the boot race. Six concerns, in order:

Tier-aware driver install. The driver package is selected from gpu_tier, not hard-coded. Datacenter nodes (HGX/DGX, NVL, A/H/B-series, RTX PRO Blackwell servers) take the Enterprise Ready Driver open-kernel-module package; workstation and consumer tiers take a different package and skip Fabric Manager (no NVSwitch). NVIDIA recommends the open kernel modules for Turing and newer, and is making them the default going forward; the proprietary modules remain only for pre-Turing silicon.¹² The branch is pinned (driver_branch) so update_cache cannot silently jump the driver.
CUDA toolkit. The runtime/compiler toolkit is installed from the CUDA network repo as a versioned meta-package (cuda-toolkit-<cuda_branch>), which pulls the toolkit without dragging in a second copy of the driver. The cuda meta-package would install both and fight the tier-selected driver.⁵
nvidia-fabricmanager, NVSwitch only. Fabric Manager configures the NVSwitches into a single NVLink memory fabric; it is scoped to NVSwitch-based HGX/DGX and NVL systems and is installed/started only when nvidia_nvswitch is true. Its version must match the installed driver branch exactly: at start-up the FM service checks the loaded driver stack and aborts on a mismatch.⁶ See Fabric Manager.
nvidia-persistenced. Enables driver persistence so kernel state survives the absence of GPU clients, cutting cold-start latency and keeping ECC/clocks initialised. Preferred over the deprecated nvidia-smi -pm 1 persistence-mode flag.⁷
DCGM. Installs the Data Center GPU Manager so reliability/RAS and the validate role can run dcgmi diag and stream health telemetry. The package carries a -cuda<major> suffix that must match the installed CUDA user-mode driver major version.⁸
Container Toolkit + CDI. Installs nvidia-container-toolkit and generates a Container Device Interface spec so containerd injects GPUs declaratively, rather than via the legacy nvidia OCI hook.⁹¹⁰ See the GPU software stack for where this sits under Kubernetes.

Each package set that needs a fresh boot to bind cleanly notifies the shared reboot node handler; a clean re-run is a no-op and reboots nothing. nvidia_stack deliberately does not install OFED, load nvidia_peermem, or write /etc/nccl.conf; those belong to rdma_fabric, which must run after the driver so the peer-memory client builds against the RDMA APIs.

Variables¶

Inventory-level keys (gpu_tier, driver_branch, cuda_branch, nvidia_nvswitch) are set on the [gpu_nodes:vars] group in the hub inventory; role-only knobs live in roles/nvidia_stack/defaults/main.yml. The driver package and whether Fabric Manager runs are both functions of gpu_tier and nvidia_nvswitch.

Variable	Scope	Default	Purpose
`gpu_tier`	inventory	`datacenter`	`datacenter` -> ERD open-module driver + Fabric Manager (if NVSwitch); `workstation` -> RTX/PRO driver, no FM; `consumer` -> GeForce driver, no FM. Drives `nvidia_driver_pkg`.
`driver_branch`	inventory	`580`	Driver branch pinned into the package name and matched by Fabric Manager. From branch 590 the Ubuntu packages dropped the branch suffix and pinning moves to version-locking — verify naming for your branch.⁴
`cuda_branch`	inventory	`13-0`	CUDA toolkit branch -> `cuda-toolkit-{{ cuda_branch }}` from the CUDA network repo. Hyphenated form (`13-0`), as the meta-package uses.
`nvidia_nvswitch`	inventory	`false`	True only on NVSwitch baseboards (HGX/DGX 8-GPU, NVL). Gates `nvidia-fabricmanager` install + start. False for PCIe/RTX/consumer.
`nvidia_dcgm_cuda_major`	role	`13`	DCGM CUDA-major suffix -> `datacenter-gpu-manager-4-cuda{{ }}`. Must match the installed CUDA user-mode driver major; Maxwell/Volta/Pascal on driver 580 need `12`.⁸
`nvidia_driver_pkg`	role	computed	Resolved driver package (see the `set_fact` task). Override to set an exact package (e.g. a `-server-open` ERD variant) if the default mapping does not fit your distro.
`nvidia_fabricmanager_pkg`	role	`nvidia-fabricmanager-{{ driver_branch }}`	FM package. Use `cuda-drivers-fabricmanager-{{ driver_branch }}` to let it pull a matching driver, or `nvlink5-{{ driver_branch }}` on 4th-gen NVSwitch (B200/B300).⁶
`nvidia_container_runtime`	role	`containerd`	Runtime configured by `nvidia-ctk runtime configure`. `containerd` for K8s nodes; `docker` for Docker hosts.
`nvidia_cdi_enabled`	role	`true`	Generate a CDI spec with `nvidia-ctk cdi generate`. Runtime configuration is a separate `nvidia-ctk runtime configure --runtime=...` step.
`nvidia_reboot_timeout`	role	`1200`	Seconds the shared `reboot node` handler waits for the node to return (driver DKMS builds are slow).

# roles/nvidia_stack/defaults/main.yml
gpu_tier: datacenter            # datacenter | workstation | consumer
driver_branch: "580"           # pin to your platform's qualified branch
cuda_branch: "13-0"            # -> cuda-toolkit-13-0
nvidia_nvswitch: false         # true only on NVSwitch baseboards (HGX/DGX/NVL)
nvidia_dcgm_cuda_major: "13"   # DCGM -cuda<major>; 12 for Maxwell/Volta/Pascal on r580
nvidia_fabricmanager_pkg: "nvidia-fabricmanager-{{ driver_branch }}"
nvidia_container_runtime: containerd
nvidia_cdi_enabled: true
nvidia_reboot_timeout: 1200

The tier -> driver-package mapping is resolved once in vars/main.yml so the task list stays declarative. Datacenter uses the Enterprise Ready Driver open-kernel-module package; the -server suffix marks the ERD line and -open the open modules.¹³

# roles/nvidia_stack/vars/main.yml
# Datacenter = Enterprise Ready Driver, open kernel modules (-server-open).
# Workstation/consumer use the UDA package; -open for Turing+.
# Verify exact names against your distro/branch (see References).
_nvidia_driver_pkg_map:
  datacenter:  "nvidia-driver-{{ driver_branch }}-server-open"
  workstation: "nvidia-driver-{{ driver_branch }}-open"
  consumer:    "nvidia-driver-{{ driver_branch }}-open"
nvidia_driver_pkg: "{{ _nvidia_driver_pkg_map[gpu_tier] }}"

Tasks¶

Real, idempotent tasks/main.yml. Stock modules only (ansible.builtin.*); assumes the CUDA toolkit repo, NVIDIA Container Toolkit repo keyrings, and the distro/NVIDIA driver repo that serves your selected nvidia_driver_pkg are already configured (a repo-setup task, omitted here, or your base image). The default package map uses Ubuntu-style driver package names; if your fleet installs from NVIDIA's CUDA network repo where the current guide uses nvidia-open / cuda-drivers, override nvidia_driver_pkg and pin with apt preferences or version locks. The set_fact fails loudly if gpu_tier is not a key in the map.

# roles/nvidia_stack/tasks/main.yml
- name: Resolve driver package for this gpu_tier
  ansible.builtin.set_fact:
    nvidia_driver_pkg: "{{ _nvidia_driver_pkg_map[gpu_tier] }}"
  # KeyError here means gpu_tier is not one of datacenter|workstation|consumer

- name: Install GPU driver, CUDA toolkit, DCGM (branch-pinned)
  ansible.builtin.apt:
    name:
      - "{{ nvidia_driver_pkg }}"                                  # tier-selected, branch-pinned
      - "cuda-toolkit-{{ cuda_branch }}"                          # toolkit only, not the cuda meta-package
      - "datacenter-gpu-manager-4-cuda{{ nvidia_dcgm_cuda_major }}"  # DCGM 4.x, -cuda<major>
      - "nvidia-container-toolkit"                                 # bridge to the container runtime
    state: present
    update_cache: true
    allow_change_held_packages: false   # respect apt-mark holds on a pinned branch
  notify: reboot node

- name: Install Fabric Manager (NVSwitch baseboards only)
  ansible.builtin.apt:
    name: "{{ nvidia_fabricmanager_pkg }}"   # version must match driver_branch exactly
    state: present
    update_cache: true
  when: nvidia_nvswitch | bool
  notify: reboot node

- name: Enable and start Fabric Manager (NVSwitch baseboards only)
  ansible.builtin.systemd_service:
    name: nvidia-fabricmanager
    enabled: true
    state: started
  when: nvidia_nvswitch | bool
  # FM aborts at start-up if its version does not match the loaded driver stack

- name: Enable and start nvidia-persistenced (driver persistence)
  ansible.builtin.systemd_service:
    name: nvidia-persistenced
    enabled: true
    state: started

- name: Check whether containerd already has the NVIDIA runtime stanza
  ansible.builtin.command:
    cmd: grep -q 'nvidia-container-runtime' /etc/containerd/config.toml
  register: nvidia_runtime_configured
  changed_when: false
  failed_when: false
  when: nvidia_container_runtime == 'containerd'

- name: Configure the container runtime for NVIDIA
  ansible.builtin.command:
    cmd: nvidia-ctk runtime configure --runtime={{ nvidia_container_runtime }}
  register: nvidia_runtime_configure
  changed_when: nvidia_runtime_configure.rc == 0
  when:
    - nvidia_container_runtime != 'containerd' or nvidia_runtime_configured.rc != 0
  notify: restart containerd

- name: Generate the CDI specification for the GPUs
  ansible.builtin.command:
    cmd: nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml
    creates: /etc/cdi/nvidia.yaml
  when: nvidia_cdi_enabled | bool
  # On Container Toolkit >= 1.18.0 the nvidia-cdi-refresh systemd unit regenerates
  # this on install/upgrade; the `creates` guard keeps this task idempotent.

Idempotency notes. ansible.builtin.apt with state: present is convergent: an already-installed branch reports ok, so the reboot node notify does not fire on a clean re-run; pinning driver_branch (and an apt-mark hold) is what stops update_cache from silently upgrading the driver out from under DKMS.¹¹ ansible.builtin.systemd_service reports changed only when the unit's enabled/active state actually changes.¹² nvidia-ctk runtime configure is guarded by a read-only check for the containerd runtime stanza; if you use Docker or a nonstandard containerd config path, adjust that probe rather than keying it to /etc/cdi/nvidia.yaml. The CDI generator is separately guarded with creates: /etc/cdi/nvidia.yaml; on toolkit >= 1.18.0 the nvidia-cdi-refresh unit regenerates it on install/upgrade.¹⁰

Handlers¶

Two handlers. reboot node is the shared handler name used by base_tuning and rdma_fabric; Ansible de-duplicates notifications, so multiple notify across all three roles collapse to a single reboot at the end of the play. restart containerd bounces the runtime so it picks up the NVIDIA/CDI configuration without a full reboot.

# roles/nvidia_stack/handlers/main.yml
- name: reboot node
  ansible.builtin.reboot:
    reboot_timeout: "{{ nvidia_reboot_timeout }}"
  # shared name with base_tuning/rdma_fabric; Ansible de-dupes to one reboot per play

- name: restart containerd
  ansible.builtin.systemd_service:
    name: containerd
    state: restarted

nvidia_stack uses only ansible.builtin modules, so no extra collection is required for this role (the sibling rdma_fabric pulls community.general for modprobe, and base_tuning pulls ansible.posix for sysctl). Pin Ansible's own version in requirements.yml alongside those.

Apply & verify¶

Run the whole node bring-up (the hub site.yml applies nvidia_stack after base_tuning), or target just this role with a tag:

# whole bring-up
ansible-playbook -i inventory/hosts.ini site.yml --limit gpu-07.dc1.internal

# this role only, if tagged `nvidia` in site.yml
ansible-playbook -i inventory/hosts.ini site.yml --tags nvidia --limit gpu-07.dc1.internal

After the role reboots the node, confirm the signals it owns. Validation is read-only; fold the same checks into role: validate_health so a regressed node fails the play instead of drifting silently.

NODE=gpu-07.dc1.internal

# 1. Driver bound, GPUs enumerated, persistence on.
ssh "$NODE" "nvidia-smi --query-gpu=name,driver_version,persistence_mode --format=csv,noheader"
#   expect one row per GPU, driver_version == your driver_branch, persistence_mode == Enabled

# 2. Fabric Manager active -- ONLY on NVSwitch baseboards (nvidia_nvswitch=true).
ssh "$NODE" "systemctl is-active nvidia-fabricmanager"
#   expect: active   (on PCIe/RTX/consumer this service is absent -- that is correct)

# 3. Container Toolkit present and the CDI spec generated.
ssh "$NODE" "nvidia-ctk --version"
ssh "$NODE" "test -f /etc/cdi/nvidia.yaml && echo cdi-spec-present"
ssh "$NODE" "nvidia-ctk cdi list"           # lists nvidia.com/gpu=0, ...=all devices

# 4. DCGM up; Level-1 diagnostic is the fast smoke test (use -r 3 for the long run).
ssh "$NODE" "dcgmi discovery -l"            # enumerates GPUs DCGM can see
ssh "$NODE" "dcgmi diag -r 1"               # expect all checks Pass

Expected signal, together: nvidia-smi lists every GPU at the pinned driver_version with persistence_mode Enabled; on NVSwitch hosts nvidia-fabricmanager is active (and rightly absent elsewhere); /etc/cdi/nvidia.yaml exists and nvidia-ctk cdi list shows the GPU devices; dcgmi diag -r 1 passes. The container path is only fully proven once a GPU workload runs through containerd via CDI; see the GPU software stack and install lifecycle.

Failure modes¶

Symptom	Likely cause	Runbook
`nvidia-smi` reports "No devices were found" / "couldn't communicate with the NVIDIA driver" after reboot	`nouveau` still bound (initramfs not rebuilt in `base_tuning`), or the DKMS module failed to build against the running kernel.	kernel/GPU missing
`nvidia-fabricmanager` fails to start; GPUs do not form the NVLink domain	FM package version does not match the installed driver branch — the FM service checks the loaded driver stack at start-up and aborts on mismatch.⁶	fabric-manager failure
Fabric Manager installed/started on a PCIe or RTX node	`nvidia_nvswitch` set true on a non-NVSwitch host; there is no switch to configure, so FM errors or no-ops. Set `nvidia_nvswitch: false`.	fabric-manager failure
DCGM installs but `dcgmi` errors on load / version mismatch	`-cuda<major>` suffix does not match the installed CUDA user-mode driver major (e.g. `-cuda13` on a Maxwell/Volta/Pascal r580 host that needs `-cuda12`).⁸	kernel/GPU missing
Container sees no GPU / `nvidia.com/gpu` unknown to containerd	`/etc/cdi/nvidia.yaml` missing or stale, or containerd not restarted after `nvidia-ctk runtime configure`. Regenerate the spec and bounce containerd.	kernel/GPU missing
Driver silently upgraded; DKMS rebuild on a kernel bump breaks the stack	`driver_branch` not pinned / not `apt-mark hold`-ed, so `update_cache` pulled a newer branch. Pin the branch and re-run after kernel upgrades so DKMS rebuilds are captured.	kernel/GPU missing
Both host driver and the GPU Operator's driver container present	Host-installed driver and the Operator's driver pod conflict. Choose one path; do not run both.	Software stack

References¶

NVIDIA Driver Installation Guide — Ubuntu (nvidia-open, cuda-drivers, package naming): https://docs.nvidia.com/datacenter/tesla/driver-installation-guide/latest/ubuntu.html
NVIDIA — open-source GPU kernel modules become the default (Turing+ open modules): https://developer.nvidia.com/blog/nvidia-transitions-fully-towards-open-source-gpu-kernel-modules/
Ubuntu Server — NVIDIA drivers (-server ERD vs -open suffixes, e.g. nvidia-driver-550-server-open): https://ubuntu.com/server/docs/how-to/graphics/install-nvidia-drivers/
CUDA Installation Guide for Linux (network repo; cuda-toolkit meta-package installs the toolkit without the driver): https://docs.nvidia.com/cuda/cuda-installation-guide-linux/
NVIDIA Fabric Manager user guide (NVSwitch HGX/DGX scope; nvidia-fabricmanager/cuda-drivers-fabricmanager-<branch>/nvlink5-<branch>; version-must-match-driver): https://docs.nvidia.com/datacenter/tesla/fabric-manager-user-guide/index.html
NVIDIA Fabric Manager apt packaging (nvidia-fabricmanager-XXX, cuda-drivers-fabricmanager-XXX): https://github.com/NVIDIA/apt-packaging-fabric-manager
NVIDIA Persistence Daemon (nvidia-persistenced, preferred over nvidia-smi -pm): https://docs.nvidia.com/deploy/driver-persistence/persistence-daemon.html
NVIDIA DCGM — Getting Started (datacenter-gpu-manager-4-cuda<major>, dcgmi diag): https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/getting-started.html
NVIDIA Container Toolkit — install guide (nvidia-container-toolkit, nvidia-ctk runtime configure): https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html
NVIDIA Container Toolkit — Container Device Interface (nvidia-ctk cdi generate, nvidia-cdi-refresh): https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/cdi-support.html
ansible.builtin.apt (state, update_cache, allow_change_held_packages): https://docs.ansible.com/ansible/latest/collections/ansible/builtin/apt_module.html
ansible.builtin.systemd_service: https://docs.ansible.com/ansible/latest/collections/ansible/builtin/systemd_service_module.html
ansible.builtin.command (creates for idempotency): https://docs.ansible.com/ansible/latest/collections/ansible/builtin/command_module.html
ansible.builtin.reboot: https://docs.ansible.com/ansible/latest/collections/ansible/builtin/reboot_module.html
NVIDIA DeepOps (Ansible/Kubespray cluster deploy): https://github.com/NVIDIA/deepops
nvidia.nvidia_driver Ansible collection: https://galaxy.ansible.com/ui/repo/published/nvidia/nvidia_driver/

NVIDIA Driver Installation Guide — Ubuntu. Open kernel modules via apt install nvidia-open; proprietary via apt install cuda-drivers. https://docs.nvidia.com/datacenter/tesla/driver-installation-guide/latest/ubuntu.html ↩↩
NVIDIA, "Transitions Fully Towards Open-Source GPU Kernel Modules" — open modules recommended for Turing and newer and becoming the install default. https://developer.nvidia.com/blog/nvidia-transitions-fully-towards-open-source-gpu-kernel-modules/ ↩
Ubuntu Server docs — Enterprise Ready Drivers carry the -server suffix; -open denotes the open kernel module variant (e.g. nvidia-driver-550-server-open). https://ubuntu.com/server/docs/how-to/graphics/install-nvidia-drivers/ ↩
NVIDIA Driver Installation Guide — from branch 590 the Ubuntu packages drop the branch designation from the name and switching branches is handled via version-locking (pinning). https://docs.nvidia.com/datacenter/tesla/driver-installation-guide/latest/ubuntu.html ↩
CUDA Installation Guide for Linux — use the cuda-toolkit meta-package to install the toolkit without the bundled driver (the cuda meta-package installs both). https://docs.nvidia.com/cuda/cuda-installation-guide-linux/ ↩
NVIDIA Fabric Manager user guide — install cuda-drivers-fabricmanager-<driver-branch> (or nvidia-fabricmanager-<branch>; nvlink5-<branch> for 4th-gen NVSwitch B200/B300); service nvidia-fabricmanager; the FM version must match the driver and the service aborts at start-up on a mismatch; scoped to NVSwitch HGX/DGX/NVL systems. https://docs.nvidia.com/datacenter/tesla/fabric-manager-user-guide/index.html ↩↩↩
NVIDIA Driver Persistence — the Persistence Daemon (nvidia-persistenced) is the recommended mechanism, superseding persistence mode set via nvidia-smi -pm. https://docs.nvidia.com/deploy/driver-persistence/persistence-daemon.html ↩
NVIDIA DCGM — Getting Started. The DCGM package carries a -cuda<major> suffix matching the CUDA user-mode driver major (datacenter-gpu-manager-4-cuda13); Maxwell/Volta/Pascal on driver 580 must use the -cuda12 build. https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/getting-started.html ↩↩↩
NVIDIA Container Toolkit — Installing the NVIDIA Container Toolkit. Package nvidia-container-toolkit; configure containerd with nvidia-ctk runtime configure --runtime=containerd. https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html ↩
NVIDIA Container Toolkit — Support for Container Device Interface. nvidia-ctk cdi generate --output=... writes the spec; as of toolkit v1.18.0 the nvidia-cdi-refresh systemd service regenerates it on install/upgrade. https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/cdi-support.html ↩↩
ansible.builtin.apt — state: present (default) installs if absent; update_cache: true runs apt-get update first; allow_change_held_packages governs whether held packages may change. https://docs.ansible.com/ansible/latest/collections/ansible/builtin/apt_module.html ↩
ansible.builtin.systemd_service — manages unit enabled/state, reporting changed only on an actual state transition. https://docs.ansible.com/ansible/latest/collections/ansible/builtin/systemd_service_module.html ↩