Ansible: node & fabric bring-up¶
Scope: Ansible playbooks to take a freshly-imaged GPU node from bare OS to "ready to join Kubernetes or Slurm": driver stack, Fabric Manager, InfiniBand/RoCE, host tuning, MIG. The automation counterpart to the GPU software stack and provisioning and scheduling.
Reference templates, drawn from upstream NVIDIA/DOCA/OFED docs. Pin versions, set your own
driver_branch, and validate on one node before a fleet roll. For a maintained, batteries-included version see NVIDIA DeepOps and thenvidia.nvidia_driverAnsible collection (References).
flowchart LR
IMAGE["Fresh OS image"] --> BASE["Base tuning"]
BASE --> ACS["ACS-disable service"]
ACS --> OFED["DOCA-OFED"]
OFED --> STACK["Driver, CUDA, Fabric Manager"]
STACK --> RDMA["nvidia_peermem and NCCL fabric"]
RDMA --> MIG["Optional MIG state"]
MIG --> VALIDATE["Health validation"]
Overview¶
Uniform, repeatable node state is the root of cluster reliability; image drift is the source of most intermittent faults (reliability and RAS). Ansible (push-based, agentless over SSH) is the common tool for the metal layer; the same playbooks slot under a GitOps pipeline (SRE and MLOps practices). The goal: one site.yml that is idempotent, fleet-uniform, and ends in a verifiable healthy node.
Inventory & layout¶
# inventory/hosts.ini
[gpu_nodes]
gpu-[01:16].dc1.internal
[gpu_nodes:vars]
gpu_tier=datacenter # datacenter | workstation | consumer -> selects driver package below
driver_branch=580
cuda_branch=13-0
nvidia_nvswitch=true # HGX/DGX 8-GPU baseboard or NVL72 -> needs Fabric Manager; false for PCIe/RTX/consumer
mig_enabled=false # only datacenter (A100/A30, H100/H200, B-series) or RTX PRO 6000 Blackwell
The driver package and whether Fabric Manager runs both depend on gpu_tier. Datacenter nodes take the -open datacenter driver plus nvidia-fabricmanager (started only when nvidia_nvswitch); GeForce/consumer and RTX PRO/workstation nodes use a different driver package and skip Fabric Manager (no NVSwitch; the Fabric Manager guide scopes it to "NVSwitch-based single-node HGX and DGX systems"). See the GPU software stack for the per-tier driver and feature matrix.
# site.yml (stage roles so reboot handlers flush between kernel/driver/fabric steps)
- hosts: gpu_nodes
become: true
tasks:
- import_role: { name: base_tuning } # kernel params, hugepages, governor, nouveau
- meta: flush_handlers # reboot before package stages
- import_role: { name: acs_disable } # runtime ACS redirect-bit clearing
- import_role: { name: rdma_fabric }
vars: { rdma_stage: ofed } # DOCA-OFED before GPU driver build
- meta: flush_handlers # reboot after OFED if changed
- import_role: { name: nvidia_stack } # driver, CUDA, Fabric Manager, toolkit, DCGM
- meta: flush_handlers # driver live before peer-memory load
- import_role: { name: rdma_fabric }
vars: { rdma_stage: peermem } # nvidia_peermem, NCCL, IB/RoCE settings
- import_role: { name: mig } # optional, when mig_enabled
- meta: flush_handlers # apply any MIG reset before validation
- import_role: { name: validate } # assert healthy
Role: base_tuning (host prep)¶
# roles/base_tuning/tasks/main.yml
- name: Blacklist nouveau
copy:
dest: /etc/modprobe.d/blacklist-nouveau.conf
content: |
blacklist nouveau
options nouveau modeset=0
notify: rebuild initramfs
- name: Kernel cmdline for IOMMU passthrough and large BAR
lineinfile:
path: /etc/default/grub
regexp: '^GRUB_CMDLINE_LINUX='
line: 'GRUB_CMDLINE_LINUX="iommu=pt intel_iommu=on pci=realloc"'
notify: [update grub, reboot node]
- name: CPU governor = performance
copy:
dest: /etc/systemd/system/cpu-performance.service
content: |
[Unit]
Description=Set CPU governor to performance
[Service]
Type=oneshot
ExecStart=/usr/bin/bash -c 'for g in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do echo performance > $g; done'
[Install]
WantedBy=multi-user.target
notify: enable cpu-performance
- name: Hugepages for RDMA/large models
sysctl: { name: vm.nr_hugepages, value: "2048", sysctl_set: true, reload: true }
Disabling ACS (PCIe Access Control Services breaks P2P/GPUDirect, see performance tuning). Mainline kernels cannot override ACS via boot param, so disable it per-bridge at runtime with a service:
- name: Install ACS-disable script (run after every boot, before workloads)
copy:
dest: /usr/local/sbin/disable-acs.sh
mode: "0755"
content: |
#!/usr/bin/env bash
# Clear ACS SrcValidation+TransBlocking on PCIe bridges so GPU<->NIC P2P works.
set -euo pipefail
for dev in $(lspci -D | awk '/PCI bridge/ {print $1}'); do
cap=$(setpci -s "$dev" ECAP_ACS+0x6.w 2>/dev/null) || continue
setpci -s "$dev" ECAP_ACS+0x6.w=$(printf '%04x' $((0x$cap & ~0x1d))) || true
done
notify: enable disable-acs
Role: nvidia_stack (driver, Fabric Manager, toolkit, DCGM)¶
# roles/nvidia_stack/tasks/main.yml (Ubuntu/apt; verify exact package names for the configured repo)
# Tier-dependent driver package: datacenter uses the -open datacenter driver;
# RTX PRO/workstation and GeForce/consumer use a different package and skip Fabric Manager.
- name: Select driver package by gpu_tier
set_fact:
driver_pkg: >-
{{ 'nvidia-driver-' ~ driver_branch ~ '-server-open' if gpu_tier == 'datacenter'
else 'nvidia-driver-' ~ driver_branch ~ '-open' }} # verify exact package names for your distro/branch
- name: Install GPU driver + DCGM (pinned branch)
apt:
name:
- "{{ driver_pkg }}"
- "datacenter-gpu-manager-4-cuda13" # DCGM 4.x (-cuda<major> suffix), feeds telemetry
- "nvidia-container-toolkit" # bridge to the container runtime
state: present
update_cache: true
notify: reboot node
- name: Install Fabric Manager (datacenter NVSwitch baseboards only)
apt: { name: "nvidia-fabricmanager-{{ driver_branch }}", state: present, update_cache: true }
when: gpu_tier == 'datacenter' and nvidia_nvswitch | bool
- name: Enable Fabric Manager on NVSwitch systems
systemd: { name: nvidia-fabricmanager, enabled: true, state: started }
when: gpu_tier == 'datacenter' and nvidia_nvswitch | bool
- name: Persistence mode via nvidia-persistenced
systemd: { name: nvidia-persistenced, enabled: true, state: started }
- name: Configure containerd for the NVIDIA runtime
command: nvidia-ctk runtime configure --runtime=containerd
notify: restart containerd
- name: Generate the CDI specification for the GPUs
command: nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml
creates: /etc/cdi/nvidia.yaml
Role: rdma_fabric (InfiniBand / RoCE)¶
# roles/rdma_fabric/tasks/main.yml
- name: Install DOCA-OFED (RDMA stack, supersedes MLNX_OFED)
apt: { name: doca-ofed, state: present, update_cache: true }
notify: reboot node
when: rdma_stage | default('all') in ['ofed', 'all']
- name: Load peer-memory module for GPUDirect RDMA
copy: { dest: /etc/modules-load.d/nvidia-peermem.conf, content: "nvidia_peermem\n" }
when: rdma_stage | default('all') in ['peermem', 'all']
- name: Persist NCCL fabric defaults (override per-job as needed)
copy:
dest: /etc/nccl.conf
content: |
NCCL_IB_HCA=mlx5
NCCL_IB_GID_INDEX=3
NCCL_NET_GDR_LEVEL=SYS
when: rdma_stage | default('all') in ['peermem', 'all']
Role: validate (fail the play if a node is not healthy)¶
# roles/validate/tasks/main.yml
- name: GPUs visible
command: nvidia-smi --query-gpu=name,driver_version,pstate --format=csv,noheader
register: smi
changed_when: false
failed_when: smi.rc != 0
- name: Fabric Manager active on datacenter NVSwitch systems
command: systemctl is-active nvidia-fabricmanager
register: fm
changed_when: false
failed_when: gpu_tier == 'datacenter' and nvidia_nvswitch | bool and fm.stdout != "active"
- name: InfiniBand ports are ACTIVE
shell: "ibstat | grep -c 'State: Active'"
register: ibactive
changed_when: false
failed_when: ibactive.stdout | int < 1
- name: DCGM Level-3 diagnostic passes
command: dcgmi diag -r 3
register: diag
changed_when: false
failed_when: "'Fail' in diag.stdout"
Don't-miss checklist¶
- Pin
driver_branch/cuda_branchand the DOCA-OFED version; commit the inventory to git (SRE and MLOps practices). - Run the ACS-disable service on every boot, before workloads start (performance tuning).
- Start Fabric Manager only on NVSwitch systems, and assert it active in
validate(the GPU software stack). - Gate fleet readiness on
dcgmi diagandibstat, not just driver presence (commissioning). - Re-run
site.ymlafter any kernel upgrade so DKMS rebuilds are captured.
Failure modes¶
- Driver installed but
nvidia-fabricmanagermasked/stopped: GPUs do not form the NVLink domain (reliability and RAS). - ACS re-enabled by a firmware/BIOS reset, silently halving P2P bandwidth.
nvidia_peermemnot loaded: GPUDirect RDMA falls back to host-staged copies.- Non-idempotent shell tasks causing drift instead of convergence.
Open questions & validation¶
- Confirm package names and repo for the target distro (apt vs dnf) and pin exact versions.
- Verify the ACS-disable mask against the specific PCIe bridges on the platform before fleet-wide use.
- Decide host-installed driver vs the GPU Operator's driver containers (the Kubernetes platform); do not run both.
References¶
- NVIDIA DeepOps (Ansible/Kubespray cluster deploy): https://github.com/NVIDIA/deepops
nvidia.nvidia_driverAnsible collection: https://galaxy.ansible.com/ui/repo/published/nvidia/nvidia_driver/- NVIDIA driver install guide: https://docs.nvidia.com/datacenter/tesla/tesla-installation-notes/index.html
- Fabric Manager user guide (NVSwitch HGX/DGX scope): https://docs.nvidia.com/datacenter/tesla/fabric-manager-user-guide/index.html
- NVIDIA RTX Enterprise / professional drivers (workstation tier): https://www.nvidia.com/en-us/drivers/
- DOCA-OFED / host install: https://docs.nvidia.com/doca/sdk/doca-host+installation+and+upgrade/index.html
- GPUDirect / ACS notes: https://docs.nvidia.com/cuda/gpudirect-rdma/index.html
Per-role cookbook pages¶
Each role and the orchestration has its own page — variables, the real tasks, how to apply and verify, and failure modes.
- Roles: base_tuning · nvidia_stack · rdma_fabric · mig · validate
- Inventory & services: inventory & variables · PCIe ACS-disable service
- Orchestration: site playbook
Related: Provisioning · Software Stack · Optimization · K8s Platform · Practices · Glossary