Skip to content
Markdown

Ansible: node & fabric bring-up

Scope: Ansible playbooks to take a freshly-imaged GPU node from bare OS to "ready to join Kubernetes or Slurm": driver stack, Fabric Manager, InfiniBand/RoCE, host tuning, MIG. The automation counterpart to the GPU software stack and provisioning and scheduling.

Reference templates, drawn from upstream NVIDIA/DOCA/OFED docs. Pin versions, set your own driver_branch, and validate on one node before a fleet roll. For a maintained, batteries-included version see NVIDIA DeepOps and the nvidia.nvidia_driver Ansible collection (References).

flowchart LR
  IMAGE["Fresh OS image"] --> BASE["Base tuning"]
  BASE --> ACS["ACS-disable service"]
  ACS --> OFED["DOCA-OFED"]
  OFED --> STACK["Driver, CUDA, Fabric Manager"]
  STACK --> RDMA["nvidia_peermem and NCCL fabric"]
  RDMA --> MIG["Optional MIG state"]
  MIG --> VALIDATE["Health validation"]

Overview

Uniform, repeatable node state is the root of cluster reliability; image drift is the source of most intermittent faults (reliability and RAS). Ansible (push-based, agentless over SSH) is the common tool for the metal layer; the same playbooks slot under a GitOps pipeline (SRE and MLOps practices). The goal: one site.yml that is idempotent, fleet-uniform, and ends in a verifiable healthy node.

Inventory & layout

# inventory/hosts.ini
[gpu_nodes]
gpu-[01:16].dc1.internal

[gpu_nodes:vars]
gpu_tier=datacenter         # datacenter | workstation | consumer -> selects driver package below
driver_branch=580
cuda_branch=13-0
nvidia_nvswitch=true        # HGX/DGX 8-GPU baseboard or NVL72 -> needs Fabric Manager; false for PCIe/RTX/consumer
mig_enabled=false           # only datacenter (A100/A30, H100/H200, B-series) or RTX PRO 6000 Blackwell

The driver package and whether Fabric Manager runs both depend on gpu_tier. Datacenter nodes take the -open datacenter driver plus nvidia-fabricmanager (started only when nvidia_nvswitch); GeForce/consumer and RTX PRO/workstation nodes use a different driver package and skip Fabric Manager (no NVSwitch; the Fabric Manager guide scopes it to "NVSwitch-based single-node HGX and DGX systems"). See the GPU software stack for the per-tier driver and feature matrix.

# site.yml (stage roles so reboot handlers flush between kernel/driver/fabric steps)
- hosts: gpu_nodes
  become: true
  tasks:
    - import_role: { name: base_tuning }              # kernel params, hugepages, governor, nouveau
    - meta: flush_handlers                            # reboot before package stages
    - import_role: { name: acs_disable }              # runtime ACS redirect-bit clearing
    - import_role: { name: rdma_fabric }
      vars: { rdma_stage: ofed }                      # DOCA-OFED before GPU driver build
    - meta: flush_handlers                            # reboot after OFED if changed
    - import_role: { name: nvidia_stack }             # driver, CUDA, Fabric Manager, toolkit, DCGM
    - meta: flush_handlers                            # driver live before peer-memory load
    - import_role: { name: rdma_fabric }
      vars: { rdma_stage: peermem }                   # nvidia_peermem, NCCL, IB/RoCE settings
    - import_role: { name: mig }                      # optional, when mig_enabled
    - meta: flush_handlers                            # apply any MIG reset before validation
    - import_role: { name: validate }                 # assert healthy

Role: base_tuning (host prep)

# roles/base_tuning/tasks/main.yml
- name: Blacklist nouveau
  copy:
    dest: /etc/modprobe.d/blacklist-nouveau.conf
    content: |
      blacklist nouveau
      options nouveau modeset=0
  notify: rebuild initramfs

- name: Kernel cmdline for IOMMU passthrough and large BAR
  lineinfile:
    path: /etc/default/grub
    regexp: '^GRUB_CMDLINE_LINUX='
    line: 'GRUB_CMDLINE_LINUX="iommu=pt intel_iommu=on pci=realloc"'
  notify: [update grub, reboot node]

- name: CPU governor = performance
  copy:
    dest: /etc/systemd/system/cpu-performance.service
    content: |
      [Unit]
      Description=Set CPU governor to performance
      [Service]
      Type=oneshot
      ExecStart=/usr/bin/bash -c 'for g in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do echo performance > $g; done'
      [Install]
      WantedBy=multi-user.target
  notify: enable cpu-performance

- name: Hugepages for RDMA/large models
  sysctl: { name: vm.nr_hugepages, value: "2048", sysctl_set: true, reload: true }

Disabling ACS (PCIe Access Control Services breaks P2P/GPUDirect, see performance tuning). Mainline kernels cannot override ACS via boot param, so disable it per-bridge at runtime with a service:

- name: Install ACS-disable script (run after every boot, before workloads)
  copy:
    dest: /usr/local/sbin/disable-acs.sh
    mode: "0755"
    content: |
      #!/usr/bin/env bash
      # Clear ACS SrcValidation+TransBlocking on PCIe bridges so GPU<->NIC P2P works.
      set -euo pipefail
      for dev in $(lspci -D | awk '/PCI bridge/ {print $1}'); do
        cap=$(setpci -s "$dev" ECAP_ACS+0x6.w 2>/dev/null) || continue
        setpci -s "$dev" ECAP_ACS+0x6.w=$(printf '%04x' $((0x$cap & ~0x1d))) || true
      done
  notify: enable disable-acs

Role: nvidia_stack (driver, Fabric Manager, toolkit, DCGM)

# roles/nvidia_stack/tasks/main.yml  (Ubuntu/apt; verify exact package names for the configured repo)
# Tier-dependent driver package: datacenter uses the -open datacenter driver;
# RTX PRO/workstation and GeForce/consumer use a different package and skip Fabric Manager.
- name: Select driver package by gpu_tier
  set_fact:
    driver_pkg: >-
      {{ 'nvidia-driver-' ~ driver_branch ~ '-server-open' if gpu_tier == 'datacenter'
         else 'nvidia-driver-' ~ driver_branch ~ '-open' }}   # verify exact package names for your distro/branch

- name: Install GPU driver + DCGM (pinned branch)
  apt:
    name:
      - "{{ driver_pkg }}"
      - "datacenter-gpu-manager-4-cuda13"              # DCGM 4.x (-cuda<major> suffix), feeds telemetry
      - "nvidia-container-toolkit"                      # bridge to the container runtime
    state: present
    update_cache: true
  notify: reboot node

- name: Install Fabric Manager (datacenter NVSwitch baseboards only)
  apt: { name: "nvidia-fabricmanager-{{ driver_branch }}", state: present, update_cache: true }
  when: gpu_tier == 'datacenter' and nvidia_nvswitch | bool

- name: Enable Fabric Manager on NVSwitch systems
  systemd: { name: nvidia-fabricmanager, enabled: true, state: started }
  when: gpu_tier == 'datacenter' and nvidia_nvswitch | bool

- name: Persistence mode via nvidia-persistenced
  systemd: { name: nvidia-persistenced, enabled: true, state: started }

- name: Configure containerd for the NVIDIA runtime
  command: nvidia-ctk runtime configure --runtime=containerd
  notify: restart containerd

- name: Generate the CDI specification for the GPUs
  command: nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml
  creates: /etc/cdi/nvidia.yaml

Role: rdma_fabric (InfiniBand / RoCE)

# roles/rdma_fabric/tasks/main.yml
- name: Install DOCA-OFED (RDMA stack, supersedes MLNX_OFED)
  apt: { name: doca-ofed, state: present, update_cache: true }
  notify: reboot node
  when: rdma_stage | default('all') in ['ofed', 'all']

- name: Load peer-memory module for GPUDirect RDMA
  copy: { dest: /etc/modules-load.d/nvidia-peermem.conf, content: "nvidia_peermem\n" }
  when: rdma_stage | default('all') in ['peermem', 'all']

- name: Persist NCCL fabric defaults (override per-job as needed)
  copy:
    dest: /etc/nccl.conf
    content: |
      NCCL_IB_HCA=mlx5
      NCCL_IB_GID_INDEX=3
      NCCL_NET_GDR_LEVEL=SYS
  when: rdma_stage | default('all') in ['peermem', 'all']

Role: validate (fail the play if a node is not healthy)

# roles/validate/tasks/main.yml
- name: GPUs visible
  command: nvidia-smi --query-gpu=name,driver_version,pstate --format=csv,noheader
  register: smi
  changed_when: false
  failed_when: smi.rc != 0

- name: Fabric Manager active on datacenter NVSwitch systems
  command: systemctl is-active nvidia-fabricmanager
  register: fm
  changed_when: false
  failed_when: gpu_tier == 'datacenter' and nvidia_nvswitch | bool and fm.stdout != "active"

- name: InfiniBand ports are ACTIVE
  shell: "ibstat | grep -c 'State: Active'"
  register: ibactive
  changed_when: false
  failed_when: ibactive.stdout | int < 1

- name: DCGM Level-3 diagnostic passes
  command: dcgmi diag -r 3
  register: diag
  changed_when: false
  failed_when: "'Fail' in diag.stdout"

Don't-miss checklist

  • Pin driver_branch/cuda_branch and the DOCA-OFED version; commit the inventory to git (SRE and MLOps practices).
  • Run the ACS-disable service on every boot, before workloads start (performance tuning).
  • Start Fabric Manager only on NVSwitch systems, and assert it active in validate (the GPU software stack).
  • Gate fleet readiness on dcgmi diag and ibstat, not just driver presence (commissioning).
  • Re-run site.yml after any kernel upgrade so DKMS rebuilds are captured.

Failure modes

  • Driver installed but nvidia-fabricmanager masked/stopped: GPUs do not form the NVLink domain (reliability and RAS).
  • ACS re-enabled by a firmware/BIOS reset, silently halving P2P bandwidth.
  • nvidia_peermem not loaded: GPUDirect RDMA falls back to host-staged copies.
  • Non-idempotent shell tasks causing drift instead of convergence.

Open questions & validation

  • Confirm package names and repo for the target distro (apt vs dnf) and pin exact versions.
  • Verify the ACS-disable mask against the specific PCIe bridges on the platform before fleet-wide use.
  • Decide host-installed driver vs the GPU Operator's driver containers (the Kubernetes platform); do not run both.

References

  • NVIDIA DeepOps (Ansible/Kubespray cluster deploy): https://github.com/NVIDIA/deepops
  • nvidia.nvidia_driver Ansible collection: https://galaxy.ansible.com/ui/repo/published/nvidia/nvidia_driver/
  • NVIDIA driver install guide: https://docs.nvidia.com/datacenter/tesla/tesla-installation-notes/index.html
  • Fabric Manager user guide (NVSwitch HGX/DGX scope): https://docs.nvidia.com/datacenter/tesla/fabric-manager-user-guide/index.html
  • NVIDIA RTX Enterprise / professional drivers (workstation tier): https://www.nvidia.com/en-us/drivers/
  • DOCA-OFED / host install: https://docs.nvidia.com/doca/sdk/doca-host+installation+and+upgrade/index.html
  • GPUDirect / ACS notes: https://docs.nvidia.com/cuda/gpudirect-rdma/index.html

Per-role cookbook pages

Each role and the orchestration has its own page — variables, the real tasks, how to apply and verify, and failure modes.

Related: Provisioning · Software Stack · Optimization · K8s Platform · Practices · Glossary