Markdown

Ansible: node & fabric bring-up¶

Scope: Ansible playbooks to take a freshly-imaged GPU node from bare OS to "ready to join Kubernetes or Slurm": driver stack, Fabric Manager, InfiniBand/RoCE, host tuning, MIG. The automation counterpart to the GPU software stack and provisioning and scheduling.

Reference templates, drawn from upstream NVIDIA/DOCA/OFED docs. Pin versions, set your own driver_branch, and validate on one node before a fleet roll. For a maintained, batteries-included version see NVIDIA DeepOps and the nvidia.nvidia_driver Ansible collection (References).

flowchart LR
  IMAGE["Fresh OS image"] --> BASE["Base tuning"]
  BASE --> ACS["ACS-disable service"]
  ACS --> OFED["DOCA-OFED"]
  OFED --> STACK["Driver, CUDA, Fabric Manager"]
  STACK --> RDMA["nvidia_peermem and NCCL fabric"]
  RDMA --> MIG["Optional MIG state"]
  MIG --> VALIDATE["Health validation"]

Overview¶

Uniform, repeatable node state is the root of cluster reliability; image drift is the source of most intermittent faults (reliability and RAS). Ansible (push-based, agentless over SSH) is the common tool for the metal layer; the same playbooks slot under a GitOps pipeline (SRE and MLOps practices). The goal: one site.yml that is idempotent, fleet-uniform, and ends in a verifiable healthy node.

Inventory & layout¶

# inventory/hosts.ini
[gpu_nodes]
gpu-[01:16].dc1.internal

[gpu_nodes:vars]
gpu_tier=datacenter         # datacenter | workstation | consumer -> selects driver package below
driver_branch=580
cuda_branch=13-0
nvidia_nvswitch=true        # HGX/DGX 8-GPU baseboard or NVL72 -> needs Fabric Manager; false for PCIe/RTX/consumer
mig_enabled=false           # only datacenter (A100/A30, H100/H200, B-series) or RTX PRO 6000 Blackwell

The driver package and whether Fabric Manager runs both depend on gpu_tier. Datacenter nodes take the -open datacenter driver plus nvidia-fabricmanager (started only when nvidia_nvswitch); GeForce/consumer and RTX PRO/workstation nodes use a different driver package and skip Fabric Manager (no NVSwitch; the Fabric Manager guide scopes it to "NVSwitch-based single-node HGX and DGX systems"). See the GPU software stack for the per-tier driver and feature matrix.

# site.yml (stage roles so reboot handlers flush between kernel/driver/fabric steps)
- hosts: gpu_nodes
  become: true
  tasks:
    - import_role: { name: base_tuning }              # kernel params, hugepages, governor, nouveau
    - meta: flush_handlers                            # reboot before package stages
    - import_role: { name: acs_disable }              # runtime ACS redirect-bit clearing
    - import_role: { name: rdma_fabric }
      vars: { rdma_stage: ofed }                      # DOCA-OFED before GPU driver build
    - meta: flush_handlers                            # reboot after OFED if changed
    - import_role: { name: nvidia_stack }             # driver, CUDA, Fabric Manager, toolkit, DCGM
    - meta: flush_handlers                            # driver live before peer-memory load
    - import_role: { name: rdma_fabric }
      vars: { rdma_stage: peermem }                   # nvidia_peermem, NCCL, IB/RoCE settings
    - import_role: { name: mig }                      # optional, when mig_enabled
    - meta: flush_handlers                            # apply any MIG reset before validation
    - import_role: { name: validate }                 # assert healthy

Role: base_tuning (host prep)¶

# roles/base_tuning/tasks/main.yml
- name: Blacklist nouveau
  copy:
    dest: /etc/modprobe.d/blacklist-nouveau.conf
    content: |
      blacklist nouveau
      options nouveau modeset=0
  notify: rebuild initramfs

- name: Kernel cmdline for IOMMU passthrough and large BAR
  lineinfile:
    path: /etc/default/grub
    regexp: '^GRUB_CMDLINE_LINUX='
    line: 'GRUB_CMDLINE_LINUX="iommu=pt intel_iommu=on pci=realloc"'
  notify: [update grub, reboot node]

- name: CPU governor = performance
  copy:
    dest: /etc/systemd/system/cpu-performance.service
    content: |
      [Unit]
      Description=Set CPU governor to performance
      [Service]
      Type=oneshot
      ExecStart=/usr/bin/bash -c 'for g in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do echo performance > $g; done'
      [Install]
      WantedBy=multi-user.target
  notify: enable cpu-performance

- name: Hugepages for RDMA/large models
  sysctl: { name: vm.nr_hugepages, value: "2048", sysctl_set: true, reload: true }

Disabling ACS (PCIe Access Control Services breaks P2P/GPUDirect, see performance tuning). Mainline kernels cannot override ACS via boot param, so disable it per-bridge at runtime with a service:

- name: Install ACS-disable script (run after every boot, before workloads)
  copy:
    dest: /usr/local/sbin/disable-acs.sh
    mode: "0755"
    content: |
      #!/usr/bin/env bash
      # Clear ACS SrcValidation+TransBlocking on PCIe bridges so GPU<->NIC P2P works.
      set -euo pipefail
      for dev in $(lspci -D | awk '/PCI bridge/ {print $1}'); do
        cap=$(setpci -s "$dev" ECAP_ACS+0x6.w 2>/dev/null) || continue
        setpci -s "$dev" ECAP_ACS+0x6.w=$(printf '%04x' $((0x$cap & ~0x1d))) || true
      done
  notify: enable disable-acs

Role: nvidia_stack (driver, Fabric Manager, toolkit, DCGM)¶

# roles/nvidia_stack/tasks/main.yml  (Ubuntu/apt; verify exact package names for the configured repo)
# Tier-dependent driver package: datacenter uses the -open datacenter driver;
# RTX PRO/workstation and GeForce/consumer use a different package and skip Fabric Manager.
- name: Select driver package by gpu_tier
  set_fact:
    driver_pkg: >-
      {{ 'nvidia-driver-' ~ driver_branch ~ '-server-open' if gpu_tier == 'datacenter'
         else 'nvidia-driver-' ~ driver_branch ~ '-open' }}   # verify exact package names for your distro/branch

- name: Install GPU driver + DCGM (pinned branch)
  apt:
    name:
      - "{{ driver_pkg }}"
      - "datacenter-gpu-manager-4-cuda13"              # DCGM 4.x (-cuda<major> suffix), feeds telemetry
      - "nvidia-container-toolkit"                      # bridge to the container runtime
    state: present
    update_cache: true
  notify: reboot node

- name: Install Fabric Manager (datacenter NVSwitch baseboards only)
  apt: { name: "nvidia-fabricmanager-{{ driver_branch }}", state: present, update_cache: true }
  when: gpu_tier == 'datacenter' and nvidia_nvswitch | bool

- name: Enable Fabric Manager on NVSwitch systems
  systemd: { name: nvidia-fabricmanager, enabled: true, state: started }
  when: gpu_tier == 'datacenter' and nvidia_nvswitch | bool

- name: Persistence mode via nvidia-persistenced
  systemd: { name: nvidia-persistenced, enabled: true, state: started }

- name: Configure containerd for the NVIDIA runtime
  command: nvidia-ctk runtime configure --runtime=containerd
  notify: restart containerd

- name: Generate the CDI specification for the GPUs
  command: nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml
  creates: /etc/cdi/nvidia.yaml

Role: rdma_fabric (InfiniBand / RoCE)¶

# roles/rdma_fabric/tasks/main.yml
- name: Install DOCA-OFED (RDMA stack, supersedes MLNX_OFED)
  apt: { name: doca-ofed, state: present, update_cache: true }
  notify: reboot node
  when: rdma_stage | default('all') in ['ofed', 'all']

- name: Load peer-memory module for GPUDirect RDMA
  copy: { dest: /etc/modules-load.d/nvidia-peermem.conf, content: "nvidia_peermem\n" }
  when: rdma_stage | default('all') in ['peermem', 'all']

- name: Persist NCCL fabric defaults (override per-job as needed)
  copy:
    dest: /etc/nccl.conf
    content: |
      NCCL_IB_HCA=mlx5
      NCCL_IB_GID_INDEX=3
      NCCL_NET_GDR_LEVEL=SYS
  when: rdma_stage | default('all') in ['peermem', 'all']

Role: validate (fail the play if a node is not healthy)¶

# roles/validate/tasks/main.yml
- name: GPUs visible
  command: nvidia-smi --query-gpu=name,driver_version,pstate --format=csv,noheader
  register: smi
  changed_when: false
  failed_when: smi.rc != 0

- name: Fabric Manager active on datacenter NVSwitch systems
  command: systemctl is-active nvidia-fabricmanager
  register: fm
  changed_when: false
  failed_when: gpu_tier == 'datacenter' and nvidia_nvswitch | bool and fm.stdout != "active"

- name: InfiniBand ports are ACTIVE
  shell: "ibstat | grep -c 'State: Active'"
  register: ibactive
  changed_when: false
  failed_when: ibactive.stdout | int < 1

- name: DCGM Level-3 diagnostic passes
  command: dcgmi diag -r 3
  register: diag
  changed_when: false
  failed_when: "'Fail' in diag.stdout"

Don't-miss checklist¶

Pin driver_branch/cuda_branch and the DOCA-OFED version; commit the inventory to git (SRE and MLOps practices).
Run the ACS-disable service on every boot, before workloads start (performance tuning).
Start Fabric Manager only on NVSwitch systems, and assert it active in validate (the GPU software stack).
Gate fleet readiness on dcgmi diag and ibstat, not just driver presence (commissioning).
Re-run site.yml after any kernel upgrade so DKMS rebuilds are captured.

Failure modes¶

Driver installed but nvidia-fabricmanager masked/stopped: GPUs do not form the NVLink domain (reliability and RAS).
ACS re-enabled by a firmware/BIOS reset, silently halving P2P bandwidth.
nvidia_peermem not loaded: GPUDirect RDMA falls back to host-staged copies.
Non-idempotent shell tasks causing drift instead of convergence.

Open questions & validation¶

Confirm package names and repo for the target distro (apt vs dnf) and pin exact versions.
Verify the ACS-disable mask against the specific PCIe bridges on the platform before fleet-wide use.
Decide host-installed driver vs the GPU Operator's driver containers (the Kubernetes platform); do not run both.

References¶

NVIDIA DeepOps (Ansible/Kubespray cluster deploy): https://github.com/NVIDIA/deepops
nvidia.nvidia_driver Ansible collection: https://galaxy.ansible.com/ui/repo/published/nvidia/nvidia_driver/
NVIDIA driver install guide: https://docs.nvidia.com/datacenter/tesla/tesla-installation-notes/index.html
Fabric Manager user guide (NVSwitch HGX/DGX scope): https://docs.nvidia.com/datacenter/tesla/fabric-manager-user-guide/index.html
NVIDIA RTX Enterprise / professional drivers (workstation tier): https://www.nvidia.com/en-us/drivers/
DOCA-OFED / host install: https://docs.nvidia.com/doca/sdk/doca-host+installation+and+upgrade/index.html
GPUDirect / ACS notes: https://docs.nvidia.com/cuda/gpudirect-rdma/index.html

Per-role cookbook pages¶

Each role and the orchestration has its own page — variables, the real tasks, how to apply and verify, and failure modes.

Roles: base_tuning · nvidia_stack · rdma_fabric · mig · validate
Inventory & services: inventory & variables · PCIe ACS-disable service
Orchestration: site playbook

Related: Provisioning · Software Stack · Optimization · K8s Platform · Practices · Glossary