Markdown

Ansible site playbook¶

Scope: the site.yml that orchestrates a fleet bring-up: host selection, privilege escalation, staged role order (base_tuning -> acs_disable -> rdma_fabric OFED stage -> nvidia_stack -> rdma_fabric peer-memory stage -> mig -> validate), handler flushing, pre-flight gates (OS, repo reachability), and rolling serial/max_fail_percentage batching so the fleet is converged in waves, never big-bang. The orchestration layer above the per-role pages; the node-level skeleton it builds on is in Ansible node & fabric bring-up.

Reference templates, drawn from the upstream Ansible playbook/keyword docs and the NVIDIA driver/Fabric Manager/DCGM docs. Nothing here was executed on hardware. Pin your driver_branch/cuda_branch, set the inventory vars from Ansible Inventory and Variables, and run the play against one canary host before a fleet roll. For a maintained, batteries-included deployer see NVIDIA DeepOps (References).

flowchart TB
  PRE["pre_tasks: OS + repo pre-flight (fail fast)"] --> B1["batch 1 (canary): serial 1"]
  B1 --> BASE["base_tuning"]
  BASE --> H1["flush: grub/initramfs/reboot"]
  H1 --> ACS["acs_disable"]
  ACS --> OFED["rdma_fabric stage=ofed"]
  OFED --> H2["flush: OFED reboot"]
  H2 --> STACK["nvidia_stack"]
  STACK --> H3["flush: driver reboot"]
  H3 --> PEER["rdma_fabric stage=peermem"]
  PEER --> MIG["mig"]
  MIG --> H4["flush: pending MIG reboot"]
  H4 --> VALIDATE["validate"]
  VALIDATE --> POST["post_tasks: assert node healthy"]
  POST --> B2["batch 2..N: serial 25%, max_fail_percentage gate"]
  B2 --> BASE

What it does¶

site.yml is a single play that targets the gpu_nodes group and applies the roles in dependency-ordered stages. The order is load-bearing: base_tuning (nouveau blacklist, kernel cmdline, governor, hugepages) must converge and reboot before package-heavy roles; acs_disable must run before workloads; DOCA-OFED must be present before the NVIDIA driver is built if nvidia_peermem is expected to link cleanly against the RDMA APIs; the NVIDIA driver and Fabric Manager must then be live before the peer-memory load; mig only runs where mig_enabled; validate runs last and fails the play on any unhealthy node. The play uses tasks: + ansible.builtin.import_role instead of one roles: list because handlers must flush between stages, not only after all roles have run.¹⁶

Three orchestration concerns live here, not in the roles:

Pre-flight, fail-fast. pre_tasks assert the OS is a supported distro and that the package repos are reachable before any role mutates the node. A play that discovers an unreachable CUDA repo halfway through nvidia_stack leaves the node half-converged; pre-flight turns that into a clean abort.
Rolling batches. serial bounds how many hosts are mutated at once so the fleet is taken through bring-up in waves. Without it Ansible runs every task on all hosts in parallel (default fork-limited),² which is exactly the big-bang a fleet operator must avoid: a bad driver branch would brick every node in one pass.
Batch failure gating. max_fail_percentage (with serial) aborts the whole play once too many hosts in the current batch fail, so a systemic fault (wrong package name, repo outage, bad firmware) stops the roll instead of marching through the fleet.³ any_errors_fatal is the stricter variant: first failure ends the play for everyone in the batch.⁴

Handlers (rebuild initramfs, update grub, reboot, restart containerd, enable services) are defined once at the play level and notified by the roles. By default a notified handler runs after each of pre_tasks, roles/tasks, and post_tasks, in that order.⁵ That is too late for this chain: the base reboot must happen before OFED/driver work, OFED must reboot before the driver build when required, and the driver reboot must complete before nvidia_peermem and validate. The play therefore flushes handlers explicitly with meta: flush_handlers between stages.⁶ force_handlers: true guarantees a queued reboot/initramfs handler still runs on a host whose later task failed, so a node is never left with a pending-but-undelivered kernel change.⁷

Variables¶

Play-level and inventory variables this playbook reads. Node-shape vars (gpu_tier, driver_branch, cuda_branch, nvidia_nvswitch, mig_enabled) are owned by the inventory and documented in Ansible Inventory and Variables; the rest are orchestration knobs set in group_vars/gpu_nodes.yml or on the command line.

Variable	Scope	Default	Purpose
`gpu_nodes` (inventory group)	inventory	—	Target hosts; the play's `hosts:` value.
`gpu_tier`	inventory var	`datacenter`	`datacenter \\| workstation`; selects driver package and whether FM runs (Ansible Inventory and Variables).
`driver_branch`	inventory var	`580`	Pinned NVIDIA driver branch (Driver Versions and Branches).
`cuda_branch`	inventory var	`13-0`	Pinned CUDA repo branch.
`nvidia_nvswitch`	inventory var	`true`	NVSwitch baseboard present -> Fabric Manager required (Fabric Manager).
`mig_enabled`	inventory var	`false`	Gate for the `mig` role (MIG).
`bringup_serial`	play var	`[1, "25%"]`	`serial` batches: 1 canary host, then 25% waves. List form is documented.²
`bringup_max_fail_pct`	play var	`20`	`max_fail_percentage` per batch; abort the roll past this.³
`supported_distros`	play var	`["Ubuntu", "Debian"]`	Allowed `ansible_distribution` values for the pre-flight assert.⁸
`cuda_repo_url`	play var	`https://developer.download.nvidia.com/compute/cuda/repos`	Repo base URL probed for reachability in pre-flight.
`repo_probe_timeout`	play var	`10`	Socket timeout (s) for the `uri` reachability probe.⁹
`become`	play keyword	`true`	Privilege escalation; all roles mutate system state.¹
`gather_facts`	play keyword	`true`	Run the `setup` module so `ansible_distribution` etc. are populated for pre-flight.⁸

Tasks¶

The orchestration play. The roles themselves (base_tuning, acs_disable, nvidia_stack, rdma_fabric, mig, validate) carry the idempotent task bodies. See the per-role pages. Every module below is ansible.builtin.*; the YAML is idempotent (assertions and probes are read-only, role tasks gate on when/creates/changed_when).

# site.yml — fleet GPU-node bring-up orchestration.
- name: GPU node bring-up
  hosts: gpu_nodes
  become: true
  gather_facts: true

  # Rolling waves: one canary first, then 25% batches. Bound the blast radius.
  serial: "{{ bringup_serial | default([1, '25%']) }}"
  max_fail_percentage: "{{ bringup_max_fail_pct | default(20) }}"
  force_handlers: true            # deliver queued reboot/initramfs even if a later task failed

  vars:
    supported_distros: ["Ubuntu", "Debian"]
    cuda_repo_url: "https://developer.download.nvidia.com/compute/cuda/repos"
    repo_probe_timeout: 10

  pre_tasks:
    - name: Pre-flight - supported OS only
      ansible.builtin.assert:
        that:
          - ansible_distribution in supported_distros
        fail_msg: >-
          Unsupported distro {{ ansible_distribution }} {{ ansible_distribution_version }};
          this playbook targets {{ supported_distros | join(', ') }}.
        success_msg: "OS {{ ansible_distribution }} {{ ansible_distribution_version }} supported"
        quiet: true

    - name: Pre-flight - required inventory vars are set
      ansible.builtin.assert:
        that:
          - gpu_tier in ['datacenter', 'workstation']
          - driver_branch is defined
          - cuda_branch is defined
        fail_msg: "gpu_tier/driver_branch/cuda_branch must be set in inventory"
        quiet: true

    - name: Pre-flight - CUDA repo is reachable (fail fast before any mutation)
      ansible.builtin.uri:
        url: "{{ cuda_repo_url }}/"
        method: GET
        status_code: [200, 301, 302, 403]   # index may 403; we only prove the host answers
        timeout: "{{ repo_probe_timeout }}"
        validate_certs: true
      changed_when: false

  tasks:
    - name: Stage 1 - base tuning (kernel cmdline, nouveau, governor, hugepages)
      ansible.builtin.import_role:
        name: base_tuning

    - name: Apply base-tuning handlers before package stages
      ansible.builtin.meta: flush_handlers

    - name: Stage 2 - clear PCIe ACS redirect bits for this boot
      ansible.builtin.import_role:
        name: acs_disable

    - name: Stage 3 - install DOCA-OFED before NVIDIA driver build
      ansible.builtin.import_role:
        name: rdma_fabric
      vars:
        rdma_stage: ofed

    - name: Apply OFED reboot before the GPU driver is installed
      ansible.builtin.meta: flush_handlers

    - name: Stage 4 - NVIDIA driver, CUDA, Fabric Manager, toolkit, DCGM
      ansible.builtin.import_role:
        name: nvidia_stack

    - name: Apply driver/container-runtime handlers before peer-memory validation
      ansible.builtin.meta: flush_handlers

    - name: Stage 5 - load nvidia_peermem and write NCCL fabric defaults
      ansible.builtin.import_role:
        name: rdma_fabric
      vars:
        rdma_stage: peermem

    - name: Stage 6 - optional MIG geometry
      ansible.builtin.import_role:
        name: mig

    - name: Apply any MIG reset/reboot before validation
      ansible.builtin.meta: flush_handlers

    - name: Stage 7 - assert node health
      ansible.builtin.import_role:
        name: validate

  post_tasks:
    - name: Sign-off - GPUs visible after convergence
      ansible.builtin.command: nvidia-smi --query-gpu=count --format=csv,noheader
      register: smi_count
      changed_when: false
      failed_when: smi_count.rc != 0 or (smi_count.stdout | trim | int) < 1

  handlers:
    - name: update grub
      ansible.builtin.command: update-grub
      changed_when: true
      notify: reboot node

    - name: rebuild initramfs
      ansible.builtin.command: update-initramfs -u
      changed_when: true
      notify: reboot node

    - name: reboot node
      ansible.builtin.reboot:
        reboot_timeout: 900
        post_reboot_delay: 30
        test_command: /bin/true       # shared handler also runs before the GPU driver exists

    - name: restart containerd
      ansible.builtin.systemd:
        name: containerd
        state: restarted

    - name: enable cpu-performance
      ansible.builtin.systemd:
        name: cpu-performance
        enabled: true
        state: started
        daemon_reload: true

    - name: enable disable-acs
      ansible.builtin.systemd:
        name: disable-acs
        enabled: true
        state: started
        daemon_reload: true

Notes that keep this correct and idempotent:

serial, max_fail_percentage, force_handlers, pre_tasks, tasks, post_tasks, and handlers are all valid play-level keywords.¹ meta: flush_handlers is a task action, hence its placement between staged import_role tasks.⁶
The repo probe accepts 403 because the directory index of the CUDA repo can refuse listing while the host is up; the goal is to prove the repo host answers, not to fetch the index. Narrow status_code to [200] if your mirror serves an index.
A single reboot node handler is notified by both update grub and rebuild initramfs; Ansible de-duplicates notifications, so the node reboots once per flush even if both fired.⁵
nvidia.nvidia_driver (the upstream collection) ships its own driver role; swap nvidia_stack for it if you prefer a vendor-maintained role and keep the rest of this orchestration (References).

Apply & verify¶

Always converge one canary first (bringup_serial's leading 1), confirm it, then let the 25% waves proceed.

# Syntax + reference resolution, no host changes.
ansible-playbook -i inventory/hosts.ini site.yml --syntax-check

# Dry run against the canary only (predict changes, mutate nothing).
ansible-playbook -i inventory/hosts.ini site.yml --limit gpu-01.dc1.internal --check --diff

# Real canary roll.
ansible-playbook -i inventory/hosts.ini site.yml --limit gpu-01.dc1.internal

# Fleet roll in waves once the canary is proven.
ansible-playbook -i inventory/hosts.ini site.yml

Expected signal of a clean run, per host:

The PLAY RECAP shows failed=0 and unreachable=0 for every host in the batch; changed>0 on the first roll (grub/initramfs/driver/reboot), trending to changed=0 on a re-run of an already-converged node; that convergence-to-zero is the idempotency proof.
The validate role tasks pass: nvidia-smi lists every GPU; on gpu_tier == datacenter with nvidia_nvswitch, systemctl is-active nvidia-fabricmanager is active; ibstat shows IB ports State: Active; dcgmi diag -r 3 reports no Fail.¹⁰
The post_tasks sign-off (nvidia-smi --query-gpu=count) returns a count >= 1 after the final reboot.

max_fail_percentage/any_errors_fatal are the inverse signal: if the canary or a wave trips a systemic fault, the roll aborts and the remaining fleet is left untouched. Read the failing task, fix the root cause (usually a package name, repo URL, or pinned branch), and re-run.

Failure modes¶

Big-bang roll (no serial). A bad driver_branch or repo state mutates the entire fleet in one parallel pass. Always set bringup_serial with a leading canary 1; gate waves with max_fail_percentage.²³
Half-converged node from a repo outage. nvidia_stack fails mid-install because the CUDA repo is unreachable, leaving DKMS half-built. The pre_tasks uri probe is the guard; if it is removed, expect this. Recover with runbook: kernel upgrade, GPU missing.
Driver installed but never live (missed reboot). A role notified reboot node but the handler did not flush before validate, or a later task failed and swallowed it. force_handlers: true plus the explicit meta: flush_handlers prevent this; the symptom (modules built, GPUs absent until reboot) is the same as runbook: kernel upgrade, GPU missing.⁷⁶
FM stopped on an NVSwitch node after the driver rev. validate asserts nvidia-fabricmanager active; a version-skewed FM aborts on its compatibility check and GPUs do not form the NVLink domain. Diagnose with runbook: Fabric Manager failure.
Wave aborts on every host (systemic config error). Wrong package name, bad pinned branch, or unreachable mirror trips max_fail_percentage on the first batch. This is the intended behaviour: fix the inventory/role and re-run; it is not a node fault.
Re-run shows persistent changed. A role task is non-idempotent (a command/shell without creates/changed_when). Convergence must trend to changed=0; a task that reports changed every pass is drift, not convergence; fix it in the role (base-tuning).

References¶

Ansible playbook strategies — serial rolling batches (number, percentage, list): https://docs.ansible.com/ansible/latest/playbook_guide/playbooks_strategies.html
Ansible error handling — max_fail_percentage, any_errors_fatal, force_handlers: https://docs.ansible.com/ansible/latest/playbook_guide/playbooks_error_handling.html
Ansible handlers — notify, default flush order, meta: flush_handlers: https://docs.ansible.com/ansible/latest/playbook_guide/playbooks_handlers.html
Ansible playbook keywords reference (play- and task-level keyword list): https://docs.ansible.com/ansible/latest/reference_appendices/playbooks_keywords.html
ansible.builtin.assert module (that, fail_msg, success_msg, quiet): https://docs.ansible.com/ansible/latest/collections/ansible/builtin/assert_module.html
ansible.builtin.uri module (url, method, status_code, timeout, validate_certs): https://docs.ansible.com/ansible/latest/collections/ansible/builtin/uri_module.html
ansible.builtin.reboot module (reboot_timeout, post_reboot_delay, test_command): https://docs.ansible.com/ansible/latest/collections/ansible/builtin/reboot_module.html
Ansible facts (ansible_distribution, ansible_distribution_version): https://docs.ansible.com/ansible/latest/playbook_guide/playbooks_vars_facts.html
NVIDIA DeepOps (Ansible/Kubespray cluster deploy): https://github.com/NVIDIA/deepops
nvidia.nvidia_driver Ansible collection: https://galaxy.ansible.com/ui/repo/published/nvidia/nvidia_driver/
NVIDIA Fabric Manager user guide (NVSwitch scope, version check): https://docs.nvidia.com/datacenter/tesla/fabric-manager-user-guide/index.html

Ansible playbook keywords reference: hosts, become, gather_facts, serial, max_fail_percentage, any_errors_fatal, force_handlers, pre_tasks, roles ("list of roles to be imported into the play"), post_tasks ("a list of tasks to execute after the tasks section"), handlers, and vars are play-level keywords. https://docs.ansible.com/ansible/latest/reference_appendices/playbooks_keywords.html ↩↩↩
Ansible strategies: by default Ansible runs each task on all hosts in parallel; serial "completes the play on the specified number or percentage of hosts before starting the next batch", and accepts a number, a percentage ("30%"), or a list of batch sizes (e.g. [1, 5, 10]). Setting serial scopes failures to the batch. https://docs.ansible.com/ansible/latest/playbook_guide/playbooks_strategies.html ↩↩↩
Ansible error handling: max_fail_percentage "applies to each batch when you use it with serial" — if more than the given percentage of hosts in a batch fail, the rest of the play is aborted. https://docs.ansible.com/ansible/latest/playbook_guide/playbooks_error_handling.html ↩↩↩
Ansible error handling: with any_errors_fatal, when a task returns an error Ansible "finishes the fatal task on all hosts in the current batch and then stops executing the play on all hosts". https://docs.ansible.com/ansible/latest/playbook_guide/playbooks_error_handling.html ↩
Ansible handlers: tasks notify handlers with notify; by default notified handlers run after all tasks in a play complete, automatically after pre_tasks, roles/tasks, and post_tasks in that order. Duplicate notifications of the same handler are de-duplicated. https://docs.ansible.com/ansible/latest/playbook_guide/playbooks_handlers.html ↩↩
Ansible handlers: "the meta: flush_handlers task triggers any handlers that have been notified at that point in the play", running them immediately instead of at end-of-play. https://docs.ansible.com/ansible/latest/playbook_guide/playbooks_handlers.html ↩↩↩↩
Ansible error handling: when handlers are forced (force_handlers: true / --force-handlers), "Ansible will run all notified handlers on all hosts, even hosts with failed tasks". https://docs.ansible.com/ansible/latest/playbook_guide/playbooks_error_handling.html ↩↩
Ansible facts: with gather_facts: true the play runs the setup module at the start, populating standard facts including ansible_distribution, ansible_distribution_version, and ansible_distribution_major_version (also available as ansible_facts['distribution']). https://docs.ansible.com/ansible/latest/playbook_guide/playbooks_vars_facts.html ↩↩
ansible.builtin.uri: url, method (default GET), status_code (list of HTTP codes that signify success, default [200]), timeout (socket timeout seconds, default 30), validate_certs (default true). https://docs.ansible.com/ansible/latest/collections/ansible/builtin/uri_module.html ↩
Validation signals match the node-level validate role: nvidia-smi lists GPUs; systemctl is-active nvidia-fabricmanager is active on datacenter NVSwitch systems; ibstat ports State: Active; dcgmi diag -r 3 with no Fail (Ansible: Node and Fabric Bring-Up, GPU Diagnostics and Validation). FM is required only on NVSwitch baseboards. https://docs.nvidia.com/datacenter/tesla/fabric-manager-user-guide/index.html ↩