Skip to content
Markdown

Ansible site playbook

Scope: the site.yml that orchestrates a fleet bring-up: host selection, privilege escalation, staged role order (base_tuning -> acs_disable -> rdma_fabric OFED stage -> nvidia_stack -> rdma_fabric peer-memory stage -> mig -> validate), handler flushing, pre-flight gates (OS, repo reachability), and rolling serial/max_fail_percentage batching so the fleet is converged in waves, never big-bang. The orchestration layer above the per-role pages; the node-level skeleton it builds on is in Ansible node & fabric bring-up.

Reference templates, drawn from the upstream Ansible playbook/keyword docs and the NVIDIA driver/Fabric Manager/DCGM docs. Nothing here was executed on hardware. Pin your driver_branch/cuda_branch, set the inventory vars from Ansible Inventory and Variables, and run the play against one canary host before a fleet roll. For a maintained, batteries-included deployer see NVIDIA DeepOps (References).

flowchart TB
  PRE["pre_tasks: OS + repo pre-flight (fail fast)"] --> B1["batch 1 (canary): serial 1"]
  B1 --> BASE["base_tuning"]
  BASE --> H1["flush: grub/initramfs/reboot"]
  H1 --> ACS["acs_disable"]
  ACS --> OFED["rdma_fabric stage=ofed"]
  OFED --> H2["flush: OFED reboot"]
  H2 --> STACK["nvidia_stack"]
  STACK --> H3["flush: driver reboot"]
  H3 --> PEER["rdma_fabric stage=peermem"]
  PEER --> MIG["mig"]
  MIG --> H4["flush: pending MIG reboot"]
  H4 --> VALIDATE["validate"]
  VALIDATE --> POST["post_tasks: assert node healthy"]
  POST --> B2["batch 2..N: serial 25%, max_fail_percentage gate"]
  B2 --> BASE

What it does

site.yml is a single play that targets the gpu_nodes group and applies the roles in dependency-ordered stages. The order is load-bearing: base_tuning (nouveau blacklist, kernel cmdline, governor, hugepages) must converge and reboot before package-heavy roles; acs_disable must run before workloads; DOCA-OFED must be present before the NVIDIA driver is built if nvidia_peermem is expected to link cleanly against the RDMA APIs; the NVIDIA driver and Fabric Manager must then be live before the peer-memory load; mig only runs where mig_enabled; validate runs last and fails the play on any unhealthy node. The play uses tasks: + ansible.builtin.import_role instead of one roles: list because handlers must flush between stages, not only after all roles have run.16

Three orchestration concerns live here, not in the roles:

  • Pre-flight, fail-fast. pre_tasks assert the OS is a supported distro and that the package repos are reachable before any role mutates the node. A play that discovers an unreachable CUDA repo halfway through nvidia_stack leaves the node half-converged; pre-flight turns that into a clean abort.
  • Rolling batches. serial bounds how many hosts are mutated at once so the fleet is taken through bring-up in waves. Without it Ansible runs every task on all hosts in parallel (default fork-limited),2 which is exactly the big-bang a fleet operator must avoid: a bad driver branch would brick every node in one pass.
  • Batch failure gating. max_fail_percentage (with serial) aborts the whole play once too many hosts in the current batch fail, so a systemic fault (wrong package name, repo outage, bad firmware) stops the roll instead of marching through the fleet.3 any_errors_fatal is the stricter variant: first failure ends the play for everyone in the batch.4

Handlers (rebuild initramfs, update grub, reboot, restart containerd, enable services) are defined once at the play level and notified by the roles. By default a notified handler runs after each of pre_tasks, roles/tasks, and post_tasks, in that order.5 That is too late for this chain: the base reboot must happen before OFED/driver work, OFED must reboot before the driver build when required, and the driver reboot must complete before nvidia_peermem and validate. The play therefore flushes handlers explicitly with meta: flush_handlers between stages.6 force_handlers: true guarantees a queued reboot/initramfs handler still runs on a host whose later task failed, so a node is never left with a pending-but-undelivered kernel change.7

Variables

Play-level and inventory variables this playbook reads. Node-shape vars (gpu_tier, driver_branch, cuda_branch, nvidia_nvswitch, mig_enabled) are owned by the inventory and documented in Ansible Inventory and Variables; the rest are orchestration knobs set in group_vars/gpu_nodes.yml or on the command line.

Variable Scope Default Purpose
gpu_nodes (inventory group) inventory Target hosts; the play's hosts: value.
gpu_tier inventory var datacenter datacenter \| workstation; selects driver package and whether FM runs (Ansible Inventory and Variables).
driver_branch inventory var 580 Pinned NVIDIA driver branch (Driver Versions and Branches).
cuda_branch inventory var 13-0 Pinned CUDA repo branch.
nvidia_nvswitch inventory var true NVSwitch baseboard present -> Fabric Manager required (Fabric Manager).
mig_enabled inventory var false Gate for the mig role (MIG).
bringup_serial play var [1, "25%"] serial batches: 1 canary host, then 25% waves. List form is documented.2
bringup_max_fail_pct play var 20 max_fail_percentage per batch; abort the roll past this.3
supported_distros play var ["Ubuntu", "Debian"] Allowed ansible_distribution values for the pre-flight assert.8
cuda_repo_url play var https://developer.download.nvidia.com/compute/cuda/repos Repo base URL probed for reachability in pre-flight.
repo_probe_timeout play var 10 Socket timeout (s) for the uri reachability probe.9
become play keyword true Privilege escalation; all roles mutate system state.1
gather_facts play keyword true Run the setup module so ansible_distribution etc. are populated for pre-flight.8

Tasks

The orchestration play. The roles themselves (base_tuning, acs_disable, nvidia_stack, rdma_fabric, mig, validate) carry the idempotent task bodies. See the per-role pages. Every module below is ansible.builtin.*; the YAML is idempotent (assertions and probes are read-only, role tasks gate on when/creates/changed_when).

# site.yml — fleet GPU-node bring-up orchestration.
- name: GPU node bring-up
  hosts: gpu_nodes
  become: true
  gather_facts: true

  # Rolling waves: one canary first, then 25% batches. Bound the blast radius.
  serial: "{{ bringup_serial | default([1, '25%']) }}"
  max_fail_percentage: "{{ bringup_max_fail_pct | default(20) }}"
  force_handlers: true            # deliver queued reboot/initramfs even if a later task failed

  vars:
    supported_distros: ["Ubuntu", "Debian"]
    cuda_repo_url: "https://developer.download.nvidia.com/compute/cuda/repos"
    repo_probe_timeout: 10

  pre_tasks:
    - name: Pre-flight - supported OS only
      ansible.builtin.assert:
        that:
          - ansible_distribution in supported_distros
        fail_msg: >-
          Unsupported distro {{ ansible_distribution }} {{ ansible_distribution_version }};
          this playbook targets {{ supported_distros | join(', ') }}.
        success_msg: "OS {{ ansible_distribution }} {{ ansible_distribution_version }} supported"
        quiet: true

    - name: Pre-flight - required inventory vars are set
      ansible.builtin.assert:
        that:
          - gpu_tier in ['datacenter', 'workstation']
          - driver_branch is defined
          - cuda_branch is defined
        fail_msg: "gpu_tier/driver_branch/cuda_branch must be set in inventory"
        quiet: true

    - name: Pre-flight - CUDA repo is reachable (fail fast before any mutation)
      ansible.builtin.uri:
        url: "{{ cuda_repo_url }}/"
        method: GET
        status_code: [200, 301, 302, 403]   # index may 403; we only prove the host answers
        timeout: "{{ repo_probe_timeout }}"
        validate_certs: true
      changed_when: false

  tasks:
    - name: Stage 1 - base tuning (kernel cmdline, nouveau, governor, hugepages)
      ansible.builtin.import_role:
        name: base_tuning

    - name: Apply base-tuning handlers before package stages
      ansible.builtin.meta: flush_handlers

    - name: Stage 2 - clear PCIe ACS redirect bits for this boot
      ansible.builtin.import_role:
        name: acs_disable

    - name: Stage 3 - install DOCA-OFED before NVIDIA driver build
      ansible.builtin.import_role:
        name: rdma_fabric
      vars:
        rdma_stage: ofed

    - name: Apply OFED reboot before the GPU driver is installed
      ansible.builtin.meta: flush_handlers

    - name: Stage 4 - NVIDIA driver, CUDA, Fabric Manager, toolkit, DCGM
      ansible.builtin.import_role:
        name: nvidia_stack

    - name: Apply driver/container-runtime handlers before peer-memory validation
      ansible.builtin.meta: flush_handlers

    - name: Stage 5 - load nvidia_peermem and write NCCL fabric defaults
      ansible.builtin.import_role:
        name: rdma_fabric
      vars:
        rdma_stage: peermem

    - name: Stage 6 - optional MIG geometry
      ansible.builtin.import_role:
        name: mig

    - name: Apply any MIG reset/reboot before validation
      ansible.builtin.meta: flush_handlers

    - name: Stage 7 - assert node health
      ansible.builtin.import_role:
        name: validate

  post_tasks:
    - name: Sign-off - GPUs visible after convergence
      ansible.builtin.command: nvidia-smi --query-gpu=count --format=csv,noheader
      register: smi_count
      changed_when: false
      failed_when: smi_count.rc != 0 or (smi_count.stdout | trim | int) < 1

  handlers:
    - name: update grub
      ansible.builtin.command: update-grub
      changed_when: true
      notify: reboot node

    - name: rebuild initramfs
      ansible.builtin.command: update-initramfs -u
      changed_when: true
      notify: reboot node

    - name: reboot node
      ansible.builtin.reboot:
        reboot_timeout: 900
        post_reboot_delay: 30
        test_command: /bin/true       # shared handler also runs before the GPU driver exists

    - name: restart containerd
      ansible.builtin.systemd:
        name: containerd
        state: restarted

    - name: enable cpu-performance
      ansible.builtin.systemd:
        name: cpu-performance
        enabled: true
        state: started
        daemon_reload: true

    - name: enable disable-acs
      ansible.builtin.systemd:
        name: disable-acs
        enabled: true
        state: started
        daemon_reload: true

Notes that keep this correct and idempotent:

  • serial, max_fail_percentage, force_handlers, pre_tasks, tasks, post_tasks, and handlers are all valid play-level keywords.1 meta: flush_handlers is a task action, hence its placement between staged import_role tasks.6
  • The repo probe accepts 403 because the directory index of the CUDA repo can refuse listing while the host is up; the goal is to prove the repo host answers, not to fetch the index. Narrow status_code to [200] if your mirror serves an index.
  • A single reboot node handler is notified by both update grub and rebuild initramfs; Ansible de-duplicates notifications, so the node reboots once per flush even if both fired.5
  • nvidia.nvidia_driver (the upstream collection) ships its own driver role; swap nvidia_stack for it if you prefer a vendor-maintained role and keep the rest of this orchestration (References).

Apply & verify

Always converge one canary first (bringup_serial's leading 1), confirm it, then let the 25% waves proceed.

# Syntax + reference resolution, no host changes.
ansible-playbook -i inventory/hosts.ini site.yml --syntax-check

# Dry run against the canary only (predict changes, mutate nothing).
ansible-playbook -i inventory/hosts.ini site.yml --limit gpu-01.dc1.internal --check --diff

# Real canary roll.
ansible-playbook -i inventory/hosts.ini site.yml --limit gpu-01.dc1.internal

# Fleet roll in waves once the canary is proven.
ansible-playbook -i inventory/hosts.ini site.yml

Expected signal of a clean run, per host:

  • The PLAY RECAP shows failed=0 and unreachable=0 for every host in the batch; changed>0 on the first roll (grub/initramfs/driver/reboot), trending to changed=0 on a re-run of an already-converged node; that convergence-to-zero is the idempotency proof.
  • The validate role tasks pass: nvidia-smi lists every GPU; on gpu_tier == datacenter with nvidia_nvswitch, systemctl is-active nvidia-fabricmanager is active; ibstat shows IB ports State: Active; dcgmi diag -r 3 reports no Fail.10
  • The post_tasks sign-off (nvidia-smi --query-gpu=count) returns a count >= 1 after the final reboot.

max_fail_percentage/any_errors_fatal are the inverse signal: if the canary or a wave trips a systemic fault, the roll aborts and the remaining fleet is left untouched. Read the failing task, fix the root cause (usually a package name, repo URL, or pinned branch), and re-run.

Failure modes

  • Big-bang roll (no serial). A bad driver_branch or repo state mutates the entire fleet in one parallel pass. Always set bringup_serial with a leading canary 1; gate waves with max_fail_percentage.23
  • Half-converged node from a repo outage. nvidia_stack fails mid-install because the CUDA repo is unreachable, leaving DKMS half-built. The pre_tasks uri probe is the guard; if it is removed, expect this. Recover with runbook: kernel upgrade, GPU missing.
  • Driver installed but never live (missed reboot). A role notified reboot node but the handler did not flush before validate, or a later task failed and swallowed it. force_handlers: true plus the explicit meta: flush_handlers prevent this; the symptom (modules built, GPUs absent until reboot) is the same as runbook: kernel upgrade, GPU missing.76
  • FM stopped on an NVSwitch node after the driver rev. validate asserts nvidia-fabricmanager active; a version-skewed FM aborts on its compatibility check and GPUs do not form the NVLink domain. Diagnose with runbook: Fabric Manager failure.
  • Wave aborts on every host (systemic config error). Wrong package name, bad pinned branch, or unreachable mirror trips max_fail_percentage on the first batch. This is the intended behaviour: fix the inventory/role and re-run; it is not a node fault.
  • Re-run shows persistent changed. A role task is non-idempotent (a command/shell without creates/changed_when). Convergence must trend to changed=0; a task that reports changed every pass is drift, not convergence; fix it in the role (base-tuning).

References

  • Ansible playbook strategies — serial rolling batches (number, percentage, list): https://docs.ansible.com/ansible/latest/playbook_guide/playbooks_strategies.html
  • Ansible error handling — max_fail_percentage, any_errors_fatal, force_handlers: https://docs.ansible.com/ansible/latest/playbook_guide/playbooks_error_handling.html
  • Ansible handlers — notify, default flush order, meta: flush_handlers: https://docs.ansible.com/ansible/latest/playbook_guide/playbooks_handlers.html
  • Ansible playbook keywords reference (play- and task-level keyword list): https://docs.ansible.com/ansible/latest/reference_appendices/playbooks_keywords.html
  • ansible.builtin.assert module (that, fail_msg, success_msg, quiet): https://docs.ansible.com/ansible/latest/collections/ansible/builtin/assert_module.html
  • ansible.builtin.uri module (url, method, status_code, timeout, validate_certs): https://docs.ansible.com/ansible/latest/collections/ansible/builtin/uri_module.html
  • ansible.builtin.reboot module (reboot_timeout, post_reboot_delay, test_command): https://docs.ansible.com/ansible/latest/collections/ansible/builtin/reboot_module.html
  • Ansible facts (ansible_distribution, ansible_distribution_version): https://docs.ansible.com/ansible/latest/playbook_guide/playbooks_vars_facts.html
  • NVIDIA DeepOps (Ansible/Kubespray cluster deploy): https://github.com/NVIDIA/deepops
  • nvidia.nvidia_driver Ansible collection: https://galaxy.ansible.com/ui/repo/published/nvidia/nvidia_driver/
  • NVIDIA Fabric Manager user guide (NVSwitch scope, version check): https://docs.nvidia.com/datacenter/tesla/fabric-manager-user-guide/index.html

Related: Ansible Node & Fabric Bring-Up · Inventory & Node Model · Base Tuning Role · NVIDIA Stack Role · MIG Configuration Role · Validate Health Role · Fabric Manager Failure · Kernel/GPU Missing · Glossary


  1. Ansible playbook keywords reference: hosts, become, gather_facts, serial, max_fail_percentage, any_errors_fatal, force_handlers, pre_tasks, roles ("list of roles to be imported into the play"), post_tasks ("a list of tasks to execute after the tasks section"), handlers, and vars are play-level keywords. https://docs.ansible.com/ansible/latest/reference_appendices/playbooks_keywords.html 

  2. Ansible strategies: by default Ansible runs each task on all hosts in parallel; serial "completes the play on the specified number or percentage of hosts before starting the next batch", and accepts a number, a percentage ("30%"), or a list of batch sizes (e.g. [1, 5, 10]). Setting serial scopes failures to the batch. https://docs.ansible.com/ansible/latest/playbook_guide/playbooks_strategies.html 

  3. Ansible error handling: max_fail_percentage "applies to each batch when you use it with serial" — if more than the given percentage of hosts in a batch fail, the rest of the play is aborted. https://docs.ansible.com/ansible/latest/playbook_guide/playbooks_error_handling.html 

  4. Ansible error handling: with any_errors_fatal, when a task returns an error Ansible "finishes the fatal task on all hosts in the current batch and then stops executing the play on all hosts". https://docs.ansible.com/ansible/latest/playbook_guide/playbooks_error_handling.html 

  5. Ansible handlers: tasks notify handlers with notify; by default notified handlers run after all tasks in a play complete, automatically after pre_tasks, roles/tasks, and post_tasks in that order. Duplicate notifications of the same handler are de-duplicated. https://docs.ansible.com/ansible/latest/playbook_guide/playbooks_handlers.html 

  6. Ansible handlers: "the meta: flush_handlers task triggers any handlers that have been notified at that point in the play", running them immediately instead of at end-of-play. https://docs.ansible.com/ansible/latest/playbook_guide/playbooks_handlers.html 

  7. Ansible error handling: when handlers are forced (force_handlers: true / --force-handlers), "Ansible will run all notified handlers on all hosts, even hosts with failed tasks". https://docs.ansible.com/ansible/latest/playbook_guide/playbooks_error_handling.html 

  8. Ansible facts: with gather_facts: true the play runs the setup module at the start, populating standard facts including ansible_distribution, ansible_distribution_version, and ansible_distribution_major_version (also available as ansible_facts['distribution']). https://docs.ansible.com/ansible/latest/playbook_guide/playbooks_vars_facts.html 

  9. ansible.builtin.uri: url, method (default GET), status_code (list of HTTP codes that signify success, default [200]), timeout (socket timeout seconds, default 30), validate_certs (default true). https://docs.ansible.com/ansible/latest/collections/ansible/builtin/uri_module.html 

  10. Validation signals match the node-level validate role: nvidia-smi lists GPUs; systemctl is-active nvidia-fabricmanager is active on datacenter NVSwitch systems; ibstat ports State: Active; dcgmi diag -r 3 with no Fail (Ansible: Node and Fabric Bring-Up, GPU Diagnostics and Validation). FM is required only on NVSwitch baseboards. https://docs.nvidia.com/datacenter/tesla/fabric-manager-user-guide/index.html