Ansible site playbook¶
Scope: the site.yml that orchestrates a fleet bring-up: host selection, privilege escalation, staged role order (base_tuning -> acs_disable -> rdma_fabric OFED stage -> nvidia_stack -> rdma_fabric peer-memory stage -> mig -> validate), handler flushing, pre-flight gates (OS, repo reachability), and rolling serial/max_fail_percentage batching so the fleet is converged in waves, never big-bang. The orchestration layer above the per-role pages; the node-level skeleton it builds on is in Ansible node & fabric bring-up.
Reference templates, drawn from the upstream Ansible playbook/keyword docs and the NVIDIA driver/Fabric Manager/DCGM docs. Nothing here was executed on hardware. Pin your
driver_branch/cuda_branch, set the inventory vars from Ansible Inventory and Variables, and run the play against one canary host before a fleet roll. For a maintained, batteries-included deployer see NVIDIA DeepOps (References).
flowchart TB
PRE["pre_tasks: OS + repo pre-flight (fail fast)"] --> B1["batch 1 (canary): serial 1"]
B1 --> BASE["base_tuning"]
BASE --> H1["flush: grub/initramfs/reboot"]
H1 --> ACS["acs_disable"]
ACS --> OFED["rdma_fabric stage=ofed"]
OFED --> H2["flush: OFED reboot"]
H2 --> STACK["nvidia_stack"]
STACK --> H3["flush: driver reboot"]
H3 --> PEER["rdma_fabric stage=peermem"]
PEER --> MIG["mig"]
MIG --> H4["flush: pending MIG reboot"]
H4 --> VALIDATE["validate"]
VALIDATE --> POST["post_tasks: assert node healthy"]
POST --> B2["batch 2..N: serial 25%, max_fail_percentage gate"]
B2 --> BASE
What it does¶
site.yml is a single play that targets the gpu_nodes group and applies the roles in dependency-ordered stages. The order is load-bearing: base_tuning (nouveau blacklist, kernel cmdline, governor, hugepages) must converge and reboot before package-heavy roles; acs_disable must run before workloads; DOCA-OFED must be present before the NVIDIA driver is built if nvidia_peermem is expected to link cleanly against the RDMA APIs; the NVIDIA driver and Fabric Manager must then be live before the peer-memory load; mig only runs where mig_enabled; validate runs last and fails the play on any unhealthy node. The play uses tasks: + ansible.builtin.import_role instead of one roles: list because handlers must flush between stages, not only after all roles have run.16
Three orchestration concerns live here, not in the roles:
- Pre-flight, fail-fast.
pre_tasksassert the OS is a supported distro and that the package repos are reachable before any role mutates the node. A play that discovers an unreachable CUDA repo halfway throughnvidia_stackleaves the node half-converged; pre-flight turns that into a clean abort. - Rolling batches.
serialbounds how many hosts are mutated at once so the fleet is taken through bring-up in waves. Without it Ansible runs every task on all hosts in parallel (default fork-limited),2 which is exactly the big-bang a fleet operator must avoid: a bad driver branch would brick every node in one pass. - Batch failure gating.
max_fail_percentage(withserial) aborts the whole play once too many hosts in the current batch fail, so a systemic fault (wrong package name, repo outage, bad firmware) stops the roll instead of marching through the fleet.3any_errors_fatalis the stricter variant: first failure ends the play for everyone in the batch.4
Handlers (rebuild initramfs, update grub, reboot, restart containerd, enable services) are defined once at the play level and notified by the roles. By default a notified handler runs after each of pre_tasks, roles/tasks, and post_tasks, in that order.5 That is too late for this chain: the base reboot must happen before OFED/driver work, OFED must reboot before the driver build when required, and the driver reboot must complete before nvidia_peermem and validate. The play therefore flushes handlers explicitly with meta: flush_handlers between stages.6 force_handlers: true guarantees a queued reboot/initramfs handler still runs on a host whose later task failed, so a node is never left with a pending-but-undelivered kernel change.7
Variables¶
Play-level and inventory variables this playbook reads. Node-shape vars (gpu_tier, driver_branch, cuda_branch, nvidia_nvswitch, mig_enabled) are owned by the inventory and documented in Ansible Inventory and Variables; the rest are orchestration knobs set in group_vars/gpu_nodes.yml or on the command line.
| Variable | Scope | Default | Purpose |
|---|---|---|---|
gpu_nodes (inventory group) |
inventory | — | Target hosts; the play's hosts: value. |
gpu_tier |
inventory var | datacenter |
datacenter \| workstation; selects driver package and whether FM runs (Ansible Inventory and Variables). |
driver_branch |
inventory var | 580 |
Pinned NVIDIA driver branch (Driver Versions and Branches). |
cuda_branch |
inventory var | 13-0 |
Pinned CUDA repo branch. |
nvidia_nvswitch |
inventory var | true |
NVSwitch baseboard present -> Fabric Manager required (Fabric Manager). |
mig_enabled |
inventory var | false |
Gate for the mig role (MIG). |
bringup_serial |
play var | [1, "25%"] |
serial batches: 1 canary host, then 25% waves. List form is documented.2 |
bringup_max_fail_pct |
play var | 20 |
max_fail_percentage per batch; abort the roll past this.3 |
supported_distros |
play var | ["Ubuntu", "Debian"] |
Allowed ansible_distribution values for the pre-flight assert.8 |
cuda_repo_url |
play var | https://developer.download.nvidia.com/compute/cuda/repos |
Repo base URL probed for reachability in pre-flight. |
repo_probe_timeout |
play var | 10 |
Socket timeout (s) for the uri reachability probe.9 |
become |
play keyword | true |
Privilege escalation; all roles mutate system state.1 |
gather_facts |
play keyword | true |
Run the setup module so ansible_distribution etc. are populated for pre-flight.8 |
Tasks¶
The orchestration play. The roles themselves (base_tuning, acs_disable, nvidia_stack, rdma_fabric, mig, validate) carry the idempotent task bodies. See the per-role pages. Every module below is ansible.builtin.*; the YAML is idempotent (assertions and probes are read-only, role tasks gate on when/creates/changed_when).
# site.yml — fleet GPU-node bring-up orchestration.
- name: GPU node bring-up
hosts: gpu_nodes
become: true
gather_facts: true
# Rolling waves: one canary first, then 25% batches. Bound the blast radius.
serial: "{{ bringup_serial | default([1, '25%']) }}"
max_fail_percentage: "{{ bringup_max_fail_pct | default(20) }}"
force_handlers: true # deliver queued reboot/initramfs even if a later task failed
vars:
supported_distros: ["Ubuntu", "Debian"]
cuda_repo_url: "https://developer.download.nvidia.com/compute/cuda/repos"
repo_probe_timeout: 10
pre_tasks:
- name: Pre-flight - supported OS only
ansible.builtin.assert:
that:
- ansible_distribution in supported_distros
fail_msg: >-
Unsupported distro {{ ansible_distribution }} {{ ansible_distribution_version }};
this playbook targets {{ supported_distros | join(', ') }}.
success_msg: "OS {{ ansible_distribution }} {{ ansible_distribution_version }} supported"
quiet: true
- name: Pre-flight - required inventory vars are set
ansible.builtin.assert:
that:
- gpu_tier in ['datacenter', 'workstation']
- driver_branch is defined
- cuda_branch is defined
fail_msg: "gpu_tier/driver_branch/cuda_branch must be set in inventory"
quiet: true
- name: Pre-flight - CUDA repo is reachable (fail fast before any mutation)
ansible.builtin.uri:
url: "{{ cuda_repo_url }}/"
method: GET
status_code: [200, 301, 302, 403] # index may 403; we only prove the host answers
timeout: "{{ repo_probe_timeout }}"
validate_certs: true
changed_when: false
tasks:
- name: Stage 1 - base tuning (kernel cmdline, nouveau, governor, hugepages)
ansible.builtin.import_role:
name: base_tuning
- name: Apply base-tuning handlers before package stages
ansible.builtin.meta: flush_handlers
- name: Stage 2 - clear PCIe ACS redirect bits for this boot
ansible.builtin.import_role:
name: acs_disable
- name: Stage 3 - install DOCA-OFED before NVIDIA driver build
ansible.builtin.import_role:
name: rdma_fabric
vars:
rdma_stage: ofed
- name: Apply OFED reboot before the GPU driver is installed
ansible.builtin.meta: flush_handlers
- name: Stage 4 - NVIDIA driver, CUDA, Fabric Manager, toolkit, DCGM
ansible.builtin.import_role:
name: nvidia_stack
- name: Apply driver/container-runtime handlers before peer-memory validation
ansible.builtin.meta: flush_handlers
- name: Stage 5 - load nvidia_peermem and write NCCL fabric defaults
ansible.builtin.import_role:
name: rdma_fabric
vars:
rdma_stage: peermem
- name: Stage 6 - optional MIG geometry
ansible.builtin.import_role:
name: mig
- name: Apply any MIG reset/reboot before validation
ansible.builtin.meta: flush_handlers
- name: Stage 7 - assert node health
ansible.builtin.import_role:
name: validate
post_tasks:
- name: Sign-off - GPUs visible after convergence
ansible.builtin.command: nvidia-smi --query-gpu=count --format=csv,noheader
register: smi_count
changed_when: false
failed_when: smi_count.rc != 0 or (smi_count.stdout | trim | int) < 1
handlers:
- name: update grub
ansible.builtin.command: update-grub
changed_when: true
notify: reboot node
- name: rebuild initramfs
ansible.builtin.command: update-initramfs -u
changed_when: true
notify: reboot node
- name: reboot node
ansible.builtin.reboot:
reboot_timeout: 900
post_reboot_delay: 30
test_command: /bin/true # shared handler also runs before the GPU driver exists
- name: restart containerd
ansible.builtin.systemd:
name: containerd
state: restarted
- name: enable cpu-performance
ansible.builtin.systemd:
name: cpu-performance
enabled: true
state: started
daemon_reload: true
- name: enable disable-acs
ansible.builtin.systemd:
name: disable-acs
enabled: true
state: started
daemon_reload: true
Notes that keep this correct and idempotent:
serial,max_fail_percentage,force_handlers,pre_tasks,tasks,post_tasks, andhandlersare all valid play-level keywords.1meta: flush_handlersis a task action, hence its placement between stagedimport_roletasks.6- The repo probe accepts
403because the directory index of the CUDA repo can refuse listing while the host is up; the goal is to prove the repo host answers, not to fetch the index. Narrowstatus_codeto[200]if your mirror serves an index. - A single
reboot nodehandler is notified by bothupdate grubandrebuild initramfs; Ansible de-duplicates notifications, so the node reboots once per flush even if both fired.5 nvidia.nvidia_driver(the upstream collection) ships its own driver role; swapnvidia_stackfor it if you prefer a vendor-maintained role and keep the rest of this orchestration (References).
Apply & verify¶
Always converge one canary first (bringup_serial's leading 1), confirm it, then let the 25% waves proceed.
# Syntax + reference resolution, no host changes.
ansible-playbook -i inventory/hosts.ini site.yml --syntax-check
# Dry run against the canary only (predict changes, mutate nothing).
ansible-playbook -i inventory/hosts.ini site.yml --limit gpu-01.dc1.internal --check --diff
# Real canary roll.
ansible-playbook -i inventory/hosts.ini site.yml --limit gpu-01.dc1.internal
# Fleet roll in waves once the canary is proven.
ansible-playbook -i inventory/hosts.ini site.yml
Expected signal of a clean run, per host:
- The
PLAY RECAPshowsfailed=0andunreachable=0for every host in the batch;changed>0on the first roll (grub/initramfs/driver/reboot), trending tochanged=0on a re-run of an already-converged node; that convergence-to-zero is the idempotency proof. - The
validaterole tasks pass:nvidia-smilists every GPU; ongpu_tier == datacenterwithnvidia_nvswitch,systemctl is-active nvidia-fabricmanagerisactive;ibstatshows IB portsState: Active;dcgmi diag -r 3reports noFail.10 - The
post_taskssign-off (nvidia-smi --query-gpu=count) returns a count >= 1 after the final reboot.
max_fail_percentage/any_errors_fatal are the inverse signal: if the canary or a wave trips a systemic fault, the roll aborts and the remaining fleet is left untouched. Read the failing task, fix the root cause (usually a package name, repo URL, or pinned branch), and re-run.
Failure modes¶
- Big-bang roll (no
serial). A baddriver_branchor repo state mutates the entire fleet in one parallel pass. Always setbringup_serialwith a leading canary1; gate waves withmax_fail_percentage.23 - Half-converged node from a repo outage.
nvidia_stackfails mid-install because the CUDA repo is unreachable, leaving DKMS half-built. Thepre_tasksuriprobe is the guard; if it is removed, expect this. Recover with runbook: kernel upgrade, GPU missing. - Driver installed but never live (missed reboot). A role notified
reboot nodebut the handler did not flush beforevalidate, or a later task failed and swallowed it.force_handlers: trueplus the explicitmeta: flush_handlersprevent this; the symptom (modules built, GPUs absent until reboot) is the same as runbook: kernel upgrade, GPU missing.76 - FM stopped on an NVSwitch node after the driver rev.
validateassertsnvidia-fabricmanageractive; a version-skewed FM aborts on its compatibility check and GPUs do not form the NVLink domain. Diagnose with runbook: Fabric Manager failure. - Wave aborts on every host (systemic config error). Wrong package name, bad pinned branch, or unreachable mirror trips
max_fail_percentageon the first batch. This is the intended behaviour: fix the inventory/role and re-run; it is not a node fault. - Re-run shows persistent
changed. A role task is non-idempotent (acommand/shellwithoutcreates/changed_when). Convergence must trend tochanged=0; a task that reports changed every pass is drift, not convergence; fix it in the role (base-tuning).
References¶
- Ansible playbook strategies —
serialrolling batches (number, percentage, list): https://docs.ansible.com/ansible/latest/playbook_guide/playbooks_strategies.html - Ansible error handling —
max_fail_percentage,any_errors_fatal,force_handlers: https://docs.ansible.com/ansible/latest/playbook_guide/playbooks_error_handling.html - Ansible handlers —
notify, default flush order,meta: flush_handlers: https://docs.ansible.com/ansible/latest/playbook_guide/playbooks_handlers.html - Ansible playbook keywords reference (play- and task-level keyword list): https://docs.ansible.com/ansible/latest/reference_appendices/playbooks_keywords.html
ansible.builtin.assertmodule (that, fail_msg, success_msg, quiet): https://docs.ansible.com/ansible/latest/collections/ansible/builtin/assert_module.htmlansible.builtin.urimodule (url, method, status_code, timeout, validate_certs): https://docs.ansible.com/ansible/latest/collections/ansible/builtin/uri_module.htmlansible.builtin.rebootmodule (reboot_timeout, post_reboot_delay, test_command): https://docs.ansible.com/ansible/latest/collections/ansible/builtin/reboot_module.html- Ansible facts (
ansible_distribution,ansible_distribution_version): https://docs.ansible.com/ansible/latest/playbook_guide/playbooks_vars_facts.html - NVIDIA DeepOps (Ansible/Kubespray cluster deploy): https://github.com/NVIDIA/deepops
nvidia.nvidia_driverAnsible collection: https://galaxy.ansible.com/ui/repo/published/nvidia/nvidia_driver/- NVIDIA Fabric Manager user guide (NVSwitch scope, version check): https://docs.nvidia.com/datacenter/tesla/fabric-manager-user-guide/index.html
Related: Ansible Node & Fabric Bring-Up · Inventory & Node Model · Base Tuning Role · NVIDIA Stack Role · MIG Configuration Role · Validate Health Role · Fabric Manager Failure · Kernel/GPU Missing · Glossary
-
Ansible playbook keywords reference:
hosts,become,gather_facts,serial,max_fail_percentage,any_errors_fatal,force_handlers,pre_tasks,roles("list of roles to be imported into the play"),post_tasks("a list of tasks to execute after the tasks section"),handlers, andvarsare play-level keywords. https://docs.ansible.com/ansible/latest/reference_appendices/playbooks_keywords.html ↩↩↩ -
Ansible strategies: by default Ansible runs each task on all hosts in parallel;
serial"completes the play on the specified number or percentage of hosts before starting the next batch", and accepts a number, a percentage ("30%"), or a list of batch sizes (e.g.[1, 5, 10]). Settingserialscopes failures to the batch. https://docs.ansible.com/ansible/latest/playbook_guide/playbooks_strategies.html ↩↩↩ -
Ansible error handling:
max_fail_percentage"applies to each batch when you use it with serial" — if more than the given percentage of hosts in a batch fail, the rest of the play is aborted. https://docs.ansible.com/ansible/latest/playbook_guide/playbooks_error_handling.html ↩↩↩ -
Ansible error handling: with
any_errors_fatal, when a task returns an error Ansible "finishes the fatal task on all hosts in the current batch and then stops executing the play on all hosts". https://docs.ansible.com/ansible/latest/playbook_guide/playbooks_error_handling.html ↩ -
Ansible handlers: tasks notify handlers with
notify; by default notified handlers run after all tasks in a play complete, automatically afterpre_tasks,roles/tasks, andpost_tasksin that order. Duplicate notifications of the same handler are de-duplicated. https://docs.ansible.com/ansible/latest/playbook_guide/playbooks_handlers.html ↩↩ -
Ansible handlers: "the
meta: flush_handlerstask triggers any handlers that have been notified at that point in the play", running them immediately instead of at end-of-play. https://docs.ansible.com/ansible/latest/playbook_guide/playbooks_handlers.html ↩↩↩↩ -
Ansible error handling: when handlers are forced (
force_handlers: true/--force-handlers), "Ansible will run all notified handlers on all hosts, even hosts with failed tasks". https://docs.ansible.com/ansible/latest/playbook_guide/playbooks_error_handling.html ↩↩ -
Ansible facts: with
gather_facts: truethe play runs thesetupmodule at the start, populating standard facts includingansible_distribution,ansible_distribution_version, andansible_distribution_major_version(also available asansible_facts['distribution']). https://docs.ansible.com/ansible/latest/playbook_guide/playbooks_vars_facts.html ↩↩ -
ansible.builtin.uri:url,method(default GET),status_code(list of HTTP codes that signify success, default[200]),timeout(socket timeout seconds, default 30),validate_certs(default true). https://docs.ansible.com/ansible/latest/collections/ansible/builtin/uri_module.html ↩ -
Validation signals match the node-level
validaterole:nvidia-smilists GPUs;systemctl is-active nvidia-fabricmanagerisactiveon datacenter NVSwitch systems;ibstatportsState: Active;dcgmi diag -r 3with noFail(Ansible: Node and Fabric Bring-Up, GPU Diagnostics and Validation). FM is required only on NVSwitch baseboards. https://docs.nvidia.com/datacenter/tesla/fabric-manager-user-guide/index.html ↩