Skip to content
Markdown

Ansible inventory & variables

Scope: the inventory model that the bring-up roles read: the gpu_nodes group, the group_vars//host_vars/ layout, the per-tier variables (gpu_tier, driver_branch, cuda_branch, nvidia_nvswitch, mig_enabled, mig_profile) and how they drive each role, plus a multi-tier inventory example (datacenter + workstation). This is the single source of truth the roles in the bring-up hub consume; it does not run tasks itself.

Reference template, drawn from upstream Ansible inventory/variable-precedence docs and the variable names used by the hub playbook (Ansible: Node and Fabric Bring-Up). Not hardware-tested. The custom variables below are this KB's convention; the nvidia.nvidia_driver collection's own variables (e.g. nvidia_driver_branch) are noted where they map. Validate names against the cited NVIDIA pages before a fleet roll.

flowchart TD
  INV["inventory/hosts.ini  (gpu_nodes group)"] --> GA["group_vars/all.yml  (fleet defaults)"]
  GA --> GT["group_vars/gpu_nodes.yml"]
  GT --> GDC["group_vars/datacenter.yml"]
  GT --> GWS["group_vars/workstation.yml"]
  GDC --> HV["host_vars/gpu-h100-01.yml  (per-node overrides)"]
  GWS --> HV
  HV --> ROLES["roles read the merged vars: base_tuning, acs_disable, rdma_fabric, nvidia_stack, mig, validate"]

What it does

The inventory is the declarative state the roles converge a node to; it carries no logic. One gpu_nodes group names every GPU host; tier groups (datacenter, workstation) layer the per-class facts; host_vars/ holds the rare per-node exception. Roles never hard-code a branch or a MIG shape. They read driver_branch, nvidia_nvswitch, mig_enabled, etc., so the same role tree serves a DGX H100 baseboard and an RTX PRO 6000 workstation from different variable files.

Variable resolution follows Ansible's documented precedence (low to high): role defaults < inventory group_vars/all < inventory group_vars/* < inventory host_vars/* < play vars < extra-vars (-e).1 Within the same level, groups merge alphabetically; a child group overrides its parent, and ansible_group_priority (higher merges later, wins) breaks ties when alphabetical order is wrong.2 Default hash behaviour is replace, not deep-merge: a later file replaces a whole variable, it does not merge dict keys.1 Net effect: set fleet-wide defaults once in group_vars/all.yml, specialise per tier, and override a single host in host_vars/ without touching a role.

The layout the host_group_vars plugin discovers, resolved relative to the inventory file (or, with ansible-playbook, relative to the playbook dir; playbook-relative wins on conflict). File names match the group or host exactly; valid extensions are .yml, .yaml, .json, or none.3

inventory/
  hosts.ini                 # groups + membership only (no per-tier vars inline)
  group_vars/
    all.yml                 # fleet-wide defaults (cuda_branch, NCCL, hugepages)
    gpu_nodes.yml           # everything common to GPU hosts
    datacenter.yml          # gpu_tier=datacenter, nvswitch, MIG policy
    workstation.yml         # gpu_tier=workstation, no Fabric Manager
  host_vars/
    gpu-h100-01.yml         # one-off overrides for a single node

Variables

Custom inventory variables this KB defines (left) and, where one exists, the upstream nvidia.nvidia_driver collection variable they map onto (right). Defaults are the values set in group_vars/all.yml below; tier groups override them.

Variable Scope / where set Default Drives Upstream equivalent
gpu_tier tier group (datacenter/workstation) datacenter driver package selection, whether Fabric Manager + MIG apply — (KB convention)
driver_branch group_vars/all, override per tier/host "580" driver metapackage branch, Fabric Manager branch (lockstep) nvidia_driver_branch (default "515")4
cuda_branch group_vars/all "13-0" CUDA toolkit metapackage suffix (cuda-toolkit-13-0; use the exact suffix your CUDA repo publishes) — (set via CUDA repo)
nvidia_nvswitch tier group / host false install + enable nvidia-fabricmanager; assert it active in validate — (KB convention)
mig_enabled tier group / host false run the mig role at all — (KB convention)
mig_profile host or tier group "" (unset) MIG geometry to create when mig_enabled, e.g. 3g.40gb — (KB convention)
driver_module_type group_vars/all open open vs proprietary kernel modules (open is the default/suggested flavour on Turing and newer; proprietary only for Maxwell/Pascal/Volta)5 repo selection via nvidia_driver_ubuntu_install_from_cuda_repo4
cpu_governor group_vars/all performance base_tuning governor unit — (KB convention)
nr_hugepages group_vars/all / host 2048 vm.nr_hugepages for RDMA/large models — (KB convention)

mig_profile values are MIG profile names of the form <SM-slices>g.<memory-GB>gb (e.g. 1g.10gb, 3g.40gb, 7g.80gb); the exact set is per-GPU and must be taken from the MIG profile tables, not assumed (MIG).6 driver_branch/cuda_branch are quoted strings so YAML does not coerce a value like 13, a minor suffix such as 13-3, or a leading-zero branch into a number.

Tasks

This page ships inventory and variables, not tasks. The roles in Ansible: Node and Fabric Bring-Up consume these files. The artefacts below are the inventory itself plus the one validating play that proves the model loaded correctly.

# inventory/hosts.ini  -- membership only; per-tier facts live in group_vars/
[datacenter]
gpu-h100-[01:16].dc1.internal

[workstation]
ws-rtxpro-[01:04].lab.internal

# gpu_nodes is the union the roles target
[gpu_nodes:children]
datacenter
workstation
# inventory/group_vars/all.yml  -- fleet-wide defaults (lowest custom precedence)
cuda_branch: "13-0"           # cuda-toolkit-13-0; set the exact suffix your CUDA repo publishes
driver_branch: "580"
driver_module_type: open      # default/suggested on Turing+; proprietary only for Maxwell/Pascal/Volta
cpu_governor: performance
nr_hugepages: 2048
# safe defaults; tiers/hosts turn these on
nvidia_nvswitch: false
mig_enabled: false
mig_profile: ""
# inventory/group_vars/datacenter.yml  -- HGX/DGX 8-GPU baseboards
gpu_tier: datacenter
nvidia_nvswitch: true         # NVSwitch baseboard -> Fabric Manager required
mig_enabled: false            # flip per cohort; set mig_profile when true
# inventory/group_vars/workstation.yml  -- RTX PRO 6000 Blackwell desktops
gpu_tier: workstation
nvidia_nvswitch: false        # no NVSwitch -> nvidia_stack skips Fabric Manager
mig_enabled: false            # RTX PRO Blackwell can MIG; default off
# inventory/host_vars/gpu-h100-01.yml  -- one node carved into MIG slices
mig_enabled: true
mig_profile: "3g.40gb"        # verify the profile exists on this SKU first

The roles read these directly. The hub's nvidia_stack already keys on them: datacenter takes the -open datacenter driver plus nvidia-fabricmanager-{{ driver_branch }}, started only when: gpu_tier == 'datacenter' and nvidia_nvswitch | bool; workstation takes a different package and skips Fabric Manager (Ansible: Node and Fabric Bring-Up). The mig role runs when: mig_enabled | bool and applies mig_profile.

# play: assert the inventory model is internally consistent before any role runs.
# Pure validation -- changed_when:false, no host mutation.
- name: Validate inventory model
  hosts: gpu_nodes
  gather_facts: false
  tasks:
    - name: Required per-tier variables are defined
      ansible.builtin.assert:
        that:
          - gpu_tier in ['datacenter', 'workstation']
          - driver_branch | string | length > 0
          - cuda_branch  | string | length > 0
        fail_msg: "gpu_tier/driver_branch/cuda_branch must be set for {{ inventory_hostname }}"
        quiet: true

    - name: Fabric Manager only claimed on datacenter NVSwitch nodes
      ansible.builtin.assert:
        that: "not (nvidia_nvswitch | bool) or gpu_tier == 'datacenter'"
        fail_msg: "nvidia_nvswitch=true on a non-datacenter tier ({{ gpu_tier }})"
        quiet: true

    - name: MIG profile is set whenever MIG is enabled
      ansible.builtin.assert:
        that: "not (mig_enabled | bool) or (mig_profile | length > 0)"
        fail_msg: "mig_enabled=true but mig_profile is empty on {{ inventory_hostname }}"
        quiet: true

ansible.builtin.assert is read-only and reports ok/failed (never changed), so this play is safe to run on every converge as a pre-flight gate.7

Apply & verify

Resolve and inspect the merged variables for one host before running any role. ansible-inventory renders the exact precedence-resolved view the roles will see:

# Show the fully-merged variables for one node (what the roles actually read).
ansible-inventory -i inventory/hosts.ini --host gpu-h100-01.dc1.internal

# List the gpu_nodes group and its tier children.
ansible-inventory -i inventory/hosts.ini --graph gpu_nodes

--host prints the variable hash after precedence and group merging are applied; --graph prints the group tree.8 Then dry-run the consistency play:

ansible-playbook -i inventory/hosts.ini validate-inventory.yml --check

Expected signal: --host gpu-h100-01... shows "mig_enabled": true and "mig_profile": "3g.40gb" (host_vars winning over the tier default), while a workstation host shows "gpu_tier": "workstation", "nvidia_nvswitch": false. --graph gpu_nodes lists both datacenter and workstation as children with their members. The validate play ends failed=0; any misconfigured host fails its assert with the fail_msg naming the host.

Failure modes

  • Per-tier var set inline in hosts.ini instead of group_vars/. Inventory inline group vars sit below group_vars/* in precedence,1 so a group_vars/datacenter.yml value silently wins over an inline one, drift that only shows under ansible-inventory --host. Keep hosts.ini membership-only. Runbook: Image Drift Across Fleet.
  • Expecting dict deep-merge across files. Default hash behaviour is replace; a host_vars dict replaces the whole group_vars dict, not just the changed key.1 Override scalars, or compose the dict in one place.
  • mig_enabled: true with empty mig_profile. The mig role has no geometry to create; the assert play above catches it pre-flight. If MIG state ends up stale or half-applied: Stale MIG State.
  • nvidia_nvswitch: true on a node with no NVSwitch (wrong tier/host). nvidia_stack installs and enables Fabric Manager, which then fails to reach a fabric. Gate it on tier (assert above) and follow Fabric Manager Failure.
  • driver_branch unquoted or mistyped -> wrong/absent metapackage. A bad branch yields no matching package and the node never gets a driver. If GPUs are missing post-converge: Kernel Upgrade, GPU Missing.
  • Same-level group collision resolved the wrong way. Two same-level groups setting the same var merge alphabetically; if the loser should win, set ansible_group_priority.2

References

  • Ansible — How to build your inventory (group_vars//host_vars/ layout, [group:children], child-group override, group merge order): https://docs.ansible.com/ansible/latest/inventory_guide/intro_inventory.html
  • Ansible — Using variables / variable precedence (role defaults < group_vars/all < group_vars/ < host_vars/ < play vars < extra-vars; default hash_behaviour=replace): https://docs.ansible.com/ansible/latest/playbook_guide/playbooks_variables.html
  • Ansible — ansible.builtin.host_group_vars (file-name = group/host name; .yml/.yaml/.json/none; inventory- vs playbook-relative search): https://docs.ansible.com/ansible/latest/collections/ansible/builtin/host_group_vars_vars.html
  • Ansible — ansible.builtin.assert module (read-only assertions, that/fail_msg/quiet): https://docs.ansible.com/ansible/latest/collections/ansible/builtin/assert_module.html
  • Ansible — ansible-inventory CLI (--host, --graph to render merged vars/groups): https://docs.ansible.com/ansible/latest/cli/ansible-inventory.html
  • NVIDIA ansible-role-nvidia-driver (nvidia_driver_branch, nvidia_driver_package_state, nvidia_driver_ubuntu_install_from_cuda_repo): https://github.com/NVIDIA/ansible-role-nvidia-driver
  • NVIDIA MIG User Guide — Supported MIG Profiles (<slices>g.<mem>gb naming, per-GPU tables): https://docs.nvidia.com/datacenter/tesla/mig-user-guide/supported-mig-profiles.html
  • NVIDIA Driver Installation Guide — Kernel Modules (open = default/suggested on Turing+; proprietary required for Maxwell/Pascal/Volta): https://docs.nvidia.com/datacenter/tesla/driver-installation-guide/kernel-modules.html

Related: Node & Fabric Bring-Up (hub) · Driver by Tier · Install Lifecycle · MIG Partitioning · Fabric Manager · Glossary


  1. Ansible, "Using variables" — variable precedence, lowest to highest: role defaults, inventory group_vars/all, inventory group_vars/*, inventory host_vars/*, play vars, extra-vars (-e, always wins); inventory inline group/host vars rank below the corresponding group_vars/*/host_vars/* files; default hash_behaviour is replace (whole-variable replacement, not deep merge). https://docs.ansible.com/ansible/latest/playbook_guide/playbooks_variables.html 

  2. Ansible, "How to build your inventory" — a child group's variables override its parent's; same-level groups merge alphabetically by group name (last loaded wins); ansible_group_priority (higher = merged later = higher precedence) overrides the alphabetical order. https://docs.ansible.com/ansible/latest/inventory_guide/intro_inventory.html 

  3. Ansible, ansible.builtin.host_group_vars vars plugin — loads group_vars//host_vars/ relative to the inventory source and (under ansible-playbook) the playbook dir, playbook-relative overriding inventory-relative; file names match the group/host name; valid extensions .yml, .yaml, .json, or none; directory contents read in lexicographical order. https://docs.ansible.com/ansible/latest/collections/ansible/builtin/host_group_vars_vars.html 

  4. NVIDIA ansible-role-nvidia-driver — exposes nvidia_driver_branch (default "515"), nvidia_driver_package_state (default present), nvidia_driver_persistence_mode_on, and nvidia_driver_ubuntu_install_from_cuda_repo (Ubuntu CUDA-repo vs Canonical repo selection). The KB's driver_branch maps to nvidia_driver_branch. https://github.com/NVIDIA/ansible-role-nvidia-driver 

  5. NVIDIA Driver Installation Guide, Kernel Modules — open kernel modules are supported only on Turing and newer, and from the 560 driver series the open flavour is "the default and suggested installation"; the proprietary flavour is "required for older GPUs from the Maxwell, Pascal, or Volta architectures". Drives the driver_module_type default of open. (The page does not state a Grace Hopper/Blackwell requirement; treat that as out of scope for this citation.) https://docs.nvidia.com/datacenter/tesla/driver-installation-guide/kernel-modules.html 

  6. NVIDIA MIG User Guide, Supported MIG Profiles — profile names encode <SM-slices>g.<memory-GB>gb (e.g. 1g.10gb, 3g.40gb, 7g.80gb); the available set is per-GPU and listed in the per-SKU tables. mig_profile must be taken from these tables for the target GPU. https://docs.nvidia.com/datacenter/tesla/mig-user-guide/supported-mig-profiles.html 

  7. Ansible, ansible.builtin.assert module — evaluates the conditions in that; reports ok/failed only (never changed), so it is safe in an idempotent pre-flight play; fail_msg sets the failure message, quiet: true suppresses per-assertion success output. https://docs.ansible.com/ansible/latest/collections/ansible/builtin/assert_module.html 

  8. Ansible, ansible-inventory CLI — --host <name> outputs the variables for a single host after precedence/group merging; --graph [group] renders the group/child tree. https://docs.ansible.com/ansible/latest/cli/ansible-inventory.html