Markdown

Ansible inventory & variables¶

Scope: the inventory model that the bring-up roles read: the gpu_nodes group, the group_vars//host_vars/ layout, the per-tier variables (gpu_tier, driver_branch, cuda_branch, nvidia_nvswitch, mig_enabled, mig_profile) and how they drive each role, plus a multi-tier inventory example (datacenter + workstation). This is the single source of truth the roles in the bring-up hub consume; it does not run tasks itself.

Reference template, drawn from upstream Ansible inventory/variable-precedence docs and the variable names used by the hub playbook (Ansible: Node and Fabric Bring-Up). Not hardware-tested. The custom variables below are this KB's convention; the nvidia.nvidia_driver collection's own variables (e.g. nvidia_driver_branch) are noted where they map. Validate names against the cited NVIDIA pages before a fleet roll.

flowchart TD
  INV["inventory/hosts.ini  (gpu_nodes group)"] --> GA["group_vars/all.yml  (fleet defaults)"]
  GA --> GT["group_vars/gpu_nodes.yml"]
  GT --> GDC["group_vars/datacenter.yml"]
  GT --> GWS["group_vars/workstation.yml"]
  GDC --> HV["host_vars/gpu-h100-01.yml  (per-node overrides)"]
  GWS --> HV
  HV --> ROLES["roles read the merged vars: base_tuning, acs_disable, rdma_fabric, nvidia_stack, mig, validate"]

What it does¶

The inventory is the declarative state the roles converge a node to; it carries no logic. One gpu_nodes group names every GPU host; tier groups (datacenter, workstation) layer the per-class facts; host_vars/ holds the rare per-node exception. Roles never hard-code a branch or a MIG shape. They read driver_branch, nvidia_nvswitch, mig_enabled, etc., so the same role tree serves a DGX H100 baseboard and an RTX PRO 6000 workstation from different variable files.

Variable resolution follows Ansible's documented precedence (low to high): role defaults < inventory group_vars/all < inventory group_vars/* < inventory host_vars/* < play vars < extra-vars (-e).¹ Within the same level, groups merge alphabetically; a child group overrides its parent, and ansible_group_priority (higher merges later, wins) breaks ties when alphabetical order is wrong.² Default hash behaviour is replace, not deep-merge: a later file replaces a whole variable, it does not merge dict keys.¹ Net effect: set fleet-wide defaults once in group_vars/all.yml, specialise per tier, and override a single host in host_vars/ without touching a role.

The layout the host_group_vars plugin discovers, resolved relative to the inventory file (or, with ansible-playbook, relative to the playbook dir; playbook-relative wins on conflict). File names match the group or host exactly; valid extensions are .yml, .yaml, .json, or none.³

inventory/
  hosts.ini                 # groups + membership only (no per-tier vars inline)
  group_vars/
    all.yml                 # fleet-wide defaults (cuda_branch, NCCL, hugepages)
    gpu_nodes.yml           # everything common to GPU hosts
    datacenter.yml          # gpu_tier=datacenter, nvswitch, MIG policy
    workstation.yml         # gpu_tier=workstation, no Fabric Manager
  host_vars/
    gpu-h100-01.yml         # one-off overrides for a single node

Variables¶

Custom inventory variables this KB defines (left) and, where one exists, the upstream nvidia.nvidia_driver collection variable they map onto (right). Defaults are the values set in group_vars/all.yml below; tier groups override them.

Variable	Scope / where set	Default	Drives	Upstream equivalent
`gpu_tier`	tier group (`datacenter`/`workstation`)	`datacenter`	driver package selection, whether Fabric Manager + MIG apply	— (KB convention)
`driver_branch`	`group_vars/all`, override per tier/host	`"580"`	driver metapackage branch, Fabric Manager branch (lockstep)	`nvidia_driver_branch` (default `"515"`)⁴
`cuda_branch`	`group_vars/all`	`"13-0"`	CUDA toolkit metapackage suffix (`cuda-toolkit-13-0`; use the exact suffix your CUDA repo publishes)	— (set via CUDA repo)
`nvidia_nvswitch`	tier group / host	`false`	install + enable `nvidia-fabricmanager`; assert it active in `validate`	— (KB convention)
`mig_enabled`	tier group / host	`false`	run the `mig` role at all	— (KB convention)
`mig_profile`	host or tier group	`""` (unset)	MIG geometry to create when `mig_enabled`, e.g. `3g.40gb`	— (KB convention)
`driver_module_type`	`group_vars/all`	`open`	open vs proprietary kernel modules (open is the default/suggested flavour on Turing and newer; proprietary only for Maxwell/Pascal/Volta)⁵	repo selection via `nvidia_driver_ubuntu_install_from_cuda_repo`⁴
`cpu_governor`	`group_vars/all`	`performance`	`base_tuning` governor unit	— (KB convention)
`nr_hugepages`	`group_vars/all` / host	`2048`	`vm.nr_hugepages` for RDMA/large models	— (KB convention)

mig_profile values are MIG profile names of the form <SM-slices>g.<memory-GB>gb (e.g. 1g.10gb, 3g.40gb, 7g.80gb); the exact set is per-GPU and must be taken from the MIG profile tables, not assumed (MIG).⁶ driver_branch/cuda_branch are quoted strings so YAML does not coerce a value like 13, a minor suffix such as 13-3, or a leading-zero branch into a number.

Tasks¶

This page ships inventory and variables, not tasks. The roles in Ansible: Node and Fabric Bring-Up consume these files. The artefacts below are the inventory itself plus the one validating play that proves the model loaded correctly.

# inventory/hosts.ini  -- membership only; per-tier facts live in group_vars/
[datacenter]
gpu-h100-[01:16].dc1.internal

[workstation]
ws-rtxpro-[01:04].lab.internal

# gpu_nodes is the union the roles target
[gpu_nodes:children]
datacenter
workstation

# inventory/group_vars/all.yml  -- fleet-wide defaults (lowest custom precedence)
cuda_branch: "13-0"           # cuda-toolkit-13-0; set the exact suffix your CUDA repo publishes
driver_branch: "580"
driver_module_type: open      # default/suggested on Turing+; proprietary only for Maxwell/Pascal/Volta
cpu_governor: performance
nr_hugepages: 2048
# safe defaults; tiers/hosts turn these on
nvidia_nvswitch: false
mig_enabled: false
mig_profile: ""

# inventory/group_vars/datacenter.yml  -- HGX/DGX 8-GPU baseboards
gpu_tier: datacenter
nvidia_nvswitch: true         # NVSwitch baseboard -> Fabric Manager required
mig_enabled: false            # flip per cohort; set mig_profile when true

# inventory/group_vars/workstation.yml  -- RTX PRO 6000 Blackwell desktops
gpu_tier: workstation
nvidia_nvswitch: false        # no NVSwitch -> nvidia_stack skips Fabric Manager
mig_enabled: false            # RTX PRO Blackwell can MIG; default off

# inventory/host_vars/gpu-h100-01.yml  -- one node carved into MIG slices
mig_enabled: true
mig_profile: "3g.40gb"        # verify the profile exists on this SKU first

The roles read these directly. The hub's nvidia_stack already keys on them: datacenter takes the -open datacenter driver plus nvidia-fabricmanager-{{ driver_branch }}, started only when: gpu_tier == 'datacenter' and nvidia_nvswitch | bool; workstation takes a different package and skips Fabric Manager (Ansible: Node and Fabric Bring-Up). The mig role runs when: mig_enabled | bool and applies mig_profile.

# play: assert the inventory model is internally consistent before any role runs.
# Pure validation -- changed_when:false, no host mutation.
- name: Validate inventory model
  hosts: gpu_nodes
  gather_facts: false
  tasks:
    - name: Required per-tier variables are defined
      ansible.builtin.assert:
        that:
          - gpu_tier in ['datacenter', 'workstation']
          - driver_branch | string | length > 0
          - cuda_branch  | string | length > 0
        fail_msg: "gpu_tier/driver_branch/cuda_branch must be set for {{ inventory_hostname }}"
        quiet: true

    - name: Fabric Manager only claimed on datacenter NVSwitch nodes
      ansible.builtin.assert:
        that: "not (nvidia_nvswitch | bool) or gpu_tier == 'datacenter'"
        fail_msg: "nvidia_nvswitch=true on a non-datacenter tier ({{ gpu_tier }})"
        quiet: true

    - name: MIG profile is set whenever MIG is enabled
      ansible.builtin.assert:
        that: "not (mig_enabled | bool) or (mig_profile | length > 0)"
        fail_msg: "mig_enabled=true but mig_profile is empty on {{ inventory_hostname }}"
        quiet: true

ansible.builtin.assert is read-only and reports ok/failed (never changed), so this play is safe to run on every converge as a pre-flight gate.⁷

Apply & verify¶

Resolve and inspect the merged variables for one host before running any role. ansible-inventory renders the exact precedence-resolved view the roles will see:

# Show the fully-merged variables for one node (what the roles actually read).
ansible-inventory -i inventory/hosts.ini --host gpu-h100-01.dc1.internal

# List the gpu_nodes group and its tier children.
ansible-inventory -i inventory/hosts.ini --graph gpu_nodes

--host prints the variable hash after precedence and group merging are applied; --graph prints the group tree.⁸ Then dry-run the consistency play:

ansible-playbook -i inventory/hosts.ini validate-inventory.yml --check

Expected signal: --host gpu-h100-01... shows "mig_enabled": true and "mig_profile": "3g.40gb" (host_vars winning over the tier default), while a workstation host shows "gpu_tier": "workstation", "nvidia_nvswitch": false. --graph gpu_nodes lists both datacenter and workstation as children with their members. The validate play ends failed=0; any misconfigured host fails its assert with the fail_msg naming the host.

Failure modes¶

Per-tier var set inline in hosts.ini instead of group_vars/. Inventory inline group vars sit below group_vars/* in precedence,¹ so a group_vars/datacenter.yml value silently wins over an inline one, drift that only shows under ansible-inventory --host. Keep hosts.ini membership-only. Runbook: Image Drift Across Fleet.
Expecting dict deep-merge across files. Default hash behaviour is replace; a host_vars dict replaces the whole group_vars dict, not just the changed key.¹ Override scalars, or compose the dict in one place.
mig_enabled: true with empty mig_profile. The mig role has no geometry to create; the assert play above catches it pre-flight. If MIG state ends up stale or half-applied: Stale MIG State.
nvidia_nvswitch: true on a node with no NVSwitch (wrong tier/host). nvidia_stack installs and enables Fabric Manager, which then fails to reach a fabric. Gate it on tier (assert above) and follow Fabric Manager Failure.
driver_branch unquoted or mistyped -> wrong/absent metapackage. A bad branch yields no matching package and the node never gets a driver. If GPUs are missing post-converge: Kernel Upgrade, GPU Missing.
Same-level group collision resolved the wrong way. Two same-level groups setting the same var merge alphabetically; if the loser should win, set ansible_group_priority.²

References¶

Ansible — How to build your inventory (group_vars//host_vars/ layout, [group:children], child-group override, group merge order): https://docs.ansible.com/ansible/latest/inventory_guide/intro_inventory.html
Ansible — Using variables / variable precedence (role defaults < group_vars/all < group_vars/ < host_vars/ < play vars < extra-vars; default hash_behaviour=replace): https://docs.ansible.com/ansible/latest/playbook_guide/playbooks_variables.html
Ansible — ansible.builtin.host_group_vars (file-name = group/host name; .yml/.yaml/.json/none; inventory- vs playbook-relative search): https://docs.ansible.com/ansible/latest/collections/ansible/builtin/host_group_vars_vars.html
Ansible — ansible.builtin.assert module (read-only assertions, that/fail_msg/quiet): https://docs.ansible.com/ansible/latest/collections/ansible/builtin/assert_module.html
Ansible — ansible-inventory CLI (--host, --graph to render merged vars/groups): https://docs.ansible.com/ansible/latest/cli/ansible-inventory.html
NVIDIA ansible-role-nvidia-driver (nvidia_driver_branch, nvidia_driver_package_state, nvidia_driver_ubuntu_install_from_cuda_repo): https://github.com/NVIDIA/ansible-role-nvidia-driver
NVIDIA MIG User Guide — Supported MIG Profiles (<slices>g.<mem>gb naming, per-GPU tables): https://docs.nvidia.com/datacenter/tesla/mig-user-guide/supported-mig-profiles.html
NVIDIA Driver Installation Guide — Kernel Modules (open = default/suggested on Turing+; proprietary required for Maxwell/Pascal/Volta): https://docs.nvidia.com/datacenter/tesla/driver-installation-guide/kernel-modules.html

Ansible, "Using variables" — variable precedence, lowest to highest: role defaults, inventory group_vars/all, inventory group_vars/*, inventory host_vars/*, play vars, extra-vars (-e, always wins); inventory inline group/host vars rank below the corresponding group_vars/*/host_vars/* files; default hash_behaviour is replace (whole-variable replacement, not deep merge). https://docs.ansible.com/ansible/latest/playbook_guide/playbooks_variables.html ↩↩↩↩
Ansible, "How to build your inventory" — a child group's variables override its parent's; same-level groups merge alphabetically by group name (last loaded wins); ansible_group_priority (higher = merged later = higher precedence) overrides the alphabetical order. https://docs.ansible.com/ansible/latest/inventory_guide/intro_inventory.html ↩↩
Ansible, ansible.builtin.host_group_vars vars plugin — loads group_vars//host_vars/ relative to the inventory source and (under ansible-playbook) the playbook dir, playbook-relative overriding inventory-relative; file names match the group/host name; valid extensions .yml, .yaml, .json, or none; directory contents read in lexicographical order. https://docs.ansible.com/ansible/latest/collections/ansible/builtin/host_group_vars_vars.html ↩
NVIDIA ansible-role-nvidia-driver — exposes nvidia_driver_branch (default "515"), nvidia_driver_package_state (default present), nvidia_driver_persistence_mode_on, and nvidia_driver_ubuntu_install_from_cuda_repo (Ubuntu CUDA-repo vs Canonical repo selection). The KB's driver_branch maps to nvidia_driver_branch. https://github.com/NVIDIA/ansible-role-nvidia-driver ↩↩
NVIDIA Driver Installation Guide, Kernel Modules — open kernel modules are supported only on Turing and newer, and from the 560 driver series the open flavour is "the default and suggested installation"; the proprietary flavour is "required for older GPUs from the Maxwell, Pascal, or Volta architectures". Drives the driver_module_type default of open. (The page does not state a Grace Hopper/Blackwell requirement; treat that as out of scope for this citation.) https://docs.nvidia.com/datacenter/tesla/driver-installation-guide/kernel-modules.html ↩
NVIDIA MIG User Guide, Supported MIG Profiles — profile names encode <SM-slices>g.<memory-GB>gb (e.g. 1g.10gb, 3g.40gb, 7g.80gb); the available set is per-GPU and listed in the per-SKU tables. mig_profile must be taken from these tables for the target GPU. https://docs.nvidia.com/datacenter/tesla/mig-user-guide/supported-mig-profiles.html ↩
Ansible, ansible.builtin.assert module — evaluates the conditions in that; reports ok/failed only (never changed), so it is safe in an idempotent pre-flight play; fail_msg sets the failure message, quiet: true suppresses per-assertion success output. https://docs.ansible.com/ansible/latest/collections/ansible/builtin/assert_module.html ↩
Ansible, ansible-inventory CLI — --host <name> outputs the variables for a single host after precedence/group merging; --graph [group] renders the group/child tree. https://docs.ansible.com/ansible/latest/cli/ansible-inventory.html ↩