Ansible role: validate¶
Scope: fail the play if a node is not fit to take work. The validate role is the last role in site.yml: it asserts GPUs enumerate, persistence and (on NVSwitch nodes) Fabric Manager are active, InfiniBand ports are Active, and dcgmi diag -r 3 passes, then dumps dmesg on any failure for triage. It does not remediate; a red validate means the node never reaches the ready pool. The deeper "what each tool proves and how to read it" reference is GPU Diagnostics and Validation; this page is just the Ansible gate.
Reference templates, drawn from upstream Ansible module docs and NVIDIA driver/DCGM/InfiniBand docs. Nothing here was executed on hardware. Pin package versions to your driver branch, confirm
dcgmi diag --helpandnvidia-smi --help-query-gpuagainst your installed build, and validate on one node before a fleet roll.
This builds directly on the validate role sketched in the bring-up hub: the same inventory variables (gpu_tier, nvidia_nvswitch) and the same probes (nvidia-smi --query-gpu, systemctl is-active, ibstat, dcgmi diag -r 3), expanded into a full idempotent tasks/main.yml with assert/fail and on-failure dmesg capture. It does not contradict the hub; it is the hub's role, fleshed out.
flowchart LR
SMI["nvidia-smi query"] --> SVC["persistenced + fabricmanager active"]
SVC --> IB["ibstat Active port count"]
IB --> DIAG["dcgmi diag -r 3"]
DIAG --> VERDICT{"all asserts pass?"}
VERDICT -->|"yes"| READY["Node ready"]
VERDICT -->|"no"| DUMP["Collect dmesg, fail play"]
What it does¶
A node that boots and shows GPUs in nvidia-smi is not a node fit for a multi-hour collective job: persistence may be off, Fabric Manager may be masked, an IB port may be Down, ECC may be degrading, a row-remap may be pending (Reliability, RAS and Failure Modes). The validate role runs the cheap-to-expensive checks in order and fails the play on the first unmet assertion, so a bad node is excluded from the ready pool instead of silently joining it and sinking the next job that lands on it.
Order matters: probe the stack bottom-up so the failure points at the real cause. Driver/enumeration first, then the daemons that program persistence and the NVLink fabric, then the IB plane, then the active DCGM diagnostic that exercises the silicon. A dcgmi diag failure on a node whose Fabric Manager is down is a stack fault, not a silicon fault; the ordering surfaces that. On any failure the role collects dmesg to a host-local file so the triaging engineer has the kernel ring buffer (NVRM/Xid lines) without re-attaching to a node that may get reset.
The role is read-only and idempotent: every probe is a command/shell with changed_when: false, every verdict is assert/fail. It reports ok/failed, never changed, on a healthy node, so it is safe to re-run as a standalone gate (--tags validate) at acceptance, after a kernel bump, or post-incident. Active dcgmi diag levels contend with resident work, so run the role on an idle, cordoned/drained node (GPU Health Gating); a busy GPU yields false failures.1
Variables¶
Role behaviour is driven by inventory variables set on the gpu_nodes group in the hub's inventory (gpu_tier, nvidia_nvswitch) plus role defaults in roles/validate/defaults/main.yml. The Fabric Manager assertion is gated exactly as the hub gates installing it: datacenter NVSwitch nodes only.
| Variable | Source | Default | Meaning |
|---|---|---|---|
gpu_tier |
inventory (gpu_nodes:vars) |
datacenter |
datacenter | workstation | consumer. Only datacenter + nvidia_nvswitch asserts Fabric Manager. |
nvidia_nvswitch |
inventory (gpu_nodes:vars) |
true |
NVSwitch baseboard (HGX/DGX 8-GPU, NVL72). Gates the Fabric Manager check. PCIe/consumer nodes set false. |
validate_expected_gpu_count |
defaults | 0 |
Expected GPU count per node; 0 disables the exact-count assert (only requires >= 1). Set per host (e.g. 8) to catch a fallen-off-the-bus GPU. |
validate_min_ib_active |
defaults | 1 |
Minimum ibstat ports in State: Active. Set to the node's planned IB/RoCE port count to catch a dropped link. 0 skips the IB assert (no-RDMA nodes). |
validate_dcgm_level |
defaults | 3 |
dcgmi diag -r run level (1..4 or named tests). 3 = Long, the acceptance/post-reset level. Use 1 for a fast gate.1 |
validate_run_dcgm |
defaults | true |
Master switch for the dcgmi diag step. Set false on nodes without DCGM installed, or to gate solely on enumeration + services + IB. |
validate_check_remapped_rows |
defaults | true |
Assert no pending/failed HBM row remaps (Ampere+). Auto-skips when nvidia-smi --query-remapped-rows is unsupported.2 |
validate_dmesg_dir |
defaults | /var/log/alexandria-validate |
Host-local directory the on-failure dmesg dump is written to. |
# roles/validate/defaults/main.yml
validate_expected_gpu_count: 0 # 0 = require >=1; set 8 on a DGX/HGX node
validate_min_ib_active: 1 # set to planned IB port count; 0 to skip
validate_dcgm_level: 3 # dcgmi diag -r <level>
validate_run_dcgm: true
validate_check_remapped_rows: true
validate_dmesg_dir: /var/log/alexandria-validate
Tasks¶
Real roles/validate/tasks/main.yml. Fully-qualified ansible.builtin.* module names. Every probe is read-only (changed_when: false); check_mode: false on the block makes those probes execute even when the surrounding play is run with --check. The block/rescue collects dmesg and re-raises so the play still fails. The IB step skips only when validate_min_ib_active: 0; if IB validation is required and ibstat is missing, the role fails closed. The remapped-rows step self-skips when unsupported, so the same role runs on an NVSwitch DGX and a single-GPU PCIe box without edits.
# roles/validate/tasks/main.yml
# Read-only health gate. Asserts node fitness; collects dmesg and fails on any unmet check.
# Run on an idle, cordoned/drained node: dcgmi diag needs exclusive GPU access.
- name: Validate node health
check_mode: false # read-only probes must still execute under --check
block:
# 1. GPUs enumerate and the driver answers (nvidia-smi exits non-zero if it cannot
# talk to the driver: "couldn't communicate with the NVIDIA driver").
- name: Query GPUs (name, driver, pstate)
ansible.builtin.command:
cmd: nvidia-smi --query-gpu=name,driver_version,pstate --format=csv,noheader
register: smi
changed_when: false
failed_when: smi.rc != 0
- name: Assert at least one GPU is visible
ansible.builtin.assert:
that:
- smi.stdout_lines | length >= 1
fail_msg: "No GPUs enumerated by nvidia-smi on {{ inventory_hostname }}"
quiet: true
- name: Assert the expected GPU count (when validate_expected_gpu_count > 0)
ansible.builtin.assert:
that:
- smi.stdout_lines | length == validate_expected_gpu_count
fail_msg: >-
Expected {{ validate_expected_gpu_count }} GPUs, found
{{ smi.stdout_lines | length }} (GPU fell off the bus?)
quiet: true
when: validate_expected_gpu_count | int > 0
# 2. Persistence daemon active (avoids per-job re-init latency / clock-down).
- name: Check nvidia-persistenced is active
ansible.builtin.command:
cmd: systemctl is-active nvidia-persistenced
register: persistenced
changed_when: false
failed_when: false # is-active returns non-zero when inactive; assert below
- name: Assert nvidia-persistenced is active
ansible.builtin.assert:
that:
- persistenced.stdout == "active"
fail_msg: "nvidia-persistenced is '{{ persistenced.stdout }}' (expected active)"
quiet: true
# 3. Fabric Manager active on datacenter NVSwitch baseboards only (same gate the
# hub uses to install it). PCIe/consumer nodes never run nv-fabricmanager.
- name: Check nvidia-fabricmanager is active
ansible.builtin.command:
cmd: systemctl is-active nvidia-fabricmanager
register: fm
changed_when: false
failed_when: false
when: gpu_tier == 'datacenter' and nvidia_nvswitch | bool
- name: Assert Fabric Manager is active on NVSwitch systems
ansible.builtin.assert:
that:
- fm.stdout == "active"
fail_msg: >-
nvidia-fabricmanager is '{{ fm.stdout }}' on an NVSwitch node;
GPUs will not form the NVLink domain. See runbook-fabric-manager-failure.
quiet: true
when: gpu_tier == 'datacenter' and nvidia_nvswitch | bool
# 4. InfiniBand ports ACTIVE. ibstat prints "State: Active" per up port; count them.
# Skip only when validate_min_ib_active == 0 (no-RDMA nodes).
- name: Locate ibstat
ansible.builtin.command:
cmd: which ibstat
register: ibstat_bin
changed_when: false
failed_when: false
when: validate_min_ib_active | int > 0
- name: Assert ibstat is installed when IB validation is required
ansible.builtin.assert:
that:
- ibstat_bin.rc == 0
fail_msg: "ibstat is missing but validate_min_ib_active={{ validate_min_ib_active }}"
quiet: true
when: validate_min_ib_active | int > 0
- name: Count InfiniBand ports in State Active
ansible.builtin.shell:
cmd: "set -o pipefail; ibstat | grep -c 'State: Active'"
args:
executable: /bin/bash
register: ib_active
changed_when: false
failed_when: false # grep -c exits 1 on zero matches; assert below
when:
- validate_min_ib_active | int > 0
- name: Assert enough InfiniBand ports are Active
ansible.builtin.assert:
that:
- (ib_active.stdout | default('0') | int) >= (validate_min_ib_active | int)
fail_msg: >-
Only {{ ib_active.stdout | default('0') }} IB port(s) Active,
need {{ validate_min_ib_active }}. Check ibstat / subnet manager.
quiet: true
when:
- validate_min_ib_active | int > 0
# 5. HBM row-remap health (Ampere+). A pending/failed remap means RMA-triage, not pool.
- name: Query remapped rows (pending/failure)
ansible.builtin.command:
cmd: >-
nvidia-smi
--query-remapped-rows=remapped_rows.pending,remapped_rows.failure
--format=csv,noheader
register: remap
changed_when: false
failed_when: false # pre-Ampere prints [N/A] (exit 0, not non-zero); skip via N/A guard below
when: validate_check_remapped_rows | bool
- name: Assert no pending or failed row remaps
ansible.builtin.assert:
# nvidia-smi does not document the literal CSV value for these fields, and it
# varies by driver (some emit Yes/No, some a numeric count). Be format-agnostic:
# a healthy board reports only "No" or "0" for every field; anything else condemns.
that:
- remap.stdout | regex_replace('[\r\n]+', ',') | split(',') | map('trim') | map('lower') | reject('in', ['no', '0']) | list | length == 0
fail_msg: "Pending/failed HBM row remap on {{ inventory_hostname }}: {{ remap.stdout }}"
quiet: true
when:
- validate_check_remapped_rows | bool
- "'N/A' not in (remap.stdout | default(''))" # pre-Ampere prints [N/A] at exit 0; skip when unsupported
# 6. Active DCGM diagnostic. dcgmi diag exits non-zero on any plugin failure
# (e.g. DCGM_ST_NVVS_ERROR / 226). Needs exclusive GPU access -> idle node only.
- name: Run dcgmi diag (active diagnostic)
ansible.builtin.command:
cmd: "dcgmi diag -r {{ validate_dcgm_level }}"
register: diag
changed_when: false
failed_when: diag.rc != 0 or ('Fail' in diag.stdout)
when: validate_run_dcgm | bool
rescue:
# On any failure above, capture the kernel ring buffer for triage, then re-raise.
- name: Ensure dmesg capture directory exists
ansible.builtin.file:
path: "{{ validate_dmesg_dir }}"
state: directory
mode: "0750"
- name: Collect dmesg to a host-local file
ansible.builtin.shell:
cmd: >-
dmesg --ctime > "{{ validate_dmesg_dir }}/dmesg-{{ ansible_date_time.iso8601_basic_short }}.log"
args:
executable: /bin/bash
changed_when: true
- name: Fail the play after collecting diagnostics
ansible.builtin.fail:
msg: >-
validate failed on {{ inventory_hostname }}; dmesg saved under
{{ validate_dmesg_dir }}. Triage with diagnostics-tools / the matching runbook.
Idempotency notes: the probe tasks are changed_when: false, so a healthy run reports zero changes. failed_when: false on systemctl is-active, grep -c, and the unsupported-field queries prevents the command step from aborting on a non-zero exit that the following assert is meant to adjudicate; the verdict lives in exactly one place. The dmesg dump only runs inside rescue, i.e. only on failure, and is reported changed deliberately (it writes a file). assert/fail are the only failure surfaces.
Apply & verify¶
Run the whole bring-up, or just the gate, from the hub's site.yml. Tag the role so it can be invoked standalone.
# Full bring-up (validate runs last):
ansible-playbook -i inventory/hosts.ini site.yml --limit gpu-07.dc1.internal
# Just the gate, on an idle/cordoned node (tag the role include in site.yml as 'validate'):
ansible-playbook -i inventory/hosts.ini site.yml --tags validate --limit gpu-07.dc1.internal
# Run the read-only checks while using --check for the rest of the play; the validate block sets check_mode:false so probes execute:
ansible-playbook -i inventory/hosts.ini site.yml --tags validate --check --limit gpu-07.dc1.internal
Expected signal on a healthy node: the play PLAY RECAP shows failed=0 and the validate tasks report ok, never changed (every probe is changed_when: false). The dcgmi diag task ends ok only when the command exited zero and no Fail appears in its table.
On an unhealthy node the play stops at the first failed assert/fail, the recap shows failed=1, the fail_msg names the unmet check, and a dmesg-<timestamp>.log appears under validate_dmesg_dir. Confirm the gate is real by forcing a failure and reading the output:
# Force a failure path to prove the gate fails closed (run on a throwaway/lab node):
sudo systemctl stop nvidia-persistenced # expect: assert fails, dmesg captured, failed=1
ls -1 /var/log/alexandria-validate/ # expect: dmesg-<ts>.log present
sudo systemctl start nvidia-persistenced # restore; re-run validate -> failed=0
For the machine-readable variant (a Slurm HealthCheckProgram or a Kubernetes validator wrapping the same probes), gate on dcgmi diag -r {{ validate_dcgm_level }} -j and parse the per-test status field (tests[].results[].status, or the roll-up tests[].test_summary.status; values Pass/Fail) rather than scraping the table; see GPU Health Gating for wiring a verdict to a scheduler state change.1
Failure modes¶
dcgmi diagfails for a non-silicon reason. A missing/mismatched kernel module, a down Fabric Manager, or a GSP/driver mismatch makes the diagnostic fail though the silicon is fine. The role's bottom-up order surfaces the real cause: enumeration and service asserts trip beforedcgmi diagif the stack is broken. Triage: Kernel upgrade, GPU missing, Fabric Manager Failure. Tool-level reading of the failure: GPU Diagnostics and Validation.dcgmi diagrun on a busy node. Active levels contend with resident work and return false failures.1 Cordon/drain first (GPU Health Gating); never runvalidateagainst live jobs.- Fabric Manager
activebut the fabric is dead.FM_STAY_RESIDENT_ON_FAILURES=1lets the daemon stayactivewhile the system is uninitialized, sosystemctl is-activepasses but CUDA still failscudaErrorSystemNotReady. The service assert alone is insufficient;dcgmi diag(NCCL/NVLink plugins at-r 3) is what actually exercises the fabric. Diagnosis: Fabric Manager Failure. - IB assert passes with fewer ports than planned.
validate_min_ib_activedefaults to1; a node that should have 8 IB ports up but has 1 passes a default gate. Setvalidate_min_ib_activeto the node's planned port count to catch a dropped link. - Pending row-remap missed on pre-Ampere or with the check disabled.
--query-remapped-rowsis Ampere+ only; the step self-skips elsewhere. A pending/failed remap requires a GPU reset and RMA-triage, not a return to pool. Recovery flow: Persistence Mode / Clock Bounce is unrelated; use Reliability, RAS and Failure Modes for the fault taxonomy. dmesgrequires privilege.dmesgmay needbecome: true(the hub runs the play withbecome: true) and on some kernelskernel.dmesg_restrict=1blocks non-root reads; the capture runs as root via the play's privilege escalation.
References¶
- Ansible
ansible.builtin.assert(thatlist,fail_msg/msg,success_msg,quiet): https://docs.ansible.com/projects/ansible/latest/collections/ansible/builtin/assert_module.html - Ansible
ansible.builtin.fail(msg, conditional fail viawhen): https://docs.ansible.com/projects/ansible/latest/collections/ansible/builtin/fail_module.html - Ansible
ansible.builtin.command(register,rc,stdout,changed_when,failed_when): https://docs.ansible.com/projects/ansible/latest/collections/ansible/builtin/command_module.html - Ansible error handling (
failed_when,changed_when,block/rescue): https://docs.ansible.com/projects/ansible/latest/playbook_guide/playbooks_error_handling.html - NVIDIA DCGM Diagnostics (
dcgmi diag -r <1-4|test>,-jJSON, non-zero exit on plugin failure, run levels): https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/dcgm-diagnostics.html - NVIDIA DCGM Feature Overview (active diagnostics need exclusive GPU access;
dcgmi health): https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/feature-overview.html - nvidia-smi manual (
--query-gpu,--query-remapped-rows,--format=csv,noheader): https://docs.nvidia.com/deploy/nvidia-smi/index.html - NVIDIA Driver Persistence /
nvidia-persistenceddaemon and systemd unit: https://docs.nvidia.com/deploy/driver-persistence/index.html - NVIDIA Fabric Manager user guide (
nvidia-fabricmanagerservice,cudaErrorSystemNotReady): https://docs.nvidia.com/datacenter/tesla/fabric-manager-user-guide/index.html - InfiniBand port states (
State: Active,Physical state: LinkUpviaibstat): https://www.advancedclustering.com/act_kb/infiniband-port-states/
Related: Ansible: Node and Fabric Bring-Up · GPU Diagnostics and Validation · GPU Health Gating · Fabric Manager Failure · Kernel Upgrade — GPU Missing · Glossary
-
NVIDIA DCGM Diagnostics —
dcgmi diag -r <1|2|3|4|test_name>selects run level (higher includes lower),-j/--jsonprints parseable JSON whose per-test/per-resultstatusfield carriesPass/Fail(keyscategory/tests/name/results/status/test_summary; there is noresultfield), and the program returns a non-zero exit code matchingdcgmReturn_ton a plugin failure (e.g.DCGM_ST_NVVS_ERROR= 226). The active levels exercise real workloads and fail if other graphics processes run on the target GPU(s), so they require an idle/exclusive GPU. https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/dcgm-diagnostics.html · exclusivity and health watches: https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/feature-overview.html ↩↩↩↩ -
NVIDIA nvidia-smi —
--query-remapped-rows=remapped_rows.pending,remapped_rows.failure,remapped_rows.correctable,remapped_rows.uncorrectable --format=csvreports HBM row-remap state on Ampere and newer; a pending remap requires a GPU reset to take effect, and a remap failure condemns the board. On pre-Ampere parts the field is unsupported: per the nvidia-smi RETURN VALUE section the exit code stays0(the documented codes are0/2/3) and "Any unsupported data is indicated by a 'N/A' in the output", so the value reads[N/A]rather than failing — the task therefore skips on'N/A' in remap.stdout, not on a non-zero rc. The docs describepending/failureonly as indicating "whether or not" a remap is pending/failed and do not fix the literal CSV string, so the assert treats anything other thanNo/0as a condemned board. https://docs.nvidia.com/deploy/nvidia-smi/index.html ↩