Markdown

Ansible role: validate¶

Scope: fail the play if a node is not fit to take work. The validate role is the last role in site.yml: it asserts GPUs enumerate, persistence and (on NVSwitch nodes) Fabric Manager are active, InfiniBand ports are Active, and dcgmi diag -r 3 passes, then dumps dmesg on any failure for triage. It does not remediate; a red validate means the node never reaches the ready pool. The deeper "what each tool proves and how to read it" reference is GPU Diagnostics and Validation; this page is just the Ansible gate.

Reference templates, drawn from upstream Ansible module docs and NVIDIA driver/DCGM/InfiniBand docs. Nothing here was executed on hardware. Pin package versions to your driver branch, confirm dcgmi diag --help and nvidia-smi --help-query-gpu against your installed build, and validate on one node before a fleet roll.

This builds directly on the validate role sketched in the bring-up hub: the same inventory variables (gpu_tier, nvidia_nvswitch) and the same probes (nvidia-smi --query-gpu, systemctl is-active, ibstat, dcgmi diag -r 3), expanded into a full idempotent tasks/main.yml with assert/fail and on-failure dmesg capture. It does not contradict the hub; it is the hub's role, fleshed out.

flowchart LR
  SMI["nvidia-smi query"] --> SVC["persistenced + fabricmanager active"]
  SVC --> IB["ibstat Active port count"]
  IB --> DIAG["dcgmi diag -r 3"]
  DIAG --> VERDICT{"all asserts pass?"}
  VERDICT -->|"yes"| READY["Node ready"]
  VERDICT -->|"no"| DUMP["Collect dmesg, fail play"]

What it does¶

A node that boots and shows GPUs in nvidia-smi is not a node fit for a multi-hour collective job: persistence may be off, Fabric Manager may be masked, an IB port may be Down, ECC may be degrading, a row-remap may be pending (Reliability, RAS and Failure Modes). The validate role runs the cheap-to-expensive checks in order and fails the play on the first unmet assertion, so a bad node is excluded from the ready pool instead of silently joining it and sinking the next job that lands on it.

Order matters: probe the stack bottom-up so the failure points at the real cause. Driver/enumeration first, then the daemons that program persistence and the NVLink fabric, then the IB plane, then the active DCGM diagnostic that exercises the silicon. A dcgmi diag failure on a node whose Fabric Manager is down is a stack fault, not a silicon fault; the ordering surfaces that. On any failure the role collects dmesg to a host-local file so the triaging engineer has the kernel ring buffer (NVRM/Xid lines) without re-attaching to a node that may get reset.

The role is read-only and idempotent: every probe is a command/shell with changed_when: false, every verdict is assert/fail. It reports ok/failed, never changed, on a healthy node, so it is safe to re-run as a standalone gate (--tags validate) at acceptance, after a kernel bump, or post-incident. Active dcgmi diag levels contend with resident work, so run the role on an idle, cordoned/drained node (GPU Health Gating); a busy GPU yields false failures.¹

Variables¶

Role behaviour is driven by inventory variables set on the gpu_nodes group in the hub's inventory (gpu_tier, nvidia_nvswitch) plus role defaults in roles/validate/defaults/main.yml. The Fabric Manager assertion is gated exactly as the hub gates installing it: datacenter NVSwitch nodes only.

Variable	Source	Default	Meaning
`gpu_tier`	inventory (`gpu_nodes:vars`)	`datacenter`	`datacenter` \| `workstation` \| `consumer`. Only `datacenter` + `nvidia_nvswitch` asserts Fabric Manager.
`nvidia_nvswitch`	inventory (`gpu_nodes:vars`)	`true`	NVSwitch baseboard (HGX/DGX 8-GPU, NVL72). Gates the Fabric Manager check. PCIe/consumer nodes set `false`.
`validate_expected_gpu_count`	defaults	`0`	Expected GPU count per node; `0` disables the exact-count assert (only requires `>= 1`). Set per host (e.g. `8`) to catch a fallen-off-the-bus GPU.
`validate_min_ib_active`	defaults	`1`	Minimum `ibstat` ports in `State: Active`. Set to the node's planned IB/RoCE port count to catch a dropped link. `0` skips the IB assert (no-RDMA nodes).
`validate_dcgm_level`	defaults	`3`	`dcgmi diag -r` run level (`1`..`4` or named tests). `3` = Long, the acceptance/post-reset level. Use `1` for a fast gate.¹
`validate_run_dcgm`	defaults	`true`	Master switch for the `dcgmi diag` step. Set `false` on nodes without DCGM installed, or to gate solely on enumeration + services + IB.
`validate_check_remapped_rows`	defaults	`true`	Assert no pending/failed HBM row remaps (Ampere+). Auto-skips when `nvidia-smi --query-remapped-rows` is unsupported.²
`validate_dmesg_dir`	defaults	`/var/log/alexandria-validate`	Host-local directory the on-failure `dmesg` dump is written to.

# roles/validate/defaults/main.yml
validate_expected_gpu_count: 0      # 0 = require >=1; set 8 on a DGX/HGX node
validate_min_ib_active: 1           # set to planned IB port count; 0 to skip
validate_dcgm_level: 3              # dcgmi diag -r <level>
validate_run_dcgm: true
validate_check_remapped_rows: true
validate_dmesg_dir: /var/log/alexandria-validate

Tasks¶

Real roles/validate/tasks/main.yml. Fully-qualified ansible.builtin.* module names. Every probe is read-only (changed_when: false); check_mode: false on the block makes those probes execute even when the surrounding play is run with --check. The block/rescue collects dmesg and re-raises so the play still fails. The IB step skips only when validate_min_ib_active: 0; if IB validation is required and ibstat is missing, the role fails closed. The remapped-rows step self-skips when unsupported, so the same role runs on an NVSwitch DGX and a single-GPU PCIe box without edits.

# roles/validate/tasks/main.yml
# Read-only health gate. Asserts node fitness; collects dmesg and fails on any unmet check.
# Run on an idle, cordoned/drained node: dcgmi diag needs exclusive GPU access.
- name: Validate node health
  check_mode: false            # read-only probes must still execute under --check
  block:
    # 1. GPUs enumerate and the driver answers (nvidia-smi exits non-zero if it cannot
    #    talk to the driver: "couldn't communicate with the NVIDIA driver").
    - name: Query GPUs (name, driver, pstate)
      ansible.builtin.command:
        cmd: nvidia-smi --query-gpu=name,driver_version,pstate --format=csv,noheader
      register: smi
      changed_when: false
      failed_when: smi.rc != 0

    - name: Assert at least one GPU is visible
      ansible.builtin.assert:
        that:
          - smi.stdout_lines | length >= 1
        fail_msg: "No GPUs enumerated by nvidia-smi on {{ inventory_hostname }}"
        quiet: true

    - name: Assert the expected GPU count (when validate_expected_gpu_count > 0)
      ansible.builtin.assert:
        that:
          - smi.stdout_lines | length == validate_expected_gpu_count
        fail_msg: >-
          Expected {{ validate_expected_gpu_count }} GPUs, found
          {{ smi.stdout_lines | length }} (GPU fell off the bus?)
        quiet: true
      when: validate_expected_gpu_count | int > 0

    # 2. Persistence daemon active (avoids per-job re-init latency / clock-down).
    - name: Check nvidia-persistenced is active
      ansible.builtin.command:
        cmd: systemctl is-active nvidia-persistenced
      register: persistenced
      changed_when: false
      failed_when: false        # is-active returns non-zero when inactive; assert below

    - name: Assert nvidia-persistenced is active
      ansible.builtin.assert:
        that:
          - persistenced.stdout == "active"
        fail_msg: "nvidia-persistenced is '{{ persistenced.stdout }}' (expected active)"
        quiet: true

    # 3. Fabric Manager active on datacenter NVSwitch baseboards only (same gate the
    #    hub uses to install it). PCIe/consumer nodes never run nv-fabricmanager.
    - name: Check nvidia-fabricmanager is active
      ansible.builtin.command:
        cmd: systemctl is-active nvidia-fabricmanager
      register: fm
      changed_when: false
      failed_when: false
      when: gpu_tier == 'datacenter' and nvidia_nvswitch | bool

    - name: Assert Fabric Manager is active on NVSwitch systems
      ansible.builtin.assert:
        that:
          - fm.stdout == "active"
        fail_msg: >-
          nvidia-fabricmanager is '{{ fm.stdout }}' on an NVSwitch node;
          GPUs will not form the NVLink domain. See runbook-fabric-manager-failure.
        quiet: true
      when: gpu_tier == 'datacenter' and nvidia_nvswitch | bool

    # 4. InfiniBand ports ACTIVE. ibstat prints "State: Active" per up port; count them.
    #    Skip only when validate_min_ib_active == 0 (no-RDMA nodes).
    - name: Locate ibstat
      ansible.builtin.command:
        cmd: which ibstat
      register: ibstat_bin
      changed_when: false
      failed_when: false
      when: validate_min_ib_active | int > 0

    - name: Assert ibstat is installed when IB validation is required
      ansible.builtin.assert:
        that:
          - ibstat_bin.rc == 0
        fail_msg: "ibstat is missing but validate_min_ib_active={{ validate_min_ib_active }}"
        quiet: true
      when: validate_min_ib_active | int > 0

    - name: Count InfiniBand ports in State Active
      ansible.builtin.shell:
        cmd: "set -o pipefail; ibstat | grep -c 'State: Active'"
      args:
        executable: /bin/bash
      register: ib_active
      changed_when: false
      failed_when: false        # grep -c exits 1 on zero matches; assert below
      when:
        - validate_min_ib_active | int > 0

    - name: Assert enough InfiniBand ports are Active
      ansible.builtin.assert:
        that:
          - (ib_active.stdout | default('0') | int) >= (validate_min_ib_active | int)
        fail_msg: >-
          Only {{ ib_active.stdout | default('0') }} IB port(s) Active,
          need {{ validate_min_ib_active }}. Check ibstat / subnet manager.
        quiet: true
      when:
        - validate_min_ib_active | int > 0

    # 5. HBM row-remap health (Ampere+). A pending/failed remap means RMA-triage, not pool.
    - name: Query remapped rows (pending/failure)
      ansible.builtin.command:
        cmd: >-
          nvidia-smi
          --query-remapped-rows=remapped_rows.pending,remapped_rows.failure
          --format=csv,noheader
      register: remap
      changed_when: false
      failed_when: false        # pre-Ampere prints [N/A] (exit 0, not non-zero); skip via N/A guard below
      when: validate_check_remapped_rows | bool

    - name: Assert no pending or failed row remaps
      ansible.builtin.assert:
        # nvidia-smi does not document the literal CSV value for these fields, and it
        # varies by driver (some emit Yes/No, some a numeric count). Be format-agnostic:
        # a healthy board reports only "No" or "0" for every field; anything else condemns.
        that:
          - remap.stdout | regex_replace('[\r\n]+', ',') | split(',') | map('trim') | map('lower') | reject('in', ['no', '0']) | list | length == 0
        fail_msg: "Pending/failed HBM row remap on {{ inventory_hostname }}: {{ remap.stdout }}"
        quiet: true
      when:
        - validate_check_remapped_rows | bool
        - "'N/A' not in (remap.stdout | default(''))"   # pre-Ampere prints [N/A] at exit 0; skip when unsupported

    # 6. Active DCGM diagnostic. dcgmi diag exits non-zero on any plugin failure
    #    (e.g. DCGM_ST_NVVS_ERROR / 226). Needs exclusive GPU access -> idle node only.
    - name: Run dcgmi diag (active diagnostic)
      ansible.builtin.command:
        cmd: "dcgmi diag -r {{ validate_dcgm_level }}"
      register: diag
      changed_when: false
      failed_when: diag.rc != 0 or ('Fail' in diag.stdout)
      when: validate_run_dcgm | bool

  rescue:
    # On any failure above, capture the kernel ring buffer for triage, then re-raise.
    - name: Ensure dmesg capture directory exists
      ansible.builtin.file:
        path: "{{ validate_dmesg_dir }}"
        state: directory
        mode: "0750"

    - name: Collect dmesg to a host-local file
      ansible.builtin.shell:
        cmd: >-
          dmesg --ctime > "{{ validate_dmesg_dir }}/dmesg-{{ ansible_date_time.iso8601_basic_short }}.log"
      args:
        executable: /bin/bash
      changed_when: true

    - name: Fail the play after collecting diagnostics
      ansible.builtin.fail:
        msg: >-
          validate failed on {{ inventory_hostname }}; dmesg saved under
          {{ validate_dmesg_dir }}. Triage with diagnostics-tools / the matching runbook.

Idempotency notes: the probe tasks are changed_when: false, so a healthy run reports zero changes. failed_when: false on systemctl is-active, grep -c, and the unsupported-field queries prevents the command step from aborting on a non-zero exit that the following assert is meant to adjudicate; the verdict lives in exactly one place. The dmesg dump only runs inside rescue, i.e. only on failure, and is reported changed deliberately (it writes a file). assert/fail are the only failure surfaces.

Apply & verify¶

Run the whole bring-up, or just the gate, from the hub's site.yml. Tag the role so it can be invoked standalone.

# Full bring-up (validate runs last):
ansible-playbook -i inventory/hosts.ini site.yml --limit gpu-07.dc1.internal

# Just the gate, on an idle/cordoned node (tag the role include in site.yml as 'validate'):
ansible-playbook -i inventory/hosts.ini site.yml --tags validate --limit gpu-07.dc1.internal

# Run the read-only checks while using --check for the rest of the play; the validate block sets check_mode:false so probes execute:
ansible-playbook -i inventory/hosts.ini site.yml --tags validate --check --limit gpu-07.dc1.internal

Expected signal on a healthy node: the play PLAY RECAP shows failed=0 and the validate tasks report ok, never changed (every probe is changed_when: false). The dcgmi diag task ends ok only when the command exited zero and no Fail appears in its table.

On an unhealthy node the play stops at the first failed assert/fail, the recap shows failed=1, the fail_msg names the unmet check, and a dmesg-<timestamp>.log appears under validate_dmesg_dir. Confirm the gate is real by forcing a failure and reading the output:

# Force a failure path to prove the gate fails closed (run on a throwaway/lab node):
sudo systemctl stop nvidia-persistenced        # expect: assert fails, dmesg captured, failed=1
ls -1 /var/log/alexandria-validate/            # expect: dmesg-<ts>.log present
sudo systemctl start nvidia-persistenced       # restore; re-run validate -> failed=0

For the machine-readable variant (a Slurm HealthCheckProgram or a Kubernetes validator wrapping the same probes), gate on dcgmi diag -r {{ validate_dcgm_level }} -j and parse the per-test status field (tests[].results[].status, or the roll-up tests[].test_summary.status; values Pass/Fail) rather than scraping the table; see GPU Health Gating for wiring a verdict to a scheduler state change.¹

Failure modes¶

dcgmi diag fails for a non-silicon reason. A missing/mismatched kernel module, a down Fabric Manager, or a GSP/driver mismatch makes the diagnostic fail though the silicon is fine. The role's bottom-up order surfaces the real cause: enumeration and service asserts trip before dcgmi diag if the stack is broken. Triage: Kernel upgrade, GPU missing, Fabric Manager Failure. Tool-level reading of the failure: GPU Diagnostics and Validation.
dcgmi diag run on a busy node. Active levels contend with resident work and return false failures.¹ Cordon/drain first (GPU Health Gating); never run validate against live jobs.
Fabric Manager active but the fabric is dead. FM_STAY_RESIDENT_ON_FAILURES=1 lets the daemon stay active while the system is uninitialized, so systemctl is-active passes but CUDA still fails cudaErrorSystemNotReady. The service assert alone is insufficient; dcgmi diag (NCCL/NVLink plugins at -r 3) is what actually exercises the fabric. Diagnosis: Fabric Manager Failure.
IB assert passes with fewer ports than planned. validate_min_ib_active defaults to 1; a node that should have 8 IB ports up but has 1 passes a default gate. Set validate_min_ib_active to the node's planned port count to catch a dropped link.
Pending row-remap missed on pre-Ampere or with the check disabled. --query-remapped-rows is Ampere+ only; the step self-skips elsewhere. A pending/failed remap requires a GPU reset and RMA-triage, not a return to pool. Recovery flow: Persistence Mode / Clock Bounce is unrelated; use Reliability, RAS and Failure Modes for the fault taxonomy.
dmesg requires privilege. dmesg may need become: true (the hub runs the play with become: true) and on some kernels kernel.dmesg_restrict=1 blocks non-root reads; the capture runs as root via the play's privilege escalation.

References¶

Ansible ansible.builtin.assert (that list, fail_msg/msg, success_msg, quiet): https://docs.ansible.com/projects/ansible/latest/collections/ansible/builtin/assert_module.html
Ansible ansible.builtin.fail (msg, conditional fail via when): https://docs.ansible.com/projects/ansible/latest/collections/ansible/builtin/fail_module.html
Ansible ansible.builtin.command (register, rc, stdout, changed_when, failed_when): https://docs.ansible.com/projects/ansible/latest/collections/ansible/builtin/command_module.html
Ansible error handling (failed_when, changed_when, block/rescue): https://docs.ansible.com/projects/ansible/latest/playbook_guide/playbooks_error_handling.html
NVIDIA DCGM Diagnostics (dcgmi diag -r <1-4|test>, -j JSON, non-zero exit on plugin failure, run levels): https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/dcgm-diagnostics.html
NVIDIA DCGM Feature Overview (active diagnostics need exclusive GPU access; dcgmi health): https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/feature-overview.html
nvidia-smi manual (--query-gpu, --query-remapped-rows, --format=csv,noheader): https://docs.nvidia.com/deploy/nvidia-smi/index.html
NVIDIA Driver Persistence / nvidia-persistenced daemon and systemd unit: https://docs.nvidia.com/deploy/driver-persistence/index.html
NVIDIA Fabric Manager user guide (nvidia-fabricmanager service, cudaErrorSystemNotReady): https://docs.nvidia.com/datacenter/tesla/fabric-manager-user-guide/index.html
InfiniBand port states (State: Active, Physical state: LinkUp via ibstat): https://www.advancedclustering.com/act_kb/infiniband-port-states/

NVIDIA DCGM Diagnostics — dcgmi diag -r <1|2|3|4|test_name> selects run level (higher includes lower), -j/--json prints parseable JSON whose per-test/per-result status field carries Pass/Fail (keys category/tests/name/results/status/test_summary; there is no result field), and the program returns a non-zero exit code matching dcgmReturn_t on a plugin failure (e.g. DCGM_ST_NVVS_ERROR = 226). The active levels exercise real workloads and fail if other graphics processes run on the target GPU(s), so they require an idle/exclusive GPU. https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/dcgm-diagnostics.html · exclusivity and health watches: https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/feature-overview.html ↩↩↩↩
NVIDIA nvidia-smi — --query-remapped-rows=remapped_rows.pending,remapped_rows.failure,remapped_rows.correctable,remapped_rows.uncorrectable --format=csv reports HBM row-remap state on Ampere and newer; a pending remap requires a GPU reset to take effect, and a remap failure condemns the board. On pre-Ampere parts the field is unsupported: per the nvidia-smi RETURN VALUE section the exit code stays 0 (the documented codes are 0/2/3) and "Any unsupported data is indicated by a 'N/A' in the output", so the value reads [N/A] rather than failing — the task therefore skips on 'N/A' in remap.stdout, not on a non-zero rc. The docs describe pending/failure only as indicating "whether or not" a remap is pending/failed and do not fix the literal CSV string, so the assert treats anything other than No/0 as a condemned board. https://docs.nvidia.com/deploy/nvidia-smi/index.html ↩