Markdown

GPU cluster commissioning and acceptance testing¶

Scope: the bring-up sequence from racked hardware to production sign-off, and the acceptance tests that justify putting a name on readiness. Strong bridge from existing provisioning and validation work.

flowchart TB
  PHYSICAL["Physical install"] --> OOB["OOB and firmware"]
  OOB --> STACK["OS, drivers, fabric"]
  STACK --> NODE["Single-node tests"]
  NODE --> MULTI["Multi-node collectives"]
  MULTI --> SIGNOFF["Recorded sign-off"]

Overview¶

Commissioning is the ordered process of turning installed hardware into a validated, production-ready cluster, ending in a defensible "done." The discipline is sequence and evidence: each layer is proven before the next is trusted, and sign-off rests on recorded acceptance criteria, not impressions.

Core knowledge¶

Bring-up sequence (layered, bottom up)¶

Physical and power-on: rack, cable, power. Confirm clean power-on, no alarms, fans/pumps nominal.
Out-of-band / BMC: reach every node over BMC/IPMI/Redfish, confirm inventory, set baselines (see provisioning and scheduling).
Firmware and BIOS: bring firmware/BIOS/BMC to the qualified baseline across the fleet; consistency matters as much as version.
OS and drivers: provision OS, GPU drivers, CUDA stack, OFED/network drivers, NCCL.
Fabric: subnet manager up, fabric validated and clean before any application test (see networking fabric).
Single-node validation: per-node GPU health, memory, NVLink, local bandwidth.
Multi-node validation: collectives across the fabric at realistic message sizes.
Burn-in and acceptance: sustained load to surface infant-mortality failures, then the formal acceptance suite.
Sign-off: recorded results against criteria, handover.

Acceptance test suite (what proves "done")¶

GPU health: DCGM diagnostics, memory tests, thermal behaviour under load.
Single-GPU compute: stress and thermal soak; confirm no throttling within the cooling envelope (datacentre readiness).
Intra-node: NVLink bandwidth and topology checks (nvidia-smi nvlink, NVLink bandwidth tests).
Fabric: ibdiagnet clean, link widths/speeds correct.
Collectives: nccl-tests (all_reduce, all_gather, reduce_scatter) at a sweep of sizes; compare achieved bus bandwidth against expected for the topology.
Thermal soak: sustained full-load run watching for throttling, pump/CDU behaviour, and power transient handling.
Resilience: pull a link or a node and confirm graceful behaviour and recovery.

Defining acceptance criteria¶

Pre-agree the numbers: minimum achieved NCCL bus bandwidth, maximum GPU temperature under soak, zero fabric errors after ibdiagnet, all nodes present and healthy. Sign-off is pass/fail against these, recorded.

Don't-miss checklist¶

Firmware/BIOS baseline is uniform across the whole fleet, not just spot-checked.
Fabric proven clean before any collective benchmark, so a network fault is not misread as a GPU fault.
NCCL results compared against expected topology bandwidth, not just "it ran".
Thermal soak long enough to catch throttling and CDU behaviour, not a short burst.
A failure-injection test in the plan, not just happy-path.
Every acceptance number recorded against a pre-agreed threshold.

Failure modes¶

Skipping fabric validation, then chasing a "GPU" problem that is a degraded link.
Mixed firmware levels causing intermittent, hard-to-localise faults.
Passing point-to-point but failing at-scale collectives due to spine under-provisioning.
Short burn-in missing thermal throttling that only appears after sustained load.
Sign-off without recorded criteria, leaving "done" undefined.

Open questions & validation¶

Keep a one-page commissioning runbook with pre-agreed acceptance thresholds: NCCL bus-bandwidth floor, max soak temperature, zero post-ibdiagnet errors, all nodes healthy.
Align the acceptance suite with the deployed health stack (DCGM field metrics, observability).

References¶

DGX SuperPOD B300 reference architecture (validation context): https://docs.nvidia.com/dgx-superpod/reference-architecture/scalable-infrastructure-b300-xdr/latest/dgx-superpod-components.html
GB300 NVL72 architecture and NCCL all-reduce test context: https://verda.com/blog/gb300-nvl72-architecture

Related: Fabric · Physical · Provisioning · Performance · Glossary