GPU cluster commissioning and acceptance testing¶
Scope: the bring-up sequence from racked hardware to production sign-off, and the acceptance tests that justify putting a name on readiness. Strong bridge from existing provisioning and validation work.
flowchart TB
PHYSICAL["Physical install"] --> OOB["OOB and firmware"]
OOB --> STACK["OS, drivers, fabric"]
STACK --> NODE["Single-node tests"]
NODE --> MULTI["Multi-node collectives"]
MULTI --> SIGNOFF["Recorded sign-off"]
Overview¶
Commissioning is the ordered process of turning installed hardware into a validated, production-ready cluster, ending in a defensible "done." The discipline is sequence and evidence: each layer is proven before the next is trusted, and sign-off rests on recorded acceptance criteria, not impressions.
Core knowledge¶
Bring-up sequence (layered, bottom up)¶
- Physical and power-on: rack, cable, power. Confirm clean power-on, no alarms, fans/pumps nominal.
- Out-of-band / BMC: reach every node over BMC/IPMI/Redfish, confirm inventory, set baselines (see provisioning and scheduling).
- Firmware and BIOS: bring firmware/BIOS/BMC to the qualified baseline across the fleet; consistency matters as much as version.
- OS and drivers: provision OS, GPU drivers, CUDA stack, OFED/network drivers, NCCL.
- Fabric: subnet manager up, fabric validated and clean before any application test (see networking fabric).
- Single-node validation: per-node GPU health, memory, NVLink, local bandwidth.
- Multi-node validation: collectives across the fabric at realistic message sizes.
- Burn-in and acceptance: sustained load to surface infant-mortality failures, then the formal acceptance suite.
- Sign-off: recorded results against criteria, handover.
Acceptance test suite (what proves "done")¶
- GPU health: DCGM diagnostics, memory tests, thermal behaviour under load.
- Single-GPU compute: stress and thermal soak; confirm no throttling within the cooling envelope (datacentre readiness).
- Intra-node: NVLink bandwidth and topology checks (
nvidia-smi nvlink, NVLink bandwidth tests). - Fabric:
ibdiagnetclean, link widths/speeds correct. - Collectives:
nccl-tests(all_reduce, all_gather, reduce_scatter) at a sweep of sizes; compare achieved bus bandwidth against expected for the topology. - Thermal soak: sustained full-load run watching for throttling, pump/CDU behaviour, and power transient handling.
- Resilience: pull a link or a node and confirm graceful behaviour and recovery.
Defining acceptance criteria¶
- Pre-agree the numbers: minimum achieved NCCL bus bandwidth, maximum GPU temperature under soak, zero fabric errors after
ibdiagnet, all nodes present and healthy. Sign-off is pass/fail against these, recorded.
Don't-miss checklist¶
- Firmware/BIOS baseline is uniform across the whole fleet, not just spot-checked.
- Fabric proven clean before any collective benchmark, so a network fault is not misread as a GPU fault.
- NCCL results compared against expected topology bandwidth, not just "it ran".
- Thermal soak long enough to catch throttling and CDU behaviour, not a short burst.
- A failure-injection test in the plan, not just happy-path.
- Every acceptance number recorded against a pre-agreed threshold.
Failure modes¶
- Skipping fabric validation, then chasing a "GPU" problem that is a degraded link.
- Mixed firmware levels causing intermittent, hard-to-localise faults.
- Passing point-to-point but failing at-scale collectives due to spine under-provisioning.
- Short burn-in missing thermal throttling that only appears after sustained load.
- Sign-off without recorded criteria, leaving "done" undefined.
Open questions & validation¶
- Keep a one-page commissioning runbook with pre-agreed acceptance thresholds: NCCL bus-bandwidth floor, max soak temperature, zero post-
ibdiagneterrors, all nodes healthy. - Align the acceptance suite with the deployed health stack (DCGM field metrics, observability).
References¶
- DGX SuperPOD B300 reference architecture (validation context): https://docs.nvidia.com/dgx-superpod/reference-architecture/scalable-infrastructure-b300-xdr/latest/dgx-superpod-components.html
- GB300 NVL72 architecture and NCCL all-reduce test context: https://verda.com/blog/gb300-nvl72-architecture
Related: Fabric · Physical · Provisioning · Performance · Glossary