Skip to content
Markdown

GPU cluster commissioning and acceptance testing

Scope: the bring-up sequence from racked hardware to production sign-off, and the acceptance tests that justify putting a name on readiness. Strong bridge from existing provisioning and validation work.

flowchart TB
  PHYSICAL["Physical install"] --> OOB["OOB and firmware"]
  OOB --> STACK["OS, drivers, fabric"]
  STACK --> NODE["Single-node tests"]
  NODE --> MULTI["Multi-node collectives"]
  MULTI --> SIGNOFF["Recorded sign-off"]

Overview

Commissioning is the ordered process of turning installed hardware into a validated, production-ready cluster, ending in a defensible "done." The discipline is sequence and evidence: each layer is proven before the next is trusted, and sign-off rests on recorded acceptance criteria, not impressions.

Core knowledge

Bring-up sequence (layered, bottom up)

  1. Physical and power-on: rack, cable, power. Confirm clean power-on, no alarms, fans/pumps nominal.
  2. Out-of-band / BMC: reach every node over BMC/IPMI/Redfish, confirm inventory, set baselines (see provisioning and scheduling).
  3. Firmware and BIOS: bring firmware/BIOS/BMC to the qualified baseline across the fleet; consistency matters as much as version.
  4. OS and drivers: provision OS, GPU drivers, CUDA stack, OFED/network drivers, NCCL.
  5. Fabric: subnet manager up, fabric validated and clean before any application test (see networking fabric).
  6. Single-node validation: per-node GPU health, memory, NVLink, local bandwidth.
  7. Multi-node validation: collectives across the fabric at realistic message sizes.
  8. Burn-in and acceptance: sustained load to surface infant-mortality failures, then the formal acceptance suite.
  9. Sign-off: recorded results against criteria, handover.

Acceptance test suite (what proves "done")

  • GPU health: DCGM diagnostics, memory tests, thermal behaviour under load.
  • Single-GPU compute: stress and thermal soak; confirm no throttling within the cooling envelope (datacentre readiness).
  • Intra-node: NVLink bandwidth and topology checks (nvidia-smi nvlink, NVLink bandwidth tests).
  • Fabric: ibdiagnet clean, link widths/speeds correct.
  • Collectives: nccl-tests (all_reduce, all_gather, reduce_scatter) at a sweep of sizes; compare achieved bus bandwidth against expected for the topology.
  • Thermal soak: sustained full-load run watching for throttling, pump/CDU behaviour, and power transient handling.
  • Resilience: pull a link or a node and confirm graceful behaviour and recovery.

Defining acceptance criteria

  • Pre-agree the numbers: minimum achieved NCCL bus bandwidth, maximum GPU temperature under soak, zero fabric errors after ibdiagnet, all nodes present and healthy. Sign-off is pass/fail against these, recorded.

Don't-miss checklist

  • Firmware/BIOS baseline is uniform across the whole fleet, not just spot-checked.
  • Fabric proven clean before any collective benchmark, so a network fault is not misread as a GPU fault.
  • NCCL results compared against expected topology bandwidth, not just "it ran".
  • Thermal soak long enough to catch throttling and CDU behaviour, not a short burst.
  • A failure-injection test in the plan, not just happy-path.
  • Every acceptance number recorded against a pre-agreed threshold.

Failure modes

  • Skipping fabric validation, then chasing a "GPU" problem that is a degraded link.
  • Mixed firmware levels causing intermittent, hard-to-localise faults.
  • Passing point-to-point but failing at-scale collectives due to spine under-provisioning.
  • Short burn-in missing thermal throttling that only appears after sustained load.
  • Sign-off without recorded criteria, leaving "done" undefined.

Open questions & validation

  • Keep a one-page commissioning runbook with pre-agreed acceptance thresholds: NCCL bus-bandwidth floor, max soak temperature, zero post-ibdiagnet errors, all nodes healthy.
  • Align the acceptance suite with the deployed health stack (DCGM field metrics, observability).

References

  • DGX SuperPOD B300 reference architecture (validation context): https://docs.nvidia.com/dgx-superpod/reference-architecture/scalable-infrastructure-b300-xdr/latest/dgx-superpod-components.html
  • GB300 NVL72 architecture and NCCL all-reduce test context: https://verda.com/blog/gb300-nvl72-architecture

Related: Fabric · Physical · Provisioning · Performance · Glossary