Skip to content
Markdown

GPU cluster BOM validation checklist

Scope: reviewing a GPU-cluster bill of materials and catching errors before procurement, one of the most hardware-specific skills and the cheapest place to catch a mistake.

flowchart LR
  BOM["Vendor BOM"] --> COUNT["Node, NIC, switch, cable counts"]
  BOM --> MEDIA["Optic and cable compatibility"]
  BOM --> POWER["Power and cooling fit"]
  COUNT --> SIGNOFF["Procurement sign-off"]
  MEDIA --> SIGNOFF
  POWER --> SIGNOFF

Overview

A BOM error caught on paper costs an email. The same error caught on the loading dock costs weeks and a re-order. The skill is knowing what a complete, correct cluster BOM contains and where the recurring mistakes hide, especially in the interconnect, power, and mechanical lines that software architects rarely touch.

Core knowledge

What a GPU-cluster BOM contains

  • Compute: GPU nodes (DGX/HGX B300 8-GPU units, or GB300 NVL72 racks), GPU SKU and count, CPU, system memory, local NVMe.
  • Network adapters: SuperNICs (ConnectX-8 for 800G XDR), count per node, plus any BlueField DPUs.
  • Switches: compute fabric (Quantum-X800 Q3400/Q3200), storage fabric (often NDR Quantum-2, e.g. MQM9700), in-band and OOB management switches (e.g. SN2201).
  • Cables and optics: this is where most BOM errors live. OSFP transceivers and cables, matched to generation, form factor, and reach.
  • Power: PDUs, rack power shelves, power supplies, busways, whips, the correct phase and connector.
  • Cooling: CDUs, manifolds, cold plates, hoses, rear-door heat exchangers, quick-disconnects (for liquid-cooled racks).
  • Mechanical: racks, rail kits, blanking panels, cable management, seismic/anchor hardware.
  • Storage: high-throughput storage nodes sized to feed the fabric (SuperPOD targets >80 GB/s per-node I/O).

The interconnect traps (highest-yield review area)

  • Form factor: switch OSFP cages are IHS (closed/finned, twin-port, higher thermal); NIC OSFP cages are RHS (flat-top, single-port). They are not interchangeable. Order IHS for Quantum-X800 switches, RHS for ConnectX-8 NICs.
  • Generation: NDR uses 100G-PAM4 per lane, XDR uses 200G-PAM4 per lane. Modules are not interchangeable despite both being OSFP.
  • Media by distance: copper (DAC) for short in-rack, multimode fibre for short reach, single-mode for longer runs. XDR at scale leans on 800G single-mode. Get this wrong and the link either will not reach or will not fit.
  • Breakout: twin-port switch modules with splitter cables to connect 800G switches down to 400G CX7 NICs. Count splitter cables correctly.

Power line items to sanity-check

  • Per-GPU TDP for B300 is 1,400 W; a GB300 NVL72 rack draws roughly 120 to 140 kW. PDU and feed sizing must match, with headroom for transient spikes (see datacentre readiness).
  • Confirm rack power shelf count, redundancy scheme (N+1, 2N), and that the facility feed and connector type match the procured PDUs.

Don't-miss checklist

  • Cross-check NIC count against switch port count against cable count. The three must reconcile.
  • Verify every optic against switch-side vs NIC-side form factor and against generation.
  • Verify cable reach against actual rack-to-rack and rack-to-spine distances from the floor plan (datacentre readiness).
  • Confirm rail kits and anchor hardware exist for the actual rack and the rack weight.
  • Confirm CDU/manifold/cold-plate lines are present and sized for liquid-cooled racks.
  • Confirm power connector and phase match between PDU, whip, and facility feed.
  • Check for the quiet omissions: blanking panels, management switches, OOB cabling, spares.

Failure modes

  • IHS/RHS or NDR/XDR optic mismatch (the classic).
  • NIC count not matching switch radix, leaving nodes unconnectable.
  • Cable reach too short for the real topology distances.
  • Liquid-cooling consumables missing because the reviewer treated the rack as air-cooled.
  • PDU connector or phase mismatch with the facility.
  • No spares line for optics and cables, which fail on a predictable curve.

Open questions & validation

  • Build and memorise a reference BOM for one B300 node and one GB300 rack, so a seeded error stands out fast.
  • Cross-check a real vendor BOM against the reconciliation rules here (NIC count = switch ports = cable count; optic form factor and reach per networking fabric).

References

  • Transceiver compatibility (IHS/RHS, NDR/XDR, media by distance): https://www.vitextech.com/blogs/blog/800g-transceiver-compatibility-for-nvidia-platforms-spectrum-4-quantum-x800-connectx-8-and-bluefield-3
  • DGX SuperPOD B300 components and fabrics: https://docs.nvidia.com/dgx-superpod/reference-architecture/scalable-infrastructure-b300-xdr/latest/dgx-superpod-components.html
  • GB300 power/cooling/cabling field notes: https://introl.com/blog/why-nvidia-gb300-nvl72-blackwell-ultra-matters

Related: Fabric · Physical · Platform · Glossary