GPU cluster BOM validation checklist¶
Scope: reviewing a GPU-cluster bill of materials and catching errors before procurement, one of the most hardware-specific skills and the cheapest place to catch a mistake.
flowchart LR
BOM["Vendor BOM"] --> COUNT["Node, NIC, switch, cable counts"]
BOM --> MEDIA["Optic and cable compatibility"]
BOM --> POWER["Power and cooling fit"]
COUNT --> SIGNOFF["Procurement sign-off"]
MEDIA --> SIGNOFF
POWER --> SIGNOFF
Overview¶
A BOM error caught on paper costs an email. The same error caught on the loading dock costs weeks and a re-order. The skill is knowing what a complete, correct cluster BOM contains and where the recurring mistakes hide, especially in the interconnect, power, and mechanical lines that software architects rarely touch.
Core knowledge¶
What a GPU-cluster BOM contains¶
- Compute: GPU nodes (DGX/HGX B300 8-GPU units, or GB300 NVL72 racks), GPU SKU and count, CPU, system memory, local NVMe.
- Network adapters: SuperNICs (ConnectX-8 for 800G XDR), count per node, plus any BlueField DPUs.
- Switches: compute fabric (Quantum-X800 Q3400/Q3200), storage fabric (often NDR Quantum-2, e.g. MQM9700), in-band and OOB management switches (e.g. SN2201).
- Cables and optics: this is where most BOM errors live. OSFP transceivers and cables, matched to generation, form factor, and reach.
- Power: PDUs, rack power shelves, power supplies, busways, whips, the correct phase and connector.
- Cooling: CDUs, manifolds, cold plates, hoses, rear-door heat exchangers, quick-disconnects (for liquid-cooled racks).
- Mechanical: racks, rail kits, blanking panels, cable management, seismic/anchor hardware.
- Storage: high-throughput storage nodes sized to feed the fabric (SuperPOD targets >80 GB/s per-node I/O).
The interconnect traps (highest-yield review area)¶
- Form factor: switch OSFP cages are IHS (closed/finned, twin-port, higher thermal); NIC OSFP cages are RHS (flat-top, single-port). They are not interchangeable. Order IHS for Quantum-X800 switches, RHS for ConnectX-8 NICs.
- Generation: NDR uses 100G-PAM4 per lane, XDR uses 200G-PAM4 per lane. Modules are not interchangeable despite both being OSFP.
- Media by distance: copper (DAC) for short in-rack, multimode fibre for short reach, single-mode for longer runs. XDR at scale leans on 800G single-mode. Get this wrong and the link either will not reach or will not fit.
- Breakout: twin-port switch modules with splitter cables to connect 800G switches down to 400G CX7 NICs. Count splitter cables correctly.
Power line items to sanity-check¶
- Per-GPU TDP for B300 is 1,400 W; a GB300 NVL72 rack draws roughly 120 to 140 kW. PDU and feed sizing must match, with headroom for transient spikes (see datacentre readiness).
- Confirm rack power shelf count, redundancy scheme (N+1, 2N), and that the facility feed and connector type match the procured PDUs.
Don't-miss checklist¶
- Cross-check NIC count against switch port count against cable count. The three must reconcile.
- Verify every optic against switch-side vs NIC-side form factor and against generation.
- Verify cable reach against actual rack-to-rack and rack-to-spine distances from the floor plan (datacentre readiness).
- Confirm rail kits and anchor hardware exist for the actual rack and the rack weight.
- Confirm CDU/manifold/cold-plate lines are present and sized for liquid-cooled racks.
- Confirm power connector and phase match between PDU, whip, and facility feed.
- Check for the quiet omissions: blanking panels, management switches, OOB cabling, spares.
Failure modes¶
- IHS/RHS or NDR/XDR optic mismatch (the classic).
- NIC count not matching switch radix, leaving nodes unconnectable.
- Cable reach too short for the real topology distances.
- Liquid-cooling consumables missing because the reviewer treated the rack as air-cooled.
- PDU connector or phase mismatch with the facility.
- No spares line for optics and cables, which fail on a predictable curve.
Open questions & validation¶
- Build and memorise a reference BOM for one B300 node and one GB300 rack, so a seeded error stands out fast.
- Cross-check a real vendor BOM against the reconciliation rules here (NIC count = switch ports = cable count; optic form factor and reach per networking fabric).
References¶
- Transceiver compatibility (IHS/RHS, NDR/XDR, media by distance): https://www.vitextech.com/blogs/blog/800g-transceiver-compatibility-for-nvidia-platforms-spectrum-4-quantum-x800-connectx-8-and-bluefield-3
- DGX SuperPOD B300 components and fabrics: https://docs.nvidia.com/dgx-superpod/reference-architecture/scalable-infrastructure-b300-xdr/latest/dgx-superpod-components.html
- GB300 power/cooling/cabling field notes: https://introl.com/blog/why-nvidia-gb300-nvl72-blackwell-ultra-matters