HPC networking fabric¶
Scope: the GPU-to-GPU interconnect and the validation of it. InfiniBand and high-speed Ethernet, topologies, fabric management, and how a fabric is proven healthy at bring-up.
Overview¶
In a training cluster the fabric is the bottleneck that decides whether GPUs sit idle waiting on collectives. The deployment skill is twofold: design a topology that delivers non-blocking bandwidth where it is needed, and validate the built fabric against line rate before sign-off.
Architecture: two-tier fat-tree (rail-aligned)
flowchart TB
S1["Spine"] --- L1["Leaf"]
S1 --- L2["Leaf"]
S2["Spine"] --- L1
S2 --- L2
L1 --- N1["GPU node (8x ConnectX-8)"]
L1 --- N2["GPU node"]
L2 --- N3["GPU node"]
L2 --- N4["GPU node"]
GPU cluster networking deep dives¶
This page is the pillar for the GPU cluster network fabric. Route to the focused pages for tuning, validation, and incident response:
- Intra-node interconnect. NVSwitch and NVLink.
- RDMA transport tuning. RDMA and RoCE tuning, BlueField DPU networking.
- Collective communication. NCCL collectives and algorithms, SHARP in-network reduction, NVSHMEM GPU-initiated communication, compute and communication overlap.
- Bring-up and validation. fabric bring-up, validation and benchmarking, fabric validation with nccl-tests.
- Fabric runbooks. NCCL hang and collective stall, NVLink visibility and P2P failure, PCIe and P2P bandwidth regression, Fabric Manager failure.
InfiniBand vs RoCE, and validating the fabric¶
- InfiniBand vs RoCE for GPU training and inference? InfiniBand (NVIDIA Quantum) is the default for large training fabrics, lossless and SHARP-capable by design; RoCEv2 on Spectrum-X Ethernet is the lower-cost alternative but needs careful PFC/ECN tuning to stay lossless (RDMA and RoCE tuning). The Core knowledge section below compares them by GPU generation.
- How do you validate GPU fabric bandwidth? Prove the built fabric against line rate with
nccl-testsbus bandwidth before sign-off, not just link-up (fabric validation with nccl-tests, fabric bring-up and benchmarking).
Core knowledge¶
InfiniBand generations¶
Per-port, per-direction line rate doubles each generation: EDR 100, HDR 200, NDR 400, XDR 800 Gb/s. Adapters and switches across generations auto-negotiate to a common speed.
Interconnect differences by GPU generation/tier¶
The scale-up interconnect is not uniform across the range, and this drives the fabric design.
- NVLink generation (datacenter SXM only). Across GPU generations: Ampere uses 3rd-gen NVLink at 600 GB/s per GPU; Hopper uses 4th-gen NVLink at 900 GB/s ("fourth-generation NVLink, which offers 900 gigabytes per second"); Blackwell uses 5th-gen NVLink at 1.8 TB/s ("fifth-generation NVIDIA NVLink ... 1.8 TB/s of GPU-to-GPU interconnect"). Rack-scale NVLink domains built on the NVLink Switch System (NVSwitch), the all-to-all scale-up interconnect (e.g. the 72-GPU domain in GB200 NVL72), exist only on SXM datacenter parts, where Fabric Manager runs. PCIe datacenter cards (A100/H100 PCIe) expose only an optional 2-way NVLink Bridge and no NVSwitch domain; consumer and most workstation cards have no NVLink at all (next point).
- Consumer and most workstation cards have NO NVLink. GeForce (RTX 4090/5090) and the RTX & workstation Blackwell and Ada cards expose only PCIe (Gen5 on Blackwell/Hopper class). Multi-GPU is therefore PCIe peer-to-peer only: there is no NVSwitch domain and no Fabric Manager, and NCCL falls back to PCIe P2P or host staging rather than NVLink/NVLS (tune via PCIe/ACS, see performance tuning).
- GPUDirect RDMA is datacenter/workstation-pro only. NVIDIA states GPUDirect RDMA "is available on both Tesla and Quadro GPUs", i.e. the datacenter (A100/H100/H200/B-series) and professional/RTX-Enterprise lines. It is not supported on GeForce, so consumer cards cannot drive NIC-to-GPU RDMA over the fabric.
- InfiniBand generation pairs with the platform. Ampere-era clusters pair with HDR (200 Gb/s); Hopper with NDR (400 Gb/s, Quantum-2); Blackwell with XDR (800 Gb/s, Quantum-X800). The per-generation NIC/switch line rates and the IB switch families are detailed below.
NVIDIA Quantum InfiniBand (current)¶
- Quantum-2 is the NDR (400 Gb/s) generation, still used for Hopper-era and storage fabrics.
- Quantum-X800 is the XDR (800 Gb/s) generation, built on the Quantum-3 ASIC and 200 Gb/s-per-lane SerDes, for Blackwell-scale clusters.
- Q3400-RA (4U): 144 ports of 800 Gb/s across 72 twin-port OSFP cages. High radix supports a two-level fat-tree connecting up to ~10,368 NICs. No NDR backward compatibility.
- Q3200-RA (2U): two switches in one enclosure, 36x 800G each (72 total), supports NDR backward compatibility, so it is the bridge to existing NDR infrastructure.
- Q3450-LD: silicon-photonics / co-packaged optics variant, MPO direct connect, lower insertion loss.
- In-network features: SHARP v4 (in-network reduction for collectives), Adaptive Routing (AR), telemetry-based congestion control, SHIELD self-healing. Sub-100 ns port-to-port latency.
Adapters¶
- ConnectX-8 SuperNIC: 800 Gb/s XDR InfiniBand or 2x 400 GbE through one OSFP port. PCIe Gen6 x16, integrated PCIe switch (removes the external PCIe switch the GB200/CX7 generation needed).
- ConnectX-7: 400 Gb/s NDR, the most widely installed NIC in existing clusters. Mixed CX7/CX8 estates are common.
Subnet manager and fabric management¶
- The subnet manager (SM) is the core IB controller: network initialisation, routing, partitioning (PKeys). Runs on a managed switch or on a Linux host with OFED drivers. OpenSM is the open implementation.
- UFM (Unified Fabric Manager) is NVIDIA's management platform: provisioning, real-time monitoring, diagnostics, troubleshooting. The SM is one component of UFM. Variants: Telemetry, Enterprise, Cyber-AI (anomaly detection). Quantum-X800 switches carry a dedicated OSFP in-band management port for UFM, separated from data ports.
Ethernet alternative¶
- Spectrum-X (Spectrum-4) is the Ethernet path for AI, with RoCEv2 for RDMA. Rule of thumb: InfiniBand for training/HPC where GPU-to-GPU latency dominates; Ethernet for inference and multi-tenant cloud. Above ~10,000 GPUs a super-spine tier is added.
- Skyway is the IB-to-Ethernet gateway appliance.
Topology¶
- Fat-tree (leaf/spine) is the standard non-blocking topology. Two-level fat-tree scales to thousands of NICs; a third (super-spine) tier extends to tens of thousands.
- Rail-aligned design: in a SuperPOD, each GPU rail connects so that nodes within a Scalable Unit (SU, 72 nodes) are one hop apart; inter-SU and inter-rail traffic crosses the spine. See the Blackwell platform for SU structure.
- A SuperPOD runs four separate fabrics: compute (XDR), storage (often NDR), in-band management, and out-of-band management (SN2201 switches, OOB carried as a VXLAN).
Fabric validation (what "validated" means)¶
- Physical: every link up at the expected width and speed, no degraded lanes, transceiver and cable health clean.
- Routing: SM has converged, routing tables correct, no isolated nodes, partitioning correct.
- Performance: bandwidth at line rate and latency within spec across representative paths; congestion control and AR behaving under load.
- Tooling to know:
ibstat,ibstatus,iblinkinfo,ibdiagnet,perfquery, plus UFM dashboards andnccl-tests(all_reduce, all_gather) as the application-level proof.
Don't-miss checklist¶
- Confirm SM is running and singular (no duplicate SMs fighting).
- Walk the full link inventory for width/speed downgrades before trusting any benchmark.
- Run
ibdiagnetand clear all errors before application tests. - Prove it with
nccl-testsat realistic message sizes, not just point-to-point. - Confirm cable reach and media type match the run length (copper vs MMF vs SMF).
Failure modes¶
- Optics/form-factor mismatch (see BOM validation): IHS switch cages vs RHS NIC cages are not interchangeable; NDR and XDR OSFP modules are not interchangeable despite the shared form factor.
- Duplicate or mis-prioritised subnet managers causing routing churn.
- Under-provisioned spine causing blocking on inter-SU collectives.
- Mixed CX7/CX8 estate negotiating down to NDR unexpectedly.
Open questions & validation¶
- IB-specific tooling. Validate OpenSM/UFM,
ibdiagnet, and partitioning in a test fabric before relying on them; RoCE/Ethernet estates use a different toolchain, with NCCL-over-RoCE tuning covered in performance tuning. Proving a fabric healthy at line rate is the highest-impact networking skill. - Spectrum-X vs Quantum-X800 selection: InfiniBand for latency-bound training, Ethernet/RoCE for multi-tenant inference and cloud, applied per deployment.
References¶
- NVIDIA Quantum-X800 platform: https://www.nvidia.com/en-us/networking/products/infiniband/quantum-x800/
- NVIDIA InfiniBand switching overview (UFM, Q3200 bridging): https://www.nvidia.com/en-us/networking/infiniband-switching/
- DGX SuperPOD B300 / XDR network fabrics reference: https://docs.nvidia.com/dgx-superpod/reference-architecture/scalable-infrastructure-b300-xdr/latest/network-fabrics.html
- XDR switch hardware manual: https://docs.nvidia.com/networking/display/XDRSwitchesHWUM/Introduction
- Transceiver compatibility (IHS/RHS, NDR/XDR): https://www.vitextech.com/blogs/blog/800g-transceiver-compatibility-for-nvidia-platforms-spectrum-4-quantum-x800-connectx-8-and-bluefield-3
- NVIDIA NVLink and NVLink Switch (generation bandwidth, NVSwitch, domains): https://www.nvidia.com/en-us/data-center/nvlink/
- NVIDIA H100 (4th-gen NVLink 900 GB/s, NDR pairing): https://www.nvidia.com/en-us/data-center/h100/
- NVIDIA A100 (3rd-gen NVLink 600 GB/s): https://www.nvidia.com/en-us/data-center/a100/
- NVIDIA GB200 NVL72 (5th-gen NVLink 1.8 TB/s, NVLink Switch System): https://www.nvidia.com/en-us/data-center/gb200-nvl72/
- NVIDIA GPUDirect RDMA (Tesla/Quadro support, ACS/P2P): https://docs.nvidia.com/cuda/gpudirect-rdma/index.html
Related: BOM · Platform · Performance · Overlay & Mesh Networking · Glossary