Markdown

HPC networking fabric¶

Scope: the GPU-to-GPU interconnect and the validation of it. InfiniBand and high-speed Ethernet, topologies, fabric management, and how a fabric is proven healthy at bring-up.

Overview¶

In a training cluster the fabric is the bottleneck that decides whether GPUs sit idle waiting on collectives. The deployment skill is twofold: design a topology that delivers non-blocking bandwidth where it is needed, and validate the built fabric against line rate before sign-off.

Architecture: two-tier fat-tree (rail-aligned)

flowchart TB
  S1["Spine"] --- L1["Leaf"]
  S1 --- L2["Leaf"]
  S2["Spine"] --- L1
  S2 --- L2
  L1 --- N1["GPU node (8x ConnectX-8)"]
  L1 --- N2["GPU node"]
  L2 --- N3["GPU node"]
  L2 --- N4["GPU node"]

GPU cluster networking deep dives¶

This page is the pillar for the GPU cluster network fabric. Route to the focused pages for tuning, validation, and incident response:

Intra-node interconnect. NVSwitch and NVLink.
RDMA transport tuning. RDMA and RoCE tuning, BlueField DPU networking.
Collective communication. NCCL collectives and algorithms, SHARP in-network reduction, NVSHMEM GPU-initiated communication, compute and communication overlap.
Bring-up and validation. fabric bring-up, validation and benchmarking, fabric validation with nccl-tests.
Fabric runbooks. NCCL hang and collective stall, NVLink visibility and P2P failure, PCIe and P2P bandwidth regression, Fabric Manager failure.

InfiniBand vs RoCE, and validating the fabric¶

InfiniBand vs RoCE for GPU training and inference? InfiniBand (NVIDIA Quantum) is the default for large training fabrics, lossless and SHARP-capable by design; RoCEv2 on Spectrum-X Ethernet is the lower-cost alternative but needs careful PFC/ECN tuning to stay lossless (RDMA and RoCE tuning). The Core knowledge section below compares them by GPU generation.
How do you validate GPU fabric bandwidth? Prove the built fabric against line rate with nccl-tests bus bandwidth before sign-off, not just link-up (fabric validation with nccl-tests, fabric bring-up and benchmarking).

Core knowledge¶

InfiniBand generations¶

Per-port, per-direction line rate doubles each generation: EDR 100, HDR 200, NDR 400, XDR 800 Gb/s. Adapters and switches across generations auto-negotiate to a common speed.

Interconnect differences by GPU generation/tier¶

The scale-up interconnect is not uniform across the range, and this drives the fabric design.

NVLink generation (datacenter SXM only). Across GPU generations: Ampere uses 3rd-gen NVLink at 600 GB/s per GPU; Hopper uses 4th-gen NVLink at 900 GB/s ("fourth-generation NVLink, which offers 900 gigabytes per second"); Blackwell uses 5th-gen NVLink at 1.8 TB/s ("fifth-generation NVIDIA NVLink ... 1.8 TB/s of GPU-to-GPU interconnect"). Rack-scale NVLink domains built on the NVLink Switch System (NVSwitch), the all-to-all scale-up interconnect (e.g. the 72-GPU domain in GB200 NVL72), exist only on SXM datacenter parts, where Fabric Manager runs. PCIe datacenter cards (A100/H100 PCIe) expose only an optional 2-way NVLink Bridge and no NVSwitch domain; consumer and most workstation cards have no NVLink at all (next point).
Consumer and most workstation cards have NO NVLink. GeForce (RTX 4090/5090) and the RTX & workstation Blackwell and Ada cards expose only PCIe (Gen5 on Blackwell/Hopper class). Multi-GPU is therefore PCIe peer-to-peer only: there is no NVSwitch domain and no Fabric Manager, and NCCL falls back to PCIe P2P or host staging rather than NVLink/NVLS (tune via PCIe/ACS, see performance tuning).
GPUDirect RDMA is datacenter/workstation-pro only. NVIDIA states GPUDirect RDMA "is available on both Tesla and Quadro GPUs", i.e. the datacenter (A100/H100/H200/B-series) and professional/RTX-Enterprise lines. It is not supported on GeForce, so consumer cards cannot drive NIC-to-GPU RDMA over the fabric.
InfiniBand generation pairs with the platform. Ampere-era clusters pair with HDR (200 Gb/s); Hopper with NDR (400 Gb/s, Quantum-2); Blackwell with XDR (800 Gb/s, Quantum-X800). The per-generation NIC/switch line rates and the IB switch families are detailed below.

NVIDIA Quantum InfiniBand (current)¶

Quantum-2 is the NDR (400 Gb/s) generation, still used for Hopper-era and storage fabrics.
Quantum-X800 is the XDR (800 Gb/s) generation, built on the Quantum-3 ASIC and 200 Gb/s-per-lane SerDes, for Blackwell-scale clusters.
Q3400-RA (4U): 144 ports of 800 Gb/s across 72 twin-port OSFP cages. High radix supports a two-level fat-tree connecting up to ~10,368 NICs. No NDR backward compatibility.
Q3200-RA (2U): two switches in one enclosure, 36x 800G each (72 total), supports NDR backward compatibility, so it is the bridge to existing NDR infrastructure.
Q3450-LD: silicon-photonics / co-packaged optics variant, MPO direct connect, lower insertion loss.
In-network features: SHARP v4 (in-network reduction for collectives), Adaptive Routing (AR), telemetry-based congestion control, SHIELD self-healing. Sub-100 ns port-to-port latency.

Adapters¶

ConnectX-8 SuperNIC: 800 Gb/s XDR InfiniBand or 2x 400 GbE through one OSFP port. PCIe Gen6 x16, integrated PCIe switch (removes the external PCIe switch the GB200/CX7 generation needed).
ConnectX-7: 400 Gb/s NDR, the most widely installed NIC in existing clusters. Mixed CX7/CX8 estates are common.

Subnet manager and fabric management¶

The subnet manager (SM) is the core IB controller: network initialisation, routing, partitioning (PKeys). Runs on a managed switch or on a Linux host with OFED drivers. OpenSM is the open implementation.
UFM (Unified Fabric Manager) is NVIDIA's management platform: provisioning, real-time monitoring, diagnostics, troubleshooting. The SM is one component of UFM. Variants: Telemetry, Enterprise, Cyber-AI (anomaly detection). Quantum-X800 switches carry a dedicated OSFP in-band management port for UFM, separated from data ports.

Ethernet alternative¶

Spectrum-X (Spectrum-4) is the Ethernet path for AI, with RoCEv2 for RDMA. Rule of thumb: InfiniBand for training/HPC where GPU-to-GPU latency dominates; Ethernet for inference and multi-tenant cloud. Above ~10,000 GPUs a super-spine tier is added.
Skyway is the IB-to-Ethernet gateway appliance.

Topology¶

Fat-tree (leaf/spine) is the standard non-blocking topology. Two-level fat-tree scales to thousands of NICs; a third (super-spine) tier extends to tens of thousands.
Rail-aligned design: in a SuperPOD, each GPU rail connects so that nodes within a Scalable Unit (SU, 72 nodes) are one hop apart; inter-SU and inter-rail traffic crosses the spine. See the Blackwell platform for SU structure.
A SuperPOD runs four separate fabrics: compute (XDR), storage (often NDR), in-band management, and out-of-band management (SN2201 switches, OOB carried as a VXLAN).

Fabric validation (what "validated" means)¶

Physical: every link up at the expected width and speed, no degraded lanes, transceiver and cable health clean.
Routing: SM has converged, routing tables correct, no isolated nodes, partitioning correct.
Performance: bandwidth at line rate and latency within spec across representative paths; congestion control and AR behaving under load.
Tooling to know: ibstat, ibstatus, iblinkinfo, ibdiagnet, perfquery, plus UFM dashboards and nccl-tests (all_reduce, all_gather) as the application-level proof.

Don't-miss checklist¶

Confirm SM is running and singular (no duplicate SMs fighting).
Walk the full link inventory for width/speed downgrades before trusting any benchmark.
Run ibdiagnet and clear all errors before application tests.
Prove it with nccl-tests at realistic message sizes, not just point-to-point.
Confirm cable reach and media type match the run length (copper vs MMF vs SMF).

Failure modes¶

Optics/form-factor mismatch (see BOM validation): IHS switch cages vs RHS NIC cages are not interchangeable; NDR and XDR OSFP modules are not interchangeable despite the shared form factor.
Duplicate or mis-prioritised subnet managers causing routing churn.
Under-provisioned spine causing blocking on inter-SU collectives.
Mixed CX7/CX8 estate negotiating down to NDR unexpectedly.

Open questions & validation¶

IB-specific tooling. Validate OpenSM/UFM, ibdiagnet, and partitioning in a test fabric before relying on them; RoCE/Ethernet estates use a different toolchain, with NCCL-over-RoCE tuning covered in performance tuning. Proving a fabric healthy at line rate is the highest-impact networking skill.
Spectrum-X vs Quantum-X800 selection: InfiniBand for latency-bound training, Ethernet/RoCE for multi-tenant inference and cloud, applied per deployment.

References¶

NVIDIA Quantum-X800 platform: https://www.nvidia.com/en-us/networking/products/infiniband/quantum-x800/
NVIDIA InfiniBand switching overview (UFM, Q3200 bridging): https://www.nvidia.com/en-us/networking/infiniband-switching/
DGX SuperPOD B300 / XDR network fabrics reference: https://docs.nvidia.com/dgx-superpod/reference-architecture/scalable-infrastructure-b300-xdr/latest/network-fabrics.html
XDR switch hardware manual: https://docs.nvidia.com/networking/display/XDRSwitchesHWUM/Introduction
Transceiver compatibility (IHS/RHS, NDR/XDR): https://www.vitextech.com/blogs/blog/800g-transceiver-compatibility-for-nvidia-platforms-spectrum-4-quantum-x800-connectx-8-and-bluefield-3
NVIDIA NVLink and NVLink Switch (generation bandwidth, NVSwitch, domains): https://www.nvidia.com/en-us/data-center/nvlink/
NVIDIA H100 (4th-gen NVLink 900 GB/s, NDR pairing): https://www.nvidia.com/en-us/data-center/h100/
NVIDIA A100 (3rd-gen NVLink 600 GB/s): https://www.nvidia.com/en-us/data-center/a100/
NVIDIA GB200 NVL72 (5th-gen NVLink 1.8 TB/s, NVLink Switch System): https://www.nvidia.com/en-us/data-center/gb200-nvl72/
NVIDIA GPUDirect RDMA (Tesla/Quadro support, ACS/P2P): https://docs.nvidia.com/cuda/gpudirect-rdma/index.html

Related: BOM · Platform · Performance · Overlay & Mesh Networking · Glossary