Skip to content
Markdown

HPC networking fabric

Scope: the GPU-to-GPU interconnect and the validation of it. InfiniBand and high-speed Ethernet, topologies, fabric management, and how a fabric is proven healthy at bring-up.

Overview

In a training cluster the fabric is the bottleneck that decides whether GPUs sit idle waiting on collectives. The deployment skill is twofold: design a topology that delivers non-blocking bandwidth where it is needed, and validate the built fabric against line rate before sign-off.

Architecture: two-tier fat-tree (rail-aligned)

flowchart TB
  S1["Spine"] --- L1["Leaf"]
  S1 --- L2["Leaf"]
  S2["Spine"] --- L1
  S2 --- L2
  L1 --- N1["GPU node (8x ConnectX-8)"]
  L1 --- N2["GPU node"]
  L2 --- N3["GPU node"]
  L2 --- N4["GPU node"]

GPU cluster networking deep dives

This page is the pillar for the GPU cluster network fabric. Route to the focused pages for tuning, validation, and incident response:

InfiniBand vs RoCE, and validating the fabric

  • InfiniBand vs RoCE for GPU training and inference? InfiniBand (NVIDIA Quantum) is the default for large training fabrics, lossless and SHARP-capable by design; RoCEv2 on Spectrum-X Ethernet is the lower-cost alternative but needs careful PFC/ECN tuning to stay lossless (RDMA and RoCE tuning). The Core knowledge section below compares them by GPU generation.
  • How do you validate GPU fabric bandwidth? Prove the built fabric against line rate with nccl-tests bus bandwidth before sign-off, not just link-up (fabric validation with nccl-tests, fabric bring-up and benchmarking).

Core knowledge

InfiniBand generations

Per-port, per-direction line rate doubles each generation: EDR 100, HDR 200, NDR 400, XDR 800 Gb/s. Adapters and switches across generations auto-negotiate to a common speed.

Interconnect differences by GPU generation/tier

The scale-up interconnect is not uniform across the range, and this drives the fabric design.

  • NVLink generation (datacenter SXM only). Across GPU generations: Ampere uses 3rd-gen NVLink at 600 GB/s per GPU; Hopper uses 4th-gen NVLink at 900 GB/s ("fourth-generation NVLink, which offers 900 gigabytes per second"); Blackwell uses 5th-gen NVLink at 1.8 TB/s ("fifth-generation NVIDIA NVLink ... 1.8 TB/s of GPU-to-GPU interconnect"). Rack-scale NVLink domains built on the NVLink Switch System (NVSwitch), the all-to-all scale-up interconnect (e.g. the 72-GPU domain in GB200 NVL72), exist only on SXM datacenter parts, where Fabric Manager runs. PCIe datacenter cards (A100/H100 PCIe) expose only an optional 2-way NVLink Bridge and no NVSwitch domain; consumer and most workstation cards have no NVLink at all (next point).
  • Consumer and most workstation cards have NO NVLink. GeForce (RTX 4090/5090) and the RTX & workstation Blackwell and Ada cards expose only PCIe (Gen5 on Blackwell/Hopper class). Multi-GPU is therefore PCIe peer-to-peer only: there is no NVSwitch domain and no Fabric Manager, and NCCL falls back to PCIe P2P or host staging rather than NVLink/NVLS (tune via PCIe/ACS, see performance tuning).
  • GPUDirect RDMA is datacenter/workstation-pro only. NVIDIA states GPUDirect RDMA "is available on both Tesla and Quadro GPUs", i.e. the datacenter (A100/H100/H200/B-series) and professional/RTX-Enterprise lines. It is not supported on GeForce, so consumer cards cannot drive NIC-to-GPU RDMA over the fabric.
  • InfiniBand generation pairs with the platform. Ampere-era clusters pair with HDR (200 Gb/s); Hopper with NDR (400 Gb/s, Quantum-2); Blackwell with XDR (800 Gb/s, Quantum-X800). The per-generation NIC/switch line rates and the IB switch families are detailed below.

NVIDIA Quantum InfiniBand (current)

  • Quantum-2 is the NDR (400 Gb/s) generation, still used for Hopper-era and storage fabrics.
  • Quantum-X800 is the XDR (800 Gb/s) generation, built on the Quantum-3 ASIC and 200 Gb/s-per-lane SerDes, for Blackwell-scale clusters.
  • Q3400-RA (4U): 144 ports of 800 Gb/s across 72 twin-port OSFP cages. High radix supports a two-level fat-tree connecting up to ~10,368 NICs. No NDR backward compatibility.
  • Q3200-RA (2U): two switches in one enclosure, 36x 800G each (72 total), supports NDR backward compatibility, so it is the bridge to existing NDR infrastructure.
  • Q3450-LD: silicon-photonics / co-packaged optics variant, MPO direct connect, lower insertion loss.
  • In-network features: SHARP v4 (in-network reduction for collectives), Adaptive Routing (AR), telemetry-based congestion control, SHIELD self-healing. Sub-100 ns port-to-port latency.

Adapters

  • ConnectX-8 SuperNIC: 800 Gb/s XDR InfiniBand or 2x 400 GbE through one OSFP port. PCIe Gen6 x16, integrated PCIe switch (removes the external PCIe switch the GB200/CX7 generation needed).
  • ConnectX-7: 400 Gb/s NDR, the most widely installed NIC in existing clusters. Mixed CX7/CX8 estates are common.

Subnet manager and fabric management

  • The subnet manager (SM) is the core IB controller: network initialisation, routing, partitioning (PKeys). Runs on a managed switch or on a Linux host with OFED drivers. OpenSM is the open implementation.
  • UFM (Unified Fabric Manager) is NVIDIA's management platform: provisioning, real-time monitoring, diagnostics, troubleshooting. The SM is one component of UFM. Variants: Telemetry, Enterprise, Cyber-AI (anomaly detection). Quantum-X800 switches carry a dedicated OSFP in-band management port for UFM, separated from data ports.

Ethernet alternative

  • Spectrum-X (Spectrum-4) is the Ethernet path for AI, with RoCEv2 for RDMA. Rule of thumb: InfiniBand for training/HPC where GPU-to-GPU latency dominates; Ethernet for inference and multi-tenant cloud. Above ~10,000 GPUs a super-spine tier is added.
  • Skyway is the IB-to-Ethernet gateway appliance.

Topology

  • Fat-tree (leaf/spine) is the standard non-blocking topology. Two-level fat-tree scales to thousands of NICs; a third (super-spine) tier extends to tens of thousands.
  • Rail-aligned design: in a SuperPOD, each GPU rail connects so that nodes within a Scalable Unit (SU, 72 nodes) are one hop apart; inter-SU and inter-rail traffic crosses the spine. See the Blackwell platform for SU structure.
  • A SuperPOD runs four separate fabrics: compute (XDR), storage (often NDR), in-band management, and out-of-band management (SN2201 switches, OOB carried as a VXLAN).

Fabric validation (what "validated" means)

  • Physical: every link up at the expected width and speed, no degraded lanes, transceiver and cable health clean.
  • Routing: SM has converged, routing tables correct, no isolated nodes, partitioning correct.
  • Performance: bandwidth at line rate and latency within spec across representative paths; congestion control and AR behaving under load.
  • Tooling to know: ibstat, ibstatus, iblinkinfo, ibdiagnet, perfquery, plus UFM dashboards and nccl-tests (all_reduce, all_gather) as the application-level proof.

Don't-miss checklist

  • Confirm SM is running and singular (no duplicate SMs fighting).
  • Walk the full link inventory for width/speed downgrades before trusting any benchmark.
  • Run ibdiagnet and clear all errors before application tests.
  • Prove it with nccl-tests at realistic message sizes, not just point-to-point.
  • Confirm cable reach and media type match the run length (copper vs MMF vs SMF).

Failure modes

  • Optics/form-factor mismatch (see BOM validation): IHS switch cages vs RHS NIC cages are not interchangeable; NDR and XDR OSFP modules are not interchangeable despite the shared form factor.
  • Duplicate or mis-prioritised subnet managers causing routing churn.
  • Under-provisioned spine causing blocking on inter-SU collectives.
  • Mixed CX7/CX8 estate negotiating down to NDR unexpectedly.

Open questions & validation

  • IB-specific tooling. Validate OpenSM/UFM, ibdiagnet, and partitioning in a test fabric before relying on them; RoCE/Ethernet estates use a different toolchain, with NCCL-over-RoCE tuning covered in performance tuning. Proving a fabric healthy at line rate is the highest-impact networking skill.
  • Spectrum-X vs Quantum-X800 selection: InfiniBand for latency-bound training, Ethernet/RoCE for multi-tenant inference and cloud, applied per deployment.

References

  • NVIDIA Quantum-X800 platform: https://www.nvidia.com/en-us/networking/products/infiniband/quantum-x800/
  • NVIDIA InfiniBand switching overview (UFM, Q3200 bridging): https://www.nvidia.com/en-us/networking/infiniband-switching/
  • DGX SuperPOD B300 / XDR network fabrics reference: https://docs.nvidia.com/dgx-superpod/reference-architecture/scalable-infrastructure-b300-xdr/latest/network-fabrics.html
  • XDR switch hardware manual: https://docs.nvidia.com/networking/display/XDRSwitchesHWUM/Introduction
  • Transceiver compatibility (IHS/RHS, NDR/XDR): https://www.vitextech.com/blogs/blog/800g-transceiver-compatibility-for-nvidia-platforms-spectrum-4-quantum-x800-connectx-8-and-bluefield-3
  • NVIDIA NVLink and NVLink Switch (generation bandwidth, NVSwitch, domains): https://www.nvidia.com/en-us/data-center/nvlink/
  • NVIDIA H100 (4th-gen NVLink 900 GB/s, NDR pairing): https://www.nvidia.com/en-us/data-center/h100/
  • NVIDIA A100 (3rd-gen NVLink 600 GB/s): https://www.nvidia.com/en-us/data-center/a100/
  • NVIDIA GB200 NVL72 (5th-gen NVLink 1.8 TB/s, NVLink Switch System): https://www.nvidia.com/en-us/data-center/gb200-nvl72/
  • NVIDIA GPUDirect RDMA (Tesla/Quadro support, ACS/P2P): https://docs.nvidia.com/cuda/gpudirect-rdma/index.html

Related: BOM · Platform · Performance · Overlay & Mesh Networking · Glossary