NVSwitch & NVLink¶
Scope: the GPU-to-GPU interconnect (NVLink links, NVSwitch ASICs, the Fabric Manager that fuses them into one all-to-all domain) across both an 8-GPU baseboard and a rack-scale NVL72, and how to validate it.
flowchart TB
subgraph DOMAIN["NVLink domain (8-GPU HGX baseboard)"]
G0["GPU 0<br/>18 NVLink ports"]
G1["GPU 1<br/>18 NVLink ports"]
G7["GPU 7<br/>18 NVLink ports"]
SW0["NVSwitch 0<br/>(crossbar ASIC)"]
SW1["NVSwitch 1<br/>(crossbar ASIC)"]
SW2["NVSwitch 2<br/>(crossbar ASIC)"]
SW3["NVSwitch 3<br/>(crossbar ASIC)"]
G0 --- SW0 & SW1 & SW2 & SW3
G1 --- SW0 & SW1 & SW2 & SW3
G7 --- SW0 & SW1 & SW2 & SW3
end
FM["Fabric Manager (nv-fabricmanager):<br/>trains links, assigns Clique ID"] -.-> DOMAIN
Reference templates, drawn from the NVIDIA nvidia-smi manual, the Fabric Manager and MNNVL user guides, and the
NVIDIA/nvbandwidthrepo. Nothing here was executed on hardware. Pin every version against your driver/firmware release, substitute real GPU indices and PCI IDs, and validate on one GPU pair before trusting a fleet. Bandwidth numbers are vendor aggregates or line-rate ceilings; achieved bandwidth is always lower.
What it is¶
NVLink is NVIDIA's point-to-point GPU interconnect: a mesh of high-speed links that lets GPUs read and write each other's HBM directly, bypassing PCIe and the CPU. NVSwitch is the switch ASIC that turns those point-to-point links into a non-blocking, all-to-all fabric: every GPU reaches every other GPU at full NVLink bandwidth instead of being limited to a fixed number of direct neighbours. The set of GPUs that can talk over NVLink is an NVLink domain (NVIDIA's term in nvidia-smi: a clique).9
Per-GPU aggregate bandwidth by generation (NVIDIA prints the bidirectional aggregate; per-direction is half):
| NVLink gen | Architecture | Links/GPU | Per-GPU aggregate |
|---|---|---|---|
| 3rd | Ampere / A100 | 12 | 600 GB/s1 |
| 4th | Hopper / H100 | 18 | 900 GB/s2 |
| 5th | Blackwell / B200, GB200 | 18 | 1,800 GB/s2 |
The 5th-gen doubling comes from signaling: the same 18 links run at ~100 GT/s instead of ~50 GT/s, doubling per-link throughput without adding links, so NVLink 5 is composed of 18 links x 100 GB/s bidirectional = 1.8 TB/s per GPU.24 Note that nvidia-smi nvlink --status reports a per-link figure (historically the unidirectional half), so on a Blackwell node you will see per-link numbers that multiply up to the aggregate; see Validated usage. Do not equate the per-link --status number with the marketing aggregate.
Two domain scales matter operationally:
- Intra-node (8-GPU baseboard). An HGX/DGX H100 8-GPU board carries four third-generation NVSwitch chips; any H100 reaches any other at 900 GB/s, giving 3.6 TB/s bisection bandwidth and 450 GB/s for in-network (SHARP) reductions.5 Blackwell HGX B200/B300 boards follow the same pattern with 5th-gen NVLink. The domain is the single server; Fabric Manager runs locally on that host.
- Rack-scale (NVL72). GB200/GB300 NVL72 is a single NVLink domain spanning 72 Blackwell GPUs (paired with 36 Grace CPUs as GB200 Superchips: 1 Grace + 2 Blackwell each), wired through external NVLink Switch trays into "the largest NVLink domain ever offered": 130 TB/s of aggregate GPU-to-GPU bandwidth, 1.8 TB/s per GPU.3 Each GPU's 18 links fan out across nine NVLink switch trays.6 A multi-node domain like this adds IMEX (Internode Memory Exchange) on top of Fabric Manager to coordinate the cross-node memory fabric.11
For the per-architecture numbers see GPU generations and the Blackwell platform page; this page is the interconnect and validation reference they point to.
Why it's needed (and when)¶
PCIe is the wrong fabric for tensor-parallel and expert-parallel traffic. A PCIe Gen5 x16 link is ~64 GB/s per direction; NVLink 5 is 1,800 GB/s aggregate per GPU (more than an order of magnitude) and, crucially, it is switched all-to-all rather than a tree through the CPU root complex. Collective primitives (all-reduce, all-to-all, reduce-scatter) that dominate large-model training and MoE inference are bandwidth-bound on exactly this fabric.
You need NVSwitch/NVLink when:
- Tensor parallelism splits a layer across GPUs every forward/backward step, putting the intra-node interconnect on the critical path of every token.
- Expert parallelism / MoE routes tokens all-to-all across GPUs; NVLink all-to-all is what keeps it from collapsing to PCIe speeds.
- KV-cache or weight sharding for large-model inference spreads state across a domain; NVLink read latency to a peer's HBM is what makes that viable.
- NVLink SHARP (NVLS) offloads reductions into the switch, cutting the data moved for all-reduce.12
You do not get NVLink on every card: L40S, RTX PRO 6000, and GeForce have no NVLink, so multi-GPU there is PCIe-only.13 Choosing SXM (socketed, NVLink/NVSwitch) versus PCIe form factor is therefore an interconnect decision, not just a power/cooling one (GPU software stack).
Rule of thumb: if a job spans more than one GPU and is not embarrassingly parallel, the NVLink domain is the first thing to size and the first thing to validate. Inter-node scaling beyond the domain falls to InfiniBand/RoCE (networking fabric); the boundary between "NVLink domain" and "network" is the single most important topology fact for placement.
How it's installed & managed¶
NVLink links and the NVSwitch hardware are driven by the standard GPU driver stack (CUDA Driver, NVIDIA Kernel Modules). What turns the switches into a domain is Fabric Manager: a userspace daemon (nv-fabricmanager) that trains the NVSwitch-to-NVSwitch and NVSwitch-to-GPU links and registers each GPU into the fabric.14 On any NVSwitch system it must be installed and running, and it is lockstep-versioned with the driver: the package version has to match the loaded driver, and FM aborts on a mismatch.7
Install the package that matches your driver branch (Ubuntu/Debian example; reference template, not hardware-tested):
# Fabric Manager must match the installed driver branch exactly.
nvidia-smi --query-gpu=driver_version --format=csv,noheader # read the loaded driver, e.g. 570.xx
sudo apt-get install -y nvidia-fabricmanager-570 # match the major branch to the driver
sudo systemctl enable --now nvidia-fabricmanager # start now and on boot
systemctl status nvidia-fabricmanager # expect: active (running)
Minimum driver branches per platform: HGX A100 needs 450.xx+, HGX H100 needs 525.xx+, HGX B200/B300/B100 needs 570.xx+.7 Operating mode is set by FABRIC_MODE in the config: 0 bare-metal / full passthrough (default), 1 Shared NVSwitch multi-tenancy, 2 vGPU multi-tenancy.7
Where it lives and how to inspect it:
# Service logs (start failures, link training errors, driver-version mismatch)
sudo journalctl -u nvidia-fabricmanager --no-pager | tail -n 50
sudo tail -n 100 /var/log/fabricmanager.log # FM's own log file
# Config (FABRIC_MODE, bind interface, log level)
sudo grep -vE '^\s*#|^\s*$' /usr/share/nvidia/nvswitch/fabricmanager.cfg
On a healthy node the GPUs register and report a completed fabric handshake. FM start order matters: it must come up before any CUDA process, or GPUs will be Not Started and peer access fails. On GPU-Operator deployments FM runs as a child daemon inside the driver container rather than as a host systemd unit, but the state model is identical.7
For NVL72 racks, most of this is owned by the platform manager (NVIDIA Mission Control / Base Command Manager) and IMEX/NMX rather than run by hand; you verify, you do not bootstrap.6 Treat the by-hand commands above as the 8-GPU-node case and the diagnostic commands below as universal.
Detailed FM operations and the recovery flow live on Fabric Manager; the rack bring-up sequence is Fabric Bring-Up, Validation and Benchmarking.
Validated usage & tests¶
Validate in order: links up → fabric registered → bandwidth at expectation. Do not run a bandwidth test before the fabric handshake is Completed, or a fabric fault gets misread as a slow GPU.
1. Link state and per-link speed. --status shows each link and, if active, its bandwidth.8
nvidia-smi nvlink --status -i 0 # links + speed for GPU 0
nvidia-smi nvlink --capabilities -i 0 # P2P / system-memory / atomics support per link
Expect every populated link to read active with a per-link bandwidth printed (the per-direction figure, e.g. ~25 GB/s on Hopper, ~50 GB/s on the Blackwell switch fabric). Any link showing <inactive> means the driver cannot reach a peer over that link; on an NVL72 the MNNVL guide's bar is explicit: all 18 links up, none inactive.6 Count matters: a Hopper or Blackwell GPU should show 18 links; a short count means a degraded or untrained link.
2. Fabric registration (NVSwitch systems). The GPU must have completed its handshake with Fabric Manager before peer traffic works.9
Expect State: Completed and Status: Success on every GPU. In Progress means FM is still training; Not Started means FM is not running or the GPU has not registered (check the service and /var/log/fabricmanager.log). The Clique ID identifies which NVLink domain the GPU belongs to; GPUs that must communicate over NVLink have to share a clique.9
3. Throughput counters (live traffic). Counters are cumulative; sample twice and difference to get a rate. Use -gt (the older -g counter-control path is deprecated8):
# d = tx/rx data payload in KiB; r = payload + protocol overhead in KiB (if supported)
nvidia-smi nvlink -gt d -i 0 # data-payload counters for GPU 0
nvidia-smi nvlink -gt r -i 0 # payload + overhead
Expect monotonically increasing KiB totals per link under load; flat counters on a link carrying traffic point at a misrouted or down link. These are counters, not a benchmark; for an actual bandwidth figure use nvbandwidth.
4. Achieved bandwidth, nvbandwidth. NVIDIA's NVIDIA/nvbandwidth measures device-to-device NVLink bandwidth directly. Build it from source (reference template, not hardware-tested):10
# Prereqs: CUDA 11.x+ (multinode needs 12.3+), Boost program_options, CMake 3.20+, GCC 7.x+ (C++17).
git clone https://github.com/NVIDIA/nvbandwidth.git
cd nvbandwidth
cmake . # add -DMULTINODE=1 for a multi-node (NVL72-class) build
make
./nvbandwidth -l # list all testcases
./nvbandwidth -t device_to_device_memcpy_read_ce # P2P read over NVLink, copy-engine
./nvbandwidth -t device_to_device_memcpy_write_ce # P2P write over NVLink, copy-engine
./nvbandwidth # run the full default set
Expect a per-GPU-pair matrix of GB/s. Read it relatively: every NVSwitch-connected pair should land in the same band (a non-blocking fabric is symmetric), and that band should approach, never reach, the generation's per-direction ceiling, with protocol overhead taking the rest. An asymmetric matrix where one pair is far below the others is the signature of a degraded link or a GPU that landed in the wrong clique. Do not chase a specific headline number; chase symmetry and the absence of outliers, then compare against your own commissioning baseline (Fabric Bring-Up, Validation and Benchmarking).
The end-to-end fabric procedure (links, subnet manager, P2P, GPUDirect RDMA, NCCL collectives, then NVLink) is Fabric Bring-Up, Validation and Benchmarking. Day-to-day flags are in nvidia-smi Reference.
Failure modes¶
- Fabric Manager won't start / version mismatch. FM checks the loaded driver at init and aborts if incompatible; the node comes up with no NVLink domain.7 Symptom:
nvidia-fabricmanagerinactive, GPUsNot Started. Runbook: Fabric Manager Failure. - Links inactive / short link count. One or more links
<inactive>, or fewer than 18 links on a Hopper/Blackwell GPU, so peer access silently degrades to PCIe or fails outright.6 Triage withnvidia-smi nvlink --status; correlate with NVSwitch SXID errors indmesg(the NVSwitch analogue of GPU XIDs).15 See reliability & RAS. - Stale / split clique. A GPU registers into the wrong clique or a domain partitions after a transient, so peer traffic that should be NVLink falls back or hangs. Check
Clique IDconsistency across the domain vianvidia-smi -q | grep 'Fabric' -A 4. Runbook: Fabric Manager Failure. - NVL72 multi-node domain gaps. On rack-scale systems IMEX/NMX coordinates the cross-node memory fabric; a control-plane fault leaves intra-node NVLink fine but cross-node peer access broken.6 This is Mission Control's domain; verify with the MNNVL checks above before escalating.
These are interconnect faults; distinguish them from a sick GPU before opening a hardware RMA. The fastest discriminator: a GPU that is healthy in compute but isolated on NVLink is a fabric problem, not a GPU problem.
References¶
- NVIDIA System Management Interface (nvidia-smi) manual —
nvlinksubcommand,-s/--status,-c/--capabilities,-gt/--getthroughput(d/r),-R,-i,-l: https://docs.nvidia.com/deploy/nvidia-smi/index.html - NVIDIA Fabric Manager User Guide — service, fabric state, FABRIC_MODE, driver-version lockstep: https://docs.nvidia.com/datacenter/tesla/fabric-manager-user-guide/index.html
- NVIDIA MNNVL (multi-node NVLink) User Guide — verifying NVL72 links and fabric state: https://docs.nvidia.com/multi-node-nvlink-systems/mnnvl-user-guide/verifying.html
- NVIDIA NVLink & NVLink Switch product page — per-generation aggregate bandwidth, NVL72 130 TB/s: https://www.nvidia.com/en-us/data-center/nvlink/
- NVIDIA GB200 NVL72 page — 72-GPU single NVLink domain, 1.8 TB/s/GPU, GB200 Superchip composition: https://www.nvidia.com/en-us/data-center/gb200-nvl72/
- NVIDIA HGX H100 platform blog — 4x 3rd-gen NVSwitch, 3.6 TB/s bisection, 450 GB/s reduction: https://developer.nvidia.com/blog/introducing-nvidia-hgx-h100-an-accelerated-server-platform-for-ai-and-high-performance-computing/
- NVIDIA/nvbandwidth — build prerequisites,
-l,-t, testcase names: https://github.com/NVIDIA/nvbandwidth/blob/main/README.md - NVIDIA Ampere architecture whitepaper — A100 NVLink 12 links x 25 GB/s = 600 GB/s: https://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf
Related: Fabric Manager · Fabric bring-up & benchmarking · Glossary
-
NVIDIA A100 product page lists NVLink 600 GB/s; the Ampere whitepaper states 25 GB/s per direction x 12 links. https://www.nvidia.com/en-us/data-center/a100/ , https://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf ↩
-
NVIDIA NVLink & NVLink Switch page: 4th-gen (H100) 900 GB/s/GPU; 5th-gen (Blackwell) 1,800 GB/s/GPU at 18 links; per-direction halves are inferred, not printed by NVIDIA. https://www.nvidia.com/en-us/data-center/nvlink/ ↩↩↩
-
NVIDIA GB200 NVL72: 72 Blackwell GPUs + 36 Grace CPUs, single 72-GPU NVLink domain, 130 TB/s total GPU communication, 1.8 TB/s per GPU; GB200 Superchip = 1 Grace + 2 Blackwell. https://www.nvidia.com/en-us/data-center/gb200-nvl72/ ↩
-
Chris Fregly, AI Systems Performance Engineering (O'Reilly): "full NVLink 5 bandwidth of 1.8 TB/s bidirectional per GPU (18 x 100 GB/s links)"; "Each Blackwell GPU exposes 18 NVLink 5 ports ... 1.8 TB/s per GPU (18 NVLink links x 100 GB/s bidirectional)"; rack-scale NVLink Switch System provides about 130 TB/s aggregate. Cross-checks the NVIDIA NVLink page. ↩
-
NVIDIA HGX H100 platform: eight H100 + four third-generation NVSwitch, any-to-any at 900 GB/s, 3.6 TB/s bisection, 450 GB/s for reductions. https://developer.nvidia.com/blog/introducing-nvidia-hgx-h100-an-accelerated-server-platform-for-ai-and-high-performance-computing/ ↩
-
NVIDIA MNNVL User Guide (Verifying):
nvidia-smi nvlink --statusshould show 18 links active to the nine switch trays;nvidia-smi -q | grep 'Fabric' -A 4expects Completed/Success;<inactive>links mean the GPU cannot interact with peers. https://docs.nvidia.com/multi-node-nvlink-systems/mnnvl-user-guide/verifying.html ↩↩↩↩↩ -
NVIDIA Fabric Manager User Guide:
nvidia-fabricmanagersystemd service; lockstep driver-version check aborts on mismatch; min branches A100 450.xx+, H100 525.xx+, B200/B300/B100 570.xx+; FABRIC_MODE 0 bare-metal / 1 Shared NVSwitch / 2 vGPU; log at /var/log/fabricmanager.log. https://docs.nvidia.com/datacenter/tesla/fabric-manager-user-guide/index.html ↩↩↩↩↩ -
nvidia-smi manual,
nvlinksubcommand:-s/--status,-c/--capabilities,-gt/--getthroughputwith counter typesd(tx/rx data payload, KiB) andr(payload + protocol overhead, KiB); the older-gcounter-control path is deprecated in favour of-gt. https://docs.nvidia.com/deploy/nvidia-smi/index.html ↩↩ -
nvidia-smi GPU attributes — Fabric: State (Not Started / In Progress / Completed) reflects the handshake with nvidia-fabricmanager; Clique ID is the set of GPUs that can communicate over NVLink. https://docs.nvidia.com/deploy/nvidia-smi/index.html ↩↩↩
-
NVIDIA/nvbandwidth README: prerequisites CUDA 11.x+ (multinode 12.3+), Boost program_options, CMake 3.20+, GCC 7.x+ (C++17); build
cmake . && make(-DMULTINODE=1for multinode);-llists testcases,-t <name>runs one; testcases include device_to_device_memcpy_read_ce / device_to_device_memcpy_write_ce. https://github.com/NVIDIA/nvbandwidth/blob/main/README.md ↩ -
IMEX (Internode Memory Exchange) coordinates the multi-node NVLink memory domain on NVL72 (Glossary). ↩
-
NVLS — the NVLink SHARP variant, in-network reduction over NVLink (Glossary). ↩
-
L40S, RTX PRO 6000, and GeForce have no NVLink; multi-GPU is PCIe-only (Glossary). ↩
-
Fabric Manager (
nv-fabricmanager) programs the NVSwitch fabric so GPUs form one NVLink domain; lockstep-versioned with the driver (Glossary). ↩ -
SXID is the NVSwitch equivalent of a GPU XID error code (Glossary). ↩