Markdown

NVSwitch & NVLink¶

Scope: the GPU-to-GPU interconnect (NVLink links, NVSwitch ASICs, the Fabric Manager that fuses them into one all-to-all domain) across both an 8-GPU baseboard and a rack-scale NVL72, and how to validate it.

flowchart TB
    subgraph DOMAIN["NVLink domain (8-GPU HGX baseboard)"]
        G0["GPU 0<br/>18 NVLink ports"]
        G1["GPU 1<br/>18 NVLink ports"]
        G7["GPU 7<br/>18 NVLink ports"]
        SW0["NVSwitch 0<br/>(crossbar ASIC)"]
        SW1["NVSwitch 1<br/>(crossbar ASIC)"]
        SW2["NVSwitch 2<br/>(crossbar ASIC)"]
        SW3["NVSwitch 3<br/>(crossbar ASIC)"]
        G0 --- SW0 & SW1 & SW2 & SW3
        G1 --- SW0 & SW1 & SW2 & SW3
        G7 --- SW0 & SW1 & SW2 & SW3
    end
    FM["Fabric Manager (nv-fabricmanager):<br/>trains links, assigns Clique ID"] -.-> DOMAIN

Reference templates, drawn from the NVIDIA nvidia-smi manual, the Fabric Manager and MNNVL user guides, and the NVIDIA/nvbandwidth repo. Nothing here was executed on hardware. Pin every version against your driver/firmware release, substitute real GPU indices and PCI IDs, and validate on one GPU pair before trusting a fleet. Bandwidth numbers are vendor aggregates or line-rate ceilings; achieved bandwidth is always lower.

What it is¶

NVLink is NVIDIA's point-to-point GPU interconnect: a mesh of high-speed links that lets GPUs read and write each other's HBM directly, bypassing PCIe and the CPU. NVSwitch is the switch ASIC that turns those point-to-point links into a non-blocking, all-to-all fabric: every GPU reaches every other GPU at full NVLink bandwidth instead of being limited to a fixed number of direct neighbours. The set of GPUs that can talk over NVLink is an NVLink domain (NVIDIA's term in nvidia-smi: a clique).⁹

Per-GPU aggregate bandwidth by generation (NVIDIA prints the bidirectional aggregate; per-direction is half):

NVLink gen	Architecture	Links/GPU	Per-GPU aggregate
3rd	Ampere / A100	12	600 GB/s¹
4th	Hopper / H100	18	900 GB/s²
5th	Blackwell / B200, GB200	18	1,800 GB/s²

The 5th-gen doubling comes from signaling: the same 18 links run at ~100 GT/s instead of ~50 GT/s, doubling per-link throughput without adding links, so NVLink 5 is composed of 18 links x 100 GB/s bidirectional = 1.8 TB/s per GPU.²⁴ Note that nvidia-smi nvlink --status reports a per-link figure (historically the unidirectional half), so on a Blackwell node you will see per-link numbers that multiply up to the aggregate; see Validated usage. Do not equate the per-link --status number with the marketing aggregate.

Two domain scales matter operationally:

Intra-node (8-GPU baseboard). An HGX/DGX H100 8-GPU board carries four third-generation NVSwitch chips; any H100 reaches any other at 900 GB/s, giving 3.6 TB/s bisection bandwidth and 450 GB/s for in-network (SHARP) reductions.⁵ Blackwell HGX B200/B300 boards follow the same pattern with 5th-gen NVLink. The domain is the single server; Fabric Manager runs locally on that host.
Rack-scale (NVL72). GB200/GB300 NVL72 is a single NVLink domain spanning 72 Blackwell GPUs (paired with 36 Grace CPUs as GB200 Superchips: 1 Grace + 2 Blackwell each), wired through external NVLink Switch trays into "the largest NVLink domain ever offered": 130 TB/s of aggregate GPU-to-GPU bandwidth, 1.8 TB/s per GPU.³ Each GPU's 18 links fan out across nine NVLink switch trays.⁶ A multi-node domain like this adds IMEX (Internode Memory Exchange) on top of Fabric Manager to coordinate the cross-node memory fabric.¹¹

For the per-architecture numbers see GPU generations and the Blackwell platform page; this page is the interconnect and validation reference they point to.

Why it's needed (and when)¶

PCIe is the wrong fabric for tensor-parallel and expert-parallel traffic. A PCIe Gen5 x16 link is ~64 GB/s per direction; NVLink 5 is 1,800 GB/s aggregate per GPU (more than an order of magnitude) and, crucially, it is switched all-to-all rather than a tree through the CPU root complex. Collective primitives (all-reduce, all-to-all, reduce-scatter) that dominate large-model training and MoE inference are bandwidth-bound on exactly this fabric.

You need NVSwitch/NVLink when:

Tensor parallelism splits a layer across GPUs every forward/backward step, putting the intra-node interconnect on the critical path of every token.
Expert parallelism / MoE routes tokens all-to-all across GPUs; NVLink all-to-all is what keeps it from collapsing to PCIe speeds.
KV-cache or weight sharding for large-model inference spreads state across a domain; NVLink read latency to a peer's HBM is what makes that viable.
NVLink SHARP (NVLS) offloads reductions into the switch, cutting the data moved for all-reduce.¹²

You do not get NVLink on every card: L40S, RTX PRO 6000, and GeForce have no NVLink, so multi-GPU there is PCIe-only.¹³ Choosing SXM (socketed, NVLink/NVSwitch) versus PCIe form factor is therefore an interconnect decision, not just a power/cooling one (GPU software stack).

Rule of thumb: if a job spans more than one GPU and is not embarrassingly parallel, the NVLink domain is the first thing to size and the first thing to validate. Inter-node scaling beyond the domain falls to InfiniBand/RoCE (networking fabric); the boundary between "NVLink domain" and "network" is the single most important topology fact for placement.

How it's installed & managed¶

NVLink links and the NVSwitch hardware are driven by the standard GPU driver stack (CUDA Driver, NVIDIA Kernel Modules). What turns the switches into a domain is Fabric Manager: a userspace daemon (nv-fabricmanager) that trains the NVSwitch-to-NVSwitch and NVSwitch-to-GPU links and registers each GPU into the fabric.¹⁴ On any NVSwitch system it must be installed and running, and it is lockstep-versioned with the driver: the package version has to match the loaded driver, and FM aborts on a mismatch.⁷

Install the package that matches your driver branch (Ubuntu/Debian example; reference template, not hardware-tested):

# Fabric Manager must match the installed driver branch exactly.
nvidia-smi --query-gpu=driver_version --format=csv,noheader   # read the loaded driver, e.g. 570.xx
sudo apt-get install -y nvidia-fabricmanager-570              # match the major branch to the driver

sudo systemctl enable --now nvidia-fabricmanager             # start now and on boot
systemctl status nvidia-fabricmanager                        # expect: active (running)

Minimum driver branches per platform: HGX A100 needs 450.xx+, HGX H100 needs 525.xx+, HGX B200/B300/B100 needs 570.xx+.⁷ Operating mode is set by FABRIC_MODE in the config: 0 bare-metal / full passthrough (default), 1 Shared NVSwitch multi-tenancy, 2 vGPU multi-tenancy.⁷

Where it lives and how to inspect it:

# Service logs (start failures, link training errors, driver-version mismatch)
sudo journalctl -u nvidia-fabricmanager --no-pager | tail -n 50
sudo tail -n 100 /var/log/fabricmanager.log         # FM's own log file

# Config (FABRIC_MODE, bind interface, log level)
sudo grep -vE '^\s*#|^\s*$' /usr/share/nvidia/nvswitch/fabricmanager.cfg

On a healthy node the GPUs register and report a completed fabric handshake. FM start order matters: it must come up before any CUDA process, or GPUs will be Not Started and peer access fails. On GPU-Operator deployments FM runs as a child daemon inside the driver container rather than as a host systemd unit, but the state model is identical.⁷

For NVL72 racks, most of this is owned by the platform manager (NVIDIA Mission Control / Base Command Manager) and IMEX/NMX rather than run by hand; you verify, you do not bootstrap.⁶ Treat the by-hand commands above as the 8-GPU-node case and the diagnostic commands below as universal.

Detailed FM operations and the recovery flow live on Fabric Manager; the rack bring-up sequence is Fabric Bring-Up, Validation and Benchmarking.

Validated usage & tests¶

Validate in order: links up → fabric registered → bandwidth at expectation. Do not run a bandwidth test before the fabric handshake is Completed, or a fabric fault gets misread as a slow GPU.

1. Link state and per-link speed. --status shows each link and, if active, its bandwidth.⁸

nvidia-smi nvlink --status -i 0     # links + speed for GPU 0
nvidia-smi nvlink --capabilities -i 0   # P2P / system-memory / atomics support per link

Expect every populated link to read active with a per-link bandwidth printed (the per-direction figure, e.g. ~25 GB/s on Hopper, ~50 GB/s on the Blackwell switch fabric). Any link showing <inactive> means the driver cannot reach a peer over that link; on an NVL72 the MNNVL guide's bar is explicit: all 18 links up, none inactive.⁶ Count matters: a Hopper or Blackwell GPU should show 18 links; a short count means a degraded or untrained link.

2. Fabric registration (NVSwitch systems). The GPU must have completed its handshake with Fabric Manager before peer traffic works.⁹

nvidia-smi -q | grep 'Fabric' -A 4      # per-GPU Fabric State + Status; Clique ID

Expect State: Completed and Status: Success on every GPU. In Progress means FM is still training; Not Started means FM is not running or the GPU has not registered (check the service and /var/log/fabricmanager.log). The Clique ID identifies which NVLink domain the GPU belongs to; GPUs that must communicate over NVLink have to share a clique.⁹

3. Throughput counters (live traffic). Counters are cumulative; sample twice and difference to get a rate. Use -gt (the older -g counter-control path is deprecated⁸):

# d = tx/rx data payload in KiB; r = payload + protocol overhead in KiB (if supported)
nvidia-smi nvlink -gt d -i 0        # data-payload counters for GPU 0
nvidia-smi nvlink -gt r -i 0        # payload + overhead

Expect monotonically increasing KiB totals per link under load; flat counters on a link carrying traffic point at a misrouted or down link. These are counters, not a benchmark; for an actual bandwidth figure use nvbandwidth.

4. Achieved bandwidth, nvbandwidth. NVIDIA's NVIDIA/nvbandwidth measures device-to-device NVLink bandwidth directly. Build it from source (reference template, not hardware-tested):¹⁰

# Prereqs: CUDA 11.x+ (multinode needs 12.3+), Boost program_options, CMake 3.20+, GCC 7.x+ (C++17).
git clone https://github.com/NVIDIA/nvbandwidth.git
cd nvbandwidth
cmake .          # add -DMULTINODE=1 for a multi-node (NVL72-class) build
make

./nvbandwidth -l                                  # list all testcases
./nvbandwidth -t device_to_device_memcpy_read_ce  # P2P read over NVLink, copy-engine
./nvbandwidth -t device_to_device_memcpy_write_ce # P2P write over NVLink, copy-engine
./nvbandwidth                                     # run the full default set

Expect a per-GPU-pair matrix of GB/s. Read it relatively: every NVSwitch-connected pair should land in the same band (a non-blocking fabric is symmetric), and that band should approach, never reach, the generation's per-direction ceiling, with protocol overhead taking the rest. An asymmetric matrix where one pair is far below the others is the signature of a degraded link or a GPU that landed in the wrong clique. Do not chase a specific headline number; chase symmetry and the absence of outliers, then compare against your own commissioning baseline (Fabric Bring-Up, Validation and Benchmarking).

The end-to-end fabric procedure (links, subnet manager, P2P, GPUDirect RDMA, NCCL collectives, then NVLink) is Fabric Bring-Up, Validation and Benchmarking. Day-to-day flags are in nvidia-smi Reference.

Failure modes¶

Fabric Manager won't start / version mismatch. FM checks the loaded driver at init and aborts if incompatible; the node comes up with no NVLink domain.⁷ Symptom: nvidia-fabricmanager inactive, GPUs Not Started. Runbook: Fabric Manager Failure.
Links inactive / short link count. One or more links <inactive>, or fewer than 18 links on a Hopper/Blackwell GPU, so peer access silently degrades to PCIe or fails outright.⁶ Triage with nvidia-smi nvlink --status; correlate with NVSwitch SXID errors in dmesg (the NVSwitch analogue of GPU XIDs).¹⁵ See reliability & RAS.
Stale / split clique. A GPU registers into the wrong clique or a domain partitions after a transient, so peer traffic that should be NVLink falls back or hangs. Check Clique ID consistency across the domain via nvidia-smi -q | grep 'Fabric' -A 4. Runbook: Fabric Manager Failure.
NVL72 multi-node domain gaps. On rack-scale systems IMEX/NMX coordinates the cross-node memory fabric; a control-plane fault leaves intra-node NVLink fine but cross-node peer access broken.⁶ This is Mission Control's domain; verify with the MNNVL checks above before escalating.

These are interconnect faults; distinguish them from a sick GPU before opening a hardware RMA. The fastest discriminator: a GPU that is healthy in compute but isolated on NVLink is a fabric problem, not a GPU problem.

References¶

NVIDIA System Management Interface (nvidia-smi) manual — nvlink subcommand, -s/--status, -c/--capabilities, -gt/--getthroughput (d/r), -R, -i, -l: https://docs.nvidia.com/deploy/nvidia-smi/index.html
NVIDIA Fabric Manager User Guide — service, fabric state, FABRIC_MODE, driver-version lockstep: https://docs.nvidia.com/datacenter/tesla/fabric-manager-user-guide/index.html
NVIDIA MNNVL (multi-node NVLink) User Guide — verifying NVL72 links and fabric state: https://docs.nvidia.com/multi-node-nvlink-systems/mnnvl-user-guide/verifying.html
NVIDIA NVLink & NVLink Switch product page — per-generation aggregate bandwidth, NVL72 130 TB/s: https://www.nvidia.com/en-us/data-center/nvlink/
NVIDIA GB200 NVL72 page — 72-GPU single NVLink domain, 1.8 TB/s/GPU, GB200 Superchip composition: https://www.nvidia.com/en-us/data-center/gb200-nvl72/
NVIDIA HGX H100 platform blog — 4x 3rd-gen NVSwitch, 3.6 TB/s bisection, 450 GB/s reduction: https://developer.nvidia.com/blog/introducing-nvidia-hgx-h100-an-accelerated-server-platform-for-ai-and-high-performance-computing/
NVIDIA/nvbandwidth — build prerequisites, -l, -t, testcase names: https://github.com/NVIDIA/nvbandwidth/blob/main/README.md
NVIDIA Ampere architecture whitepaper — A100 NVLink 12 links x 25 GB/s = 600 GB/s: https://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf

NVIDIA A100 product page lists NVLink 600 GB/s; the Ampere whitepaper states 25 GB/s per direction x 12 links. https://www.nvidia.com/en-us/data-center/a100/ , https://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf ↩
NVIDIA NVLink & NVLink Switch page: 4th-gen (H100) 900 GB/s/GPU; 5th-gen (Blackwell) 1,800 GB/s/GPU at 18 links; per-direction halves are inferred, not printed by NVIDIA. https://www.nvidia.com/en-us/data-center/nvlink/ ↩↩↩
NVIDIA GB200 NVL72: 72 Blackwell GPUs + 36 Grace CPUs, single 72-GPU NVLink domain, 130 TB/s total GPU communication, 1.8 TB/s per GPU; GB200 Superchip = 1 Grace + 2 Blackwell. https://www.nvidia.com/en-us/data-center/gb200-nvl72/ ↩
Chris Fregly, AI Systems Performance Engineering (O'Reilly): "full NVLink 5 bandwidth of 1.8 TB/s bidirectional per GPU (18 x 100 GB/s links)"; "Each Blackwell GPU exposes 18 NVLink 5 ports ... 1.8 TB/s per GPU (18 NVLink links x 100 GB/s bidirectional)"; rack-scale NVLink Switch System provides about 130 TB/s aggregate. Cross-checks the NVIDIA NVLink page. ↩
NVIDIA HGX H100 platform: eight H100 + four third-generation NVSwitch, any-to-any at 900 GB/s, 3.6 TB/s bisection, 450 GB/s for reductions. https://developer.nvidia.com/blog/introducing-nvidia-hgx-h100-an-accelerated-server-platform-for-ai-and-high-performance-computing/ ↩
NVIDIA MNNVL User Guide (Verifying): nvidia-smi nvlink --status should show 18 links active to the nine switch trays; nvidia-smi -q | grep 'Fabric' -A 4 expects Completed/Success; <inactive> links mean the GPU cannot interact with peers. https://docs.nvidia.com/multi-node-nvlink-systems/mnnvl-user-guide/verifying.html ↩↩↩↩↩
NVIDIA Fabric Manager User Guide: nvidia-fabricmanager systemd service; lockstep driver-version check aborts on mismatch; min branches A100 450.xx+, H100 525.xx+, B200/B300/B100 570.xx+; FABRIC_MODE 0 bare-metal / 1 Shared NVSwitch / 2 vGPU; log at /var/log/fabricmanager.log. https://docs.nvidia.com/datacenter/tesla/fabric-manager-user-guide/index.html ↩↩↩↩↩
nvidia-smi manual, nvlink subcommand: -s/--status, -c/--capabilities, -gt/--getthroughput with counter types d (tx/rx data payload, KiB) and r (payload + protocol overhead, KiB); the older -g counter-control path is deprecated in favour of -gt. https://docs.nvidia.com/deploy/nvidia-smi/index.html ↩↩
nvidia-smi GPU attributes — Fabric: State (Not Started / In Progress / Completed) reflects the handshake with nvidia-fabricmanager; Clique ID is the set of GPUs that can communicate over NVLink. https://docs.nvidia.com/deploy/nvidia-smi/index.html ↩↩↩
NVIDIA/nvbandwidth README: prerequisites CUDA 11.x+ (multinode 12.3+), Boost program_options, CMake 3.20+, GCC 7.x+ (C++17); build cmake . && make (-DMULTINODE=1 for multinode); -l lists testcases, -t <name> runs one; testcases include device_to_device_memcpy_read_ce / device_to_device_memcpy_write_ce. https://github.com/NVIDIA/nvbandwidth/blob/main/README.md ↩
IMEX (Internode Memory Exchange) coordinates the multi-node NVLink memory domain on NVL72 (Glossary). ↩
NVLS — the NVLink SHARP variant, in-network reduction over NVLink (Glossary). ↩
L40S, RTX PRO 6000, and GeForce have no NVLink; multi-GPU is PCIe-only (Glossary). ↩
Fabric Manager (nv-fabricmanager) programs the NVSwitch fabric so GPUs form one NVLink domain; lockstep-versioned with the driver (Glossary). ↩
SXID is the NVSwitch equivalent of a GPU XID error code (Glossary). ↩