Skip to content
Markdown

NVSwitch & NVLink

Scope: the GPU-to-GPU interconnect (NVLink links, NVSwitch ASICs, the Fabric Manager that fuses them into one all-to-all domain) across both an 8-GPU baseboard and a rack-scale NVL72, and how to validate it.

flowchart TB
    subgraph DOMAIN["NVLink domain (8-GPU HGX baseboard)"]
        G0["GPU 0<br/>18 NVLink ports"]
        G1["GPU 1<br/>18 NVLink ports"]
        G7["GPU 7<br/>18 NVLink ports"]
        SW0["NVSwitch 0<br/>(crossbar ASIC)"]
        SW1["NVSwitch 1<br/>(crossbar ASIC)"]
        SW2["NVSwitch 2<br/>(crossbar ASIC)"]
        SW3["NVSwitch 3<br/>(crossbar ASIC)"]
        G0 --- SW0 & SW1 & SW2 & SW3
        G1 --- SW0 & SW1 & SW2 & SW3
        G7 --- SW0 & SW1 & SW2 & SW3
    end
    FM["Fabric Manager (nv-fabricmanager):<br/>trains links, assigns Clique ID"] -.-> DOMAIN

Reference templates, drawn from the NVIDIA nvidia-smi manual, the Fabric Manager and MNNVL user guides, and the NVIDIA/nvbandwidth repo. Nothing here was executed on hardware. Pin every version against your driver/firmware release, substitute real GPU indices and PCI IDs, and validate on one GPU pair before trusting a fleet. Bandwidth numbers are vendor aggregates or line-rate ceilings; achieved bandwidth is always lower.

What it is

NVLink is NVIDIA's point-to-point GPU interconnect: a mesh of high-speed links that lets GPUs read and write each other's HBM directly, bypassing PCIe and the CPU. NVSwitch is the switch ASIC that turns those point-to-point links into a non-blocking, all-to-all fabric: every GPU reaches every other GPU at full NVLink bandwidth instead of being limited to a fixed number of direct neighbours. The set of GPUs that can talk over NVLink is an NVLink domain (NVIDIA's term in nvidia-smi: a clique).9

Per-GPU aggregate bandwidth by generation (NVIDIA prints the bidirectional aggregate; per-direction is half):

NVLink gen Architecture Links/GPU Per-GPU aggregate
3rd Ampere / A100 12 600 GB/s1
4th Hopper / H100 18 900 GB/s2
5th Blackwell / B200, GB200 18 1,800 GB/s2

The 5th-gen doubling comes from signaling: the same 18 links run at ~100 GT/s instead of ~50 GT/s, doubling per-link throughput without adding links, so NVLink 5 is composed of 18 links x 100 GB/s bidirectional = 1.8 TB/s per GPU.24 Note that nvidia-smi nvlink --status reports a per-link figure (historically the unidirectional half), so on a Blackwell node you will see per-link numbers that multiply up to the aggregate; see Validated usage. Do not equate the per-link --status number with the marketing aggregate.

Two domain scales matter operationally:

  • Intra-node (8-GPU baseboard). An HGX/DGX H100 8-GPU board carries four third-generation NVSwitch chips; any H100 reaches any other at 900 GB/s, giving 3.6 TB/s bisection bandwidth and 450 GB/s for in-network (SHARP) reductions.5 Blackwell HGX B200/B300 boards follow the same pattern with 5th-gen NVLink. The domain is the single server; Fabric Manager runs locally on that host.
  • Rack-scale (NVL72). GB200/GB300 NVL72 is a single NVLink domain spanning 72 Blackwell GPUs (paired with 36 Grace CPUs as GB200 Superchips: 1 Grace + 2 Blackwell each), wired through external NVLink Switch trays into "the largest NVLink domain ever offered": 130 TB/s of aggregate GPU-to-GPU bandwidth, 1.8 TB/s per GPU.3 Each GPU's 18 links fan out across nine NVLink switch trays.6 A multi-node domain like this adds IMEX (Internode Memory Exchange) on top of Fabric Manager to coordinate the cross-node memory fabric.11

For the per-architecture numbers see GPU generations and the Blackwell platform page; this page is the interconnect and validation reference they point to.

Why it's needed (and when)

PCIe is the wrong fabric for tensor-parallel and expert-parallel traffic. A PCIe Gen5 x16 link is ~64 GB/s per direction; NVLink 5 is 1,800 GB/s aggregate per GPU (more than an order of magnitude) and, crucially, it is switched all-to-all rather than a tree through the CPU root complex. Collective primitives (all-reduce, all-to-all, reduce-scatter) that dominate large-model training and MoE inference are bandwidth-bound on exactly this fabric.

You need NVSwitch/NVLink when:

  • Tensor parallelism splits a layer across GPUs every forward/backward step, putting the intra-node interconnect on the critical path of every token.
  • Expert parallelism / MoE routes tokens all-to-all across GPUs; NVLink all-to-all is what keeps it from collapsing to PCIe speeds.
  • KV-cache or weight sharding for large-model inference spreads state across a domain; NVLink read latency to a peer's HBM is what makes that viable.
  • NVLink SHARP (NVLS) offloads reductions into the switch, cutting the data moved for all-reduce.12

You do not get NVLink on every card: L40S, RTX PRO 6000, and GeForce have no NVLink, so multi-GPU there is PCIe-only.13 Choosing SXM (socketed, NVLink/NVSwitch) versus PCIe form factor is therefore an interconnect decision, not just a power/cooling one (GPU software stack).

Rule of thumb: if a job spans more than one GPU and is not embarrassingly parallel, the NVLink domain is the first thing to size and the first thing to validate. Inter-node scaling beyond the domain falls to InfiniBand/RoCE (networking fabric); the boundary between "NVLink domain" and "network" is the single most important topology fact for placement.

How it's installed & managed

NVLink links and the NVSwitch hardware are driven by the standard GPU driver stack (CUDA Driver, NVIDIA Kernel Modules). What turns the switches into a domain is Fabric Manager: a userspace daemon (nv-fabricmanager) that trains the NVSwitch-to-NVSwitch and NVSwitch-to-GPU links and registers each GPU into the fabric.14 On any NVSwitch system it must be installed and running, and it is lockstep-versioned with the driver: the package version has to match the loaded driver, and FM aborts on a mismatch.7

Install the package that matches your driver branch (Ubuntu/Debian example; reference template, not hardware-tested):

# Fabric Manager must match the installed driver branch exactly.
nvidia-smi --query-gpu=driver_version --format=csv,noheader   # read the loaded driver, e.g. 570.xx
sudo apt-get install -y nvidia-fabricmanager-570              # match the major branch to the driver

sudo systemctl enable --now nvidia-fabricmanager             # start now and on boot
systemctl status nvidia-fabricmanager                        # expect: active (running)

Minimum driver branches per platform: HGX A100 needs 450.xx+, HGX H100 needs 525.xx+, HGX B200/B300/B100 needs 570.xx+.7 Operating mode is set by FABRIC_MODE in the config: 0 bare-metal / full passthrough (default), 1 Shared NVSwitch multi-tenancy, 2 vGPU multi-tenancy.7

Where it lives and how to inspect it:

# Service logs (start failures, link training errors, driver-version mismatch)
sudo journalctl -u nvidia-fabricmanager --no-pager | tail -n 50
sudo tail -n 100 /var/log/fabricmanager.log         # FM's own log file

# Config (FABRIC_MODE, bind interface, log level)
sudo grep -vE '^\s*#|^\s*$' /usr/share/nvidia/nvswitch/fabricmanager.cfg

On a healthy node the GPUs register and report a completed fabric handshake. FM start order matters: it must come up before any CUDA process, or GPUs will be Not Started and peer access fails. On GPU-Operator deployments FM runs as a child daemon inside the driver container rather than as a host systemd unit, but the state model is identical.7

For NVL72 racks, most of this is owned by the platform manager (NVIDIA Mission Control / Base Command Manager) and IMEX/NMX rather than run by hand; you verify, you do not bootstrap.6 Treat the by-hand commands above as the 8-GPU-node case and the diagnostic commands below as universal.

Detailed FM operations and the recovery flow live on Fabric Manager; the rack bring-up sequence is Fabric Bring-Up, Validation and Benchmarking.

Validated usage & tests

Validate in order: links up → fabric registered → bandwidth at expectation. Do not run a bandwidth test before the fabric handshake is Completed, or a fabric fault gets misread as a slow GPU.

1. Link state and per-link speed. --status shows each link and, if active, its bandwidth.8

nvidia-smi nvlink --status -i 0     # links + speed for GPU 0
nvidia-smi nvlink --capabilities -i 0   # P2P / system-memory / atomics support per link

Expect every populated link to read active with a per-link bandwidth printed (the per-direction figure, e.g. ~25 GB/s on Hopper, ~50 GB/s on the Blackwell switch fabric). Any link showing <inactive> means the driver cannot reach a peer over that link; on an NVL72 the MNNVL guide's bar is explicit: all 18 links up, none inactive.6 Count matters: a Hopper or Blackwell GPU should show 18 links; a short count means a degraded or untrained link.

2. Fabric registration (NVSwitch systems). The GPU must have completed its handshake with Fabric Manager before peer traffic works.9

nvidia-smi -q | grep 'Fabric' -A 4      # per-GPU Fabric State + Status; Clique ID

Expect State: Completed and Status: Success on every GPU. In Progress means FM is still training; Not Started means FM is not running or the GPU has not registered (check the service and /var/log/fabricmanager.log). The Clique ID identifies which NVLink domain the GPU belongs to; GPUs that must communicate over NVLink have to share a clique.9

3. Throughput counters (live traffic). Counters are cumulative; sample twice and difference to get a rate. Use -gt (the older -g counter-control path is deprecated8):

# d = tx/rx data payload in KiB; r = payload + protocol overhead in KiB (if supported)
nvidia-smi nvlink -gt d -i 0        # data-payload counters for GPU 0
nvidia-smi nvlink -gt r -i 0        # payload + overhead

Expect monotonically increasing KiB totals per link under load; flat counters on a link carrying traffic point at a misrouted or down link. These are counters, not a benchmark; for an actual bandwidth figure use nvbandwidth.

4. Achieved bandwidth, nvbandwidth. NVIDIA's NVIDIA/nvbandwidth measures device-to-device NVLink bandwidth directly. Build it from source (reference template, not hardware-tested):10

# Prereqs: CUDA 11.x+ (multinode needs 12.3+), Boost program_options, CMake 3.20+, GCC 7.x+ (C++17).
git clone https://github.com/NVIDIA/nvbandwidth.git
cd nvbandwidth
cmake .          # add -DMULTINODE=1 for a multi-node (NVL72-class) build
make

./nvbandwidth -l                                  # list all testcases
./nvbandwidth -t device_to_device_memcpy_read_ce  # P2P read over NVLink, copy-engine
./nvbandwidth -t device_to_device_memcpy_write_ce # P2P write over NVLink, copy-engine
./nvbandwidth                                     # run the full default set

Expect a per-GPU-pair matrix of GB/s. Read it relatively: every NVSwitch-connected pair should land in the same band (a non-blocking fabric is symmetric), and that band should approach, never reach, the generation's per-direction ceiling, with protocol overhead taking the rest. An asymmetric matrix where one pair is far below the others is the signature of a degraded link or a GPU that landed in the wrong clique. Do not chase a specific headline number; chase symmetry and the absence of outliers, then compare against your own commissioning baseline (Fabric Bring-Up, Validation and Benchmarking).

The end-to-end fabric procedure (links, subnet manager, P2P, GPUDirect RDMA, NCCL collectives, then NVLink) is Fabric Bring-Up, Validation and Benchmarking. Day-to-day flags are in nvidia-smi Reference.

Failure modes

  • Fabric Manager won't start / version mismatch. FM checks the loaded driver at init and aborts if incompatible; the node comes up with no NVLink domain.7 Symptom: nvidia-fabricmanager inactive, GPUs Not Started. Runbook: Fabric Manager Failure.
  • Links inactive / short link count. One or more links <inactive>, or fewer than 18 links on a Hopper/Blackwell GPU, so peer access silently degrades to PCIe or fails outright.6 Triage with nvidia-smi nvlink --status; correlate with NVSwitch SXID errors in dmesg (the NVSwitch analogue of GPU XIDs).15 See reliability & RAS.
  • Stale / split clique. A GPU registers into the wrong clique or a domain partitions after a transient, so peer traffic that should be NVLink falls back or hangs. Check Clique ID consistency across the domain via nvidia-smi -q | grep 'Fabric' -A 4. Runbook: Fabric Manager Failure.
  • NVL72 multi-node domain gaps. On rack-scale systems IMEX/NMX coordinates the cross-node memory fabric; a control-plane fault leaves intra-node NVLink fine but cross-node peer access broken.6 This is Mission Control's domain; verify with the MNNVL checks above before escalating.

These are interconnect faults; distinguish them from a sick GPU before opening a hardware RMA. The fastest discriminator: a GPU that is healthy in compute but isolated on NVLink is a fabric problem, not a GPU problem.

References

Related: Fabric Manager · Fabric bring-up & benchmarking · Glossary


  1. NVIDIA A100 product page lists NVLink 600 GB/s; the Ampere whitepaper states 25 GB/s per direction x 12 links. https://www.nvidia.com/en-us/data-center/a100/ , https://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf 

  2. NVIDIA NVLink & NVLink Switch page: 4th-gen (H100) 900 GB/s/GPU; 5th-gen (Blackwell) 1,800 GB/s/GPU at 18 links; per-direction halves are inferred, not printed by NVIDIA. https://www.nvidia.com/en-us/data-center/nvlink/ 

  3. NVIDIA GB200 NVL72: 72 Blackwell GPUs + 36 Grace CPUs, single 72-GPU NVLink domain, 130 TB/s total GPU communication, 1.8 TB/s per GPU; GB200 Superchip = 1 Grace + 2 Blackwell. https://www.nvidia.com/en-us/data-center/gb200-nvl72/ 

  4. Chris Fregly, AI Systems Performance Engineering (O'Reilly): "full NVLink 5 bandwidth of 1.8 TB/s bidirectional per GPU (18 x 100 GB/s links)"; "Each Blackwell GPU exposes 18 NVLink 5 ports ... 1.8 TB/s per GPU (18 NVLink links x 100 GB/s bidirectional)"; rack-scale NVLink Switch System provides about 130 TB/s aggregate. Cross-checks the NVIDIA NVLink page. 

  5. NVIDIA HGX H100 platform: eight H100 + four third-generation NVSwitch, any-to-any at 900 GB/s, 3.6 TB/s bisection, 450 GB/s for reductions. https://developer.nvidia.com/blog/introducing-nvidia-hgx-h100-an-accelerated-server-platform-for-ai-and-high-performance-computing/ 

  6. NVIDIA MNNVL User Guide (Verifying): nvidia-smi nvlink --status should show 18 links active to the nine switch trays; nvidia-smi -q | grep 'Fabric' -A 4 expects Completed/Success; <inactive> links mean the GPU cannot interact with peers. https://docs.nvidia.com/multi-node-nvlink-systems/mnnvl-user-guide/verifying.html 

  7. NVIDIA Fabric Manager User Guide: nvidia-fabricmanager systemd service; lockstep driver-version check aborts on mismatch; min branches A100 450.xx+, H100 525.xx+, B200/B300/B100 570.xx+; FABRIC_MODE 0 bare-metal / 1 Shared NVSwitch / 2 vGPU; log at /var/log/fabricmanager.log. https://docs.nvidia.com/datacenter/tesla/fabric-manager-user-guide/index.html 

  8. nvidia-smi manual, nvlink subcommand: -s/--status, -c/--capabilities, -gt/--getthroughput with counter types d (tx/rx data payload, KiB) and r (payload + protocol overhead, KiB); the older -g counter-control path is deprecated in favour of -gt. https://docs.nvidia.com/deploy/nvidia-smi/index.html 

  9. nvidia-smi GPU attributes — Fabric: State (Not Started / In Progress / Completed) reflects the handshake with nvidia-fabricmanager; Clique ID is the set of GPUs that can communicate over NVLink. https://docs.nvidia.com/deploy/nvidia-smi/index.html 

  10. NVIDIA/nvbandwidth README: prerequisites CUDA 11.x+ (multinode 12.3+), Boost program_options, CMake 3.20+, GCC 7.x+ (C++17); build cmake . && make (-DMULTINODE=1 for multinode); -l lists testcases, -t <name> runs one; testcases include device_to_device_memcpy_read_ce / device_to_device_memcpy_write_ce. https://github.com/NVIDIA/nvbandwidth/blob/main/README.md 

  11. IMEX (Internode Memory Exchange) coordinates the multi-node NVLink memory domain on NVL72 (Glossary). 

  12. NVLS — the NVLink SHARP variant, in-network reduction over NVLink (Glossary). 

  13. Fabric Manager (nv-fabricmanager) programs the NVSwitch fabric so GPUs form one NVLink domain; lockstep-versioned with the driver (Glossary). 

  14. SXID is the NVSwitch equivalent of a GPU XID error code (Glossary).