Fabric bring-up, validation & benchmarking¶
Scope: the shared procedure for bringing a GPU interconnect up, proving it healthy, and benchmarking it to line rate (InfiniBand, RoCE, and NVLink) before any workload runs.
Reference templates, drawn from NVIDIA DOCA/MLNX_OFED docs, the
linux-rdma/perftestproject, and theNVIDIA/nccl-testsandNVIDIA/nvbandwidthrepos. Nothing here was executed on hardware. Pin versions to your DOCA/driver release, substitute real device names, and validate on one link/node pair before a fleet roll. Numbers are line-rate ceilings or vendor figures; achieved bandwidth is always lower (protocol overhead).
flowchart LR
STACK["1-2 Install stack and SM"] --> LINKS["3 Verify links and topology"]
LINKS --> P2P["4 Point-to-point BW and latency"]
P2P --> GDR["5 GPUDirect RDMA"]
GDR --> NCCL["6 nccl-tests collectives"]
NCCL --> NVL["7 NVLink and NVSwitch"]
NVL --> ROCE["8 RoCE specifics"]
What this covers¶
This is the procedure the per-GPU hardware pages reference instead of repeating networking ops. It takes a built fabric, cabled, powered, drivers present (Ansible bring-up), and proves it: every link up at the right speed, the subnet manager converged, point-to-point bandwidth at line rate, GPUDirect RDMA actually engaged, collectives at the expected bus bandwidth, and NVLink intact on NVSwitch systems. Use it at commissioning (commissioning and acceptance), after any driver/firmware change (driver upgrade runbook), and as the first triage when a "GPU" problem might be a degraded link.
The ordering is deliberate and bottom-up: do not run a collective benchmark before the fabric is clean, or a network fault gets misread as a GPU fault. Concepts and topology live in networking fabric; this page is the runnable counterpart.
Prerequisites¶
- GPU driver + datacenter stack present and healthy:
nvidia-smilists every GPU, persistence mode on, and on NVSwitch systems Fabric Manager active (Section 7). See GPU software stack. - NIC family. The fabric generation pairs with the adapter and switch:
- ConnectX-6. HDR InfiniBand / 200 GbE.
- ConnectX-7. NDR InfiniBand 400 Gb/s or 400 GbE.1
- ConnectX-8 SuperNIC. XDR InfiniBand 800 Gb/s or 2x400 GbE, PCIe Gen6.2
- BlueField-3. DPU/SuperNIC at NDR 400 Gb/s IB or 400 GbE; the SuperNIC variant is the host adapter in Spectrum-X.3
- GPUDirect RDMA path. The
nvidia-peermemkernel module exposes GPU memory to the NIC for RDMA; load it (or use the newer dma-buf path) before GDR tests. NCCL enables GDR automatically when the topology permits and the module is present.4 GPUDirect RDMA is a datacenter/workstation-pro capability, not available on GeForce (networking fabric).
# Confirm the peer-memory module is loaded (GPUDirect RDMA prerequisite)
lsmod | grep nvidia_peermem || sudo modprobe nvidia_peermem
1. Install the fabric stack¶
NVIDIA's host RDMA stack is DOCA-OFED (the successor to MLNX_OFED; the last standalone MLNX_OFED was the October 2024 LTS, and from January 2025 new features ship in DOCA-OFED only).5 Install on Ubuntu from the DOCA-Host repo, then pick a profile.6
# DOCA-Host on Ubuntu: install the repo package, refresh, install a profile.
sudo dpkg -i doca-host_<version>_ubuntu<rel>_amd64.deb # repo package from NVIDIA
sudo apt-get update
sudo apt-get install -y doca-ofed # full RDMA stack profile
# doca-ofed-userspace installs userspace only (no DKMS kernel modules);
# doca-all is the everything profile. Choose per node role.
sudo apt-get install -y mlnx-fw-updater # firmware updater (pulls matching FW)
Verify the stack and the kernel module:
ofed_info -s # prints the DOCA/MLNX_OFED version string, e.g. MLNX_OFED_LINUX-24.07-...
modinfo mlx5_core # confirm the driver: version, srcversion, signature
ibv_devinfo # RDMA devices, fw_ver, port state, link_layer (InfiniBand|Ethernet)
ofed_info -s prints the installed OFED/DOCA version; ofed_info (no flag) lists every component version.7 modinfo mlx5_core confirms the loaded ConnectX driver and its firmware/signing metadata.
Firmware. Query and, if needed, update adapter firmware with the MFT tools (mft package), or rely on mlnx-fw-updater above to apply the version matched to the installed stack:89
mlxfwmanager -d <pci_id> --query # device type, PSID, current vs available FW, e.g. -d 09:00.0
flint -d <pci_id> q # low-level query (flint burns/queries a single device)
mstflint -d <pci_id> q # open-source MFT subset, equivalent query
Firmware mismatch across a fleet is a classic source of links that negotiate down or flap (Section: Failure modes). Bring every adapter to the same qualified baseline.
2. Subnet manager¶
An InfiniBand fabric needs exactly one active subnet manager (SM) to initialise the fabric, assign LIDs, and program routing. The SM runs on a managed switch, in UFM (NVIDIA's Unified Fabric Manager, where the SM is one component), or as OpenSM on a Linux host.10 On managed Quantum switches the SM typically lives on the switch; on unmanaged fabrics you run OpenSM on a host.
# Run OpenSM as a daemon (or enable the opensmd service instead).
sudo opensm -B # -B = run in background (daemon)
# OpenSM reads /etc/opensm/opensm.conf; set SM priority there for HA with multiple SMs.
Confirm the SM is up and singular:
sminfo # queries the SM: prints sm lid, sm state (e.g. SMINFO_MASTER), priority
ibhosts # lists every host (CA) node the SM sees on the fabric
sminfo reports the master SM's LID and state; ibhosts enumerates the discovered host channel adapters.11 If sminfo shows no master, no SM has converged and nothing below will pass. If two SMs both claim master, they will fight over routing; keep one master and set the rest to lower priority (UFM/OpenSM is detailed in networking fabric).
3. Verify links & topology¶
Prove every link is up at the expected width and speed, and the topology matches the design, before any benchmark.
ibstat # per-port: State, Physical state, Rate, Base lid, link_layer
ibstatus # compact per-port state/rate/link_layer
iblinkinfo # every link in the fabric with width/speed (e.g. 4X NDR) and peer
ibnetdiscover # full topology dump: switches, CAs, and the links between them
Healthy ibstat for an active IB port shows State: Active, Physical state: LinkUp, and the expected Rate (e.g. Rate: 400 for NDR 4x).12 A port stuck at State: Init means the SM has not finished configuring it; Physical state: Polling or Disabled means no link partner / an administratively down port. iblinkinfo exposes width/speed downgrades a per-port check can hide: a 4X link that came up at 1X, or NDR negotiated down to a lower rate.
Run the fabric diagnostic and clear all errors before application tests:
ibdiagnet is NVIDIA's primary fabric diagnostic; it discovers the fabric, reports link errors and configuration dumps, and validates unicast/adaptive/multicast routing for credit-loop-free correctness. It ships in the ibutils2 package within MLNX_OFED/DOCA and UFM.13 Treat any non-zero error counters or routing warnings as a stop: fix them before trusting a bandwidth number.
4. Point-to-point bandwidth & latency¶
Use perftest (the linux-rdma/perftest verbs benchmarks) between two hosts to measure raw RDMA bandwidth and latency. Each tool is a server (no host argument) and a client (the server's address):14
# Host A (server) # Host B (client)
ib_write_bw -d mlx5_0 -F ib_write_bw -d mlx5_0 -F <hostA> # RDMA write bandwidth
ib_read_bw -d mlx5_0 -F ib_read_bw -d mlx5_0 -F <hostA> # RDMA read bandwidth
ib_write_lat -d mlx5_0 -F ib_write_lat -d mlx5_0 -F <hostA> # RDMA write latency
Useful flags (all from the perftest README): -d <dev> selects the IB device; -i <port> the port; -F suppresses the cpufreq governor warning; -a sweeps message sizes from 2 up to 2^23 bytes; -b measures bidirectional bandwidth; -R connects QPs via rdma_cm (needed for RoCE); -q <n> runs multiple QPs; -D <sec> runs for a fixed duration.14
Reading the result. ib_*_bw prints BW peak, BW average, and MsgRate. By default bandwidth is reported in MB/sec; there is no documented --report_gbits flag in the README, so convert if you want Gb/s (1 GB/s = 8 Gb/s).15 Compare the large-message BW average against the port line rate: a single QP rarely saturates the link, so achieved bandwidth sits below the nominal rate (NDR 400 Gb/s is ~50 GB/s line rate; the practical achieved figure is lower from protocol overhead, see the Expected numbers table). A link delivering a small fraction of line rate is the signature of a width/speed downgrade (Section 3) or a PCIe bottleneck on the host, not of perftest itself. ib_*_lat reports min/max/typical latency in microseconds; sub-microsecond to low-single-digit-microsecond latency is normal for a clean IB link.
5. GPUDirect RDMA verification¶
GPUDirect RDMA (GDR) lets the NIC DMA directly to/from GPU memory, bypassing a host bounce buffer. perftest exercises it with --use_cuda, which is supported by ib_write_bw, ib_read_bw, ib_send_bw, ib_read_lat, and ib_send_lat.16 Build perftest with CUDA support first:
# Build perftest with GPUDirect support (CUDA build).
./autogen.sh
./configure CUDA_H_PATH=/usr/local/cuda/include/cuda.h # release 25.07+ auto-detects cuda.h
make -j
Run the same server/client pair, pinning each side to a GPU. --use_cuda=<gpu_index> registers a GPU buffer for the transfer; add --use_cuda_dmabuf to use the dma-buf path instead of nvidia-peermem:16
# Host A (server) # Host B (client)
ib_write_bw -d mlx5_0 -F --use_cuda=0 -a ib_write_bw -d mlx5_0 -F --use_cuda=0 -a <hostA>
Confirm GDR is engaged, not bypassed. A CUDA-enabled run prints CUDA initialisation and a GPU-buffer allocation line such as cuMemAlloc() of a <N> bytes GPU buffer: the memory region is on the GPU, so the path is GPU-to-NIC.17 If you instead see Couldn't allocate MR, GDR failed to register the GPU buffer; the documented workaround is to disable Scatter-to-CQE: prefix the command with MLX5_SCATTER_TO_CQE=0.16 A GDR-engaged transfer should land near the host's non-GPU ib_write_bw figure; if it is dramatically lower, the path likely fell back to a host bounce buffer over PCIe (often ACS left enabled, see performance optimization).
6. Collective benchmarks with nccl-tests¶
perftest proves the wire; nccl-tests proves the application-level collective path the trainer actually uses. Build from NVIDIA/nccl-tests:18
# Single-node build (links against an installed NCCL + CUDA).
make CUDA_HOME=/usr/local/cuda NCCL_HOME=/usr/local/nccl
# Multi-node build with MPI for the launcher:
make MPI=1 MPI_HOME=/path/to/mpi CUDA_HOME=/usr/local/cuda NCCL_HOME=/usr/local/nccl
Run an all-reduce sweep across the GPUs:
# Sweep 8 bytes -> 8 GiB, doubling each step, across N GPUs in this process.
./build/all_reduce_perf -b 8 -e 8G -f 2 -g <N>
Arguments (from the nccl-tests README): -b min size, -e max size, -f step factor, -g GPUs per thread, -t threads per process, -n iterations, -w warmup iterations.18 For multi-node, launch one rank per GPU under mpirun/Slurm with -g 1 rather than a large -g.
Read busbw, not just algbw. The output prints, per size, time plus algbw (algorithm bandwidth = size / time) and busbw (bus bandwidth). For AllReduce, busbw = algbw * 2*(n-1)/n where n is the number of ranks; busbw applies that correction so the figure reflects hardware utilisation independent of rank count and can be compared against the interconnect peak.19 busbw is the number you hold against the topology expectation: on an 8-GPU NVLink node it should approach the NVLink bus bandwidth at large sizes; across nodes it is bounded by the per-port IB/RoCE rate.
Confirm the transport. Set NCCL_DEBUG=INFO to log the selected transport. With GDR active NCCL logs GPU Direct RDMA Enabled for GPU <id> / HCA <id> and connection lines of the form NET/IB/<n>/GDRDMA; absence of GDRDMA (or a NET/Socket line) means it fell back to TCP/host-staged copies, typically a misconfigured HCA or interface.20 Key NCCL env vars for the fabric:21
NCCL_IB_HCAfilters which HCAs NCCL uses, e.g.mlx5or=mlx5_0:1,mlx5_1:1.NCCL_NET_GDR_LEVELsets the topological cutoff for using GDR (LOC,PIX,PXB,PHB,SYS);SYSis the most permissive.NCCL_IB_GID_INDEXis the RoCE GID index (Section 8); default-1.NCCL_SOCKET_IFNAMErestricts the bootstrap/TCP interface, e.g.=eth0,^docker.NCCL_IB_DISABLE=1forces IP sockets (diagnostic only; expect a large slowdown).
7. NVLink / NVSwitch validation¶
On SXM datacenter parts, validate the scale-up NVLink fabric separately from the NIC fabric.
nvidia-smi nvlink --status # per-link state; if Active, the link's rated bandwidth is shown
nvidia-smi nvlink --capabilities # per-link capabilities
nvidia-smi nvlink -gt d # throughput counters: tx/rx data payload in KiB (-gt r adds overhead)
nvidia-smi nvlink --status (-s) reports each link's state and, when active, its rated bandwidth, not live throughput. --capabilities (-c) lists link capabilities. -gt/--getthroughput reads the throughput counters; the d argument selects tx/rx data payload in KiB (r adds protocol overhead). -gt supersedes the deprecated -g counter interface.22 A link reporting inactive, or a node showing fewer links than the GPU's NVLink count, points at a seated-but-degraded board or a baseboard fault.
Measure realised NVLink bandwidth with nvbandwidth (NVIDIA/nvbandwidth), which benchmarks memcpy patterns across links using copy-engine (CE) or SM copy methods:23
# Build (needs CUDA, CMake 3.20+, Boost program_options).
sudo apt-get install -y libboost-program-options-dev
cmake .
make
./nvbandwidth -l # list available testcases
./nvbandwidth # run all tests
./nvbandwidth -t device_to_device_memcpy_read_ce # a single device-to-device test
The full test list varies by CUDA version (run -l on the build for the authoritative set); device-to-device read/write CE tests are the NVLink bandwidth checks.23
NVSwitch systems (HGX/DGX 8-GPU baseboards, NVL72 racks) require Fabric Manager to program the GPUs into one NVLink domain. Confirm it is running:24
Fabric Manager (nv-fabricmanager) is mandatory on NVSwitch baseboards: if an application launches before FM has initialised the system, or FM fails to initialise it, CUDA initialisation fails with cudaErrorSystemNotReady. FM also checks the loaded driver-stack version at startup and aborts on an incompatible version, so FM and the driver must be the matching release.24 A stopped or version-mismatched FM is a common "GPUs present but unusable" failure on these systems (reliability and RAS).
8. RoCE specifics¶
On Ethernet fabrics (Spectrum-X / Spectrum-4, with the BlueField-3 SuperNIC) RDMA runs as RoCEv2, RDMA over UDP, destination port 4791.25 RoCE needs a (near-)lossless network: PFC (Priority Flow Control, IEEE 802.1Qbb, per-priority PAUSE to avoid drops) plus ECN (Explicit Congestion Notification, where switches mark the IP-header ECN bits on congestion). DCQCN is the de-facto RoCEv2 congestion-control algorithm and combines ECN-based rate control with PFC.2627 NVIDIA Spectrum-X adds RoCE adaptive routing and end-to-end congestion control for AI Ethernet fabrics.28
Pick the RoCEv2 GID index. NCCL needs NCCL_IB_GID_INDEX set to the RoCEv2 GID, which you read from show_gids:2921
Choose the row where VER is v2 and the IPv4 column is populated (the RoCEv2 IPv4 entry); that INDEX is the value for NCCL_IB_GID_INDEX. The NCCL docs state explicitly to consult show_gids to set this variable.21
Lossless config notes (verify against your switch/NIC QoS policy; exact priorities and DSCP marks are site/vendor-specific). NVIDIA's own RoCE setup script configures PFC and trust mode with mlnx_qos:30
# Enable PFC on priority 3 and trust the IP DSCP field (RoCEv2/L3 classification).
mlnx_qos -i <iface> --trust dscp --pfc 0,0,0,1,0,0,0,0
# Set the RoCE version per device/port (-m 2 = RoCEv2):
cma_roce_mode -d mlx5_0 -p 1 -m 2
Keep PFC, ECN, and the DSCP-to-priority mapping consistent end to end across NICs and switches; a mismatch is what turns RoCE into pause storms or silent drops (Failure modes). RoCE/Ethernet uses a different toolchain from IB; NCCL-over-RoCE tuning is in performance optimization.
Expected numbers by generation¶
Line-rate ceilings and vendor figures. Achieved bandwidth is always lower (single-QP/protocol overhead); use these as the target to compare against, not as a pass threshold.
| Link | Generation / adapter | Per-port / per-GPU bandwidth | Framing |
|---|---|---|---|
| InfiniBand EDR | ConnectX-5 era | 100 Gb/s (4x) | per port, per direction31 |
| InfiniBand HDR | ConnectX-6 | 200 Gb/s (4x, 50G-PAM4/lane) | per port, per direction31 |
| InfiniBand NDR | ConnectX-7 / Quantum-2 | 400 Gb/s (4x, 100G-PAM4/lane) | per port, per direction3132 |
| InfiniBand XDR | ConnectX-8 / Quantum-X800 | 800 Gb/s (4x, 200G-PAM4/lane) | per port, per direction312 |
| NVLink 3 (Ampere/A100) | 3rd-gen | 600 GB/s per GPU | aggregate of both directions (12 links x 25 GB/s/direction x 2)33 |
| NVLink 4 (Hopper/H100) | 4th-gen | 900 GB/s per GPU | aggregate/bidirectional (per-direction ~450 GB/s inferred)34 |
| NVLink 5 (Blackwell/B200) | 5th-gen | 1.8 TB/s (1800 GB/s) per GPU | aggregate; GB200 Superchip 3.6 TB/s = 1.8 TB/s/GPU3435 |
Per-direction caveat: NVIDIA publishes the per-GPU totals above. Only the A100 has an NVIDIA-sourced per-direction figure (25 GB/s per link per direction, in the Ampere whitepaper).33 The H100 (~450 GB/s) and B200 (~900 GB/s) per-direction halves are arithmetic (total / 2), not printed on the official pages, so treat them as inferred. Do not confuse the B200 GPU-to-GPU NVLink with NVLink-C2C (the Grace CPU-to-GPU chip-to-chip link, 900 GB/s), which is a different interconnect.10
Failure modes¶
- SM not converged.
sminfoshows no master, or ports stayState: Init. Nothing below passes; routing is unprogrammed. Bring up exactly one SM and confirm withsminfo/ibhostsbefore benchmarking. - Duplicate subnet managers. Two SMs both claiming master cause routing churn. Keep one master, set others lower priority in
opensm.conf/UFM (networking fabric). - GDR silently bypassed → PCIe fallback. No
cuMemAlloc()/GDRDMAline; bandwidth collapses to a host-staged path. Usual cause:nvidia_peermemnot loaded, or ACS left enabled routing P2P through the root complex (performance optimization). - NCCL on TCP fallback.
NCCL_DEBUG=INFOshowsNET/Socketinstead ofNET/IB/.../GDRDMA; collectives run an order of magnitude slow. Fix the HCA/interface selection (NCCL_IB_HCA,NCCL_SOCKET_IFNAME) (NCCL hang runbook). - PFC/ECN misconfig (RoCE). Inconsistent priority/DSCP mapping across NICs and switches causes pause storms (head-of-line blocking) or silent drops. Verify the lossless config end to end with
show_gids,mlnx_qos, and switch QoS. - Firmware mismatch. Adapters on different FW negotiate down, flap, or behave inconsistently. Bring every NIC to one baseline with the MFT tools /
mlnx-fw-updater; on NVSwitch systems a driver/FM version drift fails CUDA init outright (Section 7). - NUMA / PCIe rail misalignment. A GPU bound to a NIC across the wrong NUMA node, or a PCIe link trained down to x8/an older Gen, silently halves bandwidth and shows up as a perftest/nccl-tests result well under line rate. Check
nvidia-smi topo -mand PCIe link state (performance optimization).
References¶
- linux-rdma perftest (README, flags,
--use_cuda): https://github.com/linux-rdma/perftest/blob/master/README - NVIDIA/nccl-tests (build, args): https://github.com/NVIDIA/nccl-tests/blob/master/README.md
- nccl-tests performance metrics (algbw/busbw, AllReduce formula): https://github.com/NVIDIA/nccl-tests/blob/master/doc/PERFORMANCE.md
- NVIDIA/nvbandwidth (build, tests): https://github.com/NVIDIA/nvbandwidth
- NCCL environment variables (NCCL_IB_HCA, NCCL_NET_GDR_LEVEL, NCCL_IB_GID_INDEX): https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html
- NCCL networking troubleshooting (GPUDirect RDMA, nvidia-peermem, GDRDMA log): https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting/networking_troubleshooting.html
- DOCA-Host installation and upgrade (DOCA-OFED profiles, mlnx-fw-updater): https://networking-docs.nvidia.com/doca/sdk/DOCA-Host-Installation-and-Upgrade
- MLNX_OFED to DOCA-OFED transition guide: https://docs.nvidia.com/doca/sdk/nvidia-mlnx-ofed-to-doca-ofed-transition-guide.pdf
- NVIDIA Firmware Tools (MFT) — mlxfwmanager, flint: https://docs.nvidia.com/networking/display/mft/mlxfwmanager-%E2%80%93-firmware-update-and-query-tool
- ibdiagnet InfiniBand Fabric Diagnostic Tool (ibutils2, routing validation): https://docs.nvidia.com/networking/display/ibdiagnetusermanualv221
- OpenSM man page (
-B, opensm.conf): https://linux.die.net/man/8/opensm - nvidia-smi manual (nvlink --status, --capabilities, -gt): https://docs.nvidia.com/deploy/nvidia-smi/index.html
- NVIDIA NVLink & NVSwitch (per-GPU bandwidth by generation): https://www.nvidia.com/en-us/data-center/nvlink/
- NVIDIA A100 (NVLink 600 GB/s) + Ampere architecture whitepaper (25 GB/s/link/direction): https://www.nvidia.com/en-us/data-center/a100/ , https://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf
- NVIDIA GB200 NVL72 (NVLink 3.6 TB/s per Superchip): https://www.nvidia.com/en-us/data-center/gb200-nvl72/
- NVIDIA Fabric Manager user guide (NVSwitch scope, cudaErrorSystemNotReady, version check): https://docs.nvidia.com/datacenter/tesla/fabric-manager-user-guide/index.html
- NVIDIA ConnectX-7 NDR 400G adapter datasheet: https://www.nvidia.com/content/dam/en-zz/Solutions/networking/infiniband-adapters/infiniband-connectx7-data-sheet.pdf
- NVIDIA BlueField-3 DPU datasheet (NDR 400G / 400 GbE): https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/documents/datasheet-nvidia-bluefield-3-dpu.pdf
- NVIDIA DGX SuperPOD cabling — NDR overview (400 Gb/s = 4x100G): https://docs.nvidia.com/dgx-superpod/design-guide-cabling-data-centers/latest/ndr-overview.html
- NVIDIA Spectrum-X (RoCE, adaptive routing, congestion control): https://www.nvidia.com/en-us/networking/spectrumx/
- NVIDIA RoCE configuration (cma_roce_mode, RoCEv2 UDP 4791): https://networking-docs.nvidia.com/mlnxofedswum/23070512/rdma-over-converged-ethernet-roce
- NVIDIA doroce-linux (mlnx_qos --trust/--pfc, NCCL_IB_GID_INDEX setup): https://github.com/NVIDIA/doroce-linux/blob/main/doRoCE.sh
- Oracle: Benchmark GPUDirect RDMA with ib_write_bw --use_cuda (cuMemAlloc GPU buffer): https://docs.oracle.com/en/learn/gpudirect-rdma-ib-write-bw/index.html
Related: Networking Fabric · GPU software stack · Commissioning & Acceptance · Performance Optimization · Glossary
-
NVIDIA ConnectX-7 operates at 400 Gb/s in InfiniBand NDR and 400 GbE modes. https://www.nvidia.com/content/dam/en-zz/Solutions/networking/infiniband-adapters/infiniband-connectx7-data-sheet.pdf ↩
-
ConnectX-8 SuperNIC: 800 Gb/s XDR InfiniBand or 2x400 GbE, PCIe Gen6 (networking fabric, citing NVIDIA Quantum-X800 / ConnectX-8 materials). ↩↩
-
NVIDIA BlueField-3 DPU: 400 Gb/s Ethernet or NDR 400 Gb/s InfiniBand. https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/documents/datasheet-nvidia-bluefield-3-dpu.pdf ↩
-
NCCL enables GPUDirect RDMA when the topology permits and the
nvidia-peermemmodule is loaded; the module exposes GPU memory to the NIC. https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting/networking_troubleshooting.html ↩ -
MLNX_OFED transitioned to DOCA-OFED; the last standalone MLNX_OFED was the October 2024 LTS and from January 2025 new features ship in DOCA-OFED only. https://docs.nvidia.com/doca/sdk/nvidia-mlnx-ofed-to-doca-ofed-transition-guide.pdf ↩
-
DOCA-Host install on deb-based distros:
dpkg -i <repo>.deb,apt-get update,apt install -y <profile>(e.g.doca-all,doca-ofed);apt install -y mlnx-fw-updaterfor firmware. https://networking-docs.nvidia.com/doca/sdk/DOCA-Host-Installation-and-Upgrade ↩ -
ofed_inforeports the software versions of each OFED component;-sprints the version string (e.g.MLNX_OFED_LINUX-24.07-...). https://networking-docs.nvidia.com/doca/sdk/DOCA-Host-Installation-and-Upgrade ↩ -
NVIDIA Firmware Tools (MFT):
mlxfwmanager -d <pci> --queryreports device type/PSID/current vs available firmware. https://docs.nvidia.com/networking/display/mft/mlxfwmanager-%E2%80%93-firmware-update-and-query-tool ↩ -
flintburns/queries firmware on a single device (flint -d <dev> q);mstflintis the open-source MFT subset (mstflint -d <pci> q). https://github.com/Mellanox/mstflint ↩ -
Subnet manager, OpenSM, UFM, and the NVLink-C2C distinction are covered in networking fabric. https://www.nvidia.com/en-us/data-center/nvlink/ ↩↩
-
OpenSM man page:
-Bruns OpenSM as a daemon; configuration inopensm.conf.sminfoqueries the master SM (LID/state);ibhostslists host CAs. https://linux.die.net/man/8/opensm ↩ -
ibstatreports per-portState(Active/Init/...),Physical state(LinkUp/Polling/...), andRate; the InfiniBand fabric utilities document these tools. https://docs.nvidia.com/networking/display/MLNXOFEDv531001/InfiniBand+Fabric+Utilities ↩ -
ibdiagnet is a primary InfiniBand fabric discovery/diagnostic tool; it reports link errors and validates unicast/adaptive/multicast routing, and is distributed in the
ibutils2package within MLNX_OFED/UFM. https://docs.nvidia.com/networking/display/ibdiagnetusermanualv221 ↩ -
linux-rdma perftest README: server (no host arg) + client (server address) model;
-ddevice,-iport,-Fcpufreq,-asize sweep to 2^23,-bbidirectional,-Rrdma_cm,-qQPs,-Dduration. https://github.com/linux-rdma/perftest/blob/master/README ↩↩ -
The perftest README does not document a
--report_gbitsflag; bandwidth is reported in MB/sec by default. https://github.com/linux-rdma/perftest/blob/master/README ↩ -
perftest CUDA build:
./autogen.sh && ./configure CUDA_H_PATH=/usr/local/cuda/include/cuda.h && make -j(25.07+ auto-detects);--use_cuda=<gpu_index>is supported by ib_write_bw, ib_read_bw, ib_send_bw, ib_read_lat, ib_send_lat;--use_cuda_dmabufuses the dma-buf path;MLX5_SCATTER_TO_CQE=0works around "Couldn't allocate MR". https://github.com/linux-rdma/perftest/blob/master/README ↩↩↩ -
A CUDA-enabled ib_write_bw run prints CUDA init and
cuMemAlloc() of a <N> bytes GPU buffer, confirming the buffer is in GPU memory (GDR path). https://docs.oracle.com/en/learn/gpudirect-rdma-ib-write-bw/index.html ↩ -
NVIDIA/nccl-tests build:
make CUDA_HOME=... NCCL_HOME=...(addMPI=1 MPI_HOME=...for MPI); args-bmin,-emax,-fstep factor,-gGPUs/thread,-tthreads,-niters,-wwarmup. https://github.com/NVIDIA/nccl-tests/blob/master/README.md ↩↩ -
nccl-tests PERFORMANCE.md:
algbw = size/time; for AllReducebusbw = algbw * 2*(n-1)/n; busbw reflects hardware utilisation independent of rank count, for comparison against hardware peak. https://github.com/NVIDIA/nccl-tests/blob/master/doc/PERFORMANCE.md ↩ -
With GDR active NCCL logs
GPU Direct RDMA Enabled for GPU <id> / HCA <id>and connections of the formNET/IB/<n>/GDRDMA. https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting/networking_troubleshooting.html ↩ -
NCCL env vars:
NCCL_IB_HCAfilters HCAs;NCCL_NET_GDR_LEVELsets the GDR topological cutoff (LOC/PIX/PXB/PHB/SYS);NCCL_IB_GID_INDEXis the RoCE GID index (default -1, "see the InfiniBand show_gids command");NCCL_SOCKET_IFNAMErestricts interfaces;NCCL_IB_DISABLE=1forces IP sockets. https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html ↩↩↩ -
nvidia-smi manual:
nvlink -s/--status(state + rated bandwidth if active),-c/--capabilities,-gt/--getthroughputwith argd(tx/rx data payload in KiB) orr(adds overhead);-gtsupersedes the deprecated-gcounters. https://docs.nvidia.com/deploy/nvidia-smi/index.html ↩ -
NVIDIA/nvbandwidth measures memcpy bandwidth across links (CE or SM copy); build needs CUDA, CMake 3.20+, and Boost program_options;
cmake . && make;-llists tests,-t <name>runs one (e.g.device_to_device_memcpy_read_ce). https://github.com/NVIDIA/nvbandwidth ↩↩ -
Fabric Manager user guide:
systemctl status nvidia-fabricmanager; required on NVSwitch DGX/HGX systems; CUDA init fails withcudaErrorSystemNotReadyif launched before FM initialises the system; FM checks the loaded driver-stack version and aborts on incompatibility. https://docs.nvidia.com/datacenter/tesla/fabric-manager-user-guide/index.html ↩↩ -
NVIDIA RoCE doc: RoCEv2 uses UDP on dedicated port 4791;
cma_roce_mode -d <dev> -p <port> -m <1|2>sets the RoCE version (2 = RoCEv2). https://networking-docs.nvidia.com/mlnxofedswum/23070512/rdma-over-converged-ethernet-roce ↩ -
NVIDIA QoS-for-RoCE / RoCEv2 congestion management docs: PFC (802.1Qbb) pauses per-priority to avoid drops; ECN marks the IP-header ECN bits on congestion. https://enterprise-support.nvidia.com/s/article/understanding-qos-configuration-for-roce ↩
-
DCQCN is the de-facto RoCEv2 congestion-control algorithm, combining ECN-based rate control with PFC. https://www.wwt.com/article/understanding-data-center-quantized-congestion-notification-dcqcn ↩
-
NVIDIA Spectrum-X: RoCE adaptive routing and end-to-end congestion control for AI Ethernet, pairing the Spectrum-4 switch with the BlueField-3 SuperNIC. https://www.nvidia.com/en-us/networking/spectrumx/ ↩
-
show_gidslists the GID table (DEV/PORT/INDEX/GID/IPv4/VER); the RoCEv2 IPv4 GID is the row withVER=v2and a populatedIPv4— its INDEX is used forNCCL_IB_GID_INDEX. https://docs.cloud.google.com/ai-hypercomputer/docs/nccl/collect-and-understand ↩ -
NVIDIA's doRoCE.sh configures lossless RoCE with
mlnx_qos -i <iface> --trust <mode> --pfc <mask>and writesNCCL_IB_GID_INDEX/NCCL_IB_TC; exact priorities/DSCP are site-specific. https://github.com/NVIDIA/doroce-linux/blob/main/doRoCE.sh ↩ -
InfiniBand 4x per-port rates: EDR 100, HDR 200 (50G-PAM4/lane), NDR 400 (100G-PAM4/lane), XDR 800 (200G-PAM4/lane) Gb/s, per port per direction (Glossary, networking fabric). ↩↩↩↩
-
NVIDIA DGX SuperPOD cabling guide: NDR = 400 Gbps (four lanes of 100 Gbps). https://docs.nvidia.com/dgx-superpod/design-guide-cabling-data-centers/latest/ndr-overview.html ↩
-
NVIDIA A100 product page lists NVLink 600 GB/s; the Ampere architecture whitepaper states each link carries 25 GB/s per direction and A100 has 12 links, totalling 600 GB/s (aggregate of both directions). https://www.nvidia.com/en-us/data-center/a100/ , https://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf ↩↩
-
NVIDIA NVLink page: 4th-gen (H100) = 900 GB/s per GPU; 5th-gen (B200) = 1,800 GB/s per GPU. Per-direction halves are inferred (total / 2), not printed by NVIDIA. https://www.nvidia.com/en-us/data-center/nvlink/ ↩↩
-
NVIDIA GB200 NVL72 page lists NVLink bandwidth 3.6 TB/s for the 2-GPU GB200 Superchip = 1.8 TB/s per GPU. https://www.nvidia.com/en-us/data-center/gb200-nvl72/ ↩