Markdown

Fabric bring-up, validation & benchmarking¶

Scope: the shared procedure for bringing a GPU interconnect up, proving it healthy, and benchmarking it to line rate (InfiniBand, RoCE, and NVLink) before any workload runs.

Reference templates, drawn from NVIDIA DOCA/MLNX_OFED docs, the linux-rdma/perftest project, and the NVIDIA/nccl-tests and NVIDIA/nvbandwidth repos. Nothing here was executed on hardware. Pin versions to your DOCA/driver release, substitute real device names, and validate on one link/node pair before a fleet roll. Numbers are line-rate ceilings or vendor figures; achieved bandwidth is always lower (protocol overhead).

flowchart LR
  STACK["1-2 Install stack and SM"] --> LINKS["3 Verify links and topology"]
  LINKS --> P2P["4 Point-to-point BW and latency"]
  P2P --> GDR["5 GPUDirect RDMA"]
  GDR --> NCCL["6 nccl-tests collectives"]
  NCCL --> NVL["7 NVLink and NVSwitch"]
  NVL --> ROCE["8 RoCE specifics"]

What this covers¶

This is the procedure the per-GPU hardware pages reference instead of repeating networking ops. It takes a built fabric, cabled, powered, drivers present (Ansible bring-up), and proves it: every link up at the right speed, the subnet manager converged, point-to-point bandwidth at line rate, GPUDirect RDMA actually engaged, collectives at the expected bus bandwidth, and NVLink intact on NVSwitch systems. Use it at commissioning (commissioning and acceptance), after any driver/firmware change (driver upgrade runbook), and as the first triage when a "GPU" problem might be a degraded link.

The ordering is deliberate and bottom-up: do not run a collective benchmark before the fabric is clean, or a network fault gets misread as a GPU fault. Concepts and topology live in networking fabric; this page is the runnable counterpart.

Prerequisites¶

GPU driver + datacenter stack present and healthy: nvidia-smi lists every GPU, persistence mode on, and on NVSwitch systems Fabric Manager active (Section 7). See GPU software stack.
NIC family. The fabric generation pairs with the adapter and switch:
ConnectX-6. HDR InfiniBand / 200 GbE.
ConnectX-7. NDR InfiniBand 400 Gb/s or 400 GbE.¹
ConnectX-8 SuperNIC. XDR InfiniBand 800 Gb/s or 2x400 GbE, PCIe Gen6.²
BlueField-3. DPU/SuperNIC at NDR 400 Gb/s IB or 400 GbE; the SuperNIC variant is the host adapter in Spectrum-X.³
GPUDirect RDMA path. The nvidia-peermem kernel module exposes GPU memory to the NIC for RDMA; load it (or use the newer dma-buf path) before GDR tests. NCCL enables GDR automatically when the topology permits and the module is present.⁴ GPUDirect RDMA is a datacenter/workstation-pro capability, not available on GeForce (networking fabric).

# Confirm the peer-memory module is loaded (GPUDirect RDMA prerequisite)
lsmod | grep nvidia_peermem || sudo modprobe nvidia_peermem

1. Install the fabric stack¶

NVIDIA's host RDMA stack is DOCA-OFED (the successor to MLNX_OFED; the last standalone MLNX_OFED was the October 2024 LTS, and from January 2025 new features ship in DOCA-OFED only).⁵ Install on Ubuntu from the DOCA-Host repo, then pick a profile.⁶

# DOCA-Host on Ubuntu: install the repo package, refresh, install a profile.
sudo dpkg -i doca-host_<version>_ubuntu<rel>_amd64.deb   # repo package from NVIDIA
sudo apt-get update
sudo apt-get install -y doca-ofed                          # full RDMA stack profile
# doca-ofed-userspace installs userspace only (no DKMS kernel modules);
# doca-all is the everything profile. Choose per node role.
sudo apt-get install -y mlnx-fw-updater                    # firmware updater (pulls matching FW)

Verify the stack and the kernel module:

ofed_info -s                 # prints the DOCA/MLNX_OFED version string, e.g. MLNX_OFED_LINUX-24.07-...
modinfo mlx5_core            # confirm the driver: version, srcversion, signature
ibv_devinfo                  # RDMA devices, fw_ver, port state, link_layer (InfiniBand|Ethernet)

ofed_info -s prints the installed OFED/DOCA version; ofed_info (no flag) lists every component version.⁷ modinfo mlx5_core confirms the loaded ConnectX driver and its firmware/signing metadata.

Firmware. Query and, if needed, update adapter firmware with the MFT tools (mft package), or rely on mlnx-fw-updater above to apply the version matched to the installed stack:⁸⁹

mlxfwmanager -d <pci_id> --query     # device type, PSID, current vs available FW, e.g. -d 09:00.0
flint -d <pci_id> q                  # low-level query (flint burns/queries a single device)
mstflint -d <pci_id> q               # open-source MFT subset, equivalent query

Firmware mismatch across a fleet is a classic source of links that negotiate down or flap (Section: Failure modes). Bring every adapter to the same qualified baseline.

2. Subnet manager¶

An InfiniBand fabric needs exactly one active subnet manager (SM) to initialise the fabric, assign LIDs, and program routing. The SM runs on a managed switch, in UFM (NVIDIA's Unified Fabric Manager, where the SM is one component), or as OpenSM on a Linux host.¹⁰ On managed Quantum switches the SM typically lives on the switch; on unmanaged fabrics you run OpenSM on a host.

# Run OpenSM as a daemon (or enable the opensmd service instead).
sudo opensm -B                     # -B = run in background (daemon)
# OpenSM reads /etc/opensm/opensm.conf; set SM priority there for HA with multiple SMs.

Confirm the SM is up and singular:

sminfo                  # queries the SM: prints sm lid, sm state (e.g. SMINFO_MASTER), priority
ibhosts                 # lists every host (CA) node the SM sees on the fabric

sminfo reports the master SM's LID and state; ibhosts enumerates the discovered host channel adapters.¹¹ If sminfo shows no master, no SM has converged and nothing below will pass. If two SMs both claim master, they will fight over routing; keep one master and set the rest to lower priority (UFM/OpenSM is detailed in networking fabric).

3. Verify links & topology¶

Prove every link is up at the expected width and speed, and the topology matches the design, before any benchmark.

ibstat                  # per-port: State, Physical state, Rate, Base lid, link_layer
ibstatus                # compact per-port state/rate/link_layer
iblinkinfo              # every link in the fabric with width/speed (e.g. 4X NDR) and peer
ibnetdiscover           # full topology dump: switches, CAs, and the links between them

Healthy ibstat for an active IB port shows State: Active, Physical state: LinkUp, and the expected Rate (e.g. Rate: 400 for NDR 4x).¹² A port stuck at State: Init means the SM has not finished configuring it; Physical state: Polling or Disabled means no link partner / an administratively down port. iblinkinfo exposes width/speed downgrades a per-port check can hide: a 4X link that came up at 1X, or NDR negotiated down to a lower rate.

Run the fabric diagnostic and clear all errors before application tests:

ibdiagnet               # full fabric scan: link errors, speed/width, routing, SM, cable/PSU health

ibdiagnet is NVIDIA's primary fabric diagnostic; it discovers the fabric, reports link errors and configuration dumps, and validates unicast/adaptive/multicast routing for credit-loop-free correctness. It ships in the ibutils2 package within MLNX_OFED/DOCA and UFM.¹³ Treat any non-zero error counters or routing warnings as a stop: fix them before trusting a bandwidth number.

4. Point-to-point bandwidth & latency¶

Use perftest (the linux-rdma/perftest verbs benchmarks) between two hosts to measure raw RDMA bandwidth and latency. Each tool is a server (no host argument) and a client (the server's address):¹⁴

# Host A (server)              # Host B (client)
ib_write_bw  -d mlx5_0 -F      ib_write_bw  -d mlx5_0 -F <hostA>   # RDMA write bandwidth
ib_read_bw   -d mlx5_0 -F      ib_read_bw   -d mlx5_0 -F <hostA>   # RDMA read bandwidth
ib_write_lat -d mlx5_0 -F      ib_write_lat -d mlx5_0 -F <hostA>   # RDMA write latency

Useful flags (all from the perftest README): -d <dev> selects the IB device; -i <port> the port; -F suppresses the cpufreq governor warning; -a sweeps message sizes from 2 up to 2^23 bytes; -b measures bidirectional bandwidth; -R connects QPs via rdma_cm (needed for RoCE); -q <n> runs multiple QPs; -D <sec> runs for a fixed duration.¹⁴

Reading the result. ib_*_bw prints BW peak, BW average, and MsgRate. By default bandwidth is reported in MB/sec; there is no documented --report_gbits flag in the README, so convert if you want Gb/s (1 GB/s = 8 Gb/s).¹⁵ Compare the large-message BW average against the port line rate: a single QP rarely saturates the link, so achieved bandwidth sits below the nominal rate (NDR 400 Gb/s is ~50 GB/s line rate; the practical achieved figure is lower from protocol overhead, see the Expected numbers table). A link delivering a small fraction of line rate is the signature of a width/speed downgrade (Section 3) or a PCIe bottleneck on the host, not of perftest itself. ib_*_lat reports min/max/typical latency in microseconds; sub-microsecond to low-single-digit-microsecond latency is normal for a clean IB link.

5. GPUDirect RDMA verification¶

GPUDirect RDMA (GDR) lets the NIC DMA directly to/from GPU memory, bypassing a host bounce buffer. perftest exercises it with --use_cuda, which is supported by ib_write_bw, ib_read_bw, ib_send_bw, ib_read_lat, and ib_send_lat.¹⁶ Build perftest with CUDA support first:

# Build perftest with GPUDirect support (CUDA build).
./autogen.sh
./configure CUDA_H_PATH=/usr/local/cuda/include/cuda.h   # release 25.07+ auto-detects cuda.h
make -j

Run the same server/client pair, pinning each side to a GPU. --use_cuda=<gpu_index> registers a GPU buffer for the transfer; add --use_cuda_dmabuf to use the dma-buf path instead of nvidia-peermem:¹⁶

# Host A (server)                                  # Host B (client)
ib_write_bw -d mlx5_0 -F --use_cuda=0 -a           ib_write_bw -d mlx5_0 -F --use_cuda=0 -a <hostA>

Confirm GDR is engaged, not bypassed. A CUDA-enabled run prints CUDA initialisation and a GPU-buffer allocation line such as cuMemAlloc() of a <N> bytes GPU buffer: the memory region is on the GPU, so the path is GPU-to-NIC.¹⁷ If you instead see Couldn't allocate MR, GDR failed to register the GPU buffer; the documented workaround is to disable Scatter-to-CQE: prefix the command with MLX5_SCATTER_TO_CQE=0.¹⁶ A GDR-engaged transfer should land near the host's non-GPU ib_write_bw figure; if it is dramatically lower, the path likely fell back to a host bounce buffer over PCIe (often ACS left enabled, see performance optimization).

6. Collective benchmarks with nccl-tests¶

perftest proves the wire; nccl-tests proves the application-level collective path the trainer actually uses. Build from NVIDIA/nccl-tests:¹⁸

# Single-node build (links against an installed NCCL + CUDA).
make CUDA_HOME=/usr/local/cuda NCCL_HOME=/usr/local/nccl
# Multi-node build with MPI for the launcher:
make MPI=1 MPI_HOME=/path/to/mpi CUDA_HOME=/usr/local/cuda NCCL_HOME=/usr/local/nccl

Run an all-reduce sweep across the GPUs:

# Sweep 8 bytes -> 8 GiB, doubling each step, across N GPUs in this process.
./build/all_reduce_perf -b 8 -e 8G -f 2 -g <N>

Arguments (from the nccl-tests README): -b min size, -e max size, -f step factor, -g GPUs per thread, -t threads per process, -n iterations, -w warmup iterations.¹⁸ For multi-node, launch one rank per GPU under mpirun/Slurm with -g 1 rather than a large -g.

Read busbw, not just algbw. The output prints, per size, time plus algbw (algorithm bandwidth = size / time) and busbw (bus bandwidth). For AllReduce, busbw = algbw * 2*(n-1)/n where n is the number of ranks; busbw applies that correction so the figure reflects hardware utilisation independent of rank count and can be compared against the interconnect peak.¹⁹ busbw is the number you hold against the topology expectation: on an 8-GPU NVLink node it should approach the NVLink bus bandwidth at large sizes; across nodes it is bounded by the per-port IB/RoCE rate.

Confirm the transport. Set NCCL_DEBUG=INFO to log the selected transport. With GDR active NCCL logs GPU Direct RDMA Enabled for GPU <id> / HCA <id> and connection lines of the form NET/IB/<n>/GDRDMA; absence of GDRDMA (or a NET/Socket line) means it fell back to TCP/host-staged copies, typically a misconfigured HCA or interface.²⁰ Key NCCL env vars for the fabric:²¹

NCCL_IB_HCA filters which HCAs NCCL uses, e.g. mlx5 or =mlx5_0:1,mlx5_1:1.
NCCL_NET_GDR_LEVEL sets the topological cutoff for using GDR (LOC, PIX, PXB, PHB, SYS); SYS is the most permissive.
NCCL_IB_GID_INDEX is the RoCE GID index (Section 8); default -1.
NCCL_SOCKET_IFNAME restricts the bootstrap/TCP interface, e.g. =eth0, ^docker.
NCCL_IB_DISABLE=1 forces IP sockets (diagnostic only; expect a large slowdown).

7. NVLink / NVSwitch validation¶

On SXM datacenter parts, validate the scale-up NVLink fabric separately from the NIC fabric.

nvidia-smi nvlink --status        # per-link state; if Active, the link's rated bandwidth is shown
nvidia-smi nvlink --capabilities  # per-link capabilities
nvidia-smi nvlink -gt d           # throughput counters: tx/rx data payload in KiB (-gt r adds overhead)

nvidia-smi nvlink --status (-s) reports each link's state and, when active, its rated bandwidth, not live throughput. --capabilities (-c) lists link capabilities. -gt/--getthroughput reads the throughput counters; the d argument selects tx/rx data payload in KiB (r adds protocol overhead). -gt supersedes the deprecated -g counter interface.²² A link reporting inactive, or a node showing fewer links than the GPU's NVLink count, points at a seated-but-degraded board or a baseboard fault.

Measure realised NVLink bandwidth with nvbandwidth (NVIDIA/nvbandwidth), which benchmarks memcpy patterns across links using copy-engine (CE) or SM copy methods:²³

# Build (needs CUDA, CMake 3.20+, Boost program_options).
sudo apt-get install -y libboost-program-options-dev
cmake .
make
./nvbandwidth -l                                   # list available testcases
./nvbandwidth                                      # run all tests
./nvbandwidth -t device_to_device_memcpy_read_ce   # a single device-to-device test

The full test list varies by CUDA version (run -l on the build for the authoritative set); device-to-device read/write CE tests are the NVLink bandwidth checks.²³

NVSwitch systems (HGX/DGX 8-GPU baseboards, NVL72 racks) require Fabric Manager to program the GPUs into one NVLink domain. Confirm it is running:²⁴

sudo systemctl status nvidia-fabricmanager     # service for nv-fabricmanager

Fabric Manager (nv-fabricmanager) is mandatory on NVSwitch baseboards: if an application launches before FM has initialised the system, or FM fails to initialise it, CUDA initialisation fails with cudaErrorSystemNotReady. FM also checks the loaded driver-stack version at startup and aborts on an incompatible version, so FM and the driver must be the matching release.²⁴ A stopped or version-mismatched FM is a common "GPUs present but unusable" failure on these systems (reliability and RAS).

8. RoCE specifics¶

On Ethernet fabrics (Spectrum-X / Spectrum-4, with the BlueField-3 SuperNIC) RDMA runs as RoCEv2, RDMA over UDP, destination port 4791.²⁵ RoCE needs a (near-)lossless network: PFC (Priority Flow Control, IEEE 802.1Qbb, per-priority PAUSE to avoid drops) plus ECN (Explicit Congestion Notification, where switches mark the IP-header ECN bits on congestion). DCQCN is the de-facto RoCEv2 congestion-control algorithm and combines ECN-based rate control with PFC.²⁶²⁷ NVIDIA Spectrum-X adds RoCE adaptive routing and end-to-end congestion control for AI Ethernet fabrics.²⁸

Pick the RoCEv2 GID index. NCCL needs NCCL_IB_GID_INDEX set to the RoCEv2 GID, which you read from show_gids:²⁹²¹

show_gids        # columns: DEV PORT INDEX GID IPv4 VER (v1|v2) DEV(netdev)

Choose the row where VER is v2 and the IPv4 column is populated (the RoCEv2 IPv4 entry); that INDEX is the value for NCCL_IB_GID_INDEX. The NCCL docs state explicitly to consult show_gids to set this variable.²¹

Lossless config notes (verify against your switch/NIC QoS policy; exact priorities and DSCP marks are site/vendor-specific). NVIDIA's own RoCE setup script configures PFC and trust mode with mlnx_qos:³⁰

# Enable PFC on priority 3 and trust the IP DSCP field (RoCEv2/L3 classification).
mlnx_qos -i <iface> --trust dscp --pfc 0,0,0,1,0,0,0,0
# Set the RoCE version per device/port (-m 2 = RoCEv2):
cma_roce_mode -d mlx5_0 -p 1 -m 2

Keep PFC, ECN, and the DSCP-to-priority mapping consistent end to end across NICs and switches; a mismatch is what turns RoCE into pause storms or silent drops (Failure modes). RoCE/Ethernet uses a different toolchain from IB; NCCL-over-RoCE tuning is in performance optimization.

Expected numbers by generation¶

Line-rate ceilings and vendor figures. Achieved bandwidth is always lower (single-QP/protocol overhead); use these as the target to compare against, not as a pass threshold.

Link	Generation / adapter	Per-port / per-GPU bandwidth	Framing
InfiniBand EDR	ConnectX-5 era	100 Gb/s (4x)	per port, per direction³¹
InfiniBand HDR	ConnectX-6	200 Gb/s (4x, 50G-PAM4/lane)	per port, per direction³¹
InfiniBand NDR	ConnectX-7 / Quantum-2	400 Gb/s (4x, 100G-PAM4/lane)	per port, per direction³¹³²
InfiniBand XDR	ConnectX-8 / Quantum-X800	800 Gb/s (4x, 200G-PAM4/lane)	per port, per direction³¹²
NVLink 3 (Ampere/A100)	3rd-gen	600 GB/s per GPU	aggregate of both directions (12 links x 25 GB/s/direction x 2)³³
NVLink 4 (Hopper/H100)	4th-gen	900 GB/s per GPU	aggregate/bidirectional (per-direction ~450 GB/s inferred)³⁴
NVLink 5 (Blackwell/B200)	5th-gen	1.8 TB/s (1800 GB/s) per GPU	aggregate; GB200 Superchip 3.6 TB/s = 1.8 TB/s/GPU³⁴³⁵

Per-direction caveat: NVIDIA publishes the per-GPU totals above. Only the A100 has an NVIDIA-sourced per-direction figure (25 GB/s per link per direction, in the Ampere whitepaper).³³ The H100 (~450 GB/s) and B200 (~900 GB/s) per-direction halves are arithmetic (total / 2), not printed on the official pages, so treat them as inferred. Do not confuse the B200 GPU-to-GPU NVLink with NVLink-C2C (the Grace CPU-to-GPU chip-to-chip link, 900 GB/s), which is a different interconnect.¹⁰

Failure modes¶

SM not converged. sminfo shows no master, or ports stay State: Init. Nothing below passes; routing is unprogrammed. Bring up exactly one SM and confirm with sminfo/ibhosts before benchmarking.
Duplicate subnet managers. Two SMs both claiming master cause routing churn. Keep one master, set others lower priority in opensm.conf/UFM (networking fabric).
GDR silently bypassed → PCIe fallback. No cuMemAlloc()/GDRDMA line; bandwidth collapses to a host-staged path. Usual cause: nvidia_peermem not loaded, or ACS left enabled routing P2P through the root complex (performance optimization).
NCCL on TCP fallback. NCCL_DEBUG=INFO shows NET/Socket instead of NET/IB/.../GDRDMA; collectives run an order of magnitude slow. Fix the HCA/interface selection (NCCL_IB_HCA, NCCL_SOCKET_IFNAME) (NCCL hang runbook).
PFC/ECN misconfig (RoCE). Inconsistent priority/DSCP mapping across NICs and switches causes pause storms (head-of-line blocking) or silent drops. Verify the lossless config end to end with show_gids, mlnx_qos, and switch QoS.
Firmware mismatch. Adapters on different FW negotiate down, flap, or behave inconsistently. Bring every NIC to one baseline with the MFT tools / mlnx-fw-updater; on NVSwitch systems a driver/FM version drift fails CUDA init outright (Section 7).
NUMA / PCIe rail misalignment. A GPU bound to a NIC across the wrong NUMA node, or a PCIe link trained down to x8/an older Gen, silently halves bandwidth and shows up as a perftest/nccl-tests result well under line rate. Check nvidia-smi topo -m and PCIe link state (performance optimization).

References¶

linux-rdma perftest (README, flags, --use_cuda): https://github.com/linux-rdma/perftest/blob/master/README
NVIDIA/nccl-tests (build, args): https://github.com/NVIDIA/nccl-tests/blob/master/README.md
nccl-tests performance metrics (algbw/busbw, AllReduce formula): https://github.com/NVIDIA/nccl-tests/blob/master/doc/PERFORMANCE.md
NVIDIA/nvbandwidth (build, tests): https://github.com/NVIDIA/nvbandwidth
NCCL environment variables (NCCL_IB_HCA, NCCL_NET_GDR_LEVEL, NCCL_IB_GID_INDEX): https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html
NCCL networking troubleshooting (GPUDirect RDMA, nvidia-peermem, GDRDMA log): https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting/networking_troubleshooting.html
DOCA-Host installation and upgrade (DOCA-OFED profiles, mlnx-fw-updater): https://networking-docs.nvidia.com/doca/sdk/DOCA-Host-Installation-and-Upgrade
MLNX_OFED to DOCA-OFED transition guide: https://docs.nvidia.com/doca/sdk/nvidia-mlnx-ofed-to-doca-ofed-transition-guide.pdf
NVIDIA Firmware Tools (MFT) — mlxfwmanager, flint: https://docs.nvidia.com/networking/display/mft/mlxfwmanager-%E2%80%93-firmware-update-and-query-tool
ibdiagnet InfiniBand Fabric Diagnostic Tool (ibutils2, routing validation): https://docs.nvidia.com/networking/display/ibdiagnetusermanualv221
OpenSM man page (-B, opensm.conf): https://linux.die.net/man/8/opensm
nvidia-smi manual (nvlink --status, --capabilities, -gt): https://docs.nvidia.com/deploy/nvidia-smi/index.html
NVIDIA NVLink & NVSwitch (per-GPU bandwidth by generation): https://www.nvidia.com/en-us/data-center/nvlink/
NVIDIA A100 (NVLink 600 GB/s) + Ampere architecture whitepaper (25 GB/s/link/direction): https://www.nvidia.com/en-us/data-center/a100/ , https://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf
NVIDIA GB200 NVL72 (NVLink 3.6 TB/s per Superchip): https://www.nvidia.com/en-us/data-center/gb200-nvl72/
NVIDIA Fabric Manager user guide (NVSwitch scope, cudaErrorSystemNotReady, version check): https://docs.nvidia.com/datacenter/tesla/fabric-manager-user-guide/index.html
NVIDIA ConnectX-7 NDR 400G adapter datasheet: https://www.nvidia.com/content/dam/en-zz/Solutions/networking/infiniband-adapters/infiniband-connectx7-data-sheet.pdf
NVIDIA BlueField-3 DPU datasheet (NDR 400G / 400 GbE): https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/documents/datasheet-nvidia-bluefield-3-dpu.pdf
NVIDIA DGX SuperPOD cabling — NDR overview (400 Gb/s = 4x100G): https://docs.nvidia.com/dgx-superpod/design-guide-cabling-data-centers/latest/ndr-overview.html
NVIDIA Spectrum-X (RoCE, adaptive routing, congestion control): https://www.nvidia.com/en-us/networking/spectrumx/
NVIDIA RoCE configuration (cma_roce_mode, RoCEv2 UDP 4791): https://networking-docs.nvidia.com/mlnxofedswum/23070512/rdma-over-converged-ethernet-roce
NVIDIA doroce-linux (mlnx_qos --trust/--pfc, NCCL_IB_GID_INDEX setup): https://github.com/NVIDIA/doroce-linux/blob/main/doRoCE.sh
Oracle: Benchmark GPUDirect RDMA with ib_write_bw --use_cuda (cuMemAlloc GPU buffer): https://docs.oracle.com/en/learn/gpudirect-rdma-ib-write-bw/index.html

NVIDIA ConnectX-7 operates at 400 Gb/s in InfiniBand NDR and 400 GbE modes. https://www.nvidia.com/content/dam/en-zz/Solutions/networking/infiniband-adapters/infiniband-connectx7-data-sheet.pdf ↩
ConnectX-8 SuperNIC: 800 Gb/s XDR InfiniBand or 2x400 GbE, PCIe Gen6 (networking fabric, citing NVIDIA Quantum-X800 / ConnectX-8 materials). ↩↩
NVIDIA BlueField-3 DPU: 400 Gb/s Ethernet or NDR 400 Gb/s InfiniBand. https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/documents/datasheet-nvidia-bluefield-3-dpu.pdf ↩
NCCL enables GPUDirect RDMA when the topology permits and the nvidia-peermem module is loaded; the module exposes GPU memory to the NIC. https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting/networking_troubleshooting.html ↩
MLNX_OFED transitioned to DOCA-OFED; the last standalone MLNX_OFED was the October 2024 LTS and from January 2025 new features ship in DOCA-OFED only. https://docs.nvidia.com/doca/sdk/nvidia-mlnx-ofed-to-doca-ofed-transition-guide.pdf ↩
DOCA-Host install on deb-based distros: dpkg -i <repo>.deb, apt-get update, apt install -y <profile> (e.g. doca-all, doca-ofed); apt install -y mlnx-fw-updater for firmware. https://networking-docs.nvidia.com/doca/sdk/DOCA-Host-Installation-and-Upgrade ↩
ofed_info reports the software versions of each OFED component; -s prints the version string (e.g. MLNX_OFED_LINUX-24.07-...). https://networking-docs.nvidia.com/doca/sdk/DOCA-Host-Installation-and-Upgrade ↩
NVIDIA Firmware Tools (MFT): mlxfwmanager -d <pci> --query reports device type/PSID/current vs available firmware. https://docs.nvidia.com/networking/display/mft/mlxfwmanager-%E2%80%93-firmware-update-and-query-tool ↩
flint burns/queries firmware on a single device (flint -d <dev> q); mstflint is the open-source MFT subset (mstflint -d <pci> q). https://github.com/Mellanox/mstflint ↩
Subnet manager, OpenSM, UFM, and the NVLink-C2C distinction are covered in networking fabric. https://www.nvidia.com/en-us/data-center/nvlink/ ↩↩
OpenSM man page: -B runs OpenSM as a daemon; configuration in opensm.conf. sminfo queries the master SM (LID/state); ibhosts lists host CAs. https://linux.die.net/man/8/opensm ↩
ibstat reports per-port State (Active/Init/...), Physical state (LinkUp/Polling/...), and Rate; the InfiniBand fabric utilities document these tools. https://docs.nvidia.com/networking/display/MLNXOFEDv531001/InfiniBand+Fabric+Utilities ↩
ibdiagnet is a primary InfiniBand fabric discovery/diagnostic tool; it reports link errors and validates unicast/adaptive/multicast routing, and is distributed in the ibutils2 package within MLNX_OFED/UFM. https://docs.nvidia.com/networking/display/ibdiagnetusermanualv221 ↩
linux-rdma perftest README: server (no host arg) + client (server address) model; -d device, -i port, -F cpufreq, -a size sweep to 2^23, -b bidirectional, -R rdma_cm, -q QPs, -D duration. https://github.com/linux-rdma/perftest/blob/master/README ↩↩
The perftest README does not document a --report_gbits flag; bandwidth is reported in MB/sec by default. https://github.com/linux-rdma/perftest/blob/master/README ↩
perftest CUDA build: ./autogen.sh && ./configure CUDA_H_PATH=/usr/local/cuda/include/cuda.h && make -j (25.07+ auto-detects); --use_cuda=<gpu_index> is supported by ib_write_bw, ib_read_bw, ib_send_bw, ib_read_lat, ib_send_lat; --use_cuda_dmabuf uses the dma-buf path; MLX5_SCATTER_TO_CQE=0 works around "Couldn't allocate MR". https://github.com/linux-rdma/perftest/blob/master/README ↩↩↩
A CUDA-enabled ib_write_bw run prints CUDA init and cuMemAlloc() of a <N> bytes GPU buffer, confirming the buffer is in GPU memory (GDR path). https://docs.oracle.com/en/learn/gpudirect-rdma-ib-write-bw/index.html ↩
NVIDIA/nccl-tests build: make CUDA_HOME=... NCCL_HOME=... (add MPI=1 MPI_HOME=... for MPI); args -b min, -e max, -f step factor, -g GPUs/thread, -t threads, -n iters, -w warmup. https://github.com/NVIDIA/nccl-tests/blob/master/README.md ↩↩
nccl-tests PERFORMANCE.md: algbw = size/time; for AllReduce busbw = algbw * 2*(n-1)/n; busbw reflects hardware utilisation independent of rank count, for comparison against hardware peak. https://github.com/NVIDIA/nccl-tests/blob/master/doc/PERFORMANCE.md ↩
With GDR active NCCL logs GPU Direct RDMA Enabled for GPU <id> / HCA <id> and connections of the form NET/IB/<n>/GDRDMA. https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting/networking_troubleshooting.html ↩
NCCL env vars: NCCL_IB_HCA filters HCAs; NCCL_NET_GDR_LEVEL sets the GDR topological cutoff (LOC/PIX/PXB/PHB/SYS); NCCL_IB_GID_INDEX is the RoCE GID index (default -1, "see the InfiniBand show_gids command"); NCCL_SOCKET_IFNAME restricts interfaces; NCCL_IB_DISABLE=1 forces IP sockets. https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html ↩↩↩
nvidia-smi manual: nvlink -s/--status (state + rated bandwidth if active), -c/--capabilities, -gt/--getthroughput with arg d (tx/rx data payload in KiB) or r (adds overhead); -gt supersedes the deprecated -g counters. https://docs.nvidia.com/deploy/nvidia-smi/index.html ↩
NVIDIA/nvbandwidth measures memcpy bandwidth across links (CE or SM copy); build needs CUDA, CMake 3.20+, and Boost program_options; cmake . && make; -l lists tests, -t <name> runs one (e.g. device_to_device_memcpy_read_ce). https://github.com/NVIDIA/nvbandwidth ↩↩
Fabric Manager user guide: systemctl status nvidia-fabricmanager; required on NVSwitch DGX/HGX systems; CUDA init fails with cudaErrorSystemNotReady if launched before FM initialises the system; FM checks the loaded driver-stack version and aborts on incompatibility. https://docs.nvidia.com/datacenter/tesla/fabric-manager-user-guide/index.html ↩↩
NVIDIA RoCE doc: RoCEv2 uses UDP on dedicated port 4791; cma_roce_mode -d <dev> -p <port> -m <1|2> sets the RoCE version (2 = RoCEv2). https://networking-docs.nvidia.com/mlnxofedswum/23070512/rdma-over-converged-ethernet-roce ↩
NVIDIA QoS-for-RoCE / RoCEv2 congestion management docs: PFC (802.1Qbb) pauses per-priority to avoid drops; ECN marks the IP-header ECN bits on congestion. https://enterprise-support.nvidia.com/s/article/understanding-qos-configuration-for-roce ↩
DCQCN is the de-facto RoCEv2 congestion-control algorithm, combining ECN-based rate control with PFC. https://www.wwt.com/article/understanding-data-center-quantized-congestion-notification-dcqcn ↩
NVIDIA Spectrum-X: RoCE adaptive routing and end-to-end congestion control for AI Ethernet, pairing the Spectrum-4 switch with the BlueField-3 SuperNIC. https://www.nvidia.com/en-us/networking/spectrumx/ ↩
show_gids lists the GID table (DEV/PORT/INDEX/GID/IPv4/VER); the RoCEv2 IPv4 GID is the row with VER=v2 and a populated IPv4 — its INDEX is used for NCCL_IB_GID_INDEX. https://docs.cloud.google.com/ai-hypercomputer/docs/nccl/collect-and-understand ↩
NVIDIA's doRoCE.sh configures lossless RoCE with mlnx_qos -i <iface> --trust <mode> --pfc <mask> and writes NCCL_IB_GID_INDEX/NCCL_IB_TC; exact priorities/DSCP are site-specific. https://github.com/NVIDIA/doroce-linux/blob/main/doRoCE.sh ↩
InfiniBand 4x per-port rates: EDR 100, HDR 200 (50G-PAM4/lane), NDR 400 (100G-PAM4/lane), XDR 800 (200G-PAM4/lane) Gb/s, per port per direction (Glossary, networking fabric). ↩↩↩↩
NVIDIA DGX SuperPOD cabling guide: NDR = 400 Gbps (four lanes of 100 Gbps). https://docs.nvidia.com/dgx-superpod/design-guide-cabling-data-centers/latest/ndr-overview.html ↩
NVIDIA A100 product page lists NVLink 600 GB/s; the Ampere architecture whitepaper states each link carries 25 GB/s per direction and A100 has 12 links, totalling 600 GB/s (aggregate of both directions). https://www.nvidia.com/en-us/data-center/a100/ , https://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf ↩↩
NVIDIA NVLink page: 4th-gen (H100) = 900 GB/s per GPU; 5th-gen (B200) = 1,800 GB/s per GPU. Per-direction halves are inferred (total / 2), not printed by NVIDIA. https://www.nvidia.com/en-us/data-center/nvlink/ ↩↩
NVIDIA GB200 NVL72 page lists NVLink bandwidth 3.6 TB/s for the 2-GPU GB200 Superchip = 1.8 TB/s per GPU. https://www.nvidia.com/en-us/data-center/gb200-nvl72/ ↩