RDMA and RoCE performance tuning¶
Scope: tuning RDMA over Converged Ethernet (RoCEv2) for GPU clusters, covering GPUDirect RDMA (NIC-to-HBM direct DMA), the lossless fabric (PFC + ECN/DCQCN), adaptive routing, GID/traffic-class selection, and the NCCL_IB_* knobs that drive the data path, contrasted with native InfiniBand.
What it is¶
RDMA lets an RDMA-capable NIC read and write application memory directly, bypassing most of the kernel network stack: no per-packet CPU involvement, no context switches, no buffer copies.1 GPUDirect RDMA is NVIDIA's GPU variant: an InfiniBand or RoCE NIC performs DMA directly to and from a remote GPU's device memory across two servers, bypassing host CPU and system RAM entirely. GPU buffers are registered with the NIC, which then services one-sided RDMA reads/writes between remote GPUs, minimising both latency and CPU overhead in multinode training.2
RoCEv2 carries the same RDMA verbs over routable UDP/IP Ethernet instead of an InfiniBand fabric. You get RDMA-like zero-copy transfers over Ethernet assuming the network gear supports RDMA and is configured for it.3 The hard part is not the NIC; it is making Ethernet behave like a lossless fabric, because RoCE requires a lossless network: any drop triggers a retransmission that cascades across the whole synchronised training job.4 Three mechanisms cooperate:
- PFC (Priority Flow Control, 802.1Qbb): link-layer pause on a specific 802.1p/priority class; the switch tells its upstream neighbour to stop sending on that class before its buffer overflows.5
- ECN (Explicit Congestion Notification): when a switch egress queue crosses a threshold (
Kmin..Kmax), it marks the CE bits in the IP header instead of dropping. The receiver echoes the mark (CNP) back to the sender.6 - DCQCN (Data Center Quantized Congestion Notification): the reference RoCEv2 congestion-control loop that combines ECN marking with a sender rate-adjustment algorithm; PFC is the last-resort backstop while DCQCN does the steady-state rate control.7
Native InfiniBand provides lossless delivery, credit-based flow control, and adaptive routing in the fabric itself, plus in-network compute (SHARP). RoCE estates generally lack SHARP, which is one reason many ultrascale systems still prefer InfiniBand; SHARP should be used when available.8 NVIDIA's Ethernet answer is Spectrum-X (Spectrum-4 switch + BlueField-3/ConnectX-8 SuperNIC), which ports lossless delivery, adaptive routing, and per-flow telemetry onto Ethernet and offloads transport/congestion control to the SuperNIC.9
Why it matters¶
The performance gap between RDMA and a TCP/IP fallback is one to two orders of magnitude, and the failure mode is silent. A modern InfiniBand link delivers a few microseconds of small-message latency where TCP over Ethernet incurs 5–10x more; for large bandwidth-bound transfers RDMA sustains hundreds of Gbps versus TCP's typical ceiling.10 In container environments (Docker, Kubernetes) without direct access to the host's InfiniBand devices (/dev/infiniband), NCCL silently falls back to TCP sockets instead of GPUDirect RDMA: throughput drops from tens of GB/s to a few Gb/s with no error message.11 A related trap: mismatched GID assignments between container and host block GPUDirect registration and force CPU-driven RDMA copies instead of true GPU RDMA.12
Concrete: the book's worked example moves 400 MB with an 800 Gb/s (100 GB/s) interconnect. A CPU-bound Gloo backend takes ~200 ms (≈2 GB/s, CPU near 100%); NCCL with GPUDirect RDMA takes ~4 ms (≈100 GB/s, line rate), two orders of magnitude.13 These numbers are the book's illustrative figures, not hardware-tested here.
When it is needed (and when not)¶
- Use RoCE tuning when the cluster's scale-out fabric is Ethernet and you need RDMA all-reduce/all-gather for distributed training or disaggregated inference KV-cache movement. RoCE at 200+ Gb/s vastly outperforms a 10–25 Gb/s TCP network for all-reduce traffic.10
- Prefer native InfiniBand for latency-bound training at scale: lossless-by-design, adaptive routing in-fabric, and SHARP offload. RoCE is the multi-tenant / cloud / cost-driven choice. See networking fabric for the platform-level selection.
- Neither is the fastest path intra-node. Keep traffic on NVLink/NVSwitch whenever possible. An NVL72 NVLink domain delivers ~1.8 TB/s per-GPU aggregate at sub-microsecond latency, far above any inter-node RDMA. Only reach for RDMA when crossing node boundaries. See NVSwitch / NVLink.14
- Skip the PFC/DCQCN rabbit hole if you are on InfiniBand. That lossless tuning is an Ethernet-only problem. On IB, the equivalent levers are service level and adaptive routing, below.
How: implement, integrate, maintain¶
1. Confirm GPUDirect RDMA is actually active¶
Never assume. The fallback is silent. Verify the peer-memory kernel module is loaded and check init in the kernel log:
# nvidia_peermem exports GPU memory to the RDMA stack (replaces legacy nv_peer_mem)
lsmod | grep nvidia_peermem
dmesg | grep -i nvidia_peermem # confirm initialization
Run NCCL with debug output to confirm it selected the IB/RoCE net transport, then validate true GPU-to-GPU DMA with a CUDA-aware perftest:15
export NCCL_DEBUG=INFO # look for "NET/IB" path lines in the log
# GPUDirect RDMA bandwidth test, GPU buffers on both ends:
ib_write_bw --use_cuda=0 -d mlx5_0 # server
ib_write_bw --use_cuda=0 -d mlx5_0 <server_ip> # client
A red flag for silent fallback: during all-reduce, GPU utilisation drops while CPU utilisation spikes. The CPU is copying data for communications.11
2. Select the RoCE data path with NCCL_IB_* (verify every value against your show_gids)¶
# Pin the HCA(s). Comma-separated; each entry is <hca>[:<port>]. Match `ibstat` device names.
export NCCL_IB_HCA=mlx5_0,mlx5_1
# RoCE GID index. Default is -1 (NCCL auto-selects). For RoCEv2/IPv4 you usually
# need an explicit index — read it from `show_gids` (pick the RoCEv2 IPv4 row).
export NCCL_IB_GID_INDEX=3
# GPUDirect RDMA distance gate between NIC and GPU. If unset, NCCL picks per topology.
# Values (increasing distance): LOC, PIX, PXB, PHB, SYS.
export NCCL_NET_GDR_LEVEL=PXB
# Bootstrap/out-of-band handshake interface. On multi-NIC hosts pin it to the HCA's
# IP interface so NCCL bootstraps on the fast fabric, then hands off to GPUDirect RDMA.
export NCCL_SOCKET_IFNAME=ib0
NCCL_IB_GID_INDEX defines the Global ID index used in RoCE mode; its default is -1 and you set it from the InfiniBand show_gids command.16 NCCL_NET_GDR_LEVEL controls when GPUDirect RDMA is used by the maximum allowed NIC↔GPU distance: LOC disables it, PIX requires same PCI switch, PXB allows multiple PCI switches, PHB allows same NUMA node (traffic via CPU), SYS allows crossing the inter-NUMA interconnect; if unspecified NCCL selects a value from the architecture.17 NCCL_IB_HCA selects which HCA interfaces NCCL uses.18
NCCL_NET_GDR_LEVELreplaced the olderNCCL_IB_GDR_LEVELname. Set the current name; prefer the official docs if a tutorial disagrees on naming.17
3. Traffic class, service level, and adaptive routing¶
RoCE QoS depends on the NIC's DSCP/traffic-class marking lining up with the switch priority that PFC and ECN are configured on. A mismatch sends RDMA on the wrong queue and breaks losslessness. The NCCL knobs:
# IB traffic class -> RoCE DSCP. Default 0. Set to match the switch's RDMA priority/DSCP
# (NVIDIA QoS examples map RDMA to a dedicated priority; CNP rides its own priority).
export NCCL_IB_TC=106 # example only: derive from your fabric's DSCP plan
# IB service level. Default 0. Maps to the SL->VL / priority used by the fabric QoS policy.
export NCCL_IB_SL=0
# Adaptive routing: ON by default on InfiniBand, OFF by default on RoCE.
# Enable explicitly on a RoCE fabric that supports it (e.g. Spectrum-X) to spread flowlets.
export NCCL_IB_ADAPTIVE_ROUTING=1
# Verbs ack timeout, computed as 4.096us * 2^value. Default 20 (NCCL >= 2.23; was 18).
# Raise only on large fabrics seeing spurious retransmits.
export NCCL_IB_TIMEOUT=20
# Queue pairs per rank-to-rank connection. Default 1. More QPs can spread load across
# adaptive-routing paths on large RoCE fabrics; raise stepwise and measure.
export NCCL_IB_QPS_PER_CONNECTION=4
NCCL_IB_TC defines the InfiniBand traffic class field (default 0); NCCL_IB_SL defines the service level (default 0); NCCL_IB_ADAPTIVE_ROUTING enables adaptive-routing-capable transfers and is enabled (1) by default on IB, disabled (0) by default on RoCE; NCCL_IB_TIMEOUT default is 20 (since 2.23; 18 since 2.14) with timeout = 4.096 µs · 2^value; NCCL_IB_QPS_PER_CONNECTION default is 1.19 The exact NCCL_IB_TC value is not a constant: it must equal the DSCP your switch QoS assigns to RDMA traffic; derive it from your fabric's QoS plan, do not copy a number.20
4. Build the lossless fabric (RoCE only)¶
This is switch- and NIC-side configuration, not an NCCL knob. The end-to-end recipe (NVIDIA's ConnectX + Spectrum reference): enable PFC on the RDMA priority on every hop, set ECN marking thresholds on switch egress queues, and let DCQCN drive the sender rate from the resulting CNPs.21 Keep RDMA, CNP, and TCP on separate priorities so a congested flow cannot head-of-line-block unrelated traffic. HoL blocking from a shared priority class is the classic PFC misconfiguration.5 On Spectrum-X, NVIDIA's congestion control (NCC) and SuperNIC offload react faster than stock DCQCN.9
5. Host-side: TCP fallback hygiene, MTU, NUMA pinning¶
Even with RDMA the CPU still sets up transfers and handles completion events, so pin the NIC's interrupt/polling threads to a core on the same NUMA node as the NIC (and ideally the GPU) to cut cross-NUMA latency. If an HCA is on NUMA node 0, bind its IRQ affinity to node 0.22 For any path that can fall back to TCP, use jumbo frames (MTU 9000) so transfers send fewer large packets, and raise the socket buffers so a high-bandwidth link is reachable:23
# Jumbo frames on the fabric interface
sudo ip link set dev ib0 mtu 9000
# Socket buffer ceilings + autotuning ranges (verify against your link bandwidth-delay product)
sudo sysctl -w net.core.rmem_max=268435456
sudo sysctl -w net.core.wmem_max=268435456
sudo sysctl -w net.ipv4.tcp_rmem="4096 87380 268435456"
sudo sysctl -w net.ipv4.tcp_wmem="4096 65536 268435456"
# On high latency-bandwidth links, consider BBR over default CUBIC; always validate it helps
sysctl net.ipv4.tcp_congestion_control
The book's guidance: on a well-engineered dedicated cluster network the default CUBIC is usually adequate; reach for BBR only on high latency-bandwidth links and validate that defaults are not capping throughput.24
Data path¶
flowchart LR
GA["GPU A HBM"] -->|"GPUDirect RDMA register"| NA["NIC A (RoCE)"]
NA -->|"RoCEv2 UDP/IP"| SW["Lossless Ethernet switch (PFC + ECN, adaptive routing)"]
SW -->|"RoCEv2 UDP/IP"| NB["NIC B (RoCE)"]
NB -->|"DMA to device memory"| GB["GPU B HBM"]
SW -.->|"CNP on congestion"| NA
Maintain¶
Monitor continuously: fallbacks are silent and links flap. Use ibstat / ibstatus for HCA link state, ethtool -S <iface> and ip -s link show <iface> for byte/packet/error counters, and nvidia-smi dmon for NVLink/PCIe/network stats.25 Watch switch ECN-marked and PFC-pause counters: sustained PFC pause means DCQCN is not holding the rate and you are one step from HoL blocking. Re-verify nvidia_peermem and the NET/IB path after every driver/NCCL/CUDA upgrade. For collective-level hangs and straggler diagnosis, see runbook: NCCL hang.
References¶
- Chris Fregly, AI Systems Performance Engineering (O'Reilly), Chapter 4, "Tuning Distributed Networking Communication" — GPUDirect RDMA, RoCE, silent TCP fallback, GID/container pitfalls, NUMA pinning, MTU/TCP tuning, the Gloo-vs-NCCL throughput example.
- NCCL Environment Variables (official): https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html
- NVIDIA GPUDirect RDMA: https://docs.nvidia.com/cuda/gpudirect-rdma/index.html
- NVIDIA — HowTo Configure RoCE over a Lossless Fabric (PFC + ECN) End-to-End, ConnectX-4 + Spectrum: https://enterprise-support.nvidia.com/s/article/how-to-configure-roce-over-a-lossless-fabric--pfc---ecn--end-to-end-using-connectx-4-and-spectrum--trust-l2-x
- NVIDIA — Understanding QoS Configuration for RoCE (DSCP/priority mapping): https://enterprise-support.nvidia.com/s/article/understanding-qos-configuration-for-roce
- NVIDIA Spectrum-X (Spectrum-4 + BlueField-3/ConnectX-8 SuperNIC, NCC, adaptive routing): https://www.nvidia.com/en-us/networking/spectrum-x/
Related: Networking fabric · Role of the RDMA fabric · BlueField DPU networking · NCCL collectives & algorithms · SHARP in-network reduction · NVSwitch / NVLink · Distributed training · Runbook: NCCL hang · Glossary
-
Fregly, Ch. 4, "High-Speed, Low-Overhead Data Transfers with RDMA": RDMA bypasses the kernel network stack and lets the NIC read/write application memory directly, avoiding per-packet CPU involvement, context switches, and buffer copies. ↩
-
Fregly, Ch. 4: GPUDirect RDMA lets an IB/RoCE NIC DMA directly to/from a remote GPU's device memory across two servers, bypassing host CPU and system RAM; GPU buffers are registered with the NIC for one-sided reads/writes. Cf. NVIDIA GPUDirect RDMA docs. ↩
-
Fregly, Ch. 4: "With RoCE, you get RDMA-like zero-copy transfers over Ethernet, assuming the network gear supports RDMA and is properly configured for it"; requires NVIDIA OFED drivers. ↩
-
Spheron / NVIDIA Spectrum-X material: RoCE requires a lossless network because any drop triggers retransmission that cascades across the synchronised job. https://www.nvidia.com/en-us/networking/spectrum-x/ ↩
-
PFC is a link-layer pause on an 802.1p priority class; a shared priority causes head-of-line blocking. WWT, "Using PFC and ECN queuing methods to create lossless fabrics for AI/ML." ↩↩
-
ECN marks the CE bit when a switch egress queue exceeds a configured threshold (Kmin..Kmax) instead of dropping; the receiver echoes the mark via CNP. ↩
-
DCQCN is the reference RoCEv2 congestion-control algorithm combining ECN marking with a sender rate-adjustment loop; PFC is the last-resort backstop. NVIDIA QoS-for-RoCE article. ↩
-
Fregly, Ch. 4 (Magnum IO / in-network compute): "Ethernet-based GPU clusters rely on technologies like RoCEv2 for RDMA but generally lack features like SHARP ... many ultrascale AI systems use InfiniBand ... SHARP should be utilized when available." ↩
-
NVIDIA Spectrum-X: Spectrum-4 switch + BlueField-3/ConnectX-8 SuperNIC; ports lossless delivery, adaptive routing, and telemetry to Ethernet; SuperNIC offloads RoCEv2 transport and congestion control; NCC reacts faster than stock DCQCN. https://www.nvidia.com/en-us/networking/spectrum-x/ ↩↩
-
Fregly, Ch. 4: IB small-message latency a few microseconds vs TCP 5–10x higher; RDMA sustains hundreds of Gbps; 200+ Gb/s RoCE beats 10–25 Gb/s TCP for all-reduce. ↩↩
-
Fregly, Ch. 4: without container access to
/dev/infiniband, NCCL silently falls back to TCP — throughput drops from tens of GB/s to a few Gb/s with no error; a red flag is GPU utilisation dropping while CPU spikes during all-reduce. ↩↩ -
Fregly, Ch. 4: mismatched container/host GID assignments (some "rdma-shared" images) block GPUDirect registration and force CPU-driven RDMA copies instead of true GPU RDMA. ↩
-
Fregly, Ch. 4: 400 MB all-reduce on 800 Gb/s hardware — Gloo ≈200 ms (≈2 GB/s, CPU ~100%) vs NCCL ≈4 ms (≈100 GB/s, line rate). Illustrative figures, not hardware-tested here. ↩
-
Fregly, Ch. 4: each Blackwell GPU in a GB200/GB300 NVL72 has 18 NVLink 5 links at ~100 GB/s for ~1.8 TB/s aggregate; keep traffic on NVLink/NVSwitch whenever possible. ↩
-
Fregly, Ch. 4: verify with
lsmod | grep nvidia_peermem, checkdmesg, run NCCL withNCCL_DEBUG=INFOto confirm NET/IB paths, and use RDMA perftests with--use_cudato validate GPU-to-GPU transfers. ↩ -
NCCL docs:
NCCL_IB_GID_INDEX"defines the Global ID index used in RoCE mode ... The default value is -1"; set fromshow_gids. https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html ↩ -
NCCL docs:
NCCL_NET_GDR_LEVELcontrols when to use GPUDirect RDMA by max NIC↔GPU distance (LOC/PIX/PXB/PHB/SYS); auto-selected if unspecified; replaces the deprecatedNCCL_IB_GDR_LEVEL. https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html ↩↩ -
NCCL docs:
NCCL_IB_HCAspecifies which Host Channel Adapter (RDMA) interfaces to use, comma-separated<hca>[:<port>]. https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html ↩ -
NCCL docs:
NCCL_IB_TC(IB traffic class, default 0),NCCL_IB_SL(service level, default 0),NCCL_IB_ADAPTIVE_ROUTING(enabled by default on IB, disabled by default on RoCE),NCCL_IB_TIMEOUT(default 20 since 2.23, was 18 since 2.14; timeout = 4.096 µs · 2^value),NCCL_IB_QPS_PER_CONNECTION(default 1). https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html ↩ -
NVIDIA "Understanding QoS Configuration for RoCE": RDMA/CNP/TCP map to distinct priorities/DSCP; the NIC traffic-class value must match the switch QoS plan. https://enterprise-support.nvidia.com/s/article/understanding-qos-configuration-for-roce ↩
-
NVIDIA "HowTo Configure RoCE over a Lossless Fabric (PFC + ECN) End-to-End Using ConnectX-4 and Spectrum (Trust L2)". https://enterprise-support.nvidia.com/s/article/how-to-configure-roce-over-a-lossless-fabric--pfc---ecn--end-to-end-using-connectx-4-and-spectrum--trust-l2-x ↩
-
Fregly, Ch. 4: even with RDMA the host sets up transfers and handles completion events; pin NIC interrupt/polling threads to the NIC's (and ideally GPU's) NUMA node. ↩
-
Fregly, Ch. 4: use jumbo frames (MTU 9000) to send fewer large packets; raise
net.core.rmem_max/wmem_maxandnet.ipv4.tcp_rmem/tcp_wmemto fully use a high-bandwidth link. ↩ -
Fregly, Ch. 4: default CUBIC is usually adequate on dedicated cluster networks; consider BBR on high latency-bandwidth links and inspect
net.ipv4.tcp_congestion_control; always validate defaults are not limiting throughput. ↩ -
Fregly, Ch. 4: monitor with
ibstat/ifstat,ethtool -S <iface>,ip -s link show <iface>, andnvidia-smi dmonfor NVLink/PCIe/network statistics. ↩