Markdown

RDMA and RoCE performance tuning¶

Scope: tuning RDMA over Converged Ethernet (RoCEv2) for GPU clusters, covering GPUDirect RDMA (NIC-to-HBM direct DMA), the lossless fabric (PFC + ECN/DCQCN), adaptive routing, GID/traffic-class selection, and the NCCL_IB_* knobs that drive the data path, contrasted with native InfiniBand.

What it is¶

RDMA lets an RDMA-capable NIC read and write application memory directly, bypassing most of the kernel network stack: no per-packet CPU involvement, no context switches, no buffer copies.¹ GPUDirect RDMA is NVIDIA's GPU variant: an InfiniBand or RoCE NIC performs DMA directly to and from a remote GPU's device memory across two servers, bypassing host CPU and system RAM entirely. GPU buffers are registered with the NIC, which then services one-sided RDMA reads/writes between remote GPUs, minimising both latency and CPU overhead in multinode training.²

RoCEv2 carries the same RDMA verbs over routable UDP/IP Ethernet instead of an InfiniBand fabric. You get RDMA-like zero-copy transfers over Ethernet assuming the network gear supports RDMA and is configured for it.³ The hard part is not the NIC; it is making Ethernet behave like a lossless fabric, because RoCE requires a lossless network: any drop triggers a retransmission that cascades across the whole synchronised training job.⁴ Three mechanisms cooperate:

PFC (Priority Flow Control, 802.1Qbb): link-layer pause on a specific 802.1p/priority class; the switch tells its upstream neighbour to stop sending on that class before its buffer overflows.⁵
ECN (Explicit Congestion Notification): when a switch egress queue crosses a threshold (Kmin..Kmax), it marks the CE bits in the IP header instead of dropping. The receiver echoes the mark (CNP) back to the sender.⁶
DCQCN (Data Center Quantized Congestion Notification): the reference RoCEv2 congestion-control loop that combines ECN marking with a sender rate-adjustment algorithm; PFC is the last-resort backstop while DCQCN does the steady-state rate control.⁷

Native InfiniBand provides lossless delivery, credit-based flow control, and adaptive routing in the fabric itself, plus in-network compute (SHARP). RoCE estates generally lack SHARP, which is one reason many ultrascale systems still prefer InfiniBand; SHARP should be used when available.⁸ NVIDIA's Ethernet answer is Spectrum-X (Spectrum-4 switch + BlueField-3/ConnectX-8 SuperNIC), which ports lossless delivery, adaptive routing, and per-flow telemetry onto Ethernet and offloads transport/congestion control to the SuperNIC.⁹

Why it matters¶

The performance gap between RDMA and a TCP/IP fallback is one to two orders of magnitude, and the failure mode is silent. A modern InfiniBand link delivers a few microseconds of small-message latency where TCP over Ethernet incurs 5–10x more; for large bandwidth-bound transfers RDMA sustains hundreds of Gbps versus TCP's typical ceiling.¹⁰ In container environments (Docker, Kubernetes) without direct access to the host's InfiniBand devices (/dev/infiniband), NCCL silently falls back to TCP sockets instead of GPUDirect RDMA: throughput drops from tens of GB/s to a few Gb/s with no error message.¹¹ A related trap: mismatched GID assignments between container and host block GPUDirect registration and force CPU-driven RDMA copies instead of true GPU RDMA.¹²

Concrete: the book's worked example moves 400 MB with an 800 Gb/s (100 GB/s) interconnect. A CPU-bound Gloo backend takes ~200 ms (≈2 GB/s, CPU near 100%); NCCL with GPUDirect RDMA takes ~4 ms (≈100 GB/s, line rate), two orders of magnitude.¹³ These numbers are the book's illustrative figures, not hardware-tested here.

When it is needed (and when not)¶

Use RoCE tuning when the cluster's scale-out fabric is Ethernet and you need RDMA all-reduce/all-gather for distributed training or disaggregated inference KV-cache movement. RoCE at 200+ Gb/s vastly outperforms a 10–25 Gb/s TCP network for all-reduce traffic.¹⁰
Prefer native InfiniBand for latency-bound training at scale: lossless-by-design, adaptive routing in-fabric, and SHARP offload. RoCE is the multi-tenant / cloud / cost-driven choice. See networking fabric for the platform-level selection.
Neither is the fastest path intra-node. Keep traffic on NVLink/NVSwitch whenever possible. An NVL72 NVLink domain delivers ~1.8 TB/s per-GPU aggregate at sub-microsecond latency, far above any inter-node RDMA. Only reach for RDMA when crossing node boundaries. See NVSwitch / NVLink.¹⁴
Skip the PFC/DCQCN rabbit hole if you are on InfiniBand. That lossless tuning is an Ethernet-only problem. On IB, the equivalent levers are service level and adaptive routing, below.

How: implement, integrate, maintain¶

1. Confirm GPUDirect RDMA is actually active¶

Never assume. The fallback is silent. Verify the peer-memory kernel module is loaded and check init in the kernel log:

# nvidia_peermem exports GPU memory to the RDMA stack (replaces legacy nv_peer_mem)
lsmod | grep nvidia_peermem
dmesg | grep -i nvidia_peermem   # confirm initialization

Run NCCL with debug output to confirm it selected the IB/RoCE net transport, then validate true GPU-to-GPU DMA with a CUDA-aware perftest:¹⁵

export NCCL_DEBUG=INFO          # look for "NET/IB" path lines in the log
# GPUDirect RDMA bandwidth test, GPU buffers on both ends:
ib_write_bw --use_cuda=0 -d mlx5_0      # server
ib_write_bw --use_cuda=0 -d mlx5_0 <server_ip>   # client

A red flag for silent fallback: during all-reduce, GPU utilisation drops while CPU utilisation spikes. The CPU is copying data for communications.¹¹

2. Select the RoCE data path with NCCL_IB_* (verify every value against your `show_gids`)¶

# Pin the HCA(s). Comma-separated; each entry is <hca>[:<port>]. Match `ibstat` device names.
export NCCL_IB_HCA=mlx5_0,mlx5_1

# RoCE GID index. Default is -1 (NCCL auto-selects). For RoCEv2/IPv4 you usually
# need an explicit index — read it from `show_gids` (pick the RoCEv2 IPv4 row).
export NCCL_IB_GID_INDEX=3

# GPUDirect RDMA distance gate between NIC and GPU. If unset, NCCL picks per topology.
# Values (increasing distance): LOC, PIX, PXB, PHB, SYS.
export NCCL_NET_GDR_LEVEL=PXB

# Bootstrap/out-of-band handshake interface. On multi-NIC hosts pin it to the HCA's
# IP interface so NCCL bootstraps on the fast fabric, then hands off to GPUDirect RDMA.
export NCCL_SOCKET_IFNAME=ib0

NCCL_IB_GID_INDEX defines the Global ID index used in RoCE mode; its default is -1 and you set it from the InfiniBand show_gids command.¹⁶ NCCL_NET_GDR_LEVEL controls when GPUDirect RDMA is used by the maximum allowed NIC↔GPU distance: LOC disables it, PIX requires same PCI switch, PXB allows multiple PCI switches, PHB allows same NUMA node (traffic via CPU), SYS allows crossing the inter-NUMA interconnect; if unspecified NCCL selects a value from the architecture.¹⁷ NCCL_IB_HCA selects which HCA interfaces NCCL uses.¹⁸

NCCL_NET_GDR_LEVEL replaced the older NCCL_IB_GDR_LEVEL name. Set the current name; prefer the official docs if a tutorial disagrees on naming.¹⁷

3. Traffic class, service level, and adaptive routing¶

RoCE QoS depends on the NIC's DSCP/traffic-class marking lining up with the switch priority that PFC and ECN are configured on. A mismatch sends RDMA on the wrong queue and breaks losslessness. The NCCL knobs:

# IB traffic class -> RoCE DSCP. Default 0. Set to match the switch's RDMA priority/DSCP
# (NVIDIA QoS examples map RDMA to a dedicated priority; CNP rides its own priority).
export NCCL_IB_TC=106            # example only: derive from your fabric's DSCP plan

# IB service level. Default 0. Maps to the SL->VL / priority used by the fabric QoS policy.
export NCCL_IB_SL=0

# Adaptive routing: ON by default on InfiniBand, OFF by default on RoCE.
# Enable explicitly on a RoCE fabric that supports it (e.g. Spectrum-X) to spread flowlets.
export NCCL_IB_ADAPTIVE_ROUTING=1

# Verbs ack timeout, computed as 4.096us * 2^value. Default 20 (NCCL >= 2.23; was 18).
# Raise only on large fabrics seeing spurious retransmits.
export NCCL_IB_TIMEOUT=20

# Queue pairs per rank-to-rank connection. Default 1. More QPs can spread load across
# adaptive-routing paths on large RoCE fabrics; raise stepwise and measure.
export NCCL_IB_QPS_PER_CONNECTION=4

NCCL_IB_TC defines the InfiniBand traffic class field (default 0); NCCL_IB_SL defines the service level (default 0); NCCL_IB_ADAPTIVE_ROUTING enables adaptive-routing-capable transfers and is enabled (1) by default on IB, disabled (0) by default on RoCE; NCCL_IB_TIMEOUT default is 20 (since 2.23; 18 since 2.14) with timeout = 4.096 µs · 2^value; NCCL_IB_QPS_PER_CONNECTION default is 1.¹⁹ The exact NCCL_IB_TC value is not a constant: it must equal the DSCP your switch QoS assigns to RDMA traffic; derive it from your fabric's QoS plan, do not copy a number.²⁰

4. Build the lossless fabric (RoCE only)¶

This is switch- and NIC-side configuration, not an NCCL knob. The end-to-end recipe (NVIDIA's ConnectX + Spectrum reference): enable PFC on the RDMA priority on every hop, set ECN marking thresholds on switch egress queues, and let DCQCN drive the sender rate from the resulting CNPs.²¹ Keep RDMA, CNP, and TCP on separate priorities so a congested flow cannot head-of-line-block unrelated traffic. HoL blocking from a shared priority class is the classic PFC misconfiguration.⁵ On Spectrum-X, NVIDIA's congestion control (NCC) and SuperNIC offload react faster than stock DCQCN.⁹

5. Host-side: TCP fallback hygiene, MTU, NUMA pinning¶

Even with RDMA the CPU still sets up transfers and handles completion events, so pin the NIC's interrupt/polling threads to a core on the same NUMA node as the NIC (and ideally the GPU) to cut cross-NUMA latency. If an HCA is on NUMA node 0, bind its IRQ affinity to node 0.²² For any path that can fall back to TCP, use jumbo frames (MTU 9000) so transfers send fewer large packets, and raise the socket buffers so a high-bandwidth link is reachable:²³

# Jumbo frames on the fabric interface
sudo ip link set dev ib0 mtu 9000

# Socket buffer ceilings + autotuning ranges (verify against your link bandwidth-delay product)
sudo sysctl -w net.core.rmem_max=268435456
sudo sysctl -w net.core.wmem_max=268435456
sudo sysctl -w net.ipv4.tcp_rmem="4096 87380 268435456"
sudo sysctl -w net.ipv4.tcp_wmem="4096 65536 268435456"

# On high latency-bandwidth links, consider BBR over default CUBIC; always validate it helps
sysctl net.ipv4.tcp_congestion_control

The book's guidance: on a well-engineered dedicated cluster network the default CUBIC is usually adequate; reach for BBR only on high latency-bandwidth links and validate that defaults are not capping throughput.²⁴

Data path¶

flowchart LR
  GA["GPU A HBM"] -->|"GPUDirect RDMA register"| NA["NIC A (RoCE)"]
  NA -->|"RoCEv2 UDP/IP"| SW["Lossless Ethernet switch (PFC + ECN, adaptive routing)"]
  SW -->|"RoCEv2 UDP/IP"| NB["NIC B (RoCE)"]
  NB -->|"DMA to device memory"| GB["GPU B HBM"]
  SW -.->|"CNP on congestion"| NA

Maintain¶

Monitor continuously: fallbacks are silent and links flap. Use ibstat / ibstatus for HCA link state, ethtool -S <iface> and ip -s link show <iface> for byte/packet/error counters, and nvidia-smi dmon for NVLink/PCIe/network stats.²⁵ Watch switch ECN-marked and PFC-pause counters: sustained PFC pause means DCQCN is not holding the rate and you are one step from HoL blocking. Re-verify nvidia_peermem and the NET/IB path after every driver/NCCL/CUDA upgrade. For collective-level hangs and straggler diagnosis, see runbook: NCCL hang.

References¶

Chris Fregly, AI Systems Performance Engineering (O'Reilly), Chapter 4, "Tuning Distributed Networking Communication" — GPUDirect RDMA, RoCE, silent TCP fallback, GID/container pitfalls, NUMA pinning, MTU/TCP tuning, the Gloo-vs-NCCL throughput example.
NCCL Environment Variables (official): https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html
NVIDIA GPUDirect RDMA: https://docs.nvidia.com/cuda/gpudirect-rdma/index.html
NVIDIA — HowTo Configure RoCE over a Lossless Fabric (PFC + ECN) End-to-End, ConnectX-4 + Spectrum: https://enterprise-support.nvidia.com/s/article/how-to-configure-roce-over-a-lossless-fabric--pfc---ecn--end-to-end-using-connectx-4-and-spectrum--trust-l2-x
NVIDIA — Understanding QoS Configuration for RoCE (DSCP/priority mapping): https://enterprise-support.nvidia.com/s/article/understanding-qos-configuration-for-roce
NVIDIA Spectrum-X (Spectrum-4 + BlueField-3/ConnectX-8 SuperNIC, NCC, adaptive routing): https://www.nvidia.com/en-us/networking/spectrum-x/

Fregly, Ch. 4, "High-Speed, Low-Overhead Data Transfers with RDMA": RDMA bypasses the kernel network stack and lets the NIC read/write application memory directly, avoiding per-packet CPU involvement, context switches, and buffer copies. ↩
Fregly, Ch. 4: GPUDirect RDMA lets an IB/RoCE NIC DMA directly to/from a remote GPU's device memory across two servers, bypassing host CPU and system RAM; GPU buffers are registered with the NIC for one-sided reads/writes. Cf. NVIDIA GPUDirect RDMA docs. ↩
Fregly, Ch. 4: "With RoCE, you get RDMA-like zero-copy transfers over Ethernet, assuming the network gear supports RDMA and is properly configured for it"; requires NVIDIA OFED drivers. ↩
Spheron / NVIDIA Spectrum-X material: RoCE requires a lossless network because any drop triggers retransmission that cascades across the synchronised job. https://www.nvidia.com/en-us/networking/spectrum-x/ ↩
PFC is a link-layer pause on an 802.1p priority class; a shared priority causes head-of-line blocking. WWT, "Using PFC and ECN queuing methods to create lossless fabrics for AI/ML." ↩↩
ECN marks the CE bit when a switch egress queue exceeds a configured threshold (Kmin..Kmax) instead of dropping; the receiver echoes the mark via CNP. ↩
DCQCN is the reference RoCEv2 congestion-control algorithm combining ECN marking with a sender rate-adjustment loop; PFC is the last-resort backstop. NVIDIA QoS-for-RoCE article. ↩
Fregly, Ch. 4 (Magnum IO / in-network compute): "Ethernet-based GPU clusters rely on technologies like RoCEv2 for RDMA but generally lack features like SHARP ... many ultrascale AI systems use InfiniBand ... SHARP should be utilized when available." ↩
NVIDIA Spectrum-X: Spectrum-4 switch + BlueField-3/ConnectX-8 SuperNIC; ports lossless delivery, adaptive routing, and telemetry to Ethernet; SuperNIC offloads RoCEv2 transport and congestion control; NCC reacts faster than stock DCQCN. https://www.nvidia.com/en-us/networking/spectrum-x/ ↩↩
Fregly, Ch. 4: IB small-message latency a few microseconds vs TCP 5–10x higher; RDMA sustains hundreds of Gbps; 200+ Gb/s RoCE beats 10–25 Gb/s TCP for all-reduce. ↩↩
Fregly, Ch. 4: without container access to /dev/infiniband, NCCL silently falls back to TCP — throughput drops from tens of GB/s to a few Gb/s with no error; a red flag is GPU utilisation dropping while CPU spikes during all-reduce. ↩↩
Fregly, Ch. 4: mismatched container/host GID assignments (some "rdma-shared" images) block GPUDirect registration and force CPU-driven RDMA copies instead of true GPU RDMA. ↩
Fregly, Ch. 4: 400 MB all-reduce on 800 Gb/s hardware — Gloo ≈200 ms (≈2 GB/s, CPU ~100%) vs NCCL ≈4 ms (≈100 GB/s, line rate). Illustrative figures, not hardware-tested here. ↩
Fregly, Ch. 4: each Blackwell GPU in a GB200/GB300 NVL72 has 18 NVLink 5 links at ~100 GB/s for ~1.8 TB/s aggregate; keep traffic on NVLink/NVSwitch whenever possible. ↩
Fregly, Ch. 4: verify with lsmod | grep nvidia_peermem, check dmesg, run NCCL with NCCL_DEBUG=INFO to confirm NET/IB paths, and use RDMA perftests with --use_cuda to validate GPU-to-GPU transfers. ↩
NCCL docs: NCCL_IB_GID_INDEX "defines the Global ID index used in RoCE mode ... The default value is -1"; set from show_gids. https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html ↩
NCCL docs: NCCL_NET_GDR_LEVEL controls when to use GPUDirect RDMA by max NIC↔GPU distance (LOC/PIX/PXB/PHB/SYS); auto-selected if unspecified; replaces the deprecated NCCL_IB_GDR_LEVEL. https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html ↩↩
NCCL docs: NCCL_IB_HCA specifies which Host Channel Adapter (RDMA) interfaces to use, comma-separated <hca>[:<port>]. https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html ↩
NCCL docs: NCCL_IB_TC (IB traffic class, default 0), NCCL_IB_SL (service level, default 0), NCCL_IB_ADAPTIVE_ROUTING (enabled by default on IB, disabled by default on RoCE), NCCL_IB_TIMEOUT (default 20 since 2.23, was 18 since 2.14; timeout = 4.096 µs · 2^value), NCCL_IB_QPS_PER_CONNECTION (default 1). https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html ↩
NVIDIA "Understanding QoS Configuration for RoCE": RDMA/CNP/TCP map to distinct priorities/DSCP; the NIC traffic-class value must match the switch QoS plan. https://enterprise-support.nvidia.com/s/article/understanding-qos-configuration-for-roce ↩
NVIDIA "HowTo Configure RoCE over a Lossless Fabric (PFC + ECN) End-to-End Using ConnectX-4 and Spectrum (Trust L2)". https://enterprise-support.nvidia.com/s/article/how-to-configure-roce-over-a-lossless-fabric--pfc---ecn--end-to-end-using-connectx-4-and-spectrum--trust-l2-x ↩
Fregly, Ch. 4: even with RDMA the host sets up transfers and handles completion events; pin NIC interrupt/polling threads to the NIC's (and ideally GPU's) NUMA node. ↩
Fregly, Ch. 4: use jumbo frames (MTU 9000) to send fewer large packets; raise net.core.rmem_max/wmem_max and net.ipv4.tcp_rmem/tcp_wmem to fully use a high-bandwidth link. ↩
Fregly, Ch. 4: default CUBIC is usually adequate on dedicated cluster networks; consider BBR on high latency-bandwidth links and inspect net.ipv4.tcp_congestion_control; always validate defaults are not limiting throughput. ↩
Fregly, Ch. 4: monitor with ibstat/ifstat, ethtool -S <iface>, ip -s link show <iface>, and nvidia-smi dmon for NVLink/PCIe/network statistics. ↩