Markdown

NCCL collectives and algorithm selection¶

Scope: how NCCL implements all-reduce, all-gather, reduce-scatter, and broadcast, how it selects an algorithm (Ring/Tree/CollNet/NVLS) and protocol (Simple/LL/LL128) by message size and topology, the key tuning env vars (NCCL_ALGO, NCCL_PROTO, NCCL_NTHREADS, NCCL_BUFFSIZE), and how to validate with nccl-tests bus bandwidth.

flowchart TB
    AR["All-reduce request<br/>(reduce-scatter then all-gather)"]
    SIZE{"Message size /<br/>scale?"}
    AR --> SIZE
    SIZE -->|"small msg / many ranks<br/>(latency-dominated)"| TREE["Tree / NVLSTree<br/>O(log N) steps, latency-optimal"]
    SIZE -->|"large msg<br/>(bandwidth-dominated)"| RING["Ring<br/>bandwidth-optimal, latency O(N) hops"]
    SIZE -->|"multi-node, fast intra-domain"| COLL["CollNet / PAT<br/>hierarchical, RDMA + SHARP offload"]
    TREE --> NVLS["NVLS / SHARP<br/>in-network reduction (NVSwitch / IB)"]
    PROTO{"Protocol<br/>(latency vs bandwidth)?"}
    TREE --> PROTO
    RING --> PROTO
    PROTO -->|"smallest msg"| LL["LL / LL128<br/>low latency"]
    PROTO -->|"large msg"| SIMPLE["Simple<br/>peak bandwidth"]

What it is¶

NCCL (NVIDIA Collective Communications Library) is a many-to-many communication library providing optimized collectives (all-reduce, all-gather, broadcast, reduce-scatter) used by groups of GPUs to share data.¹ It underpins most multi-GPU training in the NVIDIA ecosystem: each GPU computes gradients on its data shard, then NCCL all-reduces those gradients so every GPU updates weights with the averaged result.¹

The four collectives in scope:

all-reduce: sum (or reduce) a tensor across all ranks; every rank ends with the full reduced result. The gradient-sync primitive for data-parallel training. An all-reduce is equivalent to a reduce-scatter followed by an all-gather (broadcast).¹
all-gather: every rank contributes a shard; every rank ends with the concatenation of all shards. Used in FSDP to re-materialize sharded parameters.
reduce-scatter: reduce across ranks, but each rank keeps only its slice of the result. The reduction half of FSDP gradient handling.
broadcast: one root rank sends identical data to all ranks (e.g. updated weights). NCCL can use NVSwitch hardware multicast for one-hop broadcast inside an NVLink domain.¹

NCCL runs over PCIe, NVLink, NVSwitch, InfiniBand, and TCP sockets, and automatically chooses the fastest path between any two GPUs.¹ At communicator init it inspects message size, interconnect topology, and GPU generation to pick the algorithm+protocol combination per collective.¹

Why it matters¶

Communication, not compute, is the scaling wall. The same all-reduce can run at tens of GB/s or hundreds of GB/s depending purely on whether NCCL routes over NVLink versus PCIe.¹ A topology-unaware single ring across four GPUs split over two PCIe switches forces a stage over the slow inter-switch link; a hierarchical approach keeps the bulk on NVLink. In the book's profiled example this is the difference between 60% SM utilization at 100 ms/iter and 90% SM utilization at 70 ms/iter.¹

Algorithm and protocol choice is message-size-dependent. Small messages are latency-dominated (startup cost dominates); large messages are bandwidth-dominated (byte-movement dominates).¹ Picking the wrong one (or letting a misconfiguration force a fallback) silently costs an order of magnitude without crashing.

When it is needed (and when not)¶

Needed whenever multiple GPUs must agree on a tensor: DDP/FSDP gradient sync, tensor-parallel partial-sum reduction, weight broadcast at init. NCCL is the correct backend for any NVIDIA multi-GPU collective; it is PyTorch's default.¹

Tuning the selection (overriding NCCL_ALGO/NCCL_PROTO) is rarely needed. NCCL's automatic selection is good; manual override is for troubleshooting, research experiments, or a profiled, confirmed pathology such as unexpectedly high cross-node latency.¹ Set explicit values to pin behavior across NCCL upgrades; defaults can change between versions and are hard to debug when they do.¹

Not the right tool for point-to-point inference transfers (KV-cache movement): NCCL send()/recv() exist but are less optimized than NIXL for one-to-one tail latency.¹ See Disaggregated Inference. Never use the CPU-bound Gloo backend for GPU training; it staged through host memory over TCP and runs an order of magnitude slower.¹

How: implement, integrate, maintain¶

Algorithm selection (topology- and size-driven)¶

NCCL's primary collective algorithms:¹

Ring: GPUs form a logical ring; each chunk circulates accumulating partial sums. Perfectly balances load (every link moves 2 x (data_size / num_gpus) bytes for all-reduce) and is bandwidth-optimal, but latency scales with hop count O(N). Best for large messages (bandwidth-dominated).¹
Tree / NVLSTree: reduce-scatter then broadcast over a spanning tree, completing all-reduce in O(log N) steps. Lower latency for small messages; may not saturate all links on large ones. NVLSTree enables NVLink SHARP (NVLS) offload.¹
CollNet / CollTree: two-level hierarchical collectives: a high-throughput local algorithm inside each fast domain (node / NVSwitch island), then one leader per group joins a second-level tree across groups over RDMA, pipelined. Low internode latency plus full intranode bandwidth; can offload to InfiniBand SHARP when the NCCL-SHARP plugin is enabled.¹
PAT (Parallel Aggregated Tree): pipelined ring/tree hybrid: splits the message into chunks and staggers per-chunk tree reductions, achieving near-ring throughput with tree-level O(log N) per-segment latency.¹

Rule of thumb: small messages (tens of MB) favor trees (fewer steps); large messages favor ring (bandwidth).¹ Keep as much traffic as possible on the fastest interconnect (NVLink/NVSwitch intranode); minimize PCIe and inter-NUMA hops.¹

The valid NCCL_ALGO values per the official NCCL docs are Ring, Tree, CollnetChain, CollnetDirect, NVLS, NVLSTree, PAT; a ^ prefix excludes rather than includes. Unset (the default) lets NCCL choose from node topology and architecture.² (The book uses illustrative spellings such as NVLSTree,PAT; prefer the official token list on disagreement.¹)

# Override only for troubleshooting / A-B testing. Set BEFORE ncclCommInitRank.
export NCCL_ALGO=Tree          # force tree (latency-dominated small messages)
export NCCL_ALGO=^Ring         # exclude ring, let NCCL pick among the rest

Protocol selection (Simple / LL / LL128)¶

Independent of algorithm, NCCL picks a wire protocol that trades latency against bandwidth. Valid NCCL_PROTO values per the official docs are LL, LL128, Simple (^ excludes):²

LL ("low latency"): lowest latency, lowest peak bandwidth; for the smallest messages.
LL128: low latency tuned for 128-byte granularity; high bandwidth on NVLink-class links. Only available on platforms that support it.
Simple: highest peak bandwidth, higher fixed latency; for large messages.

Default is unset, which enables all supported protocols: LL,LL128,Simple where LL128 is supported, LL,Simple otherwise.²

export NCCL_PROTO=Simple        # force the bandwidth-optimal protocol
export NCCL_PROTO=^LL128         # exclude LL128 (e.g. suspected LL128 corruption)

Key tuning env vars¶

Variable	Purpose	Default (official)
`NCCL_ALGO`	Allowed collective algorithm(s)	unset → auto by topology²
`NCCL_PROTO`	Allowed protocol(s)	unset → all supported²
`NCCL_NTHREADS`	CUDA threads per block; one block per channel	512 on recent GPUs, 256 on some older ones²
`NCCL_BUFFSIZE`	Per-GPU-pair communication buffer, bytes	4194304 (4 MiB)²

NCCL_NTHREADS valid values are 64, 128, 256, 512.² NCCL_BUFFSIZE takes integer bytes (powers of 2 recommended).² The book cautions that raising NCCL_BUFFSIZE can improve large-all-reduce bandwidth but must be sized carefully: too high causes GPU memory pressure; start at 4 MiB and increase stepwise while monitoring memory.¹

export NCCL_NTHREADS=512
export NCCL_BUFFSIZE=4194304     # 4 MiB; raise stepwise, watch GPU memory

Related channel tunables (leave at default unless profiled): NCCL_MIN_NCHANNELS / NCCL_MAX_NCHANNELS control how many subrings/channels NCCL uses; each channel is one CUDA block, so more channels cost more GPU resources. NCCL_MIN_NCHANNELS accepts integers up to 32 (NCCL 2.5+); both are platform-dependent by default.² On NVSwitch systems NCCL auto-tunes channel count by topology and message size.¹

Integrate: confirm the path before trusting throughput¶

NCCL falls back silently. Always confirm the intended algorithm/protocol/transport is actually active:¹

export NCCL_DEBUG=INFO                 # log NET/IB paths, algo/proto, fallbacks
export NCCL_TOPO_DUMP_FILE=/tmp/nccl_topo.xml   # dump detected topology
export NCCL_SOCKET_IFNAME=ib0          # bootstrap handshake over the IB HCA

A red flag for a silent fallback: during all-reduce, GPU utilization drops and CPU utilization spikes: the CPU is copying data instead of GPUDirect RDMA.¹ Do not leave debug-only kills (NCCL_P2P_DISABLE=1, NCCL_SHM_DISABLE=1) set in production; they force host-staged copies and collapse intranode bandwidth from hundreds of GB/s to tens.¹ See RDMA and RoCE Performance Tuning and NCCL Hang / Collective Stall.

Maintain: validate with nccl-tests bus bandwidth¶

nccl-tests reports two bandwidths. Algorithm bandwidth (algbw) = input size / time. Bus bandwidth (busbw) corrects algbw for the number of ranks so the result is comparable to hardware peak independent of rank count.³ The official correction factors are:³

Collective	busbw = algbw x
AllReduce	`2*(n-1)/n`
AllGather	`(n-1)/n`
ReduceScatter	`(n-1)/n`
Broadcast	`1`

where n is the number of ranks. Compare measured busbw against the link's hardware peak; a large gap means a degraded path or a suboptimal algorithm/protocol.

Run a size sweep on one node (8 GPUs), then across nodes via MPI (binary must be built with MPI=1):⁴

# Single node, 8 GPUs: sweep 8 B -> 128 MiB, doubling each step
./build/all_reduce_perf -b 8 -e 128M -f 2 -g 8

# 64 GPUs across 8 nodes (8 GPUs/node), 1 GPU per process
mpirun -np 64 -N 8 ./build/all_reduce_perf -b 8 -e 8G -f 2 -g 1

Flags: -b minimum size, -e maximum size, -f size multiplication factor, -g GPUs per thread.⁴ The sweep exposes the size-dependent crossover where NCCL shifts from latency-oriented (tree, LL/LL128) to bandwidth-oriented (ring, Simple) behavior; watch busbw rise toward link peak as message size grows.

Treat NCCL warnings as actionable: unable to enable P2P, falling back to copy and NET/Socket: using Ethernet interface eth0 both indicate the fast path was not taken; track down the cause rather than ignoring it.¹ Re-validate busbw after every NCCL upgrade; performance usually improves but defaults can shift and require retuning.¹

References¶

Chris Fregly, AI Systems Performance Engineering (O'Reilly, 2026), Chapter 4, "Tuning Distributed Networking Communication" — NCCL collectives, topology awareness, communication algorithms (Ring/Tree/CollTree/CollNet/PAT), NVLS/SHARP, and environment-variable gotchas.
NVIDIA NCCL — Environment Variables: https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html (NCCL_ALGO, NCCL_PROTO, NCCL_NTHREADS, NCCL_BUFFSIZE, NCCL_MIN_NCHANNELS values and defaults).
NVIDIA nccl-tests — Performance metrics (busbw correction factors): https://github.com/NVIDIA/nccl-tests/blob/master/doc/PERFORMANCE.md
NVIDIA nccl-tests — Usage (all_reduce_perf flags, MPI invocation): https://github.com/NVIDIA/nccl-tests/blob/master/README.md

Reference templates only. The commands, env vars, and bandwidth formulas here are transcribed from the book and official NVIDIA documentation; they have not been hardware-tested in this knowledge base. Validate on your own fabric before relying on any number.

Fregly, AI Systems Performance Engineering, Ch. 4. ↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩
NVIDIA NCCL Environment Variables, https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html ↩↩↩↩↩↩↩↩↩↩
NVIDIA nccl-tests PERFORMANCE.md, https://github.com/NVIDIA/nccl-tests/blob/master/doc/PERFORMANCE.md ↩↩
NVIDIA nccl-tests README.md, https://github.com/NVIDIA/nccl-tests/blob/master/README.md ↩↩