NCCL collectives and algorithm selection¶
Scope: how NCCL implements all-reduce, all-gather, reduce-scatter, and broadcast, how it selects an algorithm (Ring/Tree/CollNet/NVLS) and protocol (Simple/LL/LL128) by message size and topology, the key tuning env vars (NCCL_ALGO, NCCL_PROTO, NCCL_NTHREADS, NCCL_BUFFSIZE), and how to validate with nccl-tests bus bandwidth.
flowchart TB
AR["All-reduce request<br/>(reduce-scatter then all-gather)"]
SIZE{"Message size /<br/>scale?"}
AR --> SIZE
SIZE -->|"small msg / many ranks<br/>(latency-dominated)"| TREE["Tree / NVLSTree<br/>O(log N) steps, latency-optimal"]
SIZE -->|"large msg<br/>(bandwidth-dominated)"| RING["Ring<br/>bandwidth-optimal, latency O(N) hops"]
SIZE -->|"multi-node, fast intra-domain"| COLL["CollNet / PAT<br/>hierarchical, RDMA + SHARP offload"]
TREE --> NVLS["NVLS / SHARP<br/>in-network reduction (NVSwitch / IB)"]
PROTO{"Protocol<br/>(latency vs bandwidth)?"}
TREE --> PROTO
RING --> PROTO
PROTO -->|"smallest msg"| LL["LL / LL128<br/>low latency"]
PROTO -->|"large msg"| SIMPLE["Simple<br/>peak bandwidth"]
What it is¶
NCCL (NVIDIA Collective Communications Library) is a many-to-many communication library providing optimized collectives (all-reduce, all-gather, broadcast, reduce-scatter) used by groups of GPUs to share data.1 It underpins most multi-GPU training in the NVIDIA ecosystem: each GPU computes gradients on its data shard, then NCCL all-reduces those gradients so every GPU updates weights with the averaged result.1
The four collectives in scope:
- all-reduce: sum (or reduce) a tensor across all ranks; every rank ends with the full reduced result. The gradient-sync primitive for data-parallel training. An all-reduce is equivalent to a reduce-scatter followed by an all-gather (broadcast).1
- all-gather: every rank contributes a shard; every rank ends with the concatenation of all shards. Used in FSDP to re-materialize sharded parameters.
- reduce-scatter: reduce across ranks, but each rank keeps only its slice of the result. The reduction half of FSDP gradient handling.
- broadcast: one root rank sends identical data to all ranks (e.g. updated weights). NCCL can use NVSwitch hardware multicast for one-hop broadcast inside an NVLink domain.1
NCCL runs over PCIe, NVLink, NVSwitch, InfiniBand, and TCP sockets, and automatically chooses the fastest path between any two GPUs.1 At communicator init it inspects message size, interconnect topology, and GPU generation to pick the algorithm+protocol combination per collective.1
Why it matters¶
Communication, not compute, is the scaling wall. The same all-reduce can run at tens of GB/s or hundreds of GB/s depending purely on whether NCCL routes over NVLink versus PCIe.1 A topology-unaware single ring across four GPUs split over two PCIe switches forces a stage over the slow inter-switch link; a hierarchical approach keeps the bulk on NVLink. In the book's profiled example this is the difference between 60% SM utilization at 100 ms/iter and 90% SM utilization at 70 ms/iter.1
Algorithm and protocol choice is message-size-dependent. Small messages are latency-dominated (startup cost dominates); large messages are bandwidth-dominated (byte-movement dominates).1 Picking the wrong one (or letting a misconfiguration force a fallback) silently costs an order of magnitude without crashing.
When it is needed (and when not)¶
Needed whenever multiple GPUs must agree on a tensor: DDP/FSDP gradient sync, tensor-parallel partial-sum reduction, weight broadcast at init. NCCL is the correct backend for any NVIDIA multi-GPU collective; it is PyTorch's default.1
Tuning the selection (overriding NCCL_ALGO/NCCL_PROTO) is rarely needed. NCCL's automatic selection is good; manual override is for troubleshooting, research experiments, or a profiled, confirmed pathology such as unexpectedly high cross-node latency.1 Set explicit values to pin behavior across NCCL upgrades; defaults can change between versions and are hard to debug when they do.1
Not the right tool for point-to-point inference transfers (KV-cache movement): NCCL send()/recv() exist but are less optimized than NIXL for one-to-one tail latency.1 See Disaggregated Inference. Never use the CPU-bound Gloo backend for GPU training; it staged through host memory over TCP and runs an order of magnitude slower.1
How: implement, integrate, maintain¶
Algorithm selection (topology- and size-driven)¶
NCCL's primary collective algorithms:1
- Ring: GPUs form a logical ring; each chunk circulates accumulating partial sums. Perfectly balances load (every link moves
2 x (data_size / num_gpus)bytes for all-reduce) and is bandwidth-optimal, but latency scales with hop count O(N). Best for large messages (bandwidth-dominated).1 - Tree / NVLSTree: reduce-scatter then broadcast over a spanning tree, completing all-reduce in O(log N) steps. Lower latency for small messages; may not saturate all links on large ones.
NVLSTreeenables NVLink SHARP (NVLS) offload.1 - CollNet / CollTree: two-level hierarchical collectives: a high-throughput local algorithm inside each fast domain (node / NVSwitch island), then one leader per group joins a second-level tree across groups over RDMA, pipelined. Low internode latency plus full intranode bandwidth; can offload to InfiniBand SHARP when the NCCL-SHARP plugin is enabled.1
- PAT (Parallel Aggregated Tree): pipelined ring/tree hybrid: splits the message into chunks and staggers per-chunk tree reductions, achieving near-ring throughput with tree-level O(log N) per-segment latency.1
Rule of thumb: small messages (tens of MB) favor trees (fewer steps); large messages favor ring (bandwidth).1 Keep as much traffic as possible on the fastest interconnect (NVLink/NVSwitch intranode); minimize PCIe and inter-NUMA hops.1
The valid NCCL_ALGO values per the official NCCL docs are Ring, Tree, CollnetChain, CollnetDirect, NVLS, NVLSTree, PAT; a ^ prefix excludes rather than includes. Unset (the default) lets NCCL choose from node topology and architecture.2 (The book uses illustrative spellings such as NVLSTree,PAT; prefer the official token list on disagreement.1)
# Override only for troubleshooting / A-B testing. Set BEFORE ncclCommInitRank.
export NCCL_ALGO=Tree # force tree (latency-dominated small messages)
export NCCL_ALGO=^Ring # exclude ring, let NCCL pick among the rest
Protocol selection (Simple / LL / LL128)¶
Independent of algorithm, NCCL picks a wire protocol that trades latency against bandwidth. Valid NCCL_PROTO values per the official docs are LL, LL128, Simple (^ excludes):2
- LL ("low latency"): lowest latency, lowest peak bandwidth; for the smallest messages.
- LL128: low latency tuned for 128-byte granularity; high bandwidth on NVLink-class links. Only available on platforms that support it.
- Simple: highest peak bandwidth, higher fixed latency; for large messages.
Default is unset, which enables all supported protocols: LL,LL128,Simple where LL128 is supported, LL,Simple otherwise.2
export NCCL_PROTO=Simple # force the bandwidth-optimal protocol
export NCCL_PROTO=^LL128 # exclude LL128 (e.g. suspected LL128 corruption)
Key tuning env vars¶
| Variable | Purpose | Default (official) |
|---|---|---|
NCCL_ALGO |
Allowed collective algorithm(s) | unset → auto by topology2 |
NCCL_PROTO |
Allowed protocol(s) | unset → all supported2 |
NCCL_NTHREADS |
CUDA threads per block; one block per channel | 512 on recent GPUs, 256 on some older ones2 |
NCCL_BUFFSIZE |
Per-GPU-pair communication buffer, bytes | 4194304 (4 MiB)2 |
NCCL_NTHREADS valid values are 64, 128, 256, 512.2 NCCL_BUFFSIZE takes integer bytes (powers of 2 recommended).2 The book cautions that raising NCCL_BUFFSIZE can improve large-all-reduce bandwidth but must be sized carefully: too high causes GPU memory pressure; start at 4 MiB and increase stepwise while monitoring memory.1
Related channel tunables (leave at default unless profiled): NCCL_MIN_NCHANNELS / NCCL_MAX_NCHANNELS control how many subrings/channels NCCL uses; each channel is one CUDA block, so more channels cost more GPU resources. NCCL_MIN_NCHANNELS accepts integers up to 32 (NCCL 2.5+); both are platform-dependent by default.2 On NVSwitch systems NCCL auto-tunes channel count by topology and message size.1
Integrate: confirm the path before trusting throughput¶
NCCL falls back silently. Always confirm the intended algorithm/protocol/transport is actually active:1
export NCCL_DEBUG=INFO # log NET/IB paths, algo/proto, fallbacks
export NCCL_TOPO_DUMP_FILE=/tmp/nccl_topo.xml # dump detected topology
export NCCL_SOCKET_IFNAME=ib0 # bootstrap handshake over the IB HCA
A red flag for a silent fallback: during all-reduce, GPU utilization drops and CPU utilization spikes: the CPU is copying data instead of GPUDirect RDMA.1 Do not leave debug-only kills (NCCL_P2P_DISABLE=1, NCCL_SHM_DISABLE=1) set in production; they force host-staged copies and collapse intranode bandwidth from hundreds of GB/s to tens.1 See RDMA and RoCE Performance Tuning and NCCL Hang / Collective Stall.
Maintain: validate with nccl-tests bus bandwidth¶
nccl-tests reports two bandwidths. Algorithm bandwidth (algbw) = input size / time. Bus bandwidth (busbw) corrects algbw for the number of ranks so the result is comparable to hardware peak independent of rank count.3 The official correction factors are:3
| Collective | busbw = algbw x |
|---|---|
| AllReduce | 2*(n-1)/n |
| AllGather | (n-1)/n |
| ReduceScatter | (n-1)/n |
| Broadcast | 1 |
where n is the number of ranks. Compare measured busbw against the link's hardware peak; a large gap means a degraded path or a suboptimal algorithm/protocol.
Run a size sweep on one node (8 GPUs), then across nodes via MPI (binary must be built with MPI=1):4
# Single node, 8 GPUs: sweep 8 B -> 128 MiB, doubling each step
./build/all_reduce_perf -b 8 -e 128M -f 2 -g 8
# 64 GPUs across 8 nodes (8 GPUs/node), 1 GPU per process
mpirun -np 64 -N 8 ./build/all_reduce_perf -b 8 -e 8G -f 2 -g 1
Flags: -b minimum size, -e maximum size, -f size multiplication factor, -g GPUs per thread.4 The sweep exposes the size-dependent crossover where NCCL shifts from latency-oriented (tree, LL/LL128) to bandwidth-oriented (ring, Simple) behavior; watch busbw rise toward link peak as message size grows.
Treat NCCL warnings as actionable: unable to enable P2P, falling back to copy and NET/Socket: using Ethernet interface eth0 both indicate the fast path was not taken; track down the cause rather than ignoring it.1 Re-validate busbw after every NCCL upgrade; performance usually improves but defaults can shift and require retuning.1
References¶
- Chris Fregly, AI Systems Performance Engineering (O'Reilly, 2026), Chapter 4, "Tuning Distributed Networking Communication" — NCCL collectives, topology awareness, communication algorithms (Ring/Tree/CollTree/CollNet/PAT), NVLS/SHARP, and environment-variable gotchas.
- NVIDIA NCCL — Environment Variables: https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html (
NCCL_ALGO,NCCL_PROTO,NCCL_NTHREADS,NCCL_BUFFSIZE,NCCL_MIN_NCHANNELSvalues and defaults). - NVIDIA nccl-tests — Performance metrics (busbw correction factors): https://github.com/NVIDIA/nccl-tests/blob/master/doc/PERFORMANCE.md
- NVIDIA nccl-tests — Usage (
all_reduce_perfflags, MPI invocation): https://github.com/NVIDIA/nccl-tests/blob/master/README.md
Reference templates only. The commands, env vars, and bandwidth formulas here are transcribed from the book and official NVIDIA documentation; they have not been hardware-tested in this knowledge base. Validate on your own fabric before relying on any number.
Related: SHARP: In-Network Reduction · NVSHMEM: GPU-Initiated Communication · Communication-Computation Overlap · RDMA and RoCE Performance Tuning · BlueField DPUs for AI Networking · HPC Networking Fabric · Fabric Bring-Up, Validation and Benchmarking · Continuous NCCL Fabric Benchmarking · NVSwitch and NVLink · Distributed Training Platform · FSDP · Tensor Parallelism · NCCL Hang / Collective Stall · Glossary
-
Fregly, AI Systems Performance Engineering, Ch. 4. ↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩
-
NVIDIA NCCL Environment Variables, https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html ↩↩↩↩↩↩↩↩↩↩
-
NVIDIA nccl-tests PERFORMANCE.md, https://github.com/NVIDIA/nccl-tests/blob/master/doc/PERFORMANCE.md ↩↩
-
NVIDIA nccl-tests README.md, https://github.com/NVIDIA/nccl-tests/blob/master/README.md ↩↩