Markdown

SHARP: in-network reduction¶

Scope: NVIDIA SHARP offloading all-reduce / reduce-scatter / all-gather into the InfiniBand switch ASIC (and NVLink SHARP / NVLS in the NVSwitch) to halve endpoint data movement and free GPU SMs: what it is, when it pays off, and the exact NCCL flags and fabric services that turn it on.

What it is¶

SHARP (Scalable Hierarchical Aggregation and Reduction Protocol) performs the arithmetic of a collective inside the network silicon instead of on the GPUs. As partial results from multiple endpoints flow into a switch, the switch reduces (e.g. sums) them and returns only the aggregated result, so each GPU stops shuttling intermediate buffers around the fabric.¹¹²

Two distinct hardware paths share the SHARP name:

IB SHARP: switch-resident reduction inside Quantum-class InfiniBand switches. Reached from NCCL through the CollNet algorithm provided by the nccl-rdma-sharp-plugins package.¹¹⁰
NVLink SHARP (NVLS): the analogous capability inside the NVSwitch fabric of an NVLink domain (e.g. a 72-GPU GB200/GB300 NVL72 rack). NVLS accelerates collectives and enables efficient all-to-all and broadcast across the domain, reached through the NVLS / NVLSTree algorithms.¹²¹³

SHARPv1 shipped on EDR switches; SHARPv2 added streaming aggregation for large-vector reductions at line rate, 16/32/64-bit integer and floating-point ops, and GPUDirect RDMA.¹² On the fabric side, a SHARP Aggregation Manager (sharp_am) runs on a management server alongside the Subnet Manager and builds the in-network reduction (aggregation) trees that the switches execute.³¹¹

flowchart LR
  G0["GPU 0"] --> SW["IB switch ASIC<br/>(SHARP reduction engine)"]
  G1["GPU 1"] --> SW
  G2["GPU 2"] --> SW
  G3["GPU 3"] --> SW
  SW -->|"aggregated result B/n"| G0
  SW -->|"aggregated result B/n"| G1
  SW -->|"aggregated result B/n"| G2
  SW -->|"aggregated result B/n"| G3
  AM["sharp_am + Subnet Manager<br/>(management host)"] -.->|"builds reduction tree"| SW

Why it matters¶

The win is reduced data movement per endpoint, not just faster links. For a ring reduce-scatter, each GPU normally receives B*(n-1)/n bytes across n-1 hops; with in-network reduction the switch aggregates and returns only B/n to each GPU, roughly a 1/(n-1) per-endpoint receive versus full-ring.¹ For all-gather, NVLS hardware multicast lets each GPU send its B/n segment once while the network replicates it, cutting the sender's volume by 1/(n-1) versus the full ring.¹ Overlapping a multicast all-gather with an in-network reduce-scatter can drop the bandwidth-bound phase time by ~1/2, since wall time becomes the max of the two operations instead of the sum.¹

All-gather has no arithmetic reduction, so its NVLS benefit comes purely from multicast replication and is smaller than the gains for all-reduce and reduce-scatter.¹

The effect is to move reduction off the GPU SMs and onto the switch. NCCL 2.27 extended SHARP to AllGather and ReduceScatter across both NVLink and IB fabrics, cutting the SM footprint of those collectives from 16 SMs to 6 or fewer per GPU.¹⁴ NVIDIA reports 2x–5x speedups for large-scale all-reduce with SHARP.¹⁹

Gains scale with the cluster. On two-to-four-node jobs the improvement may be marginal; at 32 nodes and beyond, where the network is the bottleneck, SHARP cuts the number of serialized communication steps and the latency falls substantially.¹

When it is needed (and when not)¶

Needed:

Large-scale all-reduce / reduce-scatter on a Quantum-class InfiniBand fabric where the network, not the GPU, bounds the collective.¹
NVLink-domain collectives on NVSwitch (NVL72 and similar) where NVLS offload and multicast apply.¹¹³

Not worth it (or unavailable):

Ethernet / RoCE fabrics. As of writing SHARP is primarily an InfiniBand technology. Spectrum-X improves all-reduce with congestion control and adaptive routing but does not expose switch-resident reduction engines analogous to SHARP.¹⁵
Small clusters (two-to-four nodes), where the network is not the bottleneck and the offload barely registers.¹
Very large messages that exceed switch buffer limits. Switch reduction buffers are finite; collectives in the many-MB/GB range may fall back to regular (non-SHARP) methods. Monitor NCCL logs and alert if SHARP fallbacks begin appearing due to switch memory pressure.⁶

How: implement, integrate, maintain¶

Fabric prerequisites. SHARP firmware on the switches, a sharp_am Aggregation Manager running on a management host alongside the Subnet Manager, and the nccl-rdma-sharp-plugins plugin discoverable by NCCL. Each host must load the GPUDirect RDMA kernel module so collectives reach GPU memory directly.³⁴

Confirm the GPUDirect RDMA module is loaded:

lsmod | grep nvidia_peermem

(absence means NCCL may use CPU-staged RDMA copies instead of true GPUDirect).⁴

Enable IB SHARP (CollNet). With the plugin present, the documented variables that route eligible collectives through the switch are:¹⁰

export NCCL_COLLNET_ENABLE=1     # allow the CollNet (SHARP) algorithm
export NCCL_ALGO=CollNet         # force CollNet for eligible collectives

NCCL_ALGO is normally left unset so NCCL auto-selects per message size and topology; pin it only to force or A/B-test the SHARP path.²¹³

Enable NVLink SHARP (NVLS). On third-generation NVSwitch (NVLink4, Hopper and later), NVLS is selected via the NCCL_ALGO NVLS / NVLSTree options and gated by NCCL_NVLS_ENABLE. The documented default is 2 (use NVLS where supported; do not fail if unsupported, but fail if resources cannot be allocated); 0 disables it. Note the default was 1 in NCCL 2.17–2.20.¹³

export NCCL_NVLS_ENABLE=2        # default: enable NVLS where the NVSwitch domain supports it
# optional, to force the NVLS path for inspection:
export NCCL_ALGO=NVLSTree

NVL72-class symmetric-memory low-latency kernels (NCCL 2.27+) apply within a single NVLink domain and are gated separately by NCCL_WIN_ENABLE (default on); set NCCL_WIN_ENABLE=0 to disable.¹⁴

Disable for A/B testing. SHARP is not enabled by default; it requires the plugin selection / policy above. To measure its impact, turn it off explicitly:¹

export NCCL_SHARP_DISABLE=1      # bypass SHARP; revert (=0 or unset) for production

Integration. Using SHARP requires no application code changes; it is a fabric-and-environment configuration. PyTorch DDP/FSDP and any NCCL-based stack inherit it once the fabric and env are configured.¹ For best results, register persistent NCCL user buffers (ncclCommRegister / ncclCommDeregister): zero-copy registration is essential to the best SHARP paths for both on-node NVLS and off-node IB, and it cuts SM/channel usage.⁷

Verify and maintain. SHARP usage shows up in NCCL logs:

export NCCL_DEBUG=INFO           # logs will name SHARP / CollNet / NVLS when active

Device-level support can be checked with ibv_devinfo.⁸ Continuously monitor the logs and alert on SHARP-to-non-SHARP fallbacks, which signal switch buffer pressure or a degraded Aggregation Manager.⁶

Reference-template guidance only. The flags, defaults, and prerequisites above are drawn from the cited book chapter and NVIDIA documentation; they have not been hardware-validated here. Confirm exact behavior against your NCCL version's release notes and your fabric's SHARP/UFM configuration before relying on them.

References¶

Chris Fregly, AI Systems Performance Engineering (O'Reilly), Chapter 4, "Tuning Distributed Networking Communication" — In-Network SHARP Aggregation, NCCL Communication Algorithms, Magnum IO in-network compute, and Key Takeaways.
NCCL-RDMA-SHARP Plugins — NVIDIA HPC-X Software Toolkit (NCCL_COLLNET_ENABLE=1, NCCL_ALGO=CollNet).
Using NVIDIA SHARP with NVIDIA NCCL — NVIDIA Networking Docs.
NVIDIA SHARP user manual (rev 3.5.2 LTS) (sharp_am Aggregation Manager).
NCCL Environment Variables — NCCL documentation (NCCL_NVLS_ENABLE, NCCL_ALGO NVLS/NVLSTree).
Enabling Fast Inference and Resilient Training with NCCL 2.27 — NVIDIA Technical Blog (SHARP for AllGather/ReduceScatter, symmetric memory, NCCL_WIN_ENABLE).
Advancing Performance with NVIDIA SHARP In-Network Computing — NVIDIA Technical Blog (SHARPv1/v2, streaming aggregation).

Fregly, Ch. 4, "In-Network SHARP Aggregation." ↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩
Fregly, Ch. 4, "NCCL Communication Algorithms" (Tree/NVLSTree, CollNet, CollTree; NCCL_ALGO override). ↩↩
Fregly, Ch. 4: NCCL offloads via the NCCL RDMA SHARP plugin with SHARP firmware on the switches and a SHARP Aggregation Manager running alongside the Subnet Manager. ↩↩
Fregly, Ch. 4: each host must load the GPUDirect RDMA kernel module; verify with lsmod | grep nvidia_peermem. ↩↩
Fregly, Ch. 4: SHARP is primarily an InfiniBand technology; Spectrum-X Ethernet does not expose switch-resident reduction engines analogous to SHARP. ↩
Fregly, Ch. 4: finite switch reduction buffers; very large collectives may fall back to regular methods — monitor NCCL logs and alert on fallbacks. ↩↩
Fregly, Ch. 4, "Persistent NCCL User Buffers and Zero-Copy Registration" (ncclCommRegister / ncclCommDeregister). ↩
Fregly, Ch. 4: verify with NCCL logs (NCCL_DEBUG=INFO) and ibv_devinfo; disable with NCCL_SHARP_DISABLE=1 for A/B testing. ↩
Fregly, Ch. 4, "Key Takeaways": in-network computing like SHARP can accelerate collectives by 2x–5x, especially at scale. ↩
NVIDIA HPC-X, "NCCL-RDMA-SHARP Plugins": NCCL_COLLNET_ENABLE=1 and NCCL_ALGO=CollNet enable SHARP aggregation with NCCL via the plugin. ↩↩
NVIDIA SHARP user manual: the Aggregation Manager (sharp_am) receives SHARP job requests and manages reduction trees. ↩
NVIDIA, "Advancing Performance with NVIDIA SHARP In-Network Computing": SHARPv2 streaming aggregation, 16/32/64-bit integer and FP ops, GPUDirect RDMA. ↩↩
NCCL documentation, Environment Variables: NCCL_NVLS_ENABLE (default 2; 1 in NCCL 2.17–2.20), NVLS/NVLSTree via NCCL_ALGO. ↩↩↩↩
NVIDIA, "Enabling Fast Inference and Resilient Training with NCCL 2.27": SHARP for AllGather and ReduceScatter on NVLink and IB fabrics; SM usage 16 -> 6 or fewer; symmetric memory and NCCL_WIN_ENABLE. ↩↩