Markdown

Communication-computation overlap¶

Scope: hiding collective latency behind compute by overlapping gradient all-reduce with backprop (DDP gradient bucketing), prefetching FSDP all-gather, masking the pipeline-parallel bubble (DeepSeek DualPipe), and tensor-parallel comm overlap, plus the stream/event mechanics that make it work and how to confirm overlap on a profiler timeline. For the underlying stream/event primitives see CUDA streams and concurrency; for the collectives being hidden see NCCL collectives and algorithm selection.

The timing model, the DDP bucketing rule, and the DualPipe bubble formulas below are executed and asserted with numpy (system python3). The DDP, FSDP, and Megatron code blocks require torch or Megatron-LM (not installed here); they are labelled reference templates, and the core math each one relies on is validated separately in a numpy block that is run and asserted.

What it is¶

Overlap (pipelining) is the technique of running collective communication concurrently with compute so that when one stage finishes, the data for the next is already in flight or delivered. The goal is to keep streaming multiprocessors (SMs) busy and spend less wall-clock time waiting on the network. The mechanism is asynchronous execution on multiple CUDA streams: one stream runs compute kernels (matrix multiplies) while a separate stream runs communication (NCCL all-reduce, all-gather), and cross-stream ordering is enforced with CUDA events rather than device-wide synchronization.¹

The canonical instance is data-parallel gradient reduction. PyTorch DistributedDataParallel (DDP) installs autograd hooks on the backward pass so that each gradient bucket triggers an asynchronous NCCL all-reduce on a dedicated communication stream while the default stream keeps computing gradients for earlier layers.¹ The DDP Reducer "organizes parameter gradients into buckets and reduces one bucket at a time"; parameters are assigned to buckets "in roughly the reverse order of Model.parameters()" because gradients become ready last-layer-first in backprop.² This last-layer-first readiness is what makes overlap possible: fc3's gradients reduce while fc2 and fc1 are still computing.¹ The strategy is often called wait-free backpropagation (WFBP).¹

The same idea generalizes across every parallelism axis:

FSDP: prefetch the next layer's parameter all-gather before the current layer's compute finishes, so the unshard collective overlaps the previous layer's matmul.⁴
Pipeline parallel: schedule micro-batches so forward and backward chunks (and their cross-node collectives) overlap, shrinking the pipeline bubble (DeepSeek DualPipe).⁵
Tensor parallel: overlap the all-gather and reduce-scatter introduced by sequence parallelism with the GEMMs that consume or produce their data (Megatron --tp-comm-overlap).⁶

Why use it¶

Communication that does not overlap adds directly to iteration time. The book's worked example: if forward plus backward compute takes 10 ms and gradient all-reduce takes 12 ms, the serial (no-overlap) iteration is about 22 ms, whereas a fully overlapped iteration approaches the max of the two, about 12 ms. The all-reduce is almost entirely hidden under compute.¹ On the example two-GPU workload, DDP overlap yields roughly a 30% iteration-time improvement; larger models with more buckets overlap more.¹

The diagnostic signature of good overlap: total iteration time is lower than compute_time + comm_time, and the GPU is rarely idle during the communication phase. On the profiler timeline the no-overlap case shows a clean two-phase split (all backward kernels, then all NCCL all-reduce kernels with the GPU otherwise idle), whereas the overlapped case shows a sawtooth of interleaved compute and NCCL kernels.¹

The numbers above are illustrative of the mechanism, not benchmarked hardware results; treat them as relative, not absolute. The ## How to scale it section reproduces this 22 ms versus about 12 ms result with a runnable timing model.

When to use it (and when not)¶

Overlap is the default win whenever a distributed job spends a non-trivial fraction of iteration time in collectives, which is almost every multi-GPU training job and most disaggregated-inference KV-cache transfers. See Distributed training platform and Disaggregated inference.

It is not a free lever in these cases:

The collective already overlaps fully (compute-bound). If comm is already hidden under compute, tuning buckets or prefetch depth buys nothing; the tail bucket that finishes after the last compute kernel is often the only unhidden portion.¹
Communication is bandwidth-saturated. Overlap hides latency behind compute; it does not create bandwidth. If the link is the wall, fix the fabric first: confirm RDMA/GPUDirect is active and not silently falling back to TCP. See RDMA and RoCE performance tuning and HPC networking fabric.
One large bucket or single collective. If a model is small enough that all gradients fit one bucket, DDP issues a single all-reduce at the end that overlaps almost nothing.¹
Memory is tight. Gradient accumulation (fewer, larger all-reduces) and FSDP prefetch (more parameters resident) both trade memory for overlap; DualPipe holds 2x parameter copies.¹⁵ When memory is the constraint, these can backfire. See GPU memory hierarchy.

A correctness caveat that silently eliminates overlap: any op that forces a device sync between backward and the next iteration stalls compute until all in-flight collectives finish. The book calls out .item() and CPU-bound prints/logs (which move a tensor GPU to CPU) and stray torch.cuda.synchronize() as the usual culprits, "disastrous for performance." Place at most one sync at the very end of an iteration for timing, never mid-pipeline.¹

Architecture¶

Overlap rests on three CUDA primitives (see CUDA streams and concurrency): kernels on different streams may run concurrently; a CUDA event recorded on one stream and waited on by another enforces just the ordering the data dependency needs; and NCCL collectives run on their own stream so they do not serialize behind compute. DDP records an event when a bucket's gradient is ready, launches that bucket's all-reduce on the comm stream, and the optimizer step waits on the comm stream's events, never on a device-wide barrier. The result on the timeline is two stream lanes running at once (a compute lane and an NCCL lane) instead of one lane that alternates compute and idle.

flowchart TB
  subgraph SERIAL["No overlap (serial): one lane, GPU idles during comm"]
    direction LR
    A["backward: fc3 -> fc2 -> fc1"] --> B["all-reduce fc1,fc2,fc3 (GPU idle)"]
  end
  subgraph OVL["Overlap (DDP / WFBP): compute lane + comm lane, event-ordered"]
    direction LR
    subgraph CS["compute stream (default)"]
      direction LR
      C["backward fc3"] --> D["backward fc2"] --> E["backward fc1"] --> OPT["optimizer.step (waits on comm events)"]
    end
    subgraph MS["comm stream (NCCL)"]
      direction LR
      F["all-reduce fc3"] --> G["all-reduce fc2"] --> H["all-reduce fc1"]
    end
    C -.->|"event: fc3 grad ready"| F
    D -.->|"event: fc2 grad ready"| G
    E -.->|"event: fc1 grad ready"| H
    H -.->|"event: comm done"| OPT
  end

The comm stream is a single serial queue: buckets all-reduce in ready order, and a bucket cannot start until both its gradient is ready and the stream is free. That serialization is exactly what the timing model in ## How to scale it encodes, and it is why the last bucket's all-reduce (which has no later compute to hide behind) is usually the only unhidden work.

How to use it (DDP gradient bucketing, data parallel)¶

DDP overlaps by default: wrap the model and let the reducer schedule per-bucket all-reduces. No manual dist.all_reduce after loss.backward(). This is a reference template (needs torch):

# Reference template (requires torch + a live NCCL process group; not run here).
# The core bucketing math it relies on is validated in the numpy block below.
import torch
import torch.nn as nn
import torch.distributed as dist

dist.init_process_group("nccl", init_method="env://")
torch.cuda.set_device(local_rank)

model = MyModel().to(local_rank)
ddp_model = nn.parallel.DistributedDataParallel(
    model,
    device_ids=[local_rank],
    bucket_cap_mb=25,        # default; raise for few-large-layer models
    gradient_as_bucket_view=True,  # avoids a grad copy into bucket buffers
)

output = ddp_model(data)
loss = nn.functional.mse_loss(output, target)
loss.backward()   # per-bucket NCCL all-reduce fires on the comm stream during backward
optimizer.step()

The DDP default bucket_cap_mb is 25 MiB (PyTorch uses 25 MiB when the value is None).³¹ Bucket sizing is a trade-off: very large buckets maximize bandwidth but delay the start of communication (you wait for more gradients before kicking off the all-reduce); very small buckets start sooner but pay per-call NCCL overhead. Profile a few sizes: raise it for models with very large layers, lower it for many small layers.¹

The bucketing rule is the load-bearing part, and it is pure arithmetic. This numpy block reproduces it and is executed and asserted: parameters bucket in reverse Model.parameters() order (last layer first), each bucket sealed at the cap, with an oversized layer isolated into its own bucket, a tiny model collapsing to one bucket (the "single collective" caveat), and the adversarial trade-off that a larger cap delays the first all-reduce.

# DDP gradient bucketing math, validated (system python3, numpy). Reproduces the
# Reducer's reverse-order assignment + bucket_cap_mb sealing, and quantifies the
# sizing trade-off. Run: python3 bucketing.py
import numpy as np

MIB = 1024 * 1024


def assign_buckets(param_bytes_fwd, cap_mb=25):
    """param_bytes_fwd: per-parameter grad size in FORWARD (registration) order.
    DDP buckets in reverse order because backward readies the last layer first.
    Returns buckets (lists of original indices) in fill order = all-reduce order."""
    cap = cap_mb * MIB
    buckets, cur, cur_bytes = [], [], 0
    for idx in reversed(range(len(param_bytes_fwd))):   # last layer first
        b = param_bytes_fwd[idx]
        if cur and cur_bytes + b > cap:                 # seal: do not exceed the cap
            buckets.append(cur); cur, cur_bytes = [], 0
        cur.append(idx); cur_bytes += b
    if cur:
        buckets.append(cur)
    return buckets


def first_allreduce_start_ms(param_bytes_fwd, cap_mb, compute_ms_per_mib):
    """When can the FIRST bucket's all-reduce start? Not until its bucket fills, and
    buckets fill last-layer-first during backward. Bigger cap -> more gradients must
    accumulate first -> later start. The 'large buckets delay comm start' trade-off."""
    buckets = assign_buckets(param_bytes_fwd, cap_mb)
    first_bucket_bytes = sum(param_bytes_fwd[i] for i in buckets[0])
    return (first_bucket_bytes / MIB) * compute_ms_per_mib


# happy path: reverse order + exact partition
sizes = [1 * MIB, 2 * MIB, 4 * MIB, 8 * MIB, 16 * MIB, 8 * MIB]   # forward order
buckets = assign_buckets(sizes, cap_mb=25)
flat = [i for b in buckets for i in b]
assert sorted(flat) == list(range(len(sizes)))          # every param placed once
assert len(flat) == len(set(flat))                      # no duplicates
assert flat[0] == len(sizes) - 1                        # last layer reduces first (WFBP)
for b in buckets:
    assert sum(sizes[i] for i in b) <= 25 * MIB or len(b) == 1   # cap respected

# edge 1: an oversized layer gets its OWN bucket, nothing dropped
bk = assign_buckets([40 * MIB, 1 * MIB], cap_mb=25)
assert [0] in bk and [1] in bk and sum(len(b) for b in bk) == 2

# edge 2: whole model fits one bucket -> a single collective, ~no overlap
assert len(assign_buckets([1 * MIB, 1 * MIB, 1 * MIB], cap_mb=25)) == 1

# edge 3 (adversarial): a bigger cap must NOT start its first all-reduce earlier
model = [3 * MIB] * 40
start_small = first_allreduce_start_ms(model, cap_mb=6, compute_ms_per_mib=0.1)
start_large = first_allreduce_start_ms(model, cap_mb=24, compute_ms_per_mib=0.1)
assert start_large > start_small, (start_small, start_large)
assert len(assign_buckets(model, 6)) > len(assign_buckets(model, 24))   # smaller cap, more buckets

print(f"first-reduced param idx={flat[0]} (last layer); "
      f"cap6 start={start_small:.2f}ms < cap24 start={start_large:.2f}ms")

Running it prints first-reduced param idx=5 (last layer); cap6 start=0.60ms < cap24 start=2.40ms and all asserts pass.

Reduce communication frequency with gradient accumulation: sum gradients over N micro-batches, then one all-reduce, cutting all-reduce frequency by N times at the cost of a larger effective batch and held-unreduced-gradient memory. Use ddp_model.no_sync() on the accumulation steps to suppress the per-step reduction.¹ See DDP.

How to integrate it across parallelism axes¶

The same overlap principle applies beyond data parallel; each axis has its own collective to hide.

FSDP all-gather prefetch (sharded data parallel)¶

FSDP overlaps the parameter all-gather (unshard) of the next unit with the compute of the current one. In FSDP2 the backward path always prefetches (BACKWARD_PRE) because that is the only correct way to overlap collectives in backward.⁴ For CPU-bound (CPU-launch-latency-bound) workloads, enable explicit forward prefetch so the next forward all-gather issues before the current forward compute. Reference template (needs torch):

# Reference template (requires torch FSDP; not run here).
from torch.distributed.fsdp import FullyShardedDataParallel as FSDP, BackwardPrefetch

model = FSDP(
    model,
    backward_prefetch=BackwardPrefetch.BACKWARD_PRE,  # max overlap, higher peak memory
    forward_prefetch=True,                            # only helps CPU-bound launch
)

forward_prefetch=True "explicitly prefetches the next forward-pass all-gather before the current forward computation" and is useful only for CPU-bound workloads; BACKWARD_PRE "enables the most overlap but increases memory usage the most."⁴ Prefetch depth costs memory: more parameter shards are resident at once. See FSDP.

Pipeline-parallel bubble masking (DualPipe)¶

In pipeline parallelism the bubble is idle time at the schedule's fill and drain edges. DeepSeek DualPipe overlaps forward and backward chunks bidirectionally to cut it. The README's bubble comparison (PP = pipeline stages; F, B, W = forward, full-backward, weight-gradient times; F&B = overlapped forward-and-backward):⁵

Method	Bubble	Parameter memory
1F1B	`(PP-1)(F+B)`	1x
ZB1P	`(PP-1)(F+B-2W)`	1x
DualPipe	`(PP/2-1)(F&B+B-3W)`	2x

DualPipe requires an even number of pipeline stages and holds 2x parameter copies for the bidirectional schedule.⁵ Within a chunk it overlaps four components (attention, all-to-all dispatch, MLP, all-to-all combine), so cross-node expert-parallel communication hides under compute.⁵ This is an algorithmic schedule, not a drop-in flag. These bubble formulas are pure arithmetic; this numpy block encodes them and is executed and asserted, including the even-stage guard (an adversarial odd-PP case that must be rejected) and the no-in-chunk-overlap case:

# DualPipe bubble formulas, validated (system python3, numpy). Run: python3 dualpipe.py
import numpy as np

def bubble_1f1b(pp, F, B, W):     return (pp - 1) * (F + B)
def bubble_zb1p(pp, F, B, W):     return (pp - 1) * (F + B - 2 * W)
def bubble_dualpipe(pp, F, B, W, FB):
    assert pp % 2 == 0, "DualPipe requires an even number of pipeline stages"
    return (pp // 2 - 1) * (FB + B - 3 * W)

# happy path: PP=16, F=1 B=2 W=1, overlapped F&B=2.4 (< F+B=3, > max(F,B)=2)
PP, F, B, W, FB = 16, 1.0, 2.0, 1.0, 2.4
b1, bz, bd = bubble_1f1b(PP, F, B, W), bubble_zb1p(PP, F, B, W), bubble_dualpipe(PP, F, B, W, FB)
assert np.isclose(b1, 45.0) and np.isclose(bz, 15.0) and np.isclose(bd, 7 * 1.4)
assert bd < bz < b1, (bd, bz, b1)                        # DualPipe wins on bubble

# invariant: the PP/2 prefactor ~halves the depth term vs (PP-1)
assert bubble_dualpipe(64, F, B, W, FB) / bubble_1f1b(64, F, B, W) < 0.5

# adversarial edge: an ODD stage count must be rejected
try:
    bubble_dualpipe(15, F, B, W, FB); raise SystemExit("odd PP not rejected")
except AssertionError as e:
    assert "even" in str(e)

# edge: no in-chunk overlap (F&B == F+B) is strictly worse than real overlap
assert bubble_dualpipe(PP, F, B, W, F + B) > bd
print(f"PP=16 bubbles: 1F1B={b1} ZB1P={bz} DualPipe={bd:.1f}; odd-PP rejected: OK")

Running it prints PP=16 bubbles: 1F1B=45.0 ZB1P=15.0 DualPipe=9.8; odd-PP rejected: OK and all asserts pass. See Pipeline parallelism, Tensor parallelism, and Distributed training recipes.

Tensor-parallel comm overlap (Megatron)¶

Tensor parallel with sequence parallelism introduces an activation all-gather (forward) and reduce-scatter (backward) per layer; these can overlap the GEMMs that bound them. In Megatron-LM, enable --tp-comm-overlap.⁶ Reference template (needs Megatron-LM):

# Reference template (requires NVIDIA Megatron-LM; not run here).
export CUDA_DEVICE_MAX_CONNECTIONS=1   # required by sequence parallelism
torchrun ... pretrain_gpt.py \
  --tensor-model-parallel-size 8 \
  --sequence-parallel \
  --tp-comm-overlap

Sequence parallelism "requires setting the environment variable CUDA_DEVICE_MAX_CONNECTIONS to 1," which forces in-order kernel issue so the scheduler can place the collective before its dependent GEMM.⁷ Note the conflict: FSDP generally wants CUDA_DEVICE_MAX_CONNECTIONS unset for parallel launch, so combining TP-with-SP and FSDP needs care.⁶ The collectives "should be scheduled before compute kernels to overlap the communication with the computation, which is necessary for a speedup but not for correctness."⁶

How to run it in production and verify overlap¶

Overlap is invisible without profiling. Confirm it, do not assume it. Capture with the PyTorch profiler or Nsight Systems:

nsys profile --trace=cuda,nvtx,osrt -o overlap_check python train.py

On the timeline, read the stream lanes: NCCL collective kernels (for example ncclDevKernel_AllReduce...) must appear on a separate stream lane that runs concurrently with backward compute kernels. The sawtooth of interleaved compute and NCCL kernels is the signature of overlap. A clean two-phase split (all compute, then all NCCL, GPU idle in between) means overlap is broken.¹ PyTorch auto-emits NVTX ranges around ops, and Holistic Trace Analysis (HTA) reports the fraction of NCCL time that overlaps compute and per-rank idle-time breakdown.⁸ Annotate phases with torch.cuda.nvtx.range_push/range_pop to locate stalls. See Nsight profiling workflow.

Before trusting any overlap number in production, confirm the collective is even on the fast path: add NCCL_DEBUG=INFO to verify collectives run on the IB/NVLink path and not a silent TCP fallback that would dwarf any overlap gain. See NCCL hang / collective stall.

How to maintain it¶

Watch for accidental syncs (.item(), prints, torch.cuda.synchronize()) creeping into the hot loop; they collapse the sawtooth back to a two-phase split.¹
Keep NCCL_DEBUG=INFO spot-checks in the pipeline so a driver or fabric change that silently drops the collective to TCP is caught before it erases the overlap gain. See RDMA and RoCE performance tuning.
Re-profile after changing bucket size, prefetch depth, or parallelism degree; the optimal overlap point is topology- and model-specific.¹

How to scale it¶

The scaling question is how much iteration time overlap actually removes, and where it stops paying. It removes the hidden fraction of comm and leaves the tail: the last bucket's all-reduce has no later compute to hide behind. This numpy block encodes the WFBP timing model (a single serial comm stream servicing buckets in ready order), reproduces the book's 22 ms serial versus about 12 ms overlapped result, and is executed and asserted. The adversarial checks are an independent event-driven reference simulator (a different control flow) that must agree with the closed form over 200 random bucket sets, plus three edges: a single bucket overlaps nothing, a compute-bound run leaves only a tiny tail, and a mid-iteration device sync collapses overlap back to fully serial.

# WFBP overlap timing model, validated (system python3, numpy). Run: python3 overlap.py
import numpy as np


def overlap_finish(compute_ms, comm_ms):
    """End time of a bucketed backward with ONE serial comm stream. Bucket i's
    compute ends at the running compute sum (ready order = last layer first); its
    all-reduce starts at max(ready, comm-stream-free), so comm serializes behind both."""
    compute_ms = np.asarray(compute_ms, np.float64)
    comm_ms = np.asarray(comm_ms, np.float64)
    assert compute_ms.shape == comm_ms.shape and compute_ms.ndim == 1
    ready = np.cumsum(compute_ms)
    comm_free = 0.0
    for i in range(compute_ms.size):
        comm_free = max(ready[i], comm_free) + comm_ms[i]   # wait for compute AND stream
    return max(ready[-1], comm_free)                        # backward done + last comm drained


def overlap_finish_ref(compute_ms, comm_ms):
    """Independent event-driven reference: walk the comm stream bucket by bucket,
    idling until each is ready. Exact (no time-stepping); a structurally different
    cross-check of overlap_finish."""
    ready = np.cumsum(np.asarray(compute_ms, np.float64))
    comm_ms = np.asarray(comm_ms, np.float64)
    clock = 0.0
    for i in range(ready.size):
        clock = max(clock, float(ready[i])) + float(comm_ms[i])
    return max(float(ready[-1]), clock)


def serial_finish(compute_ms, comm_ms):     # no overlap: all backward, THEN all comm
    return float(np.sum(compute_ms) + np.sum(comm_ms))


# happy path: reproduce the worked example (10 ms compute, 12 ms comm, many buckets)
K = 100
compute, comm = np.full(K, 10.0 / K), np.full(K, 12.0 / K)
serial, overlapped = serial_finish(compute, comm), overlap_finish(compute, comm)
assert abs(serial - 22.0) < 1e-9, serial                    # serial = compute + comm = 22
assert 12.0 <= overlapped <= 12.0 + 10.0 / K + 1e-9         # floor is comm(12) + one tail bucket
assert (serial - overlapped) / serial > 0.44                # ~45% here; corroborates >30% claim

# adversarial: closed form == independent reference over 200 random bucket sets
rng = np.random.default_rng(7)
for _ in range(200):
    m = int(rng.integers(1, 12))
    c, q = rng.uniform(0.05, 3.0, m), rng.uniform(0.05, 3.0, m)
    assert abs(overlap_finish(c, q) - overlap_finish_ref(c, q)) < 1e-9

# edge 1: a single bucket overlaps NOTHING -> serial == overlapped
assert overlap_finish([10.0], [12.0]) == serial_finish([10.0], [12.0]) == 22.0

# edge 2: compute-bound -> only the final comm bucket is unhidden
cc, qq = np.full(8, 5.0), np.full(8, 0.5)                    # 40 ms compute, 4 ms comm
assert abs(overlap_finish(cc, qq) - (40.0 + 0.5)) < 1e-9    # 40.5 ms: one 0.5 ms tail

# edge 3 (adversarial): a mid-iteration sync forces each bucket to fully drain comm
def finish_with_sync(compute_ms, comm_ms):   # the .item()/synchronize() anti-pattern
    return serial_finish(compute_ms, comm_ms)
csync, qsync = np.full(4, 3.0), np.full(4, 2.0)
assert finish_with_sync(csync, qsync) > overlap_finish(csync, qsync)   # sync is strictly worse

print(f"serial={serial:.1f}ms overlapped={overlapped:.3f}ms "
      f"improvement={(serial-overlapped)/serial*100:.1f}%")

Running it prints serial=22.0ms overlapped=12.100ms improvement=45.0% and all asserts pass. The model makes the scaling limits concrete: more, smaller buckets shrink the unhidden tail toward one bucket's comm time, but once comm is fully hidden (edge 2) the only lever left is the fabric. Overlap hides latency; it never manufactures bandwidth, so a bandwidth-saturated link (see the ## When to use it bandwidth caveat) caps the win no matter how the buckets are tuned. Scaling the parallelism degree changes both the collective size and the compute per rank, so the optimal bucket size and prefetch depth are topology-specific and must be re-profiled. See Performance optimization and tuning and Scaling to 100T parameters.

Failure modes¶

A device sync in the hot loop kills overlap. .item(), CPU-bound prints and logging (a GPU-to-CPU tensor move), or a stray torch.cuda.synchronize() between backward and the next iteration stalls compute until every in-flight collective drains, collapsing the sawtooth to a two-phase split. Keep at most one sync at the very end of an iteration, for timing.¹
Silent TCP fallback. If the collective drops off the IB/NVLink path onto TCP, the comm cost balloons and dwarfs any overlap gain. NCCL_DEBUG=INFO confirms the transport; do not read overlap numbers until it does. See NCCL hang / collective stall.
One large bucket / single collective. A model small enough to fit one bucket issues a single end-of-backward all-reduce that overlaps almost nothing; the timing model's single-bucket edge shows serial equals overlapped.¹
Overlap traded for OOM. Gradient accumulation holds unreduced gradients, FSDP prefetch keeps more shards resident, and DualPipe holds 2x parameter copies; when memory is the binding constraint these overlap levers backfire. See GPU memory hierarchy and Training OOM runbook.
Assuming overlap without profiling. Overlap is invisible in wall-clock aggregate; a clean two-phase timeline (all compute, then all NCCL, GPU idle between) means it is broken even if the code looks correct. Confirm the two concurrent stream lanes on the profiler.¹
CUDA_DEVICE_MAX_CONNECTIONS conflict. Megatron TP-with-SP requires it set to 1 (in-order issue so the collective precedes its GEMM); FSDP generally wants it unset for parallel launch. Combining the two without reconciling this setting breaks one path's overlap.⁶⁷

References¶

Chris Fregly, AI Systems Performance Engineering (O'Reilly), Chapter 4, "Tuning Distributed Networking Communication": "Overlapping Communication and Computation (Pipelining)," "Asynchronous Execution with Streams," "Reducing Communication Frequency and Volume," and "Achieving Maximal Overlap in Practice." Source of the DDP/WFBP overlap mechanism, the 10 ms / 12 ms / 22 ms versus about 12 ms worked example, the roughly 30% iteration improvement, the 25 MiB default bucket, the sawtooth-versus-two-phase profiler signature, and the .item()/synchronize() overlap-killing caveat. ↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩
PyTorch, "Distributed Data Parallel" design note (Reducer bucketing and reverse-Model.parameters() bucket order). https://docs.pytorch.org/docs/stable/notes/ddp.html ↩
PyTorch, torch.nn.parallel.DistributedDataParallel API (bucket_cap_mb defaults to 25 MiB when None; gradient_as_bucket_view). https://docs.pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html ↩
PyTorch, FullyShardedDataParallel / FSDP2 docs: backward_prefetch=BACKWARD_PRE (only correct in-backward overlap; most overlap, most memory) and forward_prefetch (explicit next-forward all-gather, useful only for CPU-bound workloads). https://docs.pytorch.org/docs/main/fsdp.html and https://github.com/pytorch/torchtitan/blob/main/docs/fsdp.md ↩↩↩
DeepSeek-AI, DualPipe: bidirectional pipeline parallelism for computation-communication overlap (DeepSeek-V3/R1). Bubble formulas (PP-1)(F+B) / (PP-1)(F+B-2W) / (PP/2-1)(F&B+B-3W), even-stage requirement, 2x parameter memory, four overlapped chunk components. https://github.com/deepseek-ai/DualPipe and DeepSeek-V3 Technical Report https://arxiv.org/abs/2412.19437 ↩↩↩↩↩
NVIDIA Megatron-LM: --tp-comm-overlap overlaps TP+SP activation collectives with compute; collectives scheduled before compute kernels are required for the speedup but not for correctness; TP/CP-with-FSDP needs different CUDA_DEVICE_MAX_CONNECTIONS settings. https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/training/arguments.py and https://docs.nvidia.com/megatron-core/developer-guide/latest/user-guide/parallelism-guide.html ↩↩↩↩↩
NVIDIA Megatron Core, core.tensor_parallel.layers: sequence parallelism requires CUDA_DEVICE_MAX_CONNECTIONS=1. https://docs.nvidia.com/megatron-core/developer-guide/latest/apidocs/core/core.tensor_parallel.layers.html ↩↩
NVIDIA Nsight Systems + PyTorch profiler: nsys profile --trace=cuda,nvtx, NVTX op annotations, and Holistic Trace Analysis all-reduce-overlap / idle-time breakdown for confirming compute-comm overlap on the timeline. https://docs.pytorch.org/tutorials/intermediate/FSDP_tutorial.html ↩