Markdown

Expert parallelism for MoE inference¶

Scope: sharding MoE experts across GPUs (expert parallelism, EP), the all-to-all dispatch/combine it forces at every MoE layer, keeping that all-to-all on NVLink/RDMA (DeepEP), overlapping it with compute, and combining EP with TP/DP for large MoE serving.

EP is the strategy that makes trillion-parameter MoE servable at all: it spreads the expert weights across GPUs so total parameters scale with GPU count, while each token still touches only its top-k experts. The cost you pay for that is a communication-heavy all-to-all at every MoE layer. Get the all-to-all on the fabric and overlapped with compute, or it dominates layer time and starves the Tensor Cores (Fregly, Ch. 15).

What it is¶

A MoE layer is many parallel feed-forward sub-networks ("experts") plus a gating network that, per token, selects the top-k experts (typically top-2). Expert parallelism places different experts on different GPUs so the expert weights are sharded rather than replicated. With 16 experts over 4 GPUs, each GPU hosts 4 experts; when a token's gate picks 2 experts, the token's activation vector is shipped to whichever GPUs own those experts, computed, and shipped back (Fregly, Ch. 15).

That shipping is the defining feature of EP: an all-to-all dispatch scatters each token to the GPU(s) holding its assigned experts, and after the expert FFN runs, an all-to-all combine returns the outputs to their original positions. This happens at every MoE layer of every forward pass (Fregly, Ch. 15). The EP "world size" in vLLM is TP x DP: expert layers shard over all ranks while attention is either replicated (TP=1, DP-attention) or TP-sharded (TP>1) (vLLM EP docs).

flowchart LR
  subgraph G0["GPU 0 (experts 0-3)"]
    A0["tokens in"]
  end
  subgraph G1["GPU 1 (experts 4-7)"]
    A1["tokens in"]
  end
  A0 -->|"dispatch (all-to-all)"| X["route by gate top-k"]
  A1 -->|"dispatch (all-to-all)"| X
  X --> E0["expert FFN on owner GPU"]
  E0 -->|"combine (all-to-all)"| O["reorder to original tokens"]

Two routing facts shape everything downstream. First, although one token hits only k experts, in aggregate across all concurrent tokens essentially all experts are active, contending for all GPUs at once (Fregly, Ch. 15). Second, gating is data-dependent, so token counts per expert are uneven, the load-balancing problem covered in MoE routing & load balancing.

Why it matters¶

It is what lets the model fit. EP shards expert parameters across GPUs, so total model capacity scales nearly linearly with GPU count while per-token compute stays close to a much smaller dense model. A 100-expert model over 100 GPUs with top-2 gating costs ~2 experts of compute per token (Fregly, Ch. 15).
The all-to-all is the bottleneck. Dynamic routing injects a communication-heavy step at each MoE layer that "can potentially dominate inference time if not handled efficiently" and can drop SM efficiency to very low values if naive (Fregly, Ch. 15). This is the whole reason DeepEP and butterfly/hierarchical schedules exist.
Imbalance stalls everyone. The straggler effect: a hot expert that receives more tokens than its peers stalls the layer because all expert computations must complete before the layer advances, leaving other expert GPUs idle (Fregly, Ch. 15). See MoE routing & load balancing.
Each GPU needs enough tokens per expert to amortize the communication. EP is only efficient when batch/concurrency is high enough to keep every expert busy (Fregly, Ch. 15).

When it is needed (and when not)¶

Use EP when you are serving a large MoE (DeepSeek-V3/R1, Llama-4, Mixtral-class and up) whose expert weights do not fit, or fit only wastefully, on the TP group you would otherwise use. EP scales model capacity; pair it with high concurrency so experts stay fed.

Prefer TP instead (or first) when a single expert or dense layer is itself too wide for one GPU. TP splits within a layer and is the right tool for wide matmuls. The book's guiding order: use TP up to diminishing returns (within a node / NVLink island), use PP minimally just to fit depth, then maximize EP to spread experts, then add DP replicas for throughput; layer CP on top only for very long contexts (Fregly, Ch. 15).

Avoid / de-prioritize EP when concurrency is low (experts starve, all-to-all cost is not amortized), when the model is dense (no experts to shard, since EP is MoE-specific), or when your fabric cannot keep the all-to-all off the host. With weak interconnect the all-to-all dominates and EP loses to a smaller-footprint layout.

EP is orthogonal to disaggregated inference: in disaggregated serving, the prefill pool typically runs throughput-tuned EP and the decode pool latency-tuned EP (different DeepEP kernels, see below).

How: implement, integrate, maintain¶

Keep the all-to-all on the fabric with DeepEP¶

DeepEP is DeepSeek's expert-parallel communication library: high-throughput and low-latency all-to-all GPU kernels, "also known as MoE dispatch and combine" (DeepEP). It exposes two kernel families:

Normal / high-throughput kernels. Optimized for asymmetric-domain forwarding (NVLink domain to RDMA domain), with SM-count control. For training and inference prefilling (DeepEP).
Low-latency kernels. Pure-RDMA, minimal delay, for inference decoding; these support adaptive routing while normal kernels do not (DeepEP).

Published DeepEP throughput (illustrative, vendor-reported, not tested here): single-node EP8 over NVLink ~726 GB/s dispatch / ~740 GB/s combine; internode 8x2 over CX7 RDMA ~90 GB/s dispatch / ~81 GB/s combine (DeepEP). DeepEP historically built on NVSHMEM for one-sided GPU-initiated RDMA; the maintained V2 line has moved its primary path to a lighter-weight NCCL-based backend and keeps NVSHMEM for legacy kernels, so pin and verify the version you deploy (DeepEP).

In vLLM, DeepEP is selected as the all-to-all backend; it is one of several pluggable backends (naive, pplx, deepep_high_throughput, deepep_low_latency, allgather_reducescatter, FlashInfer variants) (vLLM EP docs).

Run it: vLLM EP + DP across two nodes¶

EP in vLLM rides on data parallelism: the all-to-all is only triggered when dp_size > 1 and EP is enabled, and DP enables DP-attention. Example, DeepSeek-V3 over 2 x 8 GPUs, decode-tuned (deepep_low_latency), exact flags per the vLLM docs (vLLM EP docs):

# Node 1 (primary, hosts the API server)
vllm serve deepseek-ai/DeepSeek-V3-0324 \
    --all2all-backend deepep_low_latency \
    --enable-expert-parallel \
    --tensor-parallel-size 1 \
    --data-parallel-size 16 \
    --data-parallel-size-local 8 \
    --data-parallel-address 192.168.1.100 \
    --data-parallel-rpc-port 13345 \
    --api-server-count=8

# Node 2 (secondary, headless worker-only)
vllm serve deepseek-ai/DeepSeek-V3-0324 \
    --all2all-backend deepep_low_latency \
    --enable-expert-parallel \
    --tensor-parallel-size 1 \
    --data-parallel-size 16 \
    --data-parallel-size-local 8 \
    --data-parallel-start-rank 8 \
    --data-parallel-address 192.168.1.100 \
    --data-parallel-rpc-port 13345 \
    --headless

Here EP world size = TP x DP = 1 x 16 = 16-way EP; --tensor-parallel-size 1 means attention is replicated per DP rank (DP-attention). The backend can also be set with VLLM_ALL2ALL_BACKEND=deepep_low_latency (vLLM EP docs). For disaggregated serving, run the prefill instance with deepep_high_throughput and the decode instance with deepep_low_latency (vLLM EP docs); the prefill backend follows the throughput-tuned normal kernels and decode the low-latency kernels (DeepEP).

Backend flag values and exact CLI surface track the vLLM release, so confirm against the docs for your version before deploying. The names above are from current vLLM EP documentation; older releases exposed a smaller set.

Overlap the all-to-all with compute¶

The all-to-all is hideable. The book's prescription: double-buffer so that while one batch of tokens is in flight, the previous batch's expert FFNs run; this pipelining hides most of the communication latency. Drive copy and compute on separate CUDA streams and events and verify with Nsight Systems that token-step kernels overlap NIC/NVLink transfers rather than serializing on one stream (Fregly, Ch. 15). DeepEP exposes exactly this: an event/overlap interface to run dispatch/combine asynchronously against a compute stream so background communication does not block the FFN (DeepEP). See comms/compute overlap and NVSHMEM GPU communication.

Cut the all-to-all volume¶

Hierarchical (two-stage) all-to-all. Route intra-node first over NVSwitch/NVLink (fast), then spill across nodes over RDMA only for the residual tokens needing non-local experts. This cuts internode traffic (Fregly, Ch. 15).
Butterfly / shifted all-to-all schedule. Break the exchange into phased rounds so every NVLink/NIC stays busy on partial exchanges instead of stalling on one global barrier; raises link utilization, beats naive global collectives (Fregly, Ch. 15).
Expert collocation. Place frequently co-activated experts (e.g., experts 5 and 7 often picked together) on the same GPU/node to eliminate an all-to-all hop; use gating-frequency analysis to find the pairings (Fregly, Ch. 15).
Compress activations. Cast dispatched activations to FP8/NVFP4 via Transformer Engine before the all-to-all to cut NIC load; cast/pack overhead is small relative to transfer cost (Fregly, Ch. 15). DeepEP's dispatch accepts FP8 inputs directly (DeepEP). See Tensor Cores & mixed precision.
Collective + transport hygiene. Use the topology-appropriate NCCL all-to-all (or grouped send/recv), confirm GPUDirect RDMA is enabled on internode paths, and bond IB links so multiple ports act as one logical channel (Fregly, Ch. 15). Book guidance: 1 NIC per GPU improves MoE all-to-all performance.

Combine EP with TP, DP, PP¶

No single strategy suffices for multi-trillion-parameter MoE; combine them (Fregly, Ch. 15):

Strategy	Splits	Use for MoE serving
TP	within a layer (weight matrices)	wide experts/attention too big for one GPU; keep TP group inside one NVLink island (train: tensor parallel)
PP	across layers	depth that does not fit; use minimally — decode bubbles hurt (train: pipeline parallel)
EP	experts across GPUs	scale MoE capacity; pays the all-to-all
DP	full-replica per group	throughput / more concurrent requests

Worked example from the book: 64 GPUs as 16 groups of 4; a 60-layer / 64-expert MoE runs 4-way PP (15 layers/stage), 2-way TP within each stage, EP spreading 16 experts per group with top-2 gating, and DP deploying two such 64-GPU replicas to double throughput (Fregly, Ch. 15). Add CP for very long contexts and you are at "5D" parallelism. Always align groups to topology: keep TP/EP groups inside the NVLink/NVSwitch domain (e.g., within an NVL72), and keep TP local to each 8-GPU node when nodes are only IB-connected (Fregly, Ch. 15). See inference parallelism strategies and MoE sparsity & scaling.

Maintain¶

Profile the all-to-all, not just the GEMMs. In Nsight Systems, watch the expert GPUs' all-to-all timeline; a GPU whose segment is much longer is processing more tokens, a hotspot (Fregly, Ch. 15).
Verify the overlap. Confirm NIC/NVLink transfers overlap compute and that links are saturated while NVLink shuffles run in the background; measure both internode NIC and intranode NVLink utilization (Fregly, Ch. 15).
Watch the kernel/backend split in disaggregated deployments. Prefill on throughput kernels, decode on low-latency kernels; a mismatch silently costs TTFT or TPOT.
Pin DeepEP + NVSHMEM/NCCL versions and re-validate after upgrades, since the backend has changed transports between major versions (DeepEP).
Tie EP to QoS. Hot-expert stalls show up as TPOT-tail regressions, so wire them into inference QoS / admission control, the SLO/SLI catalog, and the inference SLO-breach runbook.

References¶

Chris Fregly, AI Systems Performance Engineering (O'Reilly), Chapter 15: "Multinode Inference, Parallelism, Decoding, and Routing Optimizations" — expert parallelism, all-to-all dispatch/combine, hierarchical and butterfly schedules, expert collocation, activation compression, hybrid TP/PP/EP/DP, profiling guidance.
vLLM — Expert Parallel Deployment: https://docs.vllm.ai/en/latest/serving/expert_parallel_deployment/ (EP CLI flags, DP+EP/DP-attention semantics, all-to-all backends including deepep_high_throughput / deepep_low_latency, multi-node DeepSeek launch).
DeepEP (deepseek-ai) — efficient expert-parallel communication library: https://github.com/deepseek-ai/DeepEP (normal/high-throughput vs low-latency kernels, NVLink/RDMA domains, NVSHMEM/NCCL backend, FP8 dispatch, compute-communication overlap, reported bandwidths).
NVSHMEM (NVIDIA) — GPU-initiated one-sided communication underlying DeepEP's RDMA path: https://docs.nvidia.com/nvshmem/

Reference templates only. Bandwidth and speedup figures are book- or vendor-reported and have not been hardware-validated here; benchmark on your own fabric, topology, and model before relying on any number. On any disagreement between the book and official docs, the official docs win and the difference is noted inline (e.g., DeepEP's transport changing across versions).