Inference parallelism: TP, PP, EP, DP for serving¶
Scope: choosing a parallelism layout for serving (distinct from training), tensor parallelism inside the NVLink domain for latency, pipeline parallelism across nodes for models too big for one node, expert and data parallelism for MoE capacity and throughput, and how each choice trades TTFT/TPOT against aggregate throughput and GPU count.
Serving parallelism is not training parallelism. Training optimizes steps/sec at fixed global batch; serving optimizes TTFT and TPOT under an SLO at the lowest GPU count. The same axes (TP/PP/EP/DP) exist, but the goals, and therefore the right sizes, differ. Templates below are copy-correct against vLLM/SGLang/TensorRT-LLM docs but are not hardware-tested here; validate on your topology.
What it is¶
Four parallelism axes split a model and its request stream across GPUs. For serving they decompose as:
| Axis | What it partitions | Communication | Serving role |
|---|---|---|---|
| TP (tensor) | weight matrices within each layer (attention projections, MLP) | all-reduce of activations every layer |
reduce per-request latency; fit a wide layer/expert |
| PP (pipeline) | the layer stack across GPUs (e.g. GPU0 = layers 1-20) | point-to-point activation handoff between stages | fit a model too deep for one node; bubbles hurt decode |
| EP (expert) | MoE experts across GPUs (sparse, per-token) | all-to-all token shuffle every MoE layer |
shard MoE parameters so the model fits at all |
| DP (data) | the request stream — full model replica per group | none during inference (replicas independent) | scale throughput linearly; no per-query latency win |
A fifth axis, context (sequence) parallelism (CP), partitions a single long prompt's tokens across GPUs to cut prefill latency and per-GPU KV memory for 100k+ token inputs; it does not accelerate the token-by-token decode phase.1 It is layered on only when extreme context lengths demand it.
flowchart TB
M["Model + request stream"] --> TP["TP: split each layer's matmuls (all-reduce/layer)"]
M --> PP["PP: split layer stack across stages (microbatched)"]
M --> EP["EP: MoE experts across GPUs (all-to-all/layer)"]
M --> DP["DP: replicate model, shard requests"]
TP --> H["Hybrid: TP x PP x EP x DP (+CP)"]
PP --> H
EP --> H
DP --> H
Why it matters¶
The axes have opposite latency/throughput signatures, so the layout, not just the GPU count, sets whether you hit your SLO:
- TP buys latency on compute-bound layers: near-linear speedup as long as the per-layer
all-reduceis cheap relative to compute. That holds only on high-bandwidth NVLink/NVSwitch; activations for one token are small, so the all-reduce is tolerable intra-node.2 Push TP across nodes over InfiniBand/Ethernet and the all-reduce latency dominates; efficiency drops. Keep TP groups inside one NVLink domain.6 - PP buys capacity, costs latency. It fits a deep model in memory but adds fill/flush bubbles; for one-token-at-a-time decode each token still traverses every stage sequentially, so pure PP adds end-to-end latency. Chosen for memory reasons, not latency.3
- EP buys MoE capacity (total parameters scale ~linearly with GPUs since each token hits only its top-k experts) but adds an
all-to-allper MoE layer that can dominate layer time if not overlapped, and is exposed to expert load imbalance.4 - DP buys throughput, not latency. Replicas run independent forward passes behind a load balancer; 8 replicas ≈ 8x throughput but 8x memory/cost and zero single-request speedup.5
Get this wrong and you either miss TTFT (too much PP, or TP stretched across a slow fabric) or burn GPUs (DP when the bottleneck was per-query latency).
When it is needed (and when not)¶
Decision order (the book's guiding principle, restated for serving):7
- TP first, up to the point of diminishing returns, within a node or a tightly coupled NVLink island (e.g. an NVL72 chassis). Approaching the full domain, all-reduce latency erodes the win; many production systems keep TP groups smaller and topology-aware even inside NVL72.
- PP minimally: just enough to fit the model in memory across nodes when no single node holds it.
- EP maximally for MoE: distribute experts across GPUs; scale total parameters with GPU count.
- DP last: add replicas to absorb concurrency once a single instance's latency is acceptable.
- CP only when prompts are extreme (100k+ tokens) and TTFT-bound.
Align the layout to the fabric: on an NVL72 all 72 GPUs are one NVLink domain (fifth-gen NVLink, 1.8 TB/s per GPU, ~130 TB/s aggregate bisection),10 so TP and EP groups can form within it; on two 8-GPU nodes joined by InfiniBand, keep TP local to each node and span nodes only with PP/EP/DP.8
Skip / minimize when:
- Model fits on one GPU and latency is fine → no TP/PP; use DP only to scale concurrency.
- Dense (non-MoE) model → no EP.
- Throughput is the only goal and per-query latency is already met → DP, not more TP.
- Short prompts → no CP (its per-layer boundary communication is pure overhead on short inputs).9
How: implement, integrate, maintain¶
Flag names below are taken verbatim from each engine's current docs. They differ per engine; do not assume portability.
vLLM. TP and PP are independent flags; EP for MoE is derived from TP x DP when you add --enable-expert-parallel:11
# Dense model, single 8-GPU NVLink node: TP for latency
vllm serve meta-llama/Llama-3.1-70B-Instruct \
--tensor-parallel-size 8
# Two 8-GPU nodes, model too deep for one node: TP within node, PP across nodes
# (run on the head node; see vLLM multi-node docs for the Ray/worker setup)
vllm serve <big-model> \
--tensor-parallel-size 8 \
--pipeline-parallel-size 2
# MoE (DeepSeek-V3) on 8 GPUs: DP attention + expert parallelism.
# EP_SIZE = TP_SIZE x DP_SIZE = 1 x 8 = 8; experts sharded, attention replicated per DP rank.
vllm serve deepseek-ai/DeepSeek-V3-0324 \
--tensor-parallel-size 1 \
--data-parallel-size 8 \
--enable-expert-parallel
SGLang. Separate size flags; for MoE the all-to-all backend and an optional load balancer are explicit:12
# MoE with expert parallelism. Note: DeepEP/NIXL-EP backends require ep_size == tp_size.
python -m sglang.launch_server \
--model-path deepseek-ai/DeepSeek-V3 \
--tp-size 8 \
--ep-size 8 \
--enable-dp-attention --dp-size 8 \
--enable-eplb # expert-parallelism load balancer
TensorRT-LLM (trtllm-serve). Hybrid TP/EP for MoE is expressed as a factorization where moe_tensor_parallel_size * moe_expert_parallel_size == tensor_parallel_size:13
# extra_llm_api_config.yaml -> trtllm-serve <model> --extra_llm_api_options this.yaml
tensor_parallel_size: 8
pipeline_parallel_size: 1
moe_tensor_parallel_size: 4 # split each expert's weights across 4 GPUs
moe_expert_parallel_size: 2 # distribute experts across 2 groups (4 x 2 = 8)
Integrate. This layer sits under the serving stack: pair it with continuous batching and KV management in inference serving, with disaggregated prefill/decode (which itself lets prefill and decode pick different parallel layouts), and with the MoE routing / load-balancing that keeps EP all-to-all from stalling on hot experts. EP rides the same NVLink/NVSwitch and RDMA fabric as NVSHMEM GPU communication; TP all-reduce and MoE all-to-all cast to FP8/NVFP4 on Tensor Cores to cut transfer bytes. CUDA graphs cut per-step launch overhead that otherwise inflates TPOT under these layouts.
Maintain. Profile with Nsight Systems (end-to-end traces) and Nsight Compute (kernel metrics); confirm TP all-reduce and MoE all-to-all overlap with compute rather than serializing on one stream, and watch per-link NVLink/NIC utilization.14 Put TTFT/TPOT SLOs and burn-rate alerts on every layout change (SLO/SLI catalog, inference SLO-breach runbook). Re-tune the mix when the traffic's prompt/output shape shifts; the optimal layout is workload- and hardware-dependent, never a constant.
Worked hybrid (from the book), 64 GPUs, 60-layer / 64-expert MoE: 16 groups of 4 GPUs; PP=4 (15 layers/stage) for depth, TP=2 within each stage for width, EP spreading 16 experts/group under top-2 gating, then DP=2 replicas of the 64-GPU unit to double throughput, a 4D layout.15 Treat it as a starting point to profile, not a prescription.
References¶
- Chris Fregly, "AI Systems Performance Engineering" (O'Reilly), Chapter 15 — "Multinode Inference, Parallelism, Decoding, and Routing Optimizations" (parallelism strategies table; TP/PP/EP/DP/CP; hybrid layout; guiding principle).
- vLLM — Parallelism and Scaling: https://docs.vllm.ai/en/stable/serving/parallelism_scaling/
- vLLM — Expert Parallel Deployment: https://docs.vllm.ai/en/latest/serving/expert_parallel_deployment/
- vLLM — Data Parallel Deployment: https://docs.vllm.ai/en/latest/serving/data_parallel_deployment/
- SGLang — Expert Parallelism: https://docs.sglang.io/advanced_features/expert_parallelism.html · Server arguments: https://github.com/sgl-project/sglang/blob/main/docs/advanced_features/server_arguments.md
- TensorRT-LLM — Parallelism strategies: https://nvidia.github.io/TensorRT-LLM/features/parallel-strategy.html · Expert parallelism: https://nvidia.github.io/TensorRT-LLM/advanced/expert-parallelism.html
- NVIDIA GB200 NVL72 (NVLink domain, bandwidth): https://www.nvidia.com/en-us/data-center/gb200-nvl72/
Related: Inference Serving · Serving OSS Models · Disaggregated Inference · MoE Routing & Load Balancing · Speculative Decoding · FlashAttention/MLA · NVSHMEM Comms · Tensor Cores · CUDA Graphs · Train TP · Train PP · SLO/SLI · Glossary
-
Chris Fregly, "AI Systems Performance Engineering" (O'Reilly), Ch. 15 — Context Parallelism: "CP handles contexts larger than a single GPU's memory by splitting the KV cache across GPUs... it can shorten TTFT for extremely long prompts" but "doesn't speed up the sequential, token-by-token decode phase." ↩
-
Fregly, Ch. 15 — Tensor Parallelism: "Since the activations for one token are relatively small, those all-reduce communications are not too costly on NVLink... it's best to use TP for intranode communication — or between nodes in an NVLink domain." ↩
-
Fregly, Ch. 15 — Pipeline Parallelism: "PP tends to increase end-to-end latency for a single item due to transfer overhead between pipeline stages. As such, PP is usually chosen for model capacity reasons — rather than latency reasons." ↩
-
Fregly, Ch. 15 — Expert Parallelism: "This dynamic routing introduces a communication-heavy step at each MoE layer... The upside is that expert parallelism allows the total model capacity to scale almost linearly with the number of GPUs." ↩
-
Fregly, Ch. 15 — Data Parallelism: "DP does not reduce latency for any single request. It does, however, improve throughput... serving a model with eight replicas uses 8x the GPU memory." ↩
-
vLLM Parallelism and Scaling — "set the tensor parallel size to the number of GPUs in each node, and the pipeline parallel size to the number of nodes"; for nodes lacking NVLink, "leverage pipeline parallelism instead of tensor parallelism." https://docs.vllm.ai/en/stable/serving/parallelism_scaling/ ↩
-
Fregly, Ch. 15 — Hybrid Parallelism: "use TP up to the point of diminishing returns — usually within a node or in a tightly coupled unit like an NVL72 chassis. You would then use pipeline parallelism minimally — just enough to fit the model in memory. Then, you want to maximize EP... Finally, you add data parallel replicas." ↩
-
Fregly, Ch. 15 — "a smaller cluster of two 8-GPU nodes connected with InfiniBand has full NVLink connectivity only within each node... it's best to keep TP local to each 8-GPU node — and avoid internode parallelism if possible." ↩
-
Fregly, Ch. 15 — "for short prompts, CP often adds outsized overhead. But for very long inputs and strict TTFT goals, CP can reduce prefill latency and memory per GPU." ↩
-
NVIDIA GB200 NVL72 — 72-GPU NVLink domain, fifth-generation NVLink at 1.8 TB/s per GPU, ~130 TB/s aggregate bandwidth in the domain. https://www.nvidia.com/en-us/data-center/gb200-nvl72/ ↩
-
vLLM Expert Parallel Deployment —
--enable-expert-parallel; "EP_SIZE = TP_SIZE x DP_SIZE"; examplevllm serve deepseek-ai/DeepSeek-V3-0324 --tensor-parallel-size 1 --data-parallel-size 8 --enable-expert-parallel. https://docs.vllm.ai/en/latest/serving/expert_parallel_deployment/ ↩ -
SGLang server arguments and Expert Parallelism —
--tp-size,--dp-size,--ep-size,--enable-dp-attention,--enable-eplb; DeepEP/Mooncake/NIXL-EP backends "only support cases where ep_size = tp_size." https://docs.sglang.io/advanced_features/expert_parallelism.html · https://github.com/sgl-project/sglang/blob/main/docs/advanced_features/server_arguments.md ↩ -
TensorRT-LLM Parallelism / Expert Parallelism — "The product of moe_tensor_parallel_size and moe_expert_parallel_size must equal tensor_parallel_size"; configured via the LLM API /
trtllm-serveYAML. https://nvidia.github.io/TensorRT-LLM/features/parallel-strategy.html · https://nvidia.github.io/TensorRT-LLM/advanced/expert-parallelism.html ↩ -
Fregly, Ch. 15 — "you can use Nsight Systems for end-to-end traces and Nsight Compute for kernel-specific GPU metrics... always verify interconnect traffic, Tensor Core utilization, and other performance-enhancing mechanisms before and after making each change." ↩
-
Fregly, Ch. 15 — Hybrid Parallelism: 64-GPU example, "4-way pipeline parallelism with 15 layers per stage... within each stage (15 layers), we can use 2-way TP... use EP to spread 16 experts per group using top-2 gating... data parallelism can deploy two such 64-GPU replicas." ↩