Markdown

Inference disaggregation in practice: rate matching and the Pareto frontier¶

Scope: the decision layer of prefill/decode disaggregation: when splitting the phases beats co-located serving, how to size the two pools (rate matching), and how to read the throughput-interactivity Pareto frontier, distilled from the NVIDIA study "Beyond the Buzz: A Pragmatic Take on Inference Disaggregation" (arXiv 2506.05508), which simulates hundreds of thousands of design points across workloads, traffic patterns, and hardware configurations. The mechanics of running a disaggregated stack (PD split, KV transfer with NIXL, Dynamo/vLLM wiring) are in disaggregated inference; the latency SLOs this page trades against are defined in inference serving SLOs.

All study numbers on this page are the paper's simulated results (a proprietary high-fidelity datacenter-scale GPU simulator, Blackwell systems at FP4 precision) and are mostly reported by the authors in normalized form to convey trends, not absolute performance claims. The numpy example is executed and asserted; it is a first-order fluid model of the balance condition, not a rerun of the study.

flowchart TB
  W["Workload profile: P50 ISL and OSL, request rate, FTL and TTL SLAs"] --> Q{"Prefill-heavy (ISL much larger than OSL) and model above ~10B?"}
  Q -->|"no: small model or generation-heavy"| CO["Co-located serving (piggybacked chunking often enough)"]
  Q -->|"yes"| D["Disaggregate prefill and decode"]
  D --> PM["Partition each pool separately (prefill: CPP; decode: EP or TP as TTL tightens)"]
  PM --> RM["Rate match: pick ctx:gen instance ratio so phase throughputs balance"]
  RM --> PARETO["Operating point on the throughput-interactivity Pareto frontier"]
  PARETO -->|"SLA or traffic mix shifts"| RM

What it is¶

Disaggregation splits autoregressive inference into a prefill (context) pool and a decode (generation) pool so each phase can adopt its own model partitioning and batching. The open question has never been the mechanism but the payoff surface: for which models, traffic mixes, and latency targets does the split beat a well-tuned co-located baseline with in-flight batching and chunked-prefill piggybacking. The NVIDIA study answers that empirically at datacenter scale, framing every configuration as a point on a throughput-interactivity Pareto frontier: system throughput (tokens/s/GPU across all GPUs deployed) against interactivity (tokens/s/user, the reciprocal of token-to-token latency).¹

Two service metrics anchor the frontier: first token latency (FTL) constrains only the prefill pool, and token-to-token latency (TTL) constrains only the decode pool. That separation is the core advantage over co-location, where one instance must optimize both metrics at once and a strict TTL SLA artificially slows prefill.¹ The study's headline findings: disaggregation is most effective for prefill-heavy traffic (input sequence length much larger than output sequence length) and larger models (above roughly 10B parameters), and realizing the gains requires dynamic rate matching of the prefill-to-decode GPU ratio rather than a fixed split.¹

Rate matching is the sizing step specific to disaggregation: given an optimal partitioning for each pool, choose the ratio of prefill to decode instances so the two phases sustain the same request rate, minimizing total GPUs subject to the FTL and TTL constraints. Every point on the study's disaggregated frontiers is the output of one rate-matching step.²

Why use it¶

Prefill-heavy traffic is where the split pays. Across four traffic patterns on DeepSeek-R1, the benefits of disaggregation are most pronounced for prefill-heavy workloads, where any co-located mapping that protects decoding speed sacrifices prefill throughput. Conversely, piggybacked chunking is most promising on decode-heavy traffic.³
Larger models benefit more. Comparing Llama 8B, 70B, and 405B, the advantage of disaggregation grows with model size: larger models span more GPUs, opening a richer space of distinct prefill and decode mappings to exploit.¹
Decode pools chase tight TTL more aggressively. Freed from prefill, decode instances shift to smaller batches and wider tensor parallelism as TTL tightens (Llama-3.1-70B scales TP from 2x to 64x; DeepSeek-R1 keeps expert parallelism inside the NVLink domain while attention moves from data parallel to tensor parallel), which is why the medium-latency regime favors disaggregation.¹
Prefill has its own optimal shape. Chunked pipeline parallelism (CPP) splits long contexts into chunks and pipelines them across GPUs, holding FTL within SLA on long inputs without resorting to very wide tensor parallelism (shown for DeepSeek-R1 at 256K ISL on 64 GPUs with EP x PP = 64).¹
Bigger NVLink domains widen the win. Larger NVLink domains consistently improve disaggregated serving by allowing wider expert and tensor parallelism during generation.¹

When to use it (and when not)¶

Disaggregate when the P50 traffic mix is prefill-heavy (long prompts, short generations: retrieval, summarization, agent context assembly), the model is large (above roughly 10B parameters), and the SLA sits in the medium-interactivity band where the studied frontiers separate most.
Stay co-located for small models or generation-heavy traffic; the study explicitly lists both as scenarios where disaggregation offers limited benefit.¹
Mind the attention architecture for the baseline. Piggybacked chunking on DeepSeek-R1 pays an MLA-specific overhead: the down and up projections of multi-latent attention are recomputed per prefill chunk (mitigable by caching up-projected KV across chunks). Context chunking effectiveness is highly sensitive to MLA vs GQA, and is most useful under relaxed latency and generation-heavy traffic.¹
Size on P50, not on fear of tails. The study validates that simulating the closest power-of-two of P50 ISL/OSL reproduces the Pareto frontier of full dynamic traffic; a deployment can plan pool ratios from two percentile numbers.⁵
Do not lock the ratio. The optimal context-to-generation GPU ratio varies with model and target latency; a ratio of 3.5 is performant at the most relaxed latency target but degrades as latency tightens, while 0.5 favors tight latency and suffers under relaxed targets. Small deployments feel this hardest, because limited GPUs restrict the reachable ratios.⁴

Architecture¶

The study's pipeline for producing one disaggregated design point: first fix the cheapest prefill mapping that satisfies the FTL cutoff (design points with FTL above 10 seconds are excluded), then for each candidate decode mapping run a rate-matching algorithm with an integer solver that balances the two phases' request throughput while minimizing total GPU count under the TTL constraint. Partitioning strategies searched include tensor parallelism, expert parallelism, pipeline parallelism, chunked pipeline parallelism, and TEP (tensor-parallel attention with expert-parallel FFNs), across a wide batch-size range.²

The Appendix B arithmetic is simple enough to internalize. Prefill instance throughput in requests/s/GPU is B_prefill / (FTL x G_prefill). Decode instance throughput in tokens/s/GPU is B_decode / (TTL x G_decode), which becomes request throughput after dividing by OSL - 1. Their ratio alpha, rounded to a rational within a 3% tolerance, sets the instance counts (expand the fraction so both pools serve the same request rate), and overall system throughput normalizes decode throughput by 1 + alpha to account for all GPUs deployed.² The simulation assumes KV cache produced at each layer is transferred immediately, overlapping with the compute of subsequent layers, which is what makes the bandwidth requirement analysis in Production below the binding check.

How to use it: sizing the two pools¶

The balance condition is worth validating in miniature before trusting any solver. The executed model below sizes pools from per-instance throughputs and a traffic profile, then simulates deterministic token backlogs. It asserts the four behaviors that matter operationally: a rate-matched configuration keeps both backlogs bounded, undersizing prefill diverges, oversizing decode buys nothing, and a shifted traffic mix flips which pool saturates under the old sizing:

# rate_matching.py - validated: first-order rate-matching arithmetic for sizing
# disaggregated prefill and decode pools. Deterministic fluid model (no RNG), a
# teaching model of the balance condition, not a rerun of the paper's simulator.
import math

import numpy as np


def required_instances(rate_rps: float, tokens_per_req: float, inst_tps: float) -> int:
    """Instances needed so pool token capacity covers the offered token rate."""
    assert rate_rps > 0 and tokens_per_req > 0 and inst_tps > 0
    return math.ceil(rate_rps * tokens_per_req / inst_tps)


def simulate(rate_rps: float, isl: int, osl: int, n_prefill: int, n_decode: int,
             p_tps: float, d_tps: float, steps: int = 2000) -> np.ndarray:
    """Token backlogs of both pools over 1 s steps; column 0 prefill, 1 decode."""
    q_p, q_d = 0.0, 0.0
    hist = np.zeros((steps, 2))
    for t in range(steps):
        q_p += rate_rps * isl                       # prompt tokens arriving
        served = min(q_p, n_prefill * p_tps)        # prefill pool capacity
        q_p -= served
        q_d += (served / isl) * osl                 # finished prefills feed decode
        q_d -= min(q_d, n_decode * d_tps)           # decode pool capacity
        hist[t] = (q_p, q_d)
    return hist


P_TPS, D_TPS = 20_000.0, 2_000.0                    # per-instance tokens/s
RATE, ISL, OSL = 8.0, 4096, 512                     # req/s, prefill-heavy mix

n_p = required_instances(RATE, ISL, P_TPS)          # ceil(8*4096/20000) = 2
n_d = required_instances(RATE, OSL, D_TPS)          # ceil(8*512/2000)   = 3
assert (n_p, n_d) == (2, 3)
print(f"matched pools: {n_p} prefill : {n_d} decode "
      f"(offered {RATE * ISL:.0f} prefill tok/s vs {n_p * P_TPS:.0f} capacity, "
      f"{RATE * OSL:.0f} decode tok/s vs {n_d * D_TPS:.0f} capacity)")

# 1) Rate-matched: both backlogs stay bounded (they drain within every step).
h = simulate(RATE, ISL, OSL, n_p, n_d, P_TPS, D_TPS)
assert h[:, 0].max() == 0.0 and h[:, 1].max() == 0.0

# 2) Undersized prefill: backlog grows monotonically, without bound.
h_under = simulate(RATE, ISL, OSL, 1, n_d, P_TPS, D_TPS)
growth = np.diff(h_under[:, 0])
assert np.all(growth > 0) and h_under[-1, 0] > 1e6
print(f"undersized prefill: backlog after 2000 s = {h_under[-1, 0]:.0f} tokens, "
      f"+{growth[0]:.0f}/s every step")

# 3) Oversized decode: doubling decode instances changes nothing downstream.
h_over = simulate(RATE, ISL, OSL, n_p, 2 * n_d, P_TPS, D_TPS)
assert np.allclose(h_over, h)                        # same (zero) steady state

# 4) Swapping the mix (ISL 512 / OSL 4096) flips which pool saturates under the
# tight config from case 2: prefill recovers, decode now grows without bound.
h_flip = simulate(RATE, OSL, ISL, 1, n_d, P_TPS, D_TPS)
assert h_flip[:, 0].max() == 0.0                     # prefill no longer saturated
assert np.all(np.diff(h_flip[:, 1]) > 0)             # decode backlog diverges
print(f"mix swap under old sizing: decode backlog after 2000 s = "
      f"{h_flip[-1, 1]:.0f} tokens")

# Adversarial: sizing with a zero rate must fail fast, not return 0 instances.
try:
    required_instances(0.0, ISL, P_TPS)
    raise SystemExit("zero-rate sizing must be rejected")
except AssertionError:
    pass
print("all rate-matching assertions passed")

Output of the run: a matched 2 prefill : 3 decode split (32768 prefill tok/s offered against 40000 capacity, 4096 decode tok/s against 6000), an undersized prefill pool accumulating 25536000 backlog tokens over 2000 s at +12768/s per step, a decode backlog of 53536000 tokens after the mix swap, and all rate-matching assertions passed. Case 4 is the study's Figure 10 lesson in miniature: a ratio tuned for one mix saturates a different pool when the mix moves.

How to develop with it: reading the frontier¶

Treat the Pareto frontier as the product interface between capacity planning and SLOs. Moving right (higher tokens/s/user) means tighter TTL: configurations shift toward smaller batches and wider tensor parallelism, throughput per GPU falls, and the disaggregation advantage concentrates in the medium-latency band. When comparing serving modes, compare whole frontiers at the same SLO point; a co-located and a disaggregated deployment can each look better if measured at different interactivity targets. Build the baseline honestly: the study's co-located curves are the superposition of piggybacked and non-piggybacked configurations, since neither dominates everywhere.¹ For workload characterization, extract P50 ISL and OSL from production traces, round to powers of two, and plan against those; the study shows this approximation tracks the dynamic-traffic frontier closely.⁵

How to maintain it: dynamic rate matching¶

The ratio is a live control variable, not an install-time constant. Concretely: re-derive the ctx:gen ratio whenever the SLA tier, the model, or the P50 traffic mix changes; alert on sustained queue growth in either pool (the executed model above shows divergence is monotone and fast); and prefer platforms that support elastic reallocation of instances between pools, which is what the study means by dynamic rate matching and elastic scaling being critical to Pareto-optimal performance.⁴ In small deployments, accept that the reachable ratios are coarse (integer instances) and bias toward the pool whose SLA is contractual. The orchestration layers that implement this reallocation (Dynamo, llm-d) are covered in disaggregated inference.

Running it in production: KV-transfer bandwidth¶

Disaggregation adds a per-request KV-cache transfer from prefill to decode GPUs. The study bounds the required per-GPU bandwidth analytically: egress from a prefill GPU must move N_layers x BS_prefill x ISL x d_head x N_kv_heads x bytes_per_element within the FTL window (normalized by the prefill GPUs that uniquely shard the KV), and decode ingress must land the same tensor within TTL x OSL per GPU.⁶ Three scaling consequences follow, all favorable: egress requirements fall as ISL grows (FTL grows superlinearly with ISL due to quadratic attention while KV grows linearly); on the decode side ISL cancels (KV size and TTL both scale linearly with it) and ingress is inversely proportional to OSL; and tighter TTL adds decode GPUs, which lowers per-GPU ingress. One bookkeeping trap: parallelism schemes that replicate KV rather than shard it (tensor parallelism wider than the KV-head count duplicates the cache by TP / N_kv_heads) must be excluded from the normalization. The study concludes existing provisioned datacenter bandwidth is sufficient for KV transfer under layer-wise overlap, so in practice the transfer plumbing (KV cache transfer with NIXL) is an engineering problem, not a capacity wall.⁶ Note the study reports larger models with efficient attention (MLA in DeepSeek-R1) can need less egress bandwidth than smaller models with heavier per-token KV.

Failure modes¶

Mismatched pool ratio. One pool saturates while the other idles; the starving side's backlog grows monotonically (validated above) and either FTL or TTL blows through SLA. Symptom: queue depth diverging in one pool only. Fix: re-run rate matching, not general scale-out.
Fixed ratio under a moving SLA. The study's fixed-ratio curves (3.5 vs 0.5) each collapse outside their home latency regime; a deployment that cannot re-balance instances between pools loses most of the Pareto area it paid for.⁴
Decode-heavy traffic negating gains. When OSL rivals or exceeds ISL, co-located piggybacking already balances the phases; disaggregation adds transfer and orchestration cost for little frontier gain.³
KV-transfer bandwidth as a hidden bottleneck. The favorable bounds assume layer-by-layer transfer overlapped with compute; a stack that transfers the whole KV after prefill completes concentrates the same bytes into a stall window on the critical path. Verify overlap before trusting the paper's sufficiency conclusion.⁶
KV replication miscounted. Sizing bandwidth or memory as if TP ranks shard the KV when TP exceeds the KV-head count undercounts traffic by the duplication factor.
Comparing modes at different SLO points. Disaggregated-vs-colocated benchmarks that pick each mode's favorite interactivity target manufacture a winner; hold TTL and FTL fixed and compare throughput per GPU there.

References¶

Mitra et al. (NVIDIA), Beyond the Buzz: A Pragmatic Take on Inference Disaggregation (arXiv 2506.05508): https://arxiv.org/abs/2506.05508
NVIDIA Dynamo (disaggregated serving orchestration): https://github.com/ai-dynamo/dynamo
vLLM, Disaggregated Prefilling (experimental): https://docs.vllm.ai/en/stable/features/disagg_prefill.html
Zhong et al., DistServe: Disaggregating Prefill and Decoding for Goodput-optimized LLM Serving (OSDI 2024): https://www.usenix.org/conference/osdi24/presentation/zhong-yinmin
Patel et al., Splitwise: Efficient Generative LLM Inference Using Phase Splitting (arXiv 2311.18677): https://arxiv.org/abs/2311.18677
Qin et al., Mooncake: A KVCache-centric Architecture for Serving LLM Chatbot (FAST 2025): https://www.usenix.org/conference/fast25/presentation/qin
Agrawal et al., SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills (arXiv 2308.16369): https://arxiv.org/abs/2308.16369

arXiv 2506.05508: first systematic study of disaggregated inference at scale; hundreds of thousands of simulated design points (proprietary datacenter-scale GPU simulator, Blackwell, FP4); disaggregation most effective for prefill-heavy traffic and models above ~10B parameters; Llama 8B/70B/405B size sweep; DeepSeek-R1 (ISL 16k/OSL 2k) prefers EP in the NVLink domain with attention DP at high throughput shifting to TP under tight TTL; Llama-3.1-70B TP 2x to 64x; CPP for 256K-ISL prefill on 64 GPUs (EP x PP = 64); MLA chunking overhead in piggybacked co-location; larger NVLink domains consistently help; results presented normalized. ↩↩↩↩↩↩↩↩↩↩
Study Section 3.2 and Appendix B: fix the best prefill mapping under the FTL cutoff (prefill throughput B/(FTL x G) requests/s/GPU), then for each decode mapping compute decode throughput B/(TTL x G) tokens/s/GPU, convert to requests via OSL - 1, take the ratio alpha (rounded to a rational, tolerance 0.03), expand to instance counts, and report overall throughput decode_throughput / (1 + alpha); integer solver minimizes total GPUs; FTL > 10 s design points excluded; KV transfer assumed layer-wise and overlapped. ↩↩↩
Study Section 4.2: four traffic patterns on DeepSeek-R1; disaggregation gains concentrate in prefill-heavy mixes; piggybacking is most promising on decode-heavy traffic. ↩↩
Study Section 4.3: the optimal ctx-to-gen GPU ratio varies across models and target latencies; a fixed ratio of 3.5 performs at the most relaxed latency target and degrades as latency tightens, while 0.5 favors tight latency and suffers under relaxed targets; limited GPU counts restrict the reachable ratios in small deployments. ↩↩↩
Study Appendix C: simulating the closest power-of-two of P50 ISL and OSL reproduces the Pareto frontier of the full dynamic traffic distribution closely. ↩↩
Study Section 5.1: egress (N_layers x BS_prefill x ISL x d_head x N_kv_heads x bytes) / (FTL x NumGPU_prefill); ingress (N_layers x BS_decode x ISL x d_head x N_kv_heads x bytes) / (TTL x OSL x NumGPU_decode); KV duplicated (not sharded) when TP exceeds KV heads, factor TP / N_kv_heads; egress requirement decreases with ISL, ingress inversely proportional to OSL and unaffected by ISL; provisioned datacenter bandwidth deemed sufficient under layer-wise overlap. ↩↩↩