Markdown

Disaggregated inference (prefill/decode)¶

Scope: splitting LLM inference into separate prefill and decode pools that scale independently, the KV-cache transfer that connects them, and how to run it with vLLM, NIXL, and Dynamo. The scale-out architecture for serving (inference serving, serving open-weight models).

Disaggregation is the highest-impact serving change at scale and the most operationally complex. As of mid-2026 it is production-proven (Meta, Hugging Face) but the vLLM-native path is still maturing, so orchestrate with Dynamo or llm-d for anything beyond a manual two-instance test.

Paradigm: the two phases have opposite appetites¶

LLM inference has two phases with different bottlenecks:

Prefill: process the prompt, build the KV cache. Compute-bound, bursty, short.
Decode: generate tokens one at a time. Memory-bandwidth-bound, long-running, latency-sensitive.

Co-locating them forces a single GPU pool to compromise on both, and a long prefill head-of-line blocks decodes, spiking TTFT. Disaggregation runs prefill and decode on separate GPU pools, each sized and tuned for its phase, and transfers the KV cache from prefill to decode. Result: independent scaling, smoother TTFT, and higher goodput per GPU (NVIDIA reports up to ~7x MoE throughput on GB200 NVL72 with Dynamo disaggregation plus wide expert parallelism, inference serving).

Architecture: disaggregated prefill/decode

flowchart LR
  C["Client"] --> R["KV-aware router"]
  R --> PP["Prefill pool (compute-tuned)"]
  PP -->|"KV cache via NIXL / RDMA"| DP["Decode pool (memory-tuned)"]
  DP -->|"tokens"| C

The load-bearing piece: KV-cache transfer¶

The KV cache for a long prompt is large; moving it from prefill to decode must be near-free or the win evaporates. NIXL (NVIDIA Inference Xfer Library) is the transport: it moves KV tensors GPU→GPU at wire speed over RDMA/InfiniBand, RoCE (UCX), TCP fallback, NVMe-oF, or S3. vLLM selects the connector via --kv-transfer-config (connectors include NixlConnector, LMCache, and shared-memory for co-located).

Networking is the enabler (networking fabric, performance tuning): KV transfer wants GPUDirect RDMA over the IB/RoCE fabric (or NVLink when pools share a rack). No GDR → KV crawls over TCP and disaggregation loses to co-location. Prove [GDRDMA] is engaged.

Recipe A: vLLM, manual two-instance (proof of concept)¶

# Prefill instance (producer): writes KV into the transfer layer
vllm serve deepseek-ai/DeepSeek-V3 --tensor-parallel-size 8 --trust-remote-code \
  --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_producer"}' --port 8100

# Decode instance (consumer): reads KV, runs generation
vllm serve deepseek-ai/DeepSeek-V3 --tensor-parallel-size 8 --trust-remote-code \
  --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_consumer"}' --port 8200

A proxy/router sends the prompt to a prefill worker, then hands the request (with the KV handle) to a decode worker. The exact connector JSON evolves, so track the vLLM disaggregation docs.

Recipe B: Dynamo on Kubernetes (production)¶

Dynamo (inference serving) provides the routing, KV-aware scheduling, and lifecycle that vLLM alone does not. It runs prefill and decode as separate worker pools, routes by KV locality, and uses NIXL for transfer; Grove places the pools on rack-scale NVLink systems. Dynamo is engine-agnostic: the same disaggregation, KV-aware routing, and SLA planning work across vLLM, SGLang, and TensorRT-LLM backends.¹

# Conceptual Dynamo graph (see Dynamo docs for the current CRD/operator schema)
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata: { name: deepseek-disagg, namespace: serving }
spec:
  backend: vllm
  model: deepseek-ai/DeepSeek-V3
  prefill: { replicas: 2, resources: { gpu: 8 } }   # compute-tuned pool
  decode:  { replicas: 6, resources: { gpu: 8 } }    # memory/latency-tuned pool, scaled higher
  kvTransfer: { connector: nixl }                     # RDMA KV move via NIXL
  router: { policy: kv-aware }

Install via the Dynamo operator/Helm; front with a gang scheduler so each pool's workers co-schedule (Kubernetes for GPUs, the Kubernetes platform).

When to use it (and when not)¶

Use when: high request volume, long/variable prompts, strict TTFT SLO, large models already multi-GPU. The prefill/decode ratio can be tuned to the traffic shape.
Skip when: low/bursty traffic or small models, where the KV-transfer overhead and operational complexity outweigh the gain. Start co-located (inference serving), disaggregate when goodput/SLO demands it.

Don't-miss checklist¶

Confirm GDR/RDMA KV transfer ([GDRDMA]); without it disaggregation is slower than co-location (performance tuning).
Size the prefill:decode pool ratio to the traffic's prompt/output length mix; revisit under load.
Orchestrate with Dynamo/llm-d in production; the manual two-instance path is a PoC.
Keep prefill and decode pools within fast-interconnect reach (same rail/rack where possible, networking fabric).
Put TTFT/TPOT SLOs and burn-rate alerts on it (the SLO/SLI catalog).

Failure modes¶

KV transfer on TCP fallback: disaggregation slower than co-location.
Pool ratio mismatched to traffic: prefill or decode pool idle while the other saturates.
Manual two-instance setup taken to production without routing/failover: no resilience.
Disaggregation added for low traffic: pure overhead.

Open questions & validation¶

Verify the current vLLM --kv-transfer-config connector schema and disaggregation maturity for the target version.
Validate the Dynamo deployment CRD/operator schema against the installed Dynamo release.
A/B disaggregated vs co-located on real traffic: measure goodput and TTFT before committing.

References¶

vLLM disaggregated prefilling: https://docs.vllm.ai/en/latest/features/disagg_prefill.html
NVIDIA NIXL (KV transfer): https://github.com/ai-dynamo/nixl
NVIDIA Dynamo (disaggregated serving): https://docs.dynamo.nvidia.com/ · repo: https://github.com/ai-dynamo/dynamo
llm-d (Kubernetes-native distributed inference): https://github.com/llm-d/llm-d
LMCache (KV connector/offload): https://github.com/LMCache/LMCache

Related: Fabric · Inference · Optimization · K8s Platform · OSS Models · SLO/SLI · Glossary

NVIDIA Dynamo supports vLLM, SGLang, and TensorRT-LLM as interchangeable inference backends, each with disaggregated serving and KV-aware routing. Source: NVIDIA Dynamo (https://github.com/ai-dynamo/dynamo) backend feature matrix; Chris Fregly, "AI Systems Performance Engineering" (O'Reilly), Ch. 17 — "modern inference systems like vLLM, SGLang, and NVIDIA Dynamo," and NIXL plugins for NVLink, UCX-based fabric, and GPUDirect Storage. ↩