Markdown

KV cache transfer with NIXL¶

Scope: moving the KV cache from prefill workers to decode workers (and across memory/storage tiers) in disaggregated serving, using NVIDIA's Inference Xfer Library (NIXL), its UCX / GPUDirect RDMA / GPUDirect Storage backends, layer-wise overlapped transfer, and how Dynamo wires it across vLLM, SGLang, and TensorRT-LLM.

What it is¶

In disaggregated prefill/decode serving, the prefill (context) worker computes the prompt's KV cache, and a separate decode (generation) worker consumes it to produce tokens. The KV cache produced by prefill must be moved to whichever GPU runs decode. NIXL is the transfer library that performs that move.

NIXL is an open-source, point-to-point data-movement library with a modular plug-in backend architecture. It abstracts memory and storage behind a single API: callers register memory regions (GPU VRAM, CPU DRAM, or storage) with a NIXL agent as descriptors, then issue transfers. NIXL picks a backend from the memory types involved and the backends common to both agents. Currently supported backends are UCX and NVIDIA Magnum IO GPUDirect Storage (GDS); additional file/block/object backends are in development. Concretely: a DRAM->VRAM transfer may use UCX, while a VRAM->parallel-filesystem transfer may use the GPUDirect Storage path.²³

The book frames NIXL as the GPU-to-GPU transport for disaggregation: "Both roles will use NIXL and GPUDirect RDMA to transfer the KV cache blocks. NIXL abstracts transport for GPU-to-GPU data movement over NVLink and RDMA NICs. It also provides connectors for GPUDirect Storage so that KV cache pages can be read from (or written to) different storage tiers."¹

Under the hood the transfer is one-sided RDMA: "each decode worker registers a region of its GPU memory so that prefill workers can write directly into it using RDMA." Memory-registration metadata (NIXL descriptors) is exchanged at startup or on first contact, so per-request control messages carry only a small buffer identifier rather than a full address structure.¹

Why it matters¶

Decode is memory-I/O bound and latency-sensitive (time per output token, TPOT); prefill is compute bound and dominates time to first token (TTFT). Disaggregation only pays off if the cross-phase KV handoff is cheap enough not to reintroduce the latency it was meant to remove. NIXL exists to make that handoff a zero-copy, non-blocking, GPU-to-GPU operation:

Zero-copy, no CPU in the path. Direct remote GPU reads/writes over the chosen transport avoid CPU copies. "while one GPU is transferring KV data, it can also service other forward-pass requests without waiting for the transfer to complete." This overlap is what keeps decode TPOT flat during a transfer.¹
Transport abstraction. One API spans NVLink/C2C/NVSwitch, InfiniBand, RoCE, and Ethernet, plus GPUDirect Storage to NVMe/PFS tiers. This is what lets Dynamo offload colder KV blocks to DRAM/NVMe instead of recomputing them.²
Lightweight control plane. Pre-exchanged descriptors mean a prefill request carries "just an ID for the target KV buffer." Dynamo uses etcd for worker discovery and leases; workers register memory handles so peers can fetch descriptors on first use.¹

When it is needed (and when not)¶

Needed:

You run disaggregated prefill/decode with prefill and decode on different GPUs/nodes, and KV must cross a network or NVLink fabric.
You want KV tiering (DRAM/NVMe/object) for prefix reuse beyond GPU HBM. NIXL's GDS/storage backends provide the read/write path.²

Not needed / skip the remote path:

Monolithic (colocated) serving, where prefill and decode share the GPU and KV never leaves local HBM.
Large prefix-cache hits, where most of the prompt's KV already lives on the decode worker. The book is explicit: "A large prefix hit ... means a lot of KV data would have to be transferred to a prefill worker and back, which is pointless. Hence, such requests are kept local."¹
Short prompts, where remote-prefill overhead exceeds the compute saved. The decode-side router (should_offload_prefill) gates remote transfer on effective prompt length and prefill-queue depth, so transfer only happens when it helps. See Disaggregated Inference for the routing policy.

How: implement, integrate, maintain¶

Transfer shape: coalesce, overlap, layout-transform¶

Move KV in payloads large enough to amortize RDMA setup. The book's guidance: "coalesce multiple PagedAttention blocks into ~128-token payloads before RDMA (note: vLLM defaults to 16 tokens per block on CUDA)." Use pre-registered peer memory with large pinned windows to minimize re-registration churn, and overlap transfer with compute so decode never stalls.¹

When prefill and decode use different tensor-parallel layouts, insert a layout-transform kernel on the receiver side, after the NIXL read and before the KV is used, to realign each block to the decode kernel's expected layout. The book notes this transform "is latency-insignificant compared to network transfer and avoids re-prefill."¹

vLLM: NixlConnector¶

vLLM's disaggregated prefilling uses NixlConnector for fully asynchronous KV send/receive. Configure it through --kv-transfer-config; the prefill instance is the producer, the decode instance the consumer.⁴

# Prefill (producer)
CUDA_VISIBLE_DEVICES=0 \
VLLM_NIXL_SIDE_CHANNEL_PORT=5600 \
vllm serve Qwen/Qwen3-0.6B \
  --port 8100 \
  --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_producer","kv_load_failure_policy":"fail"}'

# Decode (consumer)
CUDA_VISIBLE_DEVICES=1 \
VLLM_NIXL_SIDE_CHANNEL_PORT=5601 \
vllm serve Qwen/Qwen3-0.6B \
  --port 8200 \
  --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_consumer","kv_load_failure_policy":"fail"}'

Notes from the vLLM docs: kv_role takes kv_producer or kv_consumer (kv_both is deprecated). VLLM_NIXL_SIDE_CHANNEL_PORT (default 5600) carries the descriptor handshake and must be unique per worker on a host; set VLLM_NIXL_SIDE_CHANNEL_HOST when instances span machines. Transport is tuned through UCX env vars UCX_TLS and UCX_NET_DEVICES (e.g. mlx5_0:1,mlx5_1:1). A non-default fabric is selected via kv_connector_extra_config.backends (e.g. ["LIBFABRIC"]).⁴

TensorRT-LLM / Dynamo: cache_transceiver_config¶

In Dynamo's TensorRT-LLM backend, the KV transfer backend is set on cache_transceiver_config. Valid backend values are DEFAULT, UCX, NIXL, and MPI; NIXL is the default. Set max_tokens_in_buffer to at least the maximum input sequence length across requests for best performance.⁵

cache_transceiver_config:
  backend: NIXL
  max_tokens_in_buffer: 8192

The TRT-LLM container also exposes env-var switches: a NIXL-built container sets TRTLLM_USE_NIXL_KVCACHE=1; to fall back to UCX, unset it and export TRTLLM_USE_UCX_KVCACHE=1. When NIXL is active, its underlying transport is chosen via TRTLLM_NIXL_KVCACHE_BACKEND (UCX default, or LIBFABRIC on v0.16.0+).⁶⁵

SGLang¶

Dynamo's disaggregation supports SGLang alongside vLLM and TRT-LLM, with NIXL moving KV directly from prefill-engine VRAM to decode-engine VRAM via RDMA; each engine has backend-specific component flags documented per backend. (Exact SGLang flag names are not reproduced here; consult the Dynamo SGLang backend docs.)⁷

Dynamo orchestration¶

Dynamo registers prefill and decode roles (see the cluster config in Disaggregated Inference), each using NIXL + GPUDirect RDMA. It is decode-centric: requests arrive at the decode worker, which decides via its router whether to offload prefill remotely. The prefill worker reads any cached prefix, computes prefill, and writes KV blocks back into the decode GPU's registered memory; the decode scheduler then enqueues the decode step, overlapping data movement with compute. etcd provides worker discovery/leases so the descriptor exchange and lightweight ID-based control messages work across a fluctuating worker set.¹⁷

flowchart LR
  C["Client request"] --> D["Decode worker (ingress + router)"]
  D -->|"offload? (long prompt, queue light)"| Q["Prefill queue"]
  Q --> P["Prefill worker: read prefix cache, compute KV"]
  P -->|"NIXL one-sided RDMA write (~128-token payloads)"| D
  D --> X["Layout transform (if TP layouts differ)"]
  X --> G["Continuous-batched decode -> tokens"]
  P -.->|"GDS backend"| T["KV tiers: DRAM / NVMe / object"]

Maintain / validate¶

Validate that transfers are truly zero-copy and overlapped on the Nsight Systems timeline. The book's unified profiling command traces CUDA, UCX, and GPUDirect Storage, and adds NIC/IB-switch telemetry:¹

nsys profile --trace=cuda-hw,osrt,nvtx,ucx,gds \
  --trace-fork-before-exec=true \
  --cuda-event-trace=true \
  --cuda-graph-trace=node \
  --cuda-memory-usage=true \
  --sample=cpu \
  --gpu-metrics-device=all \
  --nic-metrics=true \
  --ib-switch-metrics-device=<GUIDs> \
  --storage-metrics --storage-devices=all \
  --gds-metrics=driver \
  -o nsys_reports/prefill_decode \
  <your_launch_here>

Operational notes: keep peer memory pre-registered with large pinned windows to avoid re-registration churn; correlate UCX activity, GPU metrics, and IB-switch counters to confirm KV transfer overlaps decode kernels rather than blocking them. Sizing the KV payload itself follows bytes_per_token = 2 x n_layers x n_kv_heads x head_dim x bytes_per_element (the 2 x covers keys and values); GQA/MQA/MLA and FP8/FP4 KV all shrink what NIXL has to move. See KV Cache Management: PagedAttention and Prefix Caching.¹

Caveat: the NIXL flags, env vars, payload sizes, and profiling command above are reproduced from the book and official vLLM/TensorRT-LLM/Dynamo docs. None were exercised on hardware in this writeup. Where the book and official docs differ, official docs win: e.g. the book calls the Dynamo+TRT-LLM NIXL path the transfer abstraction, and the TRT-LLM docs make NIXL the default cache_transceiver_config.backend (with UCX as the legacy default for the raw TRTLLM_USE_* env switches).⁵⁶

References¶

Chris Fregly, AI Systems Performance Engineering (O'Reilly), Chapter 17, "Scaling Disaggregated Prefill and Decode for Inference."¹
NVIDIA Technical Blog, "Enhancing Distributed Inference Performance with the NVIDIA Inference Transfer Library." https://developer.nvidia.com/blog/enhancing-distributed-inference-performance-with-the-nvidia-inference-transfer-library/ ²
NIXL design doc (ai-dynamo/nixl). https://github.com/ai-dynamo/nixl/blob/main/docs/nixl.md ³
vLLM docs, "NixlConnector Usage Guide." https://docs.vllm.ai/en/stable/features/nixl_connector_usage/ ⁴
TensorRT-LLM docs, "Disaggregated Serving" (cache_transceiver_config). https://nvidia.github.io/TensorRT-LLM/features/disagg-serving.html ⁵
NVIDIA Dynamo docs, "KV Cache Transfer in Disaggregated Serving" (TensorRT-LLM backend, TRTLLM_USE_NIXL_KVCACHE / TRTLLM_USE_UCX_KVCACHE). https://docs.nvidia.com/dynamo/latest/backends/trtllm/kv-cache-transfer.html ⁶
NVIDIA Dynamo docs, "Disaggregated Serving." https://docs.dynamo.nvidia.com/dynamo/design-docs/disaggregated-serving ⁷

Fregly, AI Systems Performance Engineering, Ch. 17. ↩↩↩↩↩↩↩↩↩↩↩
NVIDIA, "Enhancing Distributed Inference Performance with the NVIDIA Inference Transfer Library." ↩↩↩↩
ai-dynamo/nixl, docs/nixl.md. ↩↩
vLLM, "NixlConnector Usage Guide." ↩↩↩
TensorRT-LLM, "Disaggregated Serving." ↩↩↩↩
NVIDIA Dynamo, "KV Cache Transfer in Disaggregated Serving" (TRT-LLM). ↩↩↩
NVIDIA Dynamo, "Disaggregated Serving." ↩↩↩