KV cache management: PagedAttention and prefix caching¶
Scope: managing the KV cache that dominates decode memory, covering PagedAttention block tables to kill fragmentation, prefix caching to reuse shared prefixes across requests, KV-memory sizing per concurrent sequence, and quantized/compressed KV. These are the levers that set max concurrency on a decode pool.
Code blocks below come in two kinds. The
vllm serve/ SGLang launch snippets are reference templates on real vendor CLIs (those engines are not installed here); pin versions and validate before production. The numpy blocks are runnable, self-checking validations of the core math the page teaches (KV sizing, the paged block table, prefix-block hashing with LRU eviction); each runs on a stockpython3with numpy and asserts its result, including adversarial cases (corruption detection, boundary values, equivalence to a slow reference).
What it is¶
During decode every transformer layer attends over the keys and values of all prior tokens. Those tensors (the KV cache) are stored, not recomputed, so KV memory grows linearly with 2 (K and V) × num_layers × sequence_length × num_kv_heads × head_dim × dtype_bytes per sequence ("AI Systems Performance Engineering", ch. 18).1 Decode is memory-bandwidth-bound and the KV cache, not the weights, is what runs you out of HBM as concurrency and context length climb.
Two mechanisms govern it:
- PagedAttention (vLLM) stores the KV cache in fixed-size blocks (pages) tracked by a per-sequence block table, exactly like OS virtual memory. Logical token positions map to physical blocks that need not be contiguous, so the allocator never has to find one large contiguous slab. This eliminates the internal and external fragmentation that a naive "reserve max_context per sequence" allocator suffers, and lets blocks be shared across sequences.2
- Prefix caching reuses the KV of a shared prefix (system prompt, few-shot preamble, prior conversation turns, an attached document) across requests instead of recomputing prefill for it. The book frames this as trading memory for compute: reusing KV from HBM or DRAM is far cheaper than re-running the O(N²) prefill attention.1
The single load-bearing quantity is the KV footprint per token, because it (times context length) sets how many sequences a fixed KV pool can hold. Everything else on this page (paging, prefix reuse, FP8, GQA/MLA) exists to shrink that footprint or stop it being wasted.
Why use it¶
KV footprint is the gating resource for concurrency. The book's worked example: a 70B model, 80 layers, hidden num_heads × head_dim = 4096, FP16. That is 2 × 4096 = 8192 floats per layer per token, 655,360 floats × 2 bytes ≈ 1.31 MB per token; a 250k-token context is ~328 GB just for KV ("AI Systems Performance Engineering", ch. 18).1
Consequences:
- Fragmentation wastes HBM. A non-paged allocator pre-reserving
max_model_lenper slot strands memory in sequences that finish early or never reach max length. PagedAttention's blocks recover that, raising the number of sequences you can batch, which directly raises throughput.2 - Prefix cache hits convert prefill into a near-free copy. For agent traffic, RAG, and multi-turn chat (all dominated by a large repeated prefix), hits cut TTFT and prefill GPU-seconds substantially.
- KV memory is the max-concurrency knob. Reachable concurrency ≈
(HBM − weights − activations − overhead) / KV_bytes_per_sequence. ShrinkingKV_bytes_per_sequence(FP8 KV, GQA/MLA, fewer cached layers) is the cheapest way to batch more sequences and lift goodput, with no architectural change.1
The concurrency arithmetic above is the core algorithm the page teaches. The block below is exactly that math (KV bytes per token, then per sequence, then how many sequences a pool holds). It reproduces the page's quoted figures, proves FP8 halves the FP16 footprint and so exactly doubles concurrency, cross-checks against a slow element-by-element reference, and asserts two adversarial cases: a zero head count (a mis-parsed config) must raise rather than report a zero-byte cache, and the floor division must not be off-by-one at the exact-fit boundary.
# Runnable on system python3 (numpy). Core algorithm: KV bytes per token -> per sequence
# -> reachable concurrency, the arithmetic that sets how many sequences a KV pool can hold.
import numpy as np
def kv_bytes_per_token(layers: int, kv_heads: int, head_dim: int, dtype_bytes: int) -> int:
"""2 (K and V) * layers * kv_heads * head_dim * dtype_bytes."""
assert min(layers, kv_heads, head_dim, dtype_bytes) > 0, "all dims must be positive"
return 2 * layers * kv_heads * head_dim * dtype_bytes
def max_concurrent(kv_pool_bytes: int, bytes_per_token: int, ctx_len: int) -> int:
"""How many full-context sequences fit in the KV pool (floor division)."""
assert bytes_per_token > 0 and ctx_len > 0
return kv_pool_bytes // (bytes_per_token * ctx_len)
# Llama-3.1-70B: 80 layers, GQA -> 8 KV heads, head_dim 128.
fp16 = kv_bytes_per_token(80, 8, 128, 2)
fp8 = kv_bytes_per_token(80, 8, 128, 1)
# 1. The exact figures the page quotes.
assert fp16 == 327_680, fp16 # ~0.31 MiB / token
assert fp8 == 163_840, fp8 # ~0.16 MiB / token
assert abs(fp16 / 1024**2 - 0.3125) < 1e-9
assert max_concurrent(40 * 1024**3, fp8, 8192) == 32
# 2. Halving the dtype halves the footprint and (at fixed pool/ctx) doubles concurrency:
# the reason FP8 KV lifts batching. Assert the exact 2x, not just ">".
assert fp8 * 2 == fp16
assert max_concurrent(40 * 1024**3, fp8, 8192) == 2 * max_concurrent(40 * 1024**3, fp16, 8192)
# 3. GQA vs MHA: 8 KV heads instead of 64 query heads is an exact 8x KV reduction.
mha = kv_bytes_per_token(80, 64, 128, 2)
assert mha == fp16 * 8
# 4. Slow reference: compute the same total by explicitly summing per-token, per-layer,
# per-head, K-and-V element bytes for a whole sequence. Must equal bytes_per_token*ctx.
def slow_seq_bytes(layers, kv_heads, head_dim, dtype_bytes, ctx_len):
total = 0
for _tok in range(ctx_len):
for _layer in range(layers):
for _kv in range(2): # K and V
total += kv_heads * head_dim * dtype_bytes
return total
ctx = 37
assert slow_seq_bytes(80, 8, 128, 1, ctx) == fp8 * ctx
# 5. Adversarial: a zero head count (e.g. a mis-parsed config) must raise, not silently
# report a zero-byte KV cache that would make max_concurrent divide-by-zero downstream.
raised = False
try:
kv_bytes_per_token(80, 0, 128, 2)
except AssertionError:
raised = True
assert raised, "zero kv_heads must raise, not report a 0-byte cache"
# 6. Adversarial boundary: a context one token larger than what fits must drop concurrency
# by (at least) one; the floor division is not off-by-one at the exact-fit boundary.
pool = fp8 * 8192 * 32 # exactly 32 sequences of 8k
assert max_concurrent(pool, fp8, 8192) == 32
assert max_concurrent(pool, fp8, 8192 + 1) == 31 # one extra token per seq spills a slot
print("sizing OK:", f"fp16={fp16}B fp8={fp8}B", "conc(40GiB,fp8,8k)=%d" % max_concurrent(40*1024**3, fp8, 8192))
When to use it (and when not)¶
Paged KV management is on by default in modern engines (vLLM, SGLang, TensorRT-LLM); you don't opt out, you tune it. Decisions worth making:
- Enable prefix caching when prefixes repeat: shared system prompts, few-shot templates, multi-turn sessions, document Q&A, agent loops. vLLM V1 turns automatic prefix caching on by default; on older paths use
--enable-prefix-caching.3 - Skip or de-prioritize prefix caching when prompts are near-unique (every request a distinct long document with no shared head). Cache management overhead and occupied blocks then buy little; spend that HBM on batch size instead.
- Quantize KV to FP8 when you are KV-memory-bound and the accuracy budget allows. The book notes FP8 KV is "widely adopted in engines such as vLLM to reduce footprint and increase batching opportunities."1
- Do not compress or evict prefix KV on layers that retain full attention over the whole context (or have retrieval hooks) without per-model eval. The book flags eviction and compression as safe only under sliding-window or otherwise restricted-attention patterns.1
- Tier to CPU/NVMe (offload) only when contexts exceed HBM; the active keys and values for latency-critical decode should stay in HBM. Cross-node KV pooling and prefill-to-decode transfer are covered in disaggregated inference and KV-cache transfer with NIXL.
Architecture¶
A sequence's logical token positions do not live in one contiguous region of HBM. The block table maps each 16-token span to a physical KV page that the allocator hands out from a free list, so pages for one sequence can be scattered anywhere in the pool. That indirection is what removes fragmentation and enables cross-sequence sharing.
Block table → physical KV pages
flowchart LR
subgraph LOG ["Logical sequence"]
T0["tokens 0-15"]
T1["tokens 16-31"]
T2["tokens 32-47"]
end
BT["Block table"]
subgraph PHYS ["Physical KV blocks (HBM)"]
PA["block 7"]
PB["block 3"]
PC["block 9"]
end
T0 --> BT
T1 --> BT
T2 --> BT
BT --> PA
BT --> PB
BT --> PC
The block below is that block table made concrete. It models a free-list allocator with no contiguity requirement, hands it a deliberately fragmented free list so the pages it returns are non-adjacent, then gathers logical token order back out of the scattered physical pages and asserts the reconstruction is exact. It then proves the fragmentation win numerically (a paged pool fits 12 short sequences where a contiguous reserve-max allocator fits only 2), and asserts three adversarial guards: a double-free (which would alias one physical block into two sequences) must raise, an allocation past the pool (the real preemption trigger) must raise, and internal fragmentation is at most one partial page at a block boundary.
# Runnable on system python3 (numpy). Core idea: a block table maps logical token
# positions to NON-contiguous physical KV pages (PagedAttention), so an allocator that
# never needs a contiguous slab wastes no memory to fragmentation and fits more sequences
# than a "reserve max_model_len per slot" allocator.
import numpy as np
BLOCK = 16 # tokens per KV page; vLLM default
class BlockAllocator:
"""Free-list of physical block ids. No contiguity requirement."""
def __init__(self, n_blocks: int):
self.free = list(range(n_blocks)) # every block starts free
self.n_blocks = n_blocks
def alloc(self, n: int) -> list[int]:
assert n <= len(self.free), "out of blocks" # caller must handle exhaustion
got, self.free = self.free[:n], self.free[n:]
return got
def freeing(self, blocks: list[int]) -> None:
for b in blocks:
assert b not in self.free, f"double-free of block {b}" # corruption guard
self.free.append(b)
def n_blocks_for(seq_len: int) -> int:
return (seq_len + BLOCK - 1) // BLOCK # ceil-div, last page partly filled
def gather(kv_store: np.ndarray, block_table: list[int], seq_len: int) -> np.ndarray:
"""Reconstruct logical [0..seq_len) KV from possibly non-contiguous physical pages."""
out = np.empty(seq_len, dtype=kv_store.dtype)
for logical in range(seq_len):
page = block_table[logical // BLOCK]
out[logical] = kv_store[page * BLOCK + (logical % BLOCK)]
return out
# Physical KV store: value at slot p = p, so we can check the gather reordered correctly.
N_BLOCKS = 64
kv_store = np.arange(N_BLOCKS * BLOCK)
# 1. Non-contiguous blocks still serve a sequence in the right logical order. Hand the
# allocator a fragmented free list (evens first) so pages are guaranteed non-adjacent.
alloc = BlockAllocator(N_BLOCKS)
alloc.free = list(range(0, N_BLOCKS, 2)) + list(range(1, N_BLOCKS, 2))
seq_len = 47 # 3 pages (16+16+15)
bt = alloc.alloc(n_blocks_for(seq_len))
assert bt == [0, 2, 4], bt # non-contiguous physical pages
logical = gather(kv_store, bt, seq_len)
# Logical position i must map to physical (page*16 + offset); check the first of each page.
assert logical[0] == 0 and logical[16] == 2 * BLOCK and logical[32] == 4 * BLOCK
assert np.array_equal(logical, np.array([bt[i // BLOCK] * BLOCK + i % BLOCK for i in range(seq_len)]))
# 2. Fragmentation: paged packs more sequences than reserve-max. 64 blocks = 1024 token
# slots. Workload: many short sequences in a serving mix where max_model_len is large.
max_model_len = 512 # what a naive allocator reserves PER slot
seq_lens = [40, 55, 33, 61, 48, 39, 44, 52, 30, 58, 41, 47] # real (short) lengths
paged_blocks = sum(n_blocks_for(s) for s in seq_lens)
assert paged_blocks <= N_BLOCKS, paged_blocks # all 12 fit under paging
reserve_slots = N_BLOCKS * BLOCK // max_model_len # contiguous reserve-max capacity
assert reserve_slots == 2 # only 2 sequences fit if you reserve max
assert len(seq_lens) > reserve_slots # paging fits 12 vs 2: the fragmentation win
# 3. Free-then-reuse: blocks from a finished sequence return to the pool and are handed to
# the next one. No leak: total accounted blocks is invariant.
before = len(alloc.free) + len(bt)
alloc.freeing(bt)
assert len(alloc.free) == before # every block returned
bt2 = alloc.alloc(3)
assert len(bt2) == 3
# 4. Adversarial: double-free is corruption and must be caught, not silently duplicate a
# block into two sequences (which would alias unrelated KV).
raised = False
try:
alloc.freeing([bt2[0], bt2[0]])
except AssertionError:
raised = True
assert raised, "double-free must raise"
# 5. Adversarial: allocating past the pool must raise (this is exactly the preemption
# trigger in a real engine), never hand back a short/garbage list.
raised = False
try:
BlockAllocator(4).alloc(5)
except AssertionError:
raised = True
assert raised, "over-allocation past the block pool must raise"
# 6. Boundary: an exact multiple of BLOCK uses no wasted partial page; one token more
# needs exactly one extra page (internal fragmentation is at most one block).
assert n_blocks_for(BLOCK) == 1 and n_blocks_for(BLOCK + 1) == 2
assert n_blocks_for(0) == 0
print("paged OK:", f"paged fits {len(seq_lens)} seqs vs reserve-max {reserve_slots};",
"block_table", bt, "gathered logical order verified")
Prefix caching layers a content-addressed index on top of these blocks. vLLM V1 keeps KV blocks in a global hash table keyed by content: each block hash is computed from the parent block's hash + the block's token IDs + extra keys (LoRA ID, multimodal input hash, cache salt), so a hit requires an exact matching prefix and any divergence breaks the chain from that block onward. Eviction is LRU over the free-block queue.3 The block below is that chained hashing and LRU, made concrete: it asserts an exact 3-block prefix hit, that flipping one token inside the second block drops the reuse to one block (the parent-hash chain propagates the change), that a differing LoRA id / cache salt isolates the cache (so KV from another adapter is never reused), and that the LRU pool evicts the least-recently-used block rather than an arbitrary one. Adversarial cases prove the hash actually depends on token content and order, not just length.
# Runnable on system python3 (numpy + stdlib). Core mechanism: vLLM V1 automatic prefix
# caching. Each block's hash chains from parent_hash + block token ids + extra keys, so a
# cache HIT requires an exact matching prefix; divergence at any block breaks the chain and
# every downstream hash. Eviction is LRU over the free-block queue.
import hashlib
from collections import OrderedDict
import numpy as np
BLOCK = 16
def block_hash(parent: bytes, token_ids: tuple[int, ...], extra_keys: tuple = ()) -> bytes:
"""Chained hash: parent hash + this block's tokens + extra keys (LoRA id, salt, ...)."""
h = hashlib.sha256()
h.update(parent)
h.update(np.asarray(token_ids, dtype=np.int64).tobytes())
for k in extra_keys:
h.update(str(k).encode())
return h.digest()
ROOT = b"\x00" * 32 # hash of the empty prefix
def hash_chain(tokens, extra_keys=()) -> list[bytes]:
"""Per-block chained hashes for a full token sequence (only whole blocks are cached)."""
hashes, parent = [], ROOT
n_full = len(tokens) // BLOCK
for i in range(n_full):
blk = tuple(tokens[i * BLOCK:(i + 1) * BLOCK])
parent = block_hash(parent, blk, extra_keys)
hashes.append(parent)
return hashes
def prefix_hit_blocks(cache: set, tokens, extra_keys=()) -> int:
"""Number of leading cached blocks (the reused prefix); stops at first miss."""
hit = 0
for hsh in hash_chain(tokens, extra_keys):
if hsh not in cache:
break # first divergence ends the reusable prefix
hit += 1
return hit
rng = np.random.default_rng(11)
shared = list(rng.integers(0, 50_000, size=3 * BLOCK)) # 3-block shared system prompt
req_a = shared + list(rng.integers(0, 50_000, size=2 * BLOCK))
cache = set(hash_chain(req_a)) # request A populates the cache
# 1. Exact-prefix hit: request B that shares the 3-block prefix reuses exactly 3 blocks,
# then diverges (its own tokens are not cached).
req_b = shared + list(rng.integers(0, 50_000, size=BLOCK))
assert prefix_hit_blocks(cache, req_b) == 3
# 2. Divergence breaks the chain: flip ONE token inside the 2nd block; block 0 stays a hit
# but block 1 and everything after miss (the parent-hash chain propagates the change).
req_c = list(req_b)
req_c[BLOCK + 3] = (req_c[BLOCK + 3] + 1) % 50_000
assert prefix_hit_blocks(cache, req_c) == 1, prefix_hit_blocks(cache, req_c)
# 3. A first token change misses immediately (block 0 differs -> 0 reused).
req_d = list(req_b); req_d[0] = (req_d[0] + 1) % 50_000
assert prefix_hit_blocks(cache, req_d) == 0
# 4. Extra keys are part of the hash: the SAME tokens under a different LoRA id / cache salt
# must NOT hit the base-model cache (else KV from another adapter would be reused).
assert prefix_hit_blocks(cache, req_b, extra_keys=("lora=7",)) == 0
# ...and identical tokens WITH the same extra key are a full hit in a matching cache.
cache_lora = set(hash_chain(req_a, extra_keys=("lora=7",)))
assert prefix_hit_blocks(cache_lora, req_b, extra_keys=("lora=7",)) == 3
# 5. Adversarial: two different token blocks must not collide to the same hash (sanity that
# the hash actually depends on token content, not just length).
assert block_hash(ROOT, tuple([1] * BLOCK)) != block_hash(ROOT, tuple([2] * BLOCK))
# ...and order matters: a permuted block is a different prefix.
b = list(range(BLOCK))
assert block_hash(ROOT, tuple(b)) != block_hash(ROOT, tuple(reversed(b)))
class LRUBlockPool:
"""Free/cached blocks evicted least-recently-used first, like vLLM's free-block queue."""
def __init__(self, capacity: int):
self.capacity = capacity
self.slots: OrderedDict[bytes, int] = OrderedDict() # hash -> physical block id
self.next_id = 0
def touch(self, hsh: bytes) -> None:
self.slots.move_to_end(hsh) # mark most-recently-used
def insert(self, hsh: bytes) -> bytes | None:
evicted = None
if hsh in self.slots:
self.slots.move_to_end(hsh)
return None
if len(self.slots) >= self.capacity:
evicted, _ = self.slots.popitem(last=False) # pop LRU (front)
self.slots[hsh] = self.next_id
self.next_id += 1
return evicted
# 6. LRU eviction order: fill to capacity, touch block 0 to make block 1 the LRU, insert a
# new block, and assert block 1 (not 0) is the victim.
pool = LRUBlockPool(capacity=3)
h = [block_hash(ROOT, (i,)) for i in range(4)]
for i in range(3):
assert pool.insert(h[i]) is None # fills without eviction
pool.touch(h[0]) # 0 now MRU, so 1 is LRU
victim = pool.insert(h[3])
assert victim == h[1], "LRU must evict the least-recently-used block (1), not 0"
assert h[0] in pool.slots and h[3] in pool.slots and h[1] not in pool.slots
print("prefix OK:", "hit(B)=3 hit(diverge@blk1)=1 hit(diverge@blk0)=0;",
"extra-key isolates cache; LRU evicted the least-recently-used block")
How to use it¶
PagedAttention is intrinsic to modern engines; you tune it rather than enable it. The relevant levers are block size, the HBM fraction given to KV, optional CPU swap space, and KV dtype. The snippet below is a reference template on the real vLLM CLI (vLLM is not installed here); pin the version and validate the flags against your build before rollout.
# Reference template (needs vllm installed + GPUs). Not executed here.
# Llama-3.1-70B, TP=4, FP8 KV cache, prefix caching, paged allocator.
vllm serve meta-llama/Llama-3.1-70B-Instruct \
--tensor-parallel-size 4 \
--block-size 16 \ # tokens per KV page; vLLM default is 16
--gpu-memory-utilization 0.92 \ # fraction of HBM for weights + KV
--kv-cache-dtype fp8 \ # fp8 == fp8_e4m3; fp8_e5m2 also valid (CUDA 11.8+)
--enable-prefix-caching \ # default-on in V1; explicit for clarity
--swap-space 8 # GiB CPU swap for preempted blocks
--block-sizeis the page granularity in tokens; the vLLM default is 16, and that 16-token block is also the unit of prefix-cache hashing on the legacy path (engine args).5--kv-cache-dtypeacceptsauto,fp8(=fp8_e4m3),fp8_e4m3,fp8_e5m2. By default KV scales are1.0; for real accuracy use dataset-calibrated scales viallm-compressor(per-tensor or per-attention-head; per-head is Flash-Attention-backend only) (quantized KV cache).4--gpu-memory-utilizationsets the HBM fraction; after weights and activations the remainder becomes the KV block pool. Newer vLLM also exposeskv_cache_memory_bytesfor direct sizing (it overrides the utilization fraction when set) (cache config).5
The concurrency this buys is exactly the sizing block under Why use it: compute kv_bytes_per_token, multiply by target context, and divide the KV pool to get the sequences you can batch.
How to integrate it¶
Prefix caching and its cross-engine variants are the integration surface: what gets reused, how a hit is keyed, and where the reused KV lives.
Automatic prefix caching (vLLM V1). The engine keeps KV blocks in a global hash table keyed by content. Each block hash is computed from the parent block's hash + the block's token IDs + extra keys (LoRA ID, multimodal input hash, cache salt), so a hit requires an exact matching prefix. Eviction is LRU over the free-block queue. The default hash is sha256 (not reproducible across Python/vLLM versions); --prefix-caching-hash-algo selects sha256_cbor, xxhash, or xxhash_cbor (prefix caching design).3 This matches the book's description of "a global hash table of KV pages, each unique block of context has a hash," and it is the mechanism validated bit-for-bit in the prefix-hash block under Architecture.
SGLang: RadixAttention. SGLang stores prefix KV in a radix tree (not a flat block hash table), so it reuses overlapping prefixes, not only whole-block exact matches, and is LRU-evicted. It is on by default; disable for debugging with --disable-radix-cache. HiCache extends the radix tree across tiers (L1 GPU / L2 host RAM / L3 distributed store) for memory-constrained or long-context serving (RadixAttention, HiCache).6 The book's "global prompt tree" framing corresponds to RadixAttention.
# Reference template (needs sglang installed + GPUs). Not executed here.
python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-70B-Instruct \
--tp 4 # RadixAttention is on by default
Shrinking the per-token footprint at the architecture level. GQA already shrinks kv_heads (for Llama-3.1-70B, 8 KV heads vs 64 query heads, an 8× KV reduction versus MHA, asserted in the sizing block), and MLA (DeepSeek) compresses KV into a latent further still. Both multiply with FP8. See FlashAttention / MLA and tensor cores & mixed precision. The 1.31 MB/token figure in the book is for MHA at hidden 4096; GQA/MLA models are far smaller per token.1 Compose prefix caching carefully with routing: a hit only happens on the replica that built the prefix, so integrate KV-aware (locality) routing (inference serving) to send a request to the node holding its prefix.
How to run it in production¶
- Watch the KV-cache hit rate and block-pool utilization. vLLM exports
vllm:prefix_cache_hits_totalandvllm:prefix_cache_queries_totalplus GPU-cache-usage gauges; a low hit rate on prefix-heavy traffic means routing is not sticky. Route a request to the node that built its prefix (locality-aware / KV-aware routing) to maximize hits, per the book and inference serving.1 - Handle preemption and swapping under pressure. When the block pool is exhausted vLLM preempts sequences (recompute or swap to
--swap-space); sustained preemption means you over-subscribed concurrency. Lower it, shrink context, or enable FP8 KV.3 - Do not double-count HBM.
--gpu-memory-utilizationbudgets weights and KV; raising it past headroom causes OOM under burst, not more cache. - The KV reservation blocks GPU sharing, not just memory.
--gpu-memory-utilizationclaims its HBM fraction at startup and holds it for the life of the server, so another pod cannot co-locate on that GPU even when the server's compute sits idle between requests. On a shared GPU it is the memory reservation, not SM contention, that prevents packing more work: size it to real KV demand, or lower it deliberately to leave room for a co-tenant. See MPS and dynamic and fractional GPU sharing. - Confirm kernel and memory behaviour with Nsight Systems/Compute as the book recommends for decode kernels, especially after changing block size or KV dtype.1
How to maintain it¶
- Validate FP8 KV accuracy against an eval set before production; uncalibrated unit scales (the default
1.0) degrade quality. Use dataset-calibrated scales viallm-compressorand re-run the eval whenever the model or calibration set changes.4 - Pin the prefix-cache hash algorithm where reproducibility matters. The
sha256default is not stable across Python/vLLM versions; pin--prefix-caching-hash-algo(for examplexxhash_cbor) so cache behaviour does not shift silently on an engine upgrade.3 - Re-derive the sizing after any config change. Layer count,
kv_heads(GQA/MLA), head dim, and KV dtype all feedkv_bytes_per_token; a model, quantization, or attention-variant swap invalidates the old concurrency budget. Recompute it (the sizing block) and re-check--gpu-memory-utilizationagainst real KV demand. - Keep eviction and compression off full-attention layers unless a per-model eval clears it; the book restricts safe eviction to sliding-window or otherwise restricted-attention patterns. Re-validate this whenever the model architecture changes.1
How to scale it¶
- Shrink
KV_bytes_per_sequencefirst. It is the denominator of reachable concurrency, so FP8 KV, GQA/MLA, and caching fewer layers each raise batch size with no architectural change and no extra hardware. FP8 exactly halves the FP16 footprint (asserted in the sizing block), and GQA/MLA multiply with it.1 - Maximise prefix reuse with sticky routing. Reuse only happens on the replica that built the prefix, so route requests sharing a prefix (a system prompt, a document, a conversation) to the same replica to keep its blocks hot, the same locality argument as adapter-aware routing in multi-LoRA serving and KV-aware routing in inference serving.
- Tier the cache when contexts exceed HBM. SGLang HiCache spills the radix tree to host RAM (L2) and a distributed store (L3); keep the active decode working set in HBM (L1) and tier only the cold prefix, since offload adds latency to a bandwidth-bound phase.6
- Pool and transfer KV across nodes for disaggregated prefill/decode rather than growing a single node's pool. Cross-node KV pooling and the prefill-to-decode handoff are covered in disaggregated inference and KV-cache transfer with NIXL.
Failure modes¶
- HBM exhausted by KV, not weights. As concurrency and context climb, the KV cache (not the weights) runs the GPU out of memory; a 250k-token MHA-70B context is ~328 GB of KV alone. Size concurrency from
kv_bytes_per_token, and shrink it with FP8 / GQA-MLA before adding GPUs. See runbook: inference KV-cache OOM.1 - Fragmentation from a non-paged allocator. Reserving
max_model_lenper slot strands HBM in sequences that finish early or never reach max length (the paged block proves paging fits 12 short sequences where reserve-max fits 2). Use a paged engine; do not hand-roll a contiguous allocator.2 - Sustained preemption / swap thrash. When the block pool is exhausted the engine preempts (recompute or swap to
--swap-space); continuous preemption means concurrency is over-subscribed. Lower concurrency, shrink context, or enable FP8 KV, rather than pushing--swap-spacehigher.3 - OOM from over-set
--gpu-memory-utilization. The fraction budgets weights and KV; setting it past real headroom does not buy more cache, it OOMs under burst. Size it to measured KV demand. - Uncalibrated FP8 KV quality regression. Default unit scales (
1.0) can degrade output quality; ship dataset-calibrated scales and gate FP8 KV on an eval, not just on the memory saving.4 - Prefix cache that never hits. Near-unique prompts (every request a distinct document with no shared head) occupy blocks and add management overhead while buying no reuse. De-prioritize prefix caching there and spend the HBM on batch size instead.
- Hit rate collapses under non-sticky routing. A prefix built on one replica is not reused on another; load-balancing that ignores prefix locality tanks the hit rate. Add KV-aware / locality routing (inference serving).1
- Eviction or compression on a full-attention layer. Evicting/compressing prefix KV on layers that attend over the whole context corrupts long-context quality; the book flags eviction as safe only under sliding-window or restricted attention. Gate it on a per-model eval.1
- Offload latency on the decode path. Tiering active keys and values to CPU/NVMe adds latency to a memory-bandwidth-bound phase; keep the active decode working set in HBM and tier only cold prefix (disaggregated inference).
- Non-reproducible prefix hashing across upgrades. The
sha256default is not stable across Python/vLLM versions, so cache behaviour can shift silently on an engine bump; pin--prefix-caching-hash-algo.3
References¶
- Chris Fregly, AI Systems Performance Engineering (O'Reilly), Ch. 18 "Advanced Prefill-Decode and KV Cache Tuning": per-token KV sizing, paged KV cache, prefix reuse via global hash table / prompt tree, LRU eviction, FP8 KV adoption, sliding-window-only eviction caveat.
- vLLM, Automatic Prefix Caching (design): block hash = parent hash + token IDs + extra keys; LRU eviction; sha256 default;
--prefix-caching-hash-algo. - vLLM, Quantized KV Cache:
--kv-cache-dtypevalues, FP8 calibration viallm-compressor. - vLLM, Engine arguments and cache config:
--block-size(default 16),--gpu-memory-utilization,kv_cache_memory_bytes,--swap-space,--enable-prefix-caching. - Kwon et al., Efficient Memory Management for LLM Serving with PagedAttention (SOSP 2023): the PagedAttention block-table design.
- SGLang, RadixAttention / HiCache docs: radix-tree prefix cache,
--disable-radix-cache, hierarchical GPU/host/remote tiers.
Related: disaggregated inference · KV-cache transfer (NIXL) · continuous batching internals · rollout redundancy & cascade attention · FlashAttention / MLA · inference serving · multi-LoRA serving · runbook: inference KV-cache OOM · Glossary
-
Fregly, AI Systems Performance Engineering (O'Reilly), Ch. 18 "Advanced Prefill-Decode and KV Cache Tuning." Per-token KV
2 × layers × seq_len × kv_heads × head_dim × dtype_bytes; 70B/80-layer/hidden-4096/FP16 worked example (655,360 floats/token, ~1.31 MB/token, ~328 GB at 250k tokens); decode is memory-bandwidth-bound and KV, not weights, exhausts HBM; prefix reuse trades memory for O(N²) prefill compute via a global hash table / prompt tree with LRU eviction; FP8 KV "widely adopted in engines such as vLLM"; eviction/compression safe only under sliding-window or restricted attention; Nsight Systems/Compute for decode kernels; KV-aware routing. ↩↩↩↩↩↩↩↩↩↩↩↩↩↩ -
Kwon et al., "Efficient Memory Management for LLM Serving with PagedAttention," SOSP 2023. Fixed-size KV blocks tracked by a per-sequence block table (OS-virtual-memory analogy); non-contiguous physical pages remove internal/external fragmentation and enable cross-sequence block sharing. ↩↩↩
-
vLLM, "Automatic Prefix Caching (design)" https://docs.vllm.ai/en/stable/design/prefix_caching/. Global content-keyed KV hash table; block hash = parent hash + block token IDs + extra keys (LoRA ID, multimodal hash, cache salt); LRU eviction over the free-block queue; default
sha256(not reproducible across Python/vLLM versions);--prefix-caching-hash-algoin {sha256,sha256_cbor,xxhash,xxhash_cbor}; on-by-default in V1,--enable-prefix-cachingon older paths; preemption recompute-or-swap under pool exhaustion. ↩↩↩↩↩↩↩ -
vLLM, "Quantized KV Cache" https://docs.vllm.ai/en/latest/features/quantization/quantized_kvcache/.
--kv-cache-dtypein {auto,fp8(=fp8_e4m3),fp8_e4m3,fp8_e5m2}; default KV scales1.0; dataset-calibrated per-tensor or per-attention-head scales viallm-compressor(per-head is Flash-Attention-backend only). ↩↩↩ -
vLLM, "Engine arguments" https://docs.vllm.ai/en/stable/serving/engine_args.html and "Cache config" https://docs.vllm.ai/en/stable/api/vllm/config/cache/.
--block-sizedefault 16 (also the legacy prefix-cache hashing unit);--gpu-memory-utilizationsets the HBM fraction for weights + KV, remainder becomes the KV block pool;kv_cache_memory_bytessizes the pool directly and overrides the utilization fraction;--swap-spaceGiB CPU swap for preempted blocks. ↩↩ -
SGLang, "RadixAttention / HiCache" https://docs.sglang.ai/. Prefix KV in a radix tree (reuses overlapping prefixes, not only whole-block exact matches), LRU-evicted, on by default,
--disable-radix-cache; HiCache tiers the tree across L1 GPU / L2 host RAM / L3 distributed store for memory-constrained or long-context serving. ↩↩