How the KV cache speeds up LLM inference on GPUs¶
Scope: the founding optimisation of LLM serving, from first principles to fleet scale. The page derives the quadratic-recompute arithmetic the KV cache eliminates and validates it by counting every multiply-accumulate both ways, shows why the trade converts decode from a compute problem into a memory-bandwidth problem, and then builds the capacity model that turns the speedup into vLLM's production growth ladder: single-GPU knobs, FP8 KV, tensor parallelism sized by KV headroom, data-parallel replicas, and disaggregation. Block-table and prefix-reuse mechanics live in KV cache management; the per-step scheduler in continuous batching internals; dropping cached tokens in KV cache token eviction; the prefill/decode split in disaggregated inference.
Code blocks below come in two kinds. The
vllm servelaunch snippets are reference templates on the real vendor CLI (vLLM is not installed here); pin the image tag and validate flags against the release you deploy. The Python blocks are runnable, self-checking validations of the core math the page teaches (the cached/uncached equivalence with counted work, using numpy, and the pure-Python decode capacity model); each runs on a stockpython3and asserts its result, including adversarial cases (a corrupted cache entry, a stale cache, an over-capacity batch, a model that "fits" but cannot serve).
What it is¶
An autoregressive transformer emits one token per step, and every attention layer computes that token's query against the keys and values of all previous tokens. Under causal masking with frozen inference weights, the key and value vectors of past tokens never change: token 500's K and V are mathematically identical whether they were computed at step 500 or step 5,000 (bitwise reproducibility on real kernels additionally requires batch-invariant kernels; the cache sidesteps that by storing values once instead of recomputing them). The KV cache exploits that invariance. Each token's K and V are computed exactly once (during prefill for the prompt, during its own decode step for generated tokens), stored in GPU memory, and re-read on every later step instead of recomputed.
Without the cache, emitting the token at position s requires re-running the forward pass over all s tokens (this is literally what use_cache=False does in a Hugging Face generate loop). With the cache, the same step processes one token: project one query/key/value, append K and V to the cache, attend over the stored keys. The step cost drops from linear-in-s projections plus quadratic-in-s attention to constant projections plus linear-in-s attention reads.
The trade is explicit: memory for compute. For a Qwen3-8B-class model (36 layers, 8 KV heads via GQA, head dim 128, FP16) the cache costs 2 x 36 x 8 x 128 x 2 = 147,456 bytes, about 0.14 MiB, per token; an 8k-token sequence holds ~1.2 GB of KV. That footprint, multiplied by concurrency, is what fills HBM and why paging, prefix reuse, eviction, and FP8/NVFP4 KV exist. Kwon et al. measured that pre-paging systems kept only 20.4%-38.2% of KV memory holding actual token state, and recovering that waste (plus the recompute the cache avoids) is worth 2-4x throughput at equal latency.1
Why use it¶
The per-step FLOP ratio between uncached and cached decode is roughly the sequence length itself. A cached decode step costs about 2 x params FLOPs per sequence (every weight participates in one multiply-accumulate per token); an uncached step at position s costs about 2 x params x s because the whole prefix is re-processed. At an 8k context that is nearly four orders of magnitude more compute per token, and the gap widens every step. Summed over a generation, uncached work grows quadratically in total length for the projection/MLP terms and cubically for attention score terms, while cached work grows linearly plus a quadratic attention-over-cache term whose FLOPs are real but so few per byte of KV read that it binds on memory bandwidth, not compute. The executed block under How it works counts both totals exactly on a real attention layer and asserts the closed forms.
What the cache does not remove is the per-step read of the stored K and V: attending over s cached tokens still moves s x kv_bytes_per_token from HBM every step, per sequence. The cache therefore converts decode from compute-bound to memory-bandwidth-bound: each step reads all weights once plus all live KV, and performs comparatively trivial arithmetic. Pope et al. formalise this regime (small-batch generation is bandwidth-limited; KV memory caps batch size; multiquery attention shrinks KV enough to unlock up to 32x longer contexts)2, and it is the reason every production lever on this page is a bytes lever, not a FLOPs lever:
- Batching re-uses the one-per-step weight read across
Bconcurrent sequences, so total tokens/s rises near-linearly until KV traffic dominates. The KV cache is what makes batched decode possible at all: without it, each sequence would re-run its full prefix each step and the GPU would saturate on redundant compute at batch 1. - KV footprint cuts (FP8 KV, GQA/MLA) pay twice: more sequences fit (higher concurrency ceiling) and each step moves fewer bytes (faster steps).
- More HBM (tensor parallelism across GPUs) buys a bigger KV pool, which is usually the real reason to shard a model that technically fits.
When to use it (and when not)¶
The cache is on by default in every serving engine and every sane generate loop; the practical question is when it buys nothing or when its memory cost is deliberately traded back.
- Prefill-only workloads gain nothing. Embedding models, rerankers, classifiers, reward-model scoring, and single-shot logprob evaluation run one forward pass over a full sequence and emit no incremental tokens. There is no second step to reuse the cache; allocating KV for them wastes pool.
- Preemption is the engine choosing recompute on purpose. When the KV pool is exhausted, vLLM evicts a request and later re-runs its prefill (
RECOMPUTEis the V1 default). That is the cache trade executed in reverse, dynamically: the engine decides that recomputing one victim is cheaper than stalling the fleet.3 - Architecture changes the arithmetic. Sliding-window attention caps the cache at the window length; MLA stores a compressed latent instead of full K/V heads (FlashAttention and MLA); state-space and linear-attention layers carry O(1) recurrent state, so there is no growing cache to manage in those layers. Hybrid models mix both regimes and size their pools accordingly.
- Shared prefixes compound the win. When many requests share a system prompt, few-shot preamble, or conversation history, prefix caching reuses KV across requests, and provider prompt caching sells the same mechanism as an API contract. Parallel sampling and beam search share the prompt's KV blocks copy-on-write.1
Architecture¶
flowchart TB
subgraph life["One request"]
PF["Prefill: one pass over the prompt,<br/>K and V computed for every token"] --> DL["Decode loop: one token per step"]
DL -->|"append one K,V; read all cached K,V"| DL
end
subgraph pool["KV pool = budgeted HBM minus weights"]
BA["seq A blocks"]
BB["seq B blocks"]
FR["free blocks"]
end
PF -->|"write KV blocks"| pool
DL -->|"read KV blocks"| pool
pool -->|"pool exhausted"| PRE["preempt victim,<br/>recompute its prefill later"]
PRE -.->|"the uncached cost returns"| PF
subgraph ladder["vLLM growth ladder"]
L1["1 tune one GPU:<br/>memory budget, seqs, token budget"] --> L2["2 shrink KV:<br/>FP8 KV, prefix reuse"]
L2 --> L3["3 scale up: TP sized<br/>by KV headroom"]
L3 --> L4["4 scale out: replicas / DP<br/>+ prefix-aware routing"]
L4 --> L5["5 disaggregate<br/>prefill and decode"]
end
Prefill and decode are different machines sharing one cache. Prefill processes the whole prompt in parallel: large matrix multiplies, compute-bound, sets time-to-first-token, and writes the prompt's KV into the pool. Decode iterates one token per sequence per step: reads all weights plus all live KV, bandwidth-bound, sets inter-token latency. The pool between them is the budgeted fraction of HBM left after weights (vLLM's --gpu-memory-utilization times device memory, minus weights and activation slack), carved into fixed-size blocks by the paged allocator. Concurrency is capped by how many sequences' KV fits in that pool; overflow triggers preemption, which re-imports the very recompute cost the cache exists to remove. The growth ladder in the diagram is the production path: each rung either shrinks KV bytes, adds HBM, or adds independent engines.
How it works (validated core math)¶
The block builds one attention layer and decodes the same sequence twice: uncached (full re-forward each step, all rows) and cached (one prefill, then one-token steps). It instruments every matrix multiply, so the speedup is not asserted from theory but counted, then checked against the closed forms 4sD^2 + 2s^2D per full forward and 4D^2 + 2sD per cached step. The adversarial cases prove the equivalence test has teeth: a single corrupted cache float and a cache missing its newest token (the classic off-by-one append bug) must both change the output.
# Runnable on system python3 (numpy). Core claim of the page: caching K and V turns
# autoregressive decode from a full re-forward per token (quadratic per step, cubic per
# generation in attention) into a single-token incremental step (linear per step). The
# block builds one attention layer both ways, proves the outputs match to float64
# round-off (and that the counted work differs by 32x), counts
# every multiply-accumulate, checks the counts against closed-form cost formulas, and
# asserts the cache is load-bearing: corrupting or under-filling it changes the output.
import numpy as np
D = 64 # model width of the toy layer
rng = np.random.default_rng(7)
Wq, Wk, Wv, Wo = (rng.standard_normal((D, D)) / np.sqrt(D) for _ in range(4))
MACS = {"n": 0} # multiply-accumulate counter, incremented by every matmul
def mm(a: np.ndarray, b: np.ndarray) -> np.ndarray:
MACS["n"] += a.shape[0] * a.shape[1] * b.shape[1]
return a @ b
def softmax_rows(s: np.ndarray) -> np.ndarray:
w = np.exp(s - s.max(axis=-1, keepdims=True))
return w / w.sum(axis=-1, keepdims=True)
def forward_full(X: np.ndarray) -> np.ndarray:
"""Full causal forward over all s tokens: what use_cache=False re-runs every step."""
s = X.shape[0]
Q, K, V = mm(X, Wq), mm(X, Wk), mm(X, Wv)
scores = mm(Q, K.T) / np.sqrt(D)
scores[np.triu(np.ones((s, s), dtype=bool), 1)] = -np.inf
return mm(mm(softmax_rows(scores), V), Wo)
def generate_uncached(prompt: np.ndarray, n: int) -> np.ndarray:
"""No KV cache: re-forward the whole sequence to emit each token."""
X = prompt.copy()
for _ in range(n):
y = forward_full(X)
X = np.vstack([X, np.tanh(y[-1:])]) # deterministic autoregression
return X
def generate_cached(prompt: np.ndarray, n: int, corrupt: bool = False,
skip_append: bool = False) -> np.ndarray:
"""KV cache: one prefill that writes the cache, then one-token incremental steps."""
s = prompt.shape[0]
Q, Kc, Vc = mm(prompt, Wq), mm(prompt, Wk), mm(prompt, Wv) # projected exactly once
scores = mm(Q, Kc.T) / np.sqrt(D)
scores[np.triu(np.ones((s, s), dtype=bool), 1)] = -np.inf
y = mm(mm(softmax_rows(scores), Vc), Wo) # prefill output; Kc/Vc ARE the cache
if corrupt:
Kc[0, 0] += 1e-3 # one flipped cache entry
X = np.vstack([prompt, np.tanh(y[-1:])])
for _ in range(n - 1):
x_new = X[-1:]
q, k, v = mm(x_new, Wq), mm(x_new, Wk), mm(x_new, Wv)
if not skip_append: # skip = stale cache, newest token missing
Kc, Vc = np.vstack([Kc, k]), np.vstack([Vc, v])
s = mm(q, Kc.T) / np.sqrt(D)
out = mm(mm(softmax_rows(s), Vc), Wo)
X = np.vstack([X, np.tanh(out)])
return X
P, N = 16, 48
prompt = rng.standard_normal((P, D))
# 1. Equivalence: cached and uncached decode produce the SAME tokens (float64, tight tol).
MACS["n"] = 0
ref = generate_uncached(prompt, N)
macs_uncached = MACS["n"]
MACS["n"] = 0
out = generate_cached(prompt, N)
macs_cached = MACS["n"]
assert out.shape == ref.shape == (P + N, D)
assert np.allclose(ref, out, atol=1e-10), "cache changed the model's output"
# 2. Counted work matches the closed forms.
# Full forward at length s: 4*s*D^2 (QKV+output projections) + 2*s^2*D (scores + AV).
# Incremental step at length s: 4*D^2 + 2*s*D.
expect_uncached = sum(4 * s * D**2 + 2 * s**2 * D for s in range(P, P + N))
expect_cached = (4 * P * D**2 + 2 * P**2 * D) + sum(
4 * D**2 + 2 * s * D for s in range(P + 1, P + N))
assert macs_uncached == expect_uncached, (macs_uncached, expect_uncached)
assert macs_cached == expect_cached, (macs_cached, expect_cached)
# 3. The headline speedup for this config: ~32x less work end to end, and the LAST step
# alone is 63 tokens of recompute vs 1, a 63x per-step gap that keeps widening.
ratio = macs_uncached / macs_cached
assert ratio > 30, ratio
last_full = 4 * 63 * D**2 + 2 * 63**2 * D
last_inc = 4 * D**2 + 2 * 63 * D
assert last_full / last_inc > 60
# 4. The gap GROWS with generation length (linear in tokens): more decode, more win.
def ratio_at(n: int) -> float:
MACS["n"] = 0
generate_uncached(prompt, n)
u = MACS["n"]
MACS["n"] = 0
generate_cached(prompt, n)
return u / MACS["n"]
r12, r24, r48 = ratio_at(12), ratio_at(24), ratio_at(48)
assert r12 < r24 < r48, (r12, r24, r48)
# 5. Adversarial: the equivalence test is sensitive. One corrupted cache entry (1e-3 in
# one float) or a cache that misses the newest token (an off-by-one append bug) must
# change the output, or assertion 1 would prove nothing.
bad = generate_cached(prompt, N, corrupt=True)
assert not np.allclose(ref, bad, atol=1e-10), "corrupt cache went undetected"
stale = generate_cached(prompt, N, skip_append=True)
assert not np.allclose(ref, stale, atol=1e-10), "stale cache went undetected"
# 6. Boundary: N=1 is pure prefill; both paths do identical work and identical math.
MACS["n"] = 0
a = generate_uncached(prompt, 1)
u1 = MACS["n"]
MACS["n"] = 0
b = generate_cached(prompt, 1)
assert u1 == MACS["n"] and np.allclose(a, b)
print(f"equivalent outputs; MACs uncached={macs_uncached:,} cached={macs_cached:,} "
f"ratio={ratio:.1f}x; ratio grows {r12:.1f} -> {r24:.1f} -> {r48:.1f} OK")
# equivalent outputs; MACs uncached=41,829,376 cached=1,305,600 ratio=32.0x;
# ratio grows 9.8 -> 17.6 -> 32.0 OK
Two readings of the result matter at production scale. First, the 32x is a toy number: the ratio scales with sequence length, so at an 8k context the per-step gap is roughly 8,000x in FLOPs. Second, case 5 is a real bug class, not test theatre: cache-position off-by-ones and stale appends are exactly how broken RoPE indexing or a mis-wired speculative-decoding rollback corrupts generation while every kernel runs "successfully".
How to use it¶
In vLLM the cache is not a feature to enable; it is the engine's central data structure. Using it well means budgeting it deliberately. The template pins the knobs that size and spend the pool; the table after it says what each one trades.
# Reference template (vLLM V1 engine, written against the 0.19-series; one 80 GB GPU;
# not executed here).
# Verify flags against the release you deploy: vLLM renames/retires flags across minors.
# The numbers implement the capacity model below for a Qwen3-8B-class model at 8k ctx.
vllm serve Qwen/Qwen3-8B \
--gpu-memory-utilization 0.90 \
--max-model-len 8192 \
--max-num-seqs 44 \
--max-num-batched-tokens 8192 \
--kv-cache-dtype fp8 \
--port 8000
| Knob | What it sizes | Trade |
|---|---|---|
--gpu-memory-utilization |
Total HBM budget; what remains after weights and activations becomes the KV block pool | Higher = bigger pool, more concurrency; too high risks allocator OOM on activation spikes |
--max-model-len |
Worst-case KV per sequence the scheduler must assume | Lower = more admissible sequences for the same pool |
--max-num-seqs |
Concurrency ceiling per engine step | Above the pool's real capacity it converts overload into preemption churn instead of queueing |
--max-num-batched-tokens |
Per-step token budget shared by chunked prefill and decode | ~2048 favours inter-token latency; >8192 favours TTFT/throughput3 |
--kv-cache-dtype fp8 |
KV bytes per token | Halves footprint: doubles the concurrency ceiling and cuts per-step KV reads (details) |
On startup the engine logs the pool it actually got ("GPU KV cache size: N tokens"); that number divided by your mean context length is the honest concurrency ceiling, and vllm:kv_cache_usage_perc tracks how much of it is live. A log line saying a request "is preempted by PreemptionMode.RECOMPUTE" is the pool telling you it is too small for the offered load.3
How to develop with it¶
The capacity model below is the sizing tool to run before touching flags: it computes the KV pool, the concurrency ceiling, and the decode roofline for a model/GPU pair, and it encodes the four facts the growth ladder stands on. It is first-order on purpose (it ignores CUDA-graph memory, MLP-vs-attention split, scheduler overhead, and uses full-context KV for every sequence); calibrate its constants against one measured deployment and it becomes a fleet-planning function.
# Runnable on system python3 (pure Python). Core algorithm: the decode capacity model that
# turns the KV-cache speedup into a fleet-sizing tool. Decode step time is the max of a
# bandwidth term (weights read once per step + KV read per sequence) and a compute term
# (2*params FLOPs per token per sequence); throughput is batch/step_time; the KV pool
# caps the batch. The block asserts the properties the vLLM growth ladder relies on:
# batching amortizes weight reads, decode never leaves the bandwidth wall, FP8 KV
# doubles concurrency, and TP is sized by KV headroom, not by "does it fit".
HBM = 80e9 # H100 SXM: 80 GB
BW = 3.35e12 # 3.35 TB/s HBM3 (NVIDIA spec page)
FLOPS = 989e12 # ~989 dense BF16 TFLOPS (spec page's 1,979 is with sparsity)
UTIL = 0.9 # vLLM --gpu-memory-utilization
ACT_RESERVE = 2e9 # activations / CUDA graphs / fragmentation slack (approximate)
def kv_bytes_per_token(layers: int, kv_heads: int, head_dim: int, dtype_bytes: int) -> int:
assert min(layers, kv_heads, head_dim, dtype_bytes) > 0
return 2 * layers * kv_heads * head_dim * dtype_bytes
def kv_pool_bytes(params: float, weight_bytes: int, tp: int, util: float = UTIL,
reserve: float = ACT_RESERVE) -> float:
"""KV pool across the TP group: budgeted HBM minus sharded weights and slack."""
assert tp >= 1
pool = tp * (util * HBM - reserve) - params * weight_bytes
if pool <= 0:
raise MemoryError(f"model does not fit at TP={tp}; raise TP or shrink weights")
return pool
def max_seqs(pool: float, kv_tok: int, ctx: int) -> int:
"""Concurrency ceiling: how many ctx-length sequences the KV pool holds."""
return int(pool // (kv_tok * ctx))
def decode_step_time(batch: int, ctx: int, params: float, kv_tok: int, tp: int,
pool: float) -> tuple[float, str]:
"""One decode step: every weight byte is read once, every live KV byte is read once,
and each sequence needs ~2*params FLOPs. Returns (seconds, binding_resource)."""
assert batch >= 1, "batch must be at least 1"
if batch > max_seqs(pool, kv_tok, ctx):
raise MemoryError("batch exceeds KV pool: vLLM would preempt (recompute/swap)")
t_mem = (params * 2 + batch * ctx * kv_tok) / (tp * BW)
t_compute = (2 * params * batch) / (tp * FLOPS)
return max(t_mem, t_compute), ("compute" if t_compute > t_mem else "bandwidth")
# --- Qwen3-8B-class model on one H100: the single-GPU growth story -------------------
params = 8.19e9
kv16 = kv_bytes_per_token(36, 8, 128, 2) # FP16/BF16 KV
kv8 = kv_bytes_per_token(36, 8, 128, 1) # FP8 KV
assert kv16 == 147_456 and kv8 * 2 == kv16 # 0.14 MiB/token; FP8 halves it
CTX = 8192
pool = kv_pool_bytes(params, 2, tp=1)
cap16, cap8 = max_seqs(pool, kv16, CTX), max_seqs(pool, kv8, CTX)
assert cap16 == 44 and cap8 == 88, (cap16, cap8) # FP8 KV doubles concurrency here
assert cap8 == 2 * cap16 # exact 2x for THIS config; floor() need not double in general
def tput(batch: int, kv_tok: int = kv16, tp: int = 1, p: float = params,
pl: float = pool) -> float:
t, _ = decode_step_time(batch, CTX, p, kv_tok, tp, pl)
return batch / t
# 1. Single stream: decode reads all 16.4 GB of weights (plus its own KV) to emit ONE
# token, so a lone request gets ~190 tok/s while >99% of the FLOPs sit idle.
t1 = tput(1)
assert 180 < t1 < 210, t1
# 2. Batching amortizes the weight read: full-pool batch lifts total tokens/s ~11x while
# per-step time grows far less than 44x. Throughput is strictly monotone in batch.
sweep = [tput(b) for b in range(1, cap16 + 1)]
assert all(b > a for a, b in zip(sweep, sweep[1:]))
assert sweep[-1] / t1 > 9, sweep[-1] / t1
# 3. ...but with diminishing returns: at high batch the KV bytes, not the weights,
# dominate the step, so each extra sequence buys less than the one before.
assert (tput(2) - tput(1)) > (tput(cap16) - tput(cap16 - 1))
# 4. Decode never leaves the bandwidth wall: even at the largest batch the pool allows,
# the compute term is far below the memory term. Extra FLOPs cannot help decode; only
# more bandwidth, fewer KV bytes, or more GPUs can.
for b in (1, 8, cap16):
_, bound = decode_step_time(b, CTX, params, kv16, 1, pool)
assert bound == "bandwidth", (b, bound)
# 5. FP8 KV at the same batch also runs FASTER (half the KV bytes per step) on top of
# doubling the ceiling: footprint cuts pay twice, in capacity and in step time.
assert tput(cap16, kv_tok=kv8) > tput(cap16, kv_tok=kv16)
# 6. Adversarial: a batch one past the ceiling must raise (that is vLLM's preemption
# regime, not a throughput data point), and an oversized model must raise, not size.
try:
decode_step_time(cap16 + 1, CTX, params, kv16, 1, pool)
raise AssertionError("over-capacity batch was silently accepted")
except MemoryError:
pass
# --- Llama-3.1-70B-class model: TP is sized by KV headroom, not by fit ---------------
p70 = 70.6e9
kv70 = kv_bytes_per_token(80, 8, 128, 2)
for tp in (1, 2): # 141.2 GB of BF16 weights: no KV pool at TP=1 or TP=2
try: # under the default budget (0.9 util, 2 GB reserve)
kv_pool_bytes(p70, 2, tp=tp)
raise AssertionError(f"70B BF16 must not size at TP={tp} with default budget")
except MemoryError:
pass
# Even the most generous budget (0.95 util, zero reserve) "fits" TP=2 with only a
# handful of 8k sequences: the deployment works in a demo and starves under load.
pool2 = kv_pool_bytes(p70, 2, tp=2, util=0.95, reserve=0)
seqs2 = max_seqs(pool2, kv70, CTX)
assert 1 <= seqs2 <= 4, seqs2
pool4 = kv_pool_bytes(p70, 2, tp=4)
seqs4 = max_seqs(pool4, kv70, CTX)
assert seqs4 >= 50, seqs4 # TP=4 exists for CONCURRENCY, not for fit
print(f"1xH100 8B: 1-seq {t1:.0f} tok/s -> batch {cap16} {sweep[-1]:.0f} tok/s "
f"(x{sweep[-1]/t1:.1f}); FP8 KV cap {cap8}; 70B: TP2 holds {seqs2} seq, "
f"TP4 holds {seqs4} OK")
# 1xH100 8B: 1-seq 190 tok/s -> batch 44 2120 tok/s (x11.1); FP8 KV cap 88;
# 70B: TP2 holds 4 seq, TP4 holds 51 OK
The four asserted properties are the whole production argument in miniature. A lone request wastes the GPU (190 tok/s against ~2,120 available), so concurrency is not optional at fleet economics. Batching pays until KV bytes dominate, so footprint cuts are the next lever. Decode never becomes compute-bound at any feasible batch, so faster GPUs without more bandwidth or memory do not fix a saturated decode pool. And "the model fits" is the wrong sizing question: the 70B case shows TP=2 technically loading and then serving four 8k sequences, while TP=4 serves fifty-one.
How to run it in production: the growth ladder¶
Each rung below is what to do when the previous one saturates, with the metric that tells you it has. The metric names are vLLM's Prometheus surface at /metrics.6
- Exhaust one GPU first. Raise
--gpu-memory-utilizationuntil the startup KV-pool log stops growing safely, set--max-num-seqsfrom the capacity model, and pick--max-num-batched-tokensfor your SLO (2048-class for tight inter-token latency, 8192+ for throughput; chunked prefill is on by default in V1 and decodes are scheduled ahead of prefills).3 Saturation signature:vllm:kv_cache_usage_percpinned near 1.0,vllm:num_requests_waitingnon-zero, preemptions rising (vllm:num_preemptions, scraped asvllm:num_preemptions_total). That signature with headroom on step 2 or 3 means move down the ladder, not add GPUs; see the KV-cache OOM runbook. - Shrink KV bytes before buying silicon.
--kv-cache-dtype fp8doubles the ceiling and speeds every step (validate quality on long-context evals first; details and calibration). Enable prefix reuse for shared-prefix traffic and watchvllm:prefix_cache_hits(vllm:prefix_cache_hits_totalin PromQL): agent and RAG fleets routinely serve most prompt tokens from cache, which offloads prefill, not decode. - Scale up with tensor parallelism, sized by KV headroom. vLLM's own rule: TP across the GPUs of one node, pipeline parallelism only when spanning nodes.4 Choose the TP degree from the capacity model's concurrency output, not from whether weights load; the 70B example above is the canonical trap.
- Scale out with replicas or data parallelism. For dense models, N independent engines behind a load balancer are equivalent to vLLM's native DP and simpler to operate; native DP (
--data-parallel-size, internal/hybrid/external LB modes) earns its complexity for MoE models, where attention runs data-parallel while expert layers form one EP/TP group across ranks (--enable-expert-parallel).5 Route prefix-sticky traffic with prefix-aware request routing so rung-2 hit rates survive the fan-out, and autoscale onvllm:num_requests_waiting, never on GPU utilisation, which reads near-100% under any kernel activity and can read low when the bottleneck is off-GPU, so it carries no backlog signal either way (the deploy recipe wires the HPA). - Disaggregate prefill from decode. When one pool cannot hold both phases' SLOs (prefill bursts inflate inter-token latency, or decode concurrency starves TTFT), split them and transfer KV between pools (disaggregated inference, rate matching, NIXL transfer).
How to maintain it¶
- Steady-state invariants. Preemption rate ~0 over sustained windows;
vllm:kv_cache_usage_percbelow ~0.9 at peak withvllm:num_requests_waitingnear zero; prefix hit rate stable for the traffic mix. Alert on trend breaks, not single spikes; the SLO page has the burn-rate wiring. - Re-run the capacity model on every change of model, dtype, context ceiling, or GPU SKU. Each of those moves the ceiling multiplicatively; a config bump from 8k to 32k
--max-model-lenquarters worst-case concurrency and is a classic silent capacity regression. - Pin and re-verify on upgrades. vLLM's metric and flag surface drifts across versions: the retired V0 engine's KV gauge was
vllm:gpu_cache_usage_perc(V1:vllm:kv_cache_usage_perc), and V1 documents its counters without the_totalsuffix that Prometheus exposition appends6; dashboards and HPAs break quietly. Re-benchmark tokens/s and TTFT/ITL on each engine upgrade before rolling the fleet. - Keep the eval gate on KV compression. FP8 KV and eviction schemes are accuracy-relevant; wire a long-context regression suite into the rollout, and treat "compression shipped, quality dipped" as a rollback trigger.
Failure modes¶
- Preemption storm. Offered concurrency exceeds the pool; every preemption re-buys the prefill the cache had already paid for, so goodput collapses super-linearly under overload. Signature: preemptions rising while tokens/s falls. Fix per the ladder rungs 1-2, or admission-control (QoS page).
- KV OOM at startup or under spike. Budget set above what activations allow, or context ceiling raised without re-sizing. The KV-cache OOM runbook is the triage path.
- "Fits but cannot serve." A TP degree chosen for weight fit leaves a KV pool of a few sequences; the demo works, production queues. The capacity model's 70B case reproduces it; size TP by concurrency.
- Token-budget mis-set.
--max-num-batched-tokenstoo high lets prefill chunks crowd decode steps (inter-token latency spikes on every long-prompt arrival); too low starves TTFT under prompt-heavy load.3 - Autoscaling on the wrong signal. GPU utilisation cannot distinguish five queued requests from five hundred: it pins high under any kernel activity and says nothing about backlog. Scale on
vllm:num_requests_waiting. - Cache-integrity bugs. Wrong RoPE position on append, missed rollback after rejected speculative tokens, or cross-request block aliasing produce fluent garbage with zero errors; the executed block's corruption/staleness cases are the unit-level guard, and prefix-cache hashing covers the cross-request variant (LoRA and multimodal inputs must be part of the hash).
- Quality regressions from KV compression. FP8 KV or eviction acceptable at 4k can degrade at 64k; gate with long-context evals per maintenance above, and see token eviction for why most eviction schemes also fail to return memory.
References¶
- Kwon et al., "Efficient Memory Management for Large Language Model Serving with PagedAttention" (SOSP 2023): https://arxiv.org/abs/2309.06180
- Pope et al., "Efficiently Scaling Transformer Inference" (arXiv 2211.05102): https://arxiv.org/abs/2211.05102
- Vaswani et al., "Attention Is All You Need" (arXiv 1706.03762): https://arxiv.org/abs/1706.03762
- vLLM, "Optimization and Tuning" (preemption, chunked prefill,
max_num_batched_tokensguidance): https://docs.vllm.ai/en/latest/configuration/optimization.html - vLLM, "Parallelism and Scaling" (TP within node, PP across nodes): https://docs.vllm.ai/en/latest/serving/parallelism_scaling.html
- vLLM, "Data Parallel Deployment" (DP modes, MoE DP attention + EP experts): https://docs.vllm.ai/en/latest/serving/data_parallel_deployment.html
- vLLM, "Production Metrics" (
vllm:*metric names): https://docs.vllm.ai/en/latest/usage/metrics.html - NVIDIA H100 specifications (80 GB, 3.35 TB/s; the 1,979 FP16/BF16 tensor TFLOPS carries a "with sparsity" footnote, so the dense figure used here is half): https://www.nvidia.com/en-us/data-center/h100/
- Chris Fregly, "AI Systems Performance Engineering" (O'Reilly), ch. 18: prefill/decode economics and KV sizing.
Related: KV cache management · continuous batching internals · KV cache token eviction · LLM inference efficiency · roofline and arithmetic intensity · disaggregated inference · prompt caching (provider APIs) · inference serving · LLM request routing · recipe: vLLM inference deployment · SLOs for inference serving · runbook: inference KV-cache OOM · Glossary
-
Kwon et al., SOSP 2023 (arXiv 2309.06180). Abstract: vLLM "improves the throughput of popular LLMs by 2-4x with the same level of latency" vs FasterTransformer/Orca, with "near-zero waste in KV cache memory"; sec. 1: "only 20.4%-38.2% of the KV cache memory is used to store the actual token states in the existing systems." ↩↩
-
Pope et al., arXiv 2211.05102. Generation is memory-bandwidth-limited at low batch; KV cache size constrains batch; "the lower memory requirements of multiquery attention ... enables scaling up to 32x larger context lengths"; 29 ms/token low-batch decode (with int8 weight quantization) and 76% MFU high-batch prefill on PaLM 540B. ↩
-
vLLM, "Optimization and Tuning" https://docs.vllm.ai/en/latest/configuration/optimization.html. Default preemption is
RECOMPUTEin V1 with the quoted log line; mitigations: raisegpu_memory_utilization, lowermax_num_seqs/max_num_batched_tokens, or raise parallelism. Chunked prefill on by default; decodes prioritised over prefills; ~2048 token budget favours ITL, >8192 favours throughput; with chunked prefill disabledmax_num_batched_tokensmust exceedmax_model_len. ↩↩↩↩↩ -
vLLM, "Parallelism and Scaling" https://docs.vllm.ai/en/latest/serving/parallelism_scaling.html. Single GPU: no distribution needed; single node multi-GPU: tensor parallelism; multi-node: TP per node combined with pipeline parallelism across nodes;
--distributed-executor-backend {mp,ray}. ↩ -
vLLM, "Data Parallel Deployment" https://docs.vllm.ai/en/latest/serving/data_parallel_deployment.html.
--data-parallel-size,--data-parallel-size-local,--data-parallel-hybrid-lb, external-LB mode with per-rank endpoints; for non-MoE models independent replicas behind a load balancer are equivalent; MoE: DP attention with expert layers grouped as TP of size DPxTP or EP via--enable-expert-parallel. ↩ -
vLLM, "Production Metrics" https://docs.vllm.ai/en/latest/usage/metrics.html.
vllm:kv_cache_usage_perc(1.0 = full),vllm:num_requests_running,vllm:num_requests_waiting,vllm:num_preemptions,vllm:prefix_cache_hits/queries,vllm:time_to_first_token_secondshistogram. Counters are documented without the_totalsuffix the Prometheus client appends at exposition, so PromQL usesvllm:num_preemptions_totalandvllm:prefix_cache_hits_total. The V0 gauge namevllm:gpu_cache_usage_percis attested in the historical V1-metrics design doc shipped with v0.8-era releases, not on the current metrics page; treat any cross-version dashboard as suspect until re-checked. ↩↩