KV cache token eviction and compaction¶
Scope: token-eviction KV cache compression for long-generation (reasoning) inference: the method families (sink plus recency, cumulative attention, observation window, redundancy, geometry scoring), the two infrastructure walls that keep most of them out of production (FlashAttention never materializes attention scores; a paged allocator frees only fully-empty blocks), and the compaction pass that makes eviction actually return paged memory. Dtype compression (FP8 KV) and paged-block mechanics live in KV cache management; this page covers dropping cached tokens outright.
Code blocks below are runnable, self-checking validations of the core mechanisms this page teaches (block-granular reclamation, compaction, online softmax). Each runs on a stock
python3with numpy and asserts its result, including adversarial cases. Method-specific behaviour (TriAttention, H2O reference code) is described from the cited sources, not executed here; pin versions and validate against the repos before production use.
What it is¶
Token eviction bounds KV cache growth by deleting cached K/V rows judged unimportant, instead of (or on top of) shrinking bytes per token with FP8 or GQA/MLA. The justification is attention sparsity: a small fraction of cached tokens collects most of the attention weight, so a token that is never attended to can in principle be dropped without changing outputs. These "heavy-hitter" tokens are few enough that keeping roughly 20% of the KV preserves most of the attention weight, and the first few positions act as "attention sinks" that soak up weight regardless of content.34
The method families differ in how they pick victims:1
- Sink plus recency (StreamingLLM). Keep a handful of initial sink tokens and a sliding window of recent tokens; evict everything else. Needs no scoring signal at all, but permanently loses everything outside the window.3
- Cumulative attention scoring (H2O, Scissorhands, TOVA). Maintain a running sum of the attention each cached token receives at every decode step; evict the lowest-scoring token each step to hold a fixed budget.456
- Observation-window snapshot (SnapKV, PyramidKV, Ada-KV). Score once at the end of prefill: the attention the last W prompt tokens pay to the full history decides which tokens stay. PyramidKV and Ada-KV split the budget per layer and per head.789
- Redundancy-aware (R-KV). Built for reasoning traces: combine importance with redundancy so near-duplicate chain-of-thought steps are evicted first.11
- Page selection without eviction (Quest). Keep the full cache and let each query read only the most relevant pages. Saves attention compute and bandwidth, not memory: the cache still grows with context length.10
- Geometry scoring (TriAttention). Score tokens from the model's pre-RoPE Q/K geometry, learned offline per head, so no runtime attention score is ever needed; pair it with a periodic compaction pass that physically consolidates survivors so whole blocks empty out.21
Why use it¶
Reasoning models make the KV cache the binding constraint on single-GPU serving. Every generated thinking token appends K/V rows across all layers; nothing shrinks the cache while generation runs. NVIDIA's worked example: Qwen3-32B with 4-bit quantized weights crashes a 24 GB GPU with an out-of-memory error after roughly 24,000 generated tokens, short of the 32K-token traces reasoning models routinely produce on hard problems.1 Dtype and architecture levers (FP8 KV, GQA, MLA) shrink bytes per token but leave growth linear in tokens; eviction is the only lever that bounds the token count itself.
The payoff, when the infrastructure problems are actually solved, is large. TriAttention on Qwen3-8B with 32K-token generation at a 3,072-token KV budget matches full attention's AIME 2025 accuracy (40.8%) while decoding 2.5x faster (563.5 vs 222.8 tokens/s) and holding 10.7x less KV memory.12
The catch is that most published methods do not solve them. Accuracy at a 2,048-token budget on AIME 2024: full attention 57.1%, SnapKV 34.6%, R-KV 25.4%, TriAttention 42.1%. On AIME 2025: 40.8% / 20.0% / 17.5% / 32.9%.1 And the headline memory numbers are often measured outside the serving stack: R-KV reports 90% memory savings on pre-allocated contiguous tensors, not under vLLM; integrating such eviction with a paged allocator is the open problem the NVIDIA blog identifies.111
When to use it (and when not)¶
- Use it for long-generation workloads (reasoning traces, agent loops) on memory-constrained GPUs, where the cache, not the weights, is what OOMs, and a bounded accuracy loss at the chosen budget has been measured and accepted.
- Exhaust the free levers first. FP8 KV halves bytes per token at pool-sizing time and is immune to both walls; GQA/MLA models shrink the footprint 8x or more at model-selection time; demand caps (
--max-num-seqs,--max-model-len) bound the working set with zero quality risk. See KV cache management and the KV-cache OOM runbook. - Do not adopt a score-based evictor on a FlashAttention stack (H2O-style cumulative scoring, and any method whose scoring input is "the attention matrix"). The kernel never writes the scores; reference implementations fall back to eager attention and materialize the N x N matrix, which raises memory and lowers throughput.112
- Do not expect memory back from an evictor with no compaction pass. Marking tokens dead frees nothing in a paged engine until whole blocks empty (validated below). Quest-style page selection is a compute optimization; size the KV pool as if uncompressed.10
- Do not evict on full-attention layers without a per-model eval. Eviction is safe under sliding-window or otherwise restricted attention patterns; on layers attending over the whole context it corrupts long-context quality (KV cache management).
- StreamingLLM-style sink plus recency fits unbounded streams where losing mid-context information is acceptable by construction, and needs neither scores nor per-token bookkeeping.3
Architecture¶
Two walls sit between a paper eviction method and a production serving stack, and every method hits, dodges, or clears each one.1
Wall 1: the scoring signal. Production attention kernels (FlashAttention) tile Q/K/V through SRAM and keep only a running max, denominator, and output accumulator per query; the N x N score matrix is never written to HBM. Methods that score by observed attention need exactly what the kernel refuses to keep. The reference H2O implementation resolves this by abandoning FlashAttention for eager attention.1241
Wall 2: block-granular reclamation. Paged allocators free a block only when every slot in it is dead (about 16 tokens per block). Decode-time eviction scatters survivors, so after evicting 14,400 of 16,000 tokens the 1,600 survivors still occupy nearly all of the roughly 1,000 original blocks, and the allocator reclaims almost nothing. A compaction pass (run every ~128 decoded tokens in TriAttention) physically consolidates survivors until tail blocks empty out and return to the allocator.12
flowchart TB
M["Token eviction method"] --> Q1{"Scoring signal"}
Q1 -->|"observed attention scores"| W1["Wall 1: FlashAttention never materializes the score matrix"]
W1 --> F1["Eager fallback: N x N scores land in HBM, memory up, throughput down"]
Q1 -->|"score-free: sink+recency or pre-RoPE geometry"| K1["Fused kernels stay intact"]
K1 --> Q2{"How is memory freed"}
Q2 -->|"mark tokens dead, survivors scattered"| W2["Wall 2: paged block freed only when fully empty"]
W2 --> F2["~0 blocks reclaimed, VRAM and pool usage unchanged"]
Q2 -->|"compaction: order-preserving repack or hole-filling"| K2["Whole tail blocks returned to the allocator"]
K2 --> R["Real KV memory reduction in paged serving"]
The compaction pass has two implemented shapes:12
- Order-preserving repack: every survivor slides forward so the cache stays dense and in original token order. One gather-clone-scatter pass (read all survivors before writing any, so overlapping ranges cannot corrupt), minimal engine changes, more data movement.
- Hole-filling: history survivors stay put; the newest round's survivors drop into the slots eviction opened. Far fewer copies per pass, but physical order comes out scrambled, so each entry's logical position must be tracked explicitly and fed to position-dependent code (RoPE indices, block hashing).
The block below validates the wall-2 arithmetic and both compaction strategies: scattered 90% eviction frees zero of 1,000 blocks; repack returns 900; an aligned whole-block eviction does free its block (reclaim is block-granular, not absent); the worst case of one survivor per block pins the entire pool at 93.75% evicted; and on one decode round the two strategies free the same block at 18 vs 3 slot copies, with hole-filling scrambling physical order but preserving the survivor set.
# Runnable on system python3 (numpy). Core mechanism: paged attention reclaims a KV block
# only when every slot in it is dead, so scattered token eviction frees ~nothing until a
# compaction pass consolidates survivors into whole empty blocks.
import numpy as np
def blocks_freed_no_compaction(alive: np.ndarray, block: int) -> int:
"""Blocks the paged allocator can reclaim: those whose every slot is dead."""
assert block > 0 and alive.dtype == bool
n_blocks = int(np.ceil(len(alive) / block))
padded = np.pad(alive, (0, n_blocks * block - len(alive)), constant_values=False)
return int((~padded.reshape(n_blocks, block)).all(axis=1).sum())
def repack(tokens: np.ndarray, alive: np.ndarray, block: int) -> tuple[np.ndarray, int, int]:
"""Order-preserving compaction: survivors slide forward (gather-clone-scatter).
Returns (dense survivor array, blocks freed, slot copies performed)."""
survivors = tokens[alive] # gather: read all before any write
n_blocks = int(np.ceil(len(tokens) / block))
occupied = int(np.ceil(len(survivors) / block))
copies = int((survivors != tokens[: len(survivors)]).sum()) # slots whose content moves
return survivors, n_blocks - occupied, copies
def hole_fill(tokens: np.ndarray, alive: np.ndarray, round_start: int,
block: int) -> tuple[np.ndarray, int, int]:
"""Hole-filling compaction: history survivors stay put; only the newest round's
survivors drop into the slots eviction opened in the history region.
Returns (physical layout, blocks freed, slot copies performed)."""
holes = np.flatnonzero(~alive[:round_start])
new_survivors = tokens[round_start:][alive[round_start:]]
assert len(new_survivors) <= len(holes), "new round must fit the holes this round"
physical = tokens[:round_start].copy()
physical[holes[: len(new_survivors)]] = new_survivors
keep = np.ones(round_start, dtype=bool)
keep[holes[len(new_survivors):]] = False # unfilled holes stay dead
n_blocks = int(np.ceil(len(tokens) / block))
occupied = int(np.ceil(round_start / block)) # history blocks stay allocated
return physical[keep], n_blocks - occupied, len(new_survivors)
# 1. The production-scale failure: 16,000 cached tokens in 1,000 16-token blocks, evict
# 90% spread evenly (keep every 10th token, an H2O/R-KV-style scatter). Every block
# keeps at least one survivor, so the allocator reclaims ZERO blocks.
tokens = np.arange(16_000)
alive = np.zeros(16_000, dtype=bool)
alive[::10] = True # 1,600 survivors, uniformly scattered
assert alive.sum() == 1_600
assert blocks_freed_no_compaction(alive, block=16) == 0
# 2. Order-preserving repack on the same eviction: 1,600 survivors need only
# ceil(1600/16) = 100 blocks, so 900 of 1,000 blocks come back; the block-reclaim
# ratio now matches the eviction ratio.
survivors, freed, _ = repack(tokens, alive, block=16)
assert freed == 900 and len(survivors) == 1_600
# 3. Repack must preserve logical token order exactly (positions stay monotone):
# the dense array equals the survivors listed in original order.
assert np.array_equal(survivors, tokens[alive])
assert np.all(np.diff(survivors) > 0)
# 4. Boundary: evicting one whole aligned block of 16 contiguous tokens DOES free
# exactly that block even without compaction; reclaim is block-granular, not absent.
aligned = np.ones(16_000, dtype=bool)
aligned[32:48] = False # block 2 exactly
assert blocks_freed_no_compaction(aligned, block=16) == 1
# 5. Adversarial worst case: one survivor per block pins the entire pool however
# aggressive the eviction; 93.75% evicted still frees zero blocks.
worst = np.zeros(16_000, dtype=bool)
worst[::16] = True # exactly one survivor per block
assert worst.sum() == 1_000
assert blocks_freed_no_compaction(worst, block=16) == 0
# 6. Both compaction strategies on one round: 6 blocks x 4 slots, history T0-T19, new
# round T20-T23, evict T2, T9, T13, T21. Both free the same 1 block; hole-filling
# does it with 3 copies instead of 18, at the cost of scrambled physical order.
toy = np.arange(24)
toy_alive = np.ones(24, dtype=bool)
toy_alive[[2, 9, 13, 21]] = False
_, freed_rp, copies_rp = repack(toy, toy_alive, block=4)
layout_hf, freed_hf, copies_hf = hole_fill(toy, toy_alive, round_start=20, block=4)
assert (freed_rp, copies_rp) == (1, 18)
assert (freed_hf, copies_hf) == (1, 3)
# hole-filling scrambles physical order (T20 lands in T2's old slot) ...
assert layout_hf[2] == 20 and not np.all(np.diff(layout_hf) > 0)
# ... but holds the same survivor set, so an explicit position map recovers the logical view.
assert np.array_equal(np.sort(layout_hf), toy[toy_alive])
print("reclaim OK:", "scattered 90% evict frees 0/1000 blocks;",
"repack frees 900/1000;", f"toy: repack {copies_rp} copies vs hole-fill {copies_hf}")
Wall 1 is just as mechanical. The block below builds a FlashAttention-style online softmax and proves the three facts that matter: it matches eager attention to float64 round-off while its peak live score buffer is 16x smaller than the full matrix; the running-max trick keeps it finite where a naive streaming softmax overflows; and the cumulative per-token scores H2O needs are not a function of anything the kernel retains, shown by two caches with identical outputs and identical kernel state but different eviction decisions.
# Runnable on system python3 (numpy). Core mechanism: a FlashAttention-style online
# softmax produces attention outputs identical to eager attention while never holding
# the N x N score matrix, and the per-token scores H2O-style eviction needs are provably
# not recoverable from the state the kernel does keep.
import numpy as np
rng = np.random.default_rng(7)
def reference_attention(q: np.ndarray, k: np.ndarray, v: np.ndarray) -> tuple[np.ndarray, np.ndarray]:
"""Eager attention: materializes the full score matrix (what H2O reads)."""
s = q @ k.T / np.sqrt(q.shape[1])
p = np.exp(s - s.max(axis=1, keepdims=True))
p /= p.sum(axis=1, keepdims=True)
return p @ v, p # output AND the N x N matrix
def online_attention(q: np.ndarray, k: np.ndarray, v: np.ndarray,
tile: int) -> tuple[np.ndarray, int]:
"""FlashAttention-style streaming softmax over K/V tiles.
Keeps a running max m, denominator l, and unnormalized accumulator acc per query;
scores exist only tile-by-tile. Returns (output, peak live score floats)."""
nq, d = q.shape
m = np.full(nq, -np.inf)
l = np.zeros(nq)
acc = np.zeros((nq, v.shape[1]))
peak = 0
for j0 in range(0, len(k), tile):
s = q @ k[j0 : j0 + tile].T / np.sqrt(d) # this tile's scores, then discarded
peak = max(peak, s.size)
m_new = np.maximum(m, s.max(axis=1))
scale = np.exp(m - m_new)
p = np.exp(s - m_new[:, None])
l = l * scale + p.sum(axis=1)
acc = acc * scale[:, None] + p @ v[j0 : j0 + tile]
m = m_new
return acc / l[:, None], peak
N, D, TILE = 512, 64, 32
q, k, v = rng.standard_normal((3, N, D))
# 1. Exactness: the online path matches eager attention to float64 round-off while its
# peak live score buffer is 16x smaller than the matrix H2O needs (512x32 vs 512x512).
out_ref, p_full = reference_attention(q, k, v)
out_online, peak = online_attention(q, k, v, TILE)
assert np.allclose(out_online, out_ref, atol=1e-12)
assert peak == N * TILE and peak * 16 == N * N
# 2. H2O's signal is a reduction over that full matrix: cumulative attention per cached
# token = column sums of p. Check the evictor against a slow per-element reference.
h2o_score = p_full.sum(axis=0)
slow = np.zeros(N)
for i in range(N):
for j in range(N):
slow[j] += p_full[i, j]
assert np.allclose(h2o_score, slow)
evict = np.argsort(h2o_score)[: N // 2] # bottom half by cumulative attention
assert h2o_score[evict].max() <= h2o_score[np.setdiff1d(np.arange(N), evict)].min()
# 3. Adversarial: the running-max trick is what makes the streaming path safe. A naive
# streaming softmax without max subtraction overflows on large logits; the online
# path stays finite AND still matches the (max-subtracted) eager reference.
q_hot = q + 400.0 # drives raw logits past exp overflow
with np.errstate(over="ignore"):
naive = np.exp(q_hot @ k.T / np.sqrt(D)) # no max subtraction
assert not np.isfinite(naive).all() # overflow: inf appears
out_hot, _ = online_attention(q_hot, k, v, TILE)
assert np.isfinite(out_hot).all()
assert np.allclose(out_hot, reference_attention(q_hot, k, v)[0], atol=1e-12)
# 4. Irrecoverability: two caches that differ only by swapping the keys of two tokens
# with IDENTICAL value vectors produce the same outputs and the same retained kernel
# state (m, l, acc are symmetric functions of the tile scores), yet DIFFERENT H2O
# scores. Whatever the kernel keeps, the evictor's signal is not a function of it.
v2 = v.copy()
v2[7] = v2[300] # make the two values identical
k_swap = k.copy()
k_swap[[7, 300]] = k_swap[[300, 7]] # swap their keys
out_a, pa = reference_attention(q, k, v2)
out_b, pb = reference_attention(q, k_swap, v2)
assert np.allclose(out_a, out_b, atol=1e-12) # kernel-visible state identical
sa, sb = pa.sum(axis=0), pb.sum(axis=0)
assert np.allclose(sa[[7, 300]], sb[[300, 7]]) # the two tokens' scores swapped
assert abs(sa[7] - sa[300]) > 1e-3 # and they genuinely differ
# An evictor ranking token 7 vs 300 flips its decision between the two caches even
# though every output the model produces is identical.
print("scores OK:", f"online==eager at peak {peak} vs {N*N} score floats;",
"H2O signal unrecoverable from kernel state")
Observation-window methods dodge wall 1 partially (they score once, at the end of prefill, where a single eager pass is affordable), but the window cannot grow: RoPE orients each query by its position, so only roughly the 25 most recent queries reflect where the model is currently attending. Anything that draws no attention during that snapshot is evicted, even if later reasoning depends on it.17
How to use it¶
Order of operations for a memory-bound long-generation deployment:
- Take the free wins first.
--kv-cache-dtype fp8(halves bytes per token at pool-sizing time, no fragmentation exposure), a GQA/MLA model if the choice is open, prefix caching for shared prefixes, and demand caps. All of this is stock vLLM; see KV cache management and quantization for inference. - Confirm eviction is even the right lever. If the server OOMs at startup or under burst, that is pool sizing, not cache growth (the KV-cache OOM runbook). Eviction pays off when single-sequence generation length is what exhausts the budget.
- Pick a method that clears both walls. As of mid-2026 vLLM has no in-tree decode-time token evictor (verify against the current release; the feature surface moves). TriAttention ships its own paged-attention integration with both compaction variants; treat it as a pinned external dependency and validate on one replica before any fleet exposure.2
- Set the budget from measured accuracy, not from memory targets. The published operating points: budget 2,048 loses 15 points of AIME 2024 accuracy even for the best method; 3,072 matches full attention on AIME 2025 for TriAttention on Qwen3-8B. Re-measure on the deployed model and task mix; the numbers do not transfer across models.1
How to integrate it¶
An evictor touches a serving engine at exactly two surfaces; both need explicit acceptance tests.
- The scoring hook. Score-free methods (sink plus recency, pre-RoPE geometry) read no runtime attention state and leave the fused kernel untouched. If a candidate method's scoring input is attention weights, the integration question is already answered: it will not run on the production kernel. Test: assert the engine's chosen attention backend is unchanged after enabling the method (backend selection is logged at engine start), and that per-step memory does not scale O(N x N).
- The compaction pass. Marking tokens dead must be followed by physical consolidation, and the block table, position bookkeeping, and any content-addressed block hashes must be updated together. Order-preserving repack keeps positions monotone and needs no extra mapping; hole-filling needs an explicit logical-to-physical position map that RoPE indexing and prefix-cache hashing consume. Test: after a synthetic eviction pass at a known ratio, the engine's free-block gauge must rise by the matching block count (the numpy block above is the oracle), and generation from a compacted cache must be token-identical to generation from an uncompacted cache holding the same survivor set.
- Reference counting. Blocks shared across sequences (prefix caching, beam candidates) cannot be moved or freed per-sequence; compaction must respect block reference counts or be restricted to unshared decode-tail blocks. Test: run compaction under prefix-cache load and assert hit rate and correctness are unaffected.
- Expose method metrics. Blocks freed per compaction pass, slot copies per pass, tokens evicted, and budget occupancy. Without these the runbook below cannot distinguish a working evictor from a silent no-op.
How to run it in production¶
- Judge success by the free-block gauge, never by
nvidia-smi. The engine claims its HBM fraction at startup and holds it for the life of the server; device memory reads the same at 5% and 100% cache occupancy. The gauges that move arevllm:kv_cache_usage_percand the free-block count; blocks freed per compaction pass is the direct proof of effect (KV cache management, the no-savings runbook). - Amortize compaction. A pass every ~128 decoded tokens keeps copy overhead marginal; per-step compaction turns a bandwidth-bound decode into a copy loop. Watch inter-token latency percentiles when tuning cadence.2
- Canary with an accuracy probe. Eviction failures are silent quality regressions, not errors. Route a fixed eval slice (a few AIME/MATH-style tasks for reasoning workloads) through the canary and gate promotion on score parity within the accepted budget loss, in addition to the usual TTFT/TPOT SLOs (SLOs for inference serving).
- Keep the demand-side guards on. Eviction bounds per-sequence growth; it does not repeal concurrency arithmetic.
--max-num-seqsand preemption alerts from the KV-cache OOM runbook stay in place.
How to maintain it¶
- Re-run the accuracy gate on every model swap. Head behaviour (local, sink, range-specific) is what geometry scoring learns; a new checkpoint invalidates both the learned curves and the budget choice. Re-derive, re-measure, re-canary.2
- Pin the method version and the engine version together. Compaction touches block-table internals that engines refactor freely; an engine upgrade under a pinned evictor (or vice versa) is a change to test on one replica, not absorb silently.
- Re-check the walls on engine upgrades. Native evictor support, block-size changes, or prefix-cache hashing changes in the serving engine can obsolete or break an out-of-tree integration; diff release notes for cache-management changes before bumping.
- Watch for silent backend regression. A dependency bump that flips the attention backend to eager (wall 1's failure mode) shows up as a simultaneous throughput drop and memory rise; alert on backend selection at startup.
Failure modes¶
- Success measured at
nvidia-smi. The KV pool is a static reservation; no eviction method can move device-level memory. Judged at the wrong layer, every method looks broken (or a broken one looks fine because "memory was always flat anyway"). Use the free-block and cache-usage gauges (the no-savings runbook). - Scattered survivors pin the pool. Eviction without compaction frees zero blocks at any eviction ratio once each block keeps one survivor (validated above: 93.75% evicted, 0 blocks freed). This is the default outcome of bolting a paper evictor onto a paged engine.1
- Eager-attention fallback. Score-based evictors that cannot read FlashAttention's internals materialize the N x N matrix instead: memory rises and throughput falls exactly when the method was supposed to save both.14
- Lab-vs-paged measurement gap. Savings reported on contiguous pre-allocated tensors (R-KV's 90%) do not transfer to block-granular allocators; treat any paper number not measured under a paged engine as a compute result, not a memory result.111
- Observation-window blindness. RoPE limits informative queries to roughly the 25 most recent, so snapshot methods evict information that a later reasoning phase needs; accuracy collapses at tight budgets (SnapKV 20.0% vs full attention's 40.8% on AIME 2025 at budget 2,048).1
- Page selection mistaken for memory savings. Quest-style methods keep the whole cache; deploying one against a memory problem changes nothing about OOM behaviour.10
- Scrambled order without position bookkeeping. Hole-filling compaction breaks physical monotonicity; if RoPE indices or block hashes read physical order, outputs corrupt. The position map is part of the method, not an optimization.2
- Compaction run too often. Per-step consolidation adds a copy pass to a bandwidth-bound phase; inter-token latency degrades while memory metrics look perfect.
- Eviction on full-attention layers without eval. Long-context quality corrupts even when memory behaves; gate on a per-model eval and keep restricted-attention scoping (KV cache management).
References¶
- NVIDIA Efficient AI Lab, "KV Cache Compression and Its Infra Problems" (June 2026): https://research.nvidia.com/labs/eai/blogs/kv-cache-compression-and-its-infra-problems/
- TriAttention: Efficient Long Reasoning with Trigonometric KV Compression, preprint (arXiv:2604.04921), code: https://github.com/WeianMao/triattention
- Xiao et al., "Efficient Streaming Language Models with Attention Sinks" (StreamingLLM, ICLR 2024): https://arxiv.org/abs/2309.17453
- Zhang et al., "H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models" (NeurIPS 2023): https://arxiv.org/abs/2306.14048
- Liu et al., "Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time" (NeurIPS 2023): https://arxiv.org/abs/2305.17118
- Oren et al., "Transformers are Multi-State RNNs" (TOVA, EMNLP 2024): https://arxiv.org/abs/2401.06104
- Li et al., "SnapKV: LLM Knows What You Are Looking for Before Generation" (NeurIPS 2024): https://arxiv.org/abs/2404.14469
- Cai et al., "PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling": https://arxiv.org/abs/2406.02069
- Feng et al., "Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference" (NeurIPS 2025): https://arxiv.org/abs/2407.11550
- Tang et al., "Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference" (ICML 2024): https://arxiv.org/abs/2406.10774
- Cai et al., "R-KV: Redundancy-aware KV Cache Compression for Reasoning Models" (NeurIPS 2025): https://arxiv.org/abs/2505.24133
- Dao et al., "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness": https://arxiv.org/abs/2205.14135
- Kwon et al., "Efficient Memory Management for Large Language Model Serving with PagedAttention" (SOSP 2023): https://arxiv.org/abs/2309.06180
- vLLM production metrics: https://docs.vllm.ai/en/latest/usage/metrics.html
- vLLM quantized KV cache (
--kv-cache-dtype): https://docs.vllm.ai/en/latest/features/quantization/quantized_kvcache/
Related: KV-Cache Management · No-Savings Runbook · Inference KV-Cache OOM · FlashAttention / MLA · Continuous Batching Internals · Quantization for Inference · LLM Inference Efficiency · Inference Serving · Glossary
-
NVIDIA Efficient AI Lab, "KV Cache Compression and Its Infra Problems" (June 2026). Qwen3-32B 4-bit OOM at ~24,000 generated tokens on 24 GB; the two infrastructure problems; the 14,400-of-16,000 eviction example; the ~25-query observation-window limit under RoPE; the AIME 2024/2025 and MATH500 accuracy table at budgets 2,048/512; the accuracy-matched 3,072-budget result (40.8%, 2.5x throughput, 10.7x KV memory reduction, Qwen3-8B, 32K generation). ↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩
-
Mao et al., "TriAttention: Efficient Long Reasoning with Trigonometric KV Compression," preprint (arXiv:2604.04921). Pre-RoPE geometric scoring (no runtime attention scores); compaction every ~128 decoded tokens; order-preserving repack and hole-filling variants. https://github.com/WeianMao/triattention ↩↩↩↩↩↩↩↩
-
Xiao et al., "Efficient Streaming Language Models with Attention Sinks," ICLR 2024. https://arxiv.org/abs/2309.17453 ↩↩↩
-
Zhang et al., "H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models," NeurIPS 2023. Cumulative attention scoring; ~20% of tokens collect ~80% of attention weight; the reference implementation materializes attention scores (eager attention). https://arxiv.org/abs/2306.14048 ↩↩↩↩
-
Liu et al., "Scissorhands," NeurIPS 2023. Persistence-of-importance hypothesis. https://arxiv.org/abs/2305.17118 ↩
-
Oren et al., "Transformers are Multi-State RNNs" (TOVA), EMNLP 2024. https://arxiv.org/abs/2401.06104 ↩
-
Li et al., "SnapKV," NeurIPS 2024. Observation window of recent prompt tokens scores the history once at end of prefill. https://arxiv.org/abs/2404.14469 ↩↩
-
Cai et al., "PyramidKV," 2024. Per-layer budget allocation. https://arxiv.org/abs/2406.02069 ↩
-
Feng et al., "Ada-KV," NeurIPS 2025. Per-head adaptive budget allocation. https://arxiv.org/abs/2407.11550 ↩
-
Tang et al., "Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference," ICML 2024. Full cache retained; per-query page selection; KV memory still grows with context. https://arxiv.org/abs/2406.10774 ↩↩↩
-
Cai et al., "R-KV: Redundancy-aware KV Cache Compression for Reasoning Models," NeurIPS 2025. 90% memory savings measured on pre-allocated contiguous tensors, not under vLLM. https://arxiv.org/abs/2505.24133 ↩↩↩
-
Dao et al., "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness." Tiled attention through SRAM; the N x N score matrix is never materialized in HBM. https://arxiv.org/abs/2205.14135 ↩↩