Markdown

Runbook: KV compression enabled, no memory savings¶

Scope: diagnose and fix a serving deployment where a KV-cache compression or token-eviction method was enabled but GPU memory relief never materialized: device memory flat (expected), engine cache gauges still pinned, OOM or preemption unchanged, or memory and throughput actually worse after enabling.

Run this when a KV compression rollout did not deliver: the method is on, yet vllm:kv_cache_usage_perc stays at the ceiling, preemptions keep climbing, replicas still OOM on long generations, or throughput dropped since the change. Severity: capacity and cost, page-worthy only if the rollout also regressed the SLO. The usual root causes are measuring at the wrong layer, an eager-attention fallback, or eviction without compaction.

Reference templates on real APIs; pin versions and validate before production use.

Three different layers can be read as "KV memory", and a compression method can only ever move the inner two. Device memory (nvidia-smi) shows the engine's startup reservation from --gpu-memory-utilization and holds constant for the life of the server regardless of cache contents. Pool occupancy (vllm:kv_cache_usage_perc) shows how full the block pool is. The free-block count shows what the allocator can actually hand to new sequences, and it rises only when whole blocks (about 16 tokens each) become completely empty. Token eviction that scatters survivors across blocks frees no blocks at any eviction ratio; the mechanics and the compaction fix are on KV cache token eviction, and general pool sizing is on KV cache management.

Trigger¶

Gauges never moved. A compression/eviction method shipped, but vllm:kv_cache_usage_perc still sits at ~1.0 under the same load, and long-generation requests still OOM or preempt at the same token counts.
Memory or latency regressed after enabling. Per-replica throughput dropped and memory rose since the rollout: the signature of a score-based evictor forcing an eager-attention fallback.
Paper-vs-fleet gap. The method's reported savings (for example R-KV's 90%) do not appear under the paged serving engine.
nvidia-smi unchanged is not by itself a trigger; device memory cannot move and no fix in this runbook will change it.

Pre-checks¶

Classify the method before touching the fleet. The fix depends on the class (KV cache token eviction):
quantized KV (--kv-cache-dtype fp8): acts at pool sizing, needs a restart to take effect;
token eviction with compaction (TriAttention-style): should free whole blocks at runtime;
token eviction without compaction (H2O/SnapKV/R-KV-style bolt-on): cannot free paged memory;
page selection (Quest-style): a compute optimization, never frees memory by design.
Capture the baseline. Engine-start log lines for the KV block count (# GPU blocks: N) and the selected attention backend; current vllm:kv_cache_usage_perc, vllm:num_preemptions, and throughput under representative load.
Confirm the engine is paged. Lab harnesses with contiguous pre-allocated caches reclaim memory differently; results measured there do not transfer (KV cache token eviction).
Rule out plain KV pressure. If compression was added to paper over under-sized capacity, the sizing branch belongs to the KV-cache OOM runbook; this runbook only establishes whether the compression method itself works.
Cordon and drain one replica for any test that restarts or reconfigures the engine; never experiment on an in-rotation pod (the KV-cache OOM runbook, step 2).

Flow¶

flowchart TB
    A["Compression on, no memory relief"] --> B{"Which gauge was read?"}
    B -->|"nvidia-smi"| C["Expected: static reservation. Re-measure on engine gauges"]
    C --> D
    B -->|"engine gauges"| D{"Throughput down / memory up since enabling?"}
    D -->|"yes"| E["Wall 1: check for eager-attention fallback"]
    E --> F["Disable score-based evictor or adopt score-free scoring"]
    D -->|"no"| G{"Free blocks rise after an eviction pass?"}
    G -->|"no"| H["Wall 2: scattered survivors, no compaction"]
    H --> I["Adopt compaction-capable method, or drop eviction for fp8 KV + demand caps"]
    G -->|"yes"| J{"Still OOM / preempting?"}
    J -->|"yes"| K["Capacity sizing: divert to KV-cache OOM runbook"]
    J -->|"no"| L["Verify: gauges, throughput, accuracy gate"]
    F --> L
    I --> L

Procedure¶

EP=inference-llm                 # deployment / service
NS=serving

Re-measure at the engine layer. Scrape the gauges that can actually move and record them against the baseline:
```
curl -s http://$EP.$NS:8000/metrics | grep -E \
  'vllm:kv_cache_usage_perc|vllm:num_preemptions|vllm:num_requests_running|vllm:num_requests_waiting'
kubectl -n $NS logs deploy/$EP | grep -E 'GPU blocks|backend'   # block budget + attention backend
```
If the only "failure" was a flat nvidia-smi, and cache usage, preemptions, and OOM behaviour did improve, the method works; close here and fix the dashboard, not the server.
Check for the eager-attention fallback (wall 1). A method that scores tokens by observed attention cannot read FlashAttention's internals; integrations commonly fall back to eager attention and materialize the N x N score matrix. Confirm on the drained replica:
the attention backend logged at engine start changed after enabling the method, or
per-request memory now scales with the square of context length, or
throughput dropped simultaneously with a memory rise. Any of these confirms the fallback: disable the method (single flag revert) and restart the drained replica. A method that keeps the fused kernel (score-free scoring, or sink-plus-recency rules) is the replacement path (KV cache token eviction).
Prove or disprove block reclamation (wall 2). On the drained replica, run one long-generation request (a reasoning prompt driving tens of thousands of output tokens) and watch the free-block gauge across the method's eviction passes:
Free blocks never rise while the method reports tokens evicted: survivors are scattered and no compaction runs; the method cannot free paged memory. Marking 90% of tokens dead frees zero blocks once each ~16-token block keeps one survivor.
Free blocks rise proportionally to the eviction ratio: the method compacts correctly; the remaining pressure is capacity sizing, divert to the KV-cache OOM runbook.
Apply the fix matching the diagnosis.
No compaction: replace the bolt-on evictor with a compaction-capable integration (order-preserving repack or hole-filling; TriAttention ships both), or drop token eviction entirely and take the guaranteed levers: --kv-cache-dtype fp8 plus demand caps (--max-num-seqs, --max-model-len).
Page-selection method: reclassify it as a latency/compute optimization and size the KV pool as if uncompressed; pair with fp8 KV if memory is the constraint.
Quantized KV showing no effect: the pool is sized at startup; confirm the flag is actually set on the running process (ps / pod spec, not just the manifest) and that the block count logged at start roughly doubled versus the FP16 baseline.
Gate on accuracy before returning to rotation. Token eviction trades quality for memory; at tight budgets the loss is severe even for the best methods. Run the deployment's eval slice at the configured budget and compare against the recorded full-attention baseline; promote only within the accepted loss (KV cache token eviction).

Verification¶

Blocks come back. During a sustained long-generation load test, the free-block gauge rises after each compaction pass, proportional to the eviction ratio; vllm:kv_cache_usage_perc stays off the ceiling.
The original symptom is gone. The token count that previously OOMed a replica now completes; rate(vllm:num_preemptions[5m]) stays ~0.
Throughput at parity or better. Tokens/s under the same load matches or beats the pre-compression baseline; a persistent drop means wall 1 is still being paid somewhere.
Accuracy within budget. The eval slice scores within the accepted delta of the full-attention baseline, recorded next to the config that produced it.
No new OOM. A sustained load test completes without torch.OutOfMemoryError or worker kills.

Rollback¶

Disable the method with a single flag revert to the recorded last-good engine config and restart the drained replica; do not stack fixes with the method still enabled.
Fall back to the guaranteed configuration if the incident must close before a compaction-capable integration is validated: --kv-cache-dtype fp8, demand caps, and the supply-side levers of the KV-cache OOM runbook.
Revert any pool-sizing changes (--gpu-memory-utilization) made while testing so the steady-state footprint matches the recorded baseline (SRE and MLOps practices).

the KV-cache OOM runbook: KV pressure and preemption thrash (capacity sizing; this runbook's sibling for when the cache is simply too small).
the inference-SLO-breach runbook: the parent TTFT/TPOT runbook if the rollout regressed latency.
the training-OOM runbook: the training-side memory triage analogue.
operational runbooks: operational runbooks index.

References¶

NVIDIA Efficient AI Lab, "KV Cache Compression and Its Infra Problems" (June 2026): https://research.nvidia.com/labs/eai/blogs/kv-cache-compression-and-its-infra-problems/
TriAttention (compaction-capable eviction, order-preserving repack and hole-filling): https://github.com/WeianMao/triattention
vLLM production metrics (vllm:kv_cache_usage_perc, vllm:num_preemptions): https://docs.vllm.ai/en/latest/usage/metrics.html
vLLM engine arguments (--gpu-memory-utilization, --kv-cache-dtype, --max-num-seqs, --max-model-len): https://docs.vllm.ai/en/latest/configuration/engine_args/
vLLM quantized KV cache: https://docs.vllm.ai/en/latest/features/quantization/quantized_kvcache/
Zhang et al., "H2O" (reference implementation materializes attention scores): https://arxiv.org/abs/2306.14048
Cai et al., "R-KV" (savings measured on contiguous tensors; paged integration open): https://arxiv.org/abs/2505.24133
Tang et al., "Quest" (page selection; keeps the full cache): https://arxiv.org/abs/2406.10774