Runbook: inference KV-cache OOM / preemption thrash¶
Scope: stabilize an LLM server thrashing on KV-cache pressure (preemptions/recompute, requests evicted, latency spikes) by sizing KV memory, capping concurrency/context, and tuning the scheduler.
Run this when an LLM endpoint is thrashing on KV-cache:
num_preemptionsclimbing, requests evicted and recomputed, KV-cache usage pinned near 100%, and TTFT/TPOT spiking under load. Severity: user-facing degradation, page-worthy when it pushes the SLO. Stabilise concurrency first, then size KV memory correctly so the engine stops evicting.Reference templates on real APIs; pin versions and validate before production use.
KV-cache is the per-request decode state held in GPU memory; when the running set needs more KV blocks than the cache holds, the scheduler preempts requests, either swapping their KV to CPU or recomputing it on resume, and that recompute/swap churn is the thrash. This is the KV-pressure branch of the inference-SLO-breach runbook, taken to its hardware root cause. KV-cache mechanics (paged blocks, swap vs recompute, prefix reuse) are in KV-cache management; the prefill/decode two-phase model is in inference serving. Metric names below are vLLM V1 (vllm:kv_cache_usage_perc, vllm:num_preemptions); legacy V0 used vllm:gpu_cache_usage_perc and vllm:num_preemptions_total, so confirm against the running engine version.
Trigger¶
- Preemptions climbing.
vllm:num_preemptionsrate non-zero and rising; the engine logsSequence group ... is preempted by ... moderepeatedly. - KV-cache pinned.
vllm:kv_cache_usage_percsits at ~1.0 (100%) whilevllm:num_requests_waitingstays non-zero, so the running batch cannot grow. - Recompute/swap churn. TTFT spikes on resumed requests (recompute redoes prefill); goodput drops while raw GPU utilisation stays high.
- Often triggered by a traffic shift to longer prompts / longer
max_tokens, a concurrency bump, or a deploy that raised--max-num-seqsor--max-model-len.
Pre-checks¶
- Confirm it is KV pressure, not a node fault. A throttling or ECC-degraded GPU (gpu health gating) presents as the same latency regression. Rule it out before tuning the server:
A non-
nvidia-smi --query-gpu=name,memory.used,memory.total,clocks_event_reasons.active,\ temperature.gpu,ecc.errors.uncorrected.volatile.total --format=csv,noheader0x0000000000000000throttle reason diverts to the thermal-emergency runbook or the GPU-fault runbook. - Confirm true OOM vs pressure. A CUDA OOM at startup or mid-serve (
torch.OutOfMemoryError, killed worker) is a sizing failure (too high--gpu-memory-utilizationfor the resident model + activations), not steady-state preemption. OOM crashes the replica; preemption degrades it. - Correlate with change. Did a deploy in the burn window raise
--max-num-seqs,--max-model-len, or change quantisation /--kv-cache-dtype? A deploy-correlated breach short-circuits to Rollback (SRE and MLOps practices). - Scope it. One replica or the whole fleet? Check per-replica
vllm:kv_cache_usage_percand preemption rate. - Note the available KV blocks logged at engine start (
# GPU blocks: N); this is the KV budget you are tuning against (KV-cache management).
Flow¶
flowchart TB
A["Preemptions rising, KV usage ~100%"] --> B{"Node healthy?"}
B -->|"throttle / ECC"| C["Divert: thermal or GPU-fault runbook"]
B -->|"healthy"| D{"Replica OOM-crashing?"}
D -->|"yes, CUDA OOM"| E["Lower --gpu-memory-utilization; restart"]
D -->|"no, steady preemption"| F{"Deploy in burn window?"}
F -->|"yes"| G["Rollback config / revision"]
F -->|"no"| H["Cordon + drain one replica"]
H --> I["Cap concurrency: lower --max-num-seqs / --max-model-len"]
I --> J{"Recovered?"}
J -->|"no"| K["Add KV headroom: raise gpu-mem-util, prefix cache, chunked prefill"]
J -->|"yes"| L["Verify: preemptions ~0, KV usage off ceiling"]
K --> M{"Still pressured?"}
M -->|"yes"| N["Scale replicas out / disaggregate"]
M -->|"no"| L
N --> L
Procedure¶
KV pressure is fixed by shrinking demand (fewer/shorter concurrent sequences) or growing supply (more KV blocks). Do the safe demand-side cap first to stop the bleed, then size supply. Restart a serving replica only after cordon + drain so in-flight requests finish elsewhere.
-
Read the engine metrics to confirm the diagnosis and capture a baseline (KV-cache management). On vLLM, scrape
/metrics:curl -s http://$EP.$NS:8000/metrics | grep -E \ 'vllm:kv_cache_usage_perc|vllm:num_requests_waiting|vllm:num_requests_running|vllm:num_preemptions|vllm:prefix_cache_hits|vllm:prefix_cache_queries'kv_cache_usage_percnear 1.0 + non-zeronum_requests_waiting+ risingnum_preemptionsconfirms KV-cache preemption thrash. -
Cordon and drain one replica before mutating it. Never restart a serving pod that is taking traffic. Remove it from the router first so in-flight requests finish on healthy replicas (inference serving):
This keeps capacity flat while you reconfigure; do not edit a live, in-rotation replica.kubectl -n $NS scale deployment/$EP --replicas=$(($(kubectl -n $NS get deploy/$EP -o jsonpath='{.spec.replicas}')+1)) kubectl -n $NS rollout status deployment/$EP # bring up a spare first # then take the target pod out of rotation and let it drain kubectl -n $NS label pod/<target-pod> serving=draining --overwrite kubectl -n $NS delete pod/<target-pod> --grace-period=60 -
Cap concurrency to stop preemption (demand side). Lower the running-set size so the KV budget covers it without eviction (KV-cache management):
- Lower
--max-num-seqs. Fewer concurrent sequences directly cut KV-block demand; this is the fastest lever to stop preemption. - Lower
--max-model-lenif the traffic does not need the full context. KV footprint scales with sequence length, so capping context bounds worst-case per-request KV. -
Lower
--max-num-batched-tokensto bound prefill chunk size so a long prompt cannot grab the whole KV pool at once. -
Grow KV supply (supply side) once demand is bounded (KV-cache management):
- Raise
--gpu-memory-utilizationtoward (not to) the limit. vLLM sizes the KV-block pool from the GPU memory left after model weights + activations, so a higher fraction yields more KV blocks. Leave headroom for activation spikes or you trade preemption for a CUDA OOM. - Enable
--enable-prefix-cachingfor prefix-heavy traffic: shared prompt prefixes reuse cached KV blocks instead of recomputing, cutting both prefill cost and KV demand. Confirm withvllm:prefix_cache_hits/vllm:prefix_cache_queries. -
Enable
--enable-chunked-prefillso long prefills interleave with decodes instead of monopolising a step and forcing decodes to preempt. -
Provision CPU swap if recompute is the cost (KV-cache management). vLLM's default preemption recomputes evicted KV on resume; with
--swap-space <GiB>it can swap KV to CPU and copy back, which is cheaper than full recompute for long sequences. Set per-GPU swap space deliberately; it is a tradeoff (PCIe copy vs recompute), not a free win. Validate the effect on TTFT before keeping it. -
Out of single-replica headroom: scale or disaggregate (inference serving). If demand is capped and supply maxed but the fleet is still pressured, add decode capacity:
At high volume with long/variable prompts, disaggregating prefill and decode into separately-scaled pools is the structural fix; see step 4 of the inference-SLO-breach runbook. It is a planned change, not an incident lever.kubectl -n $NS scale deployment/$EP --replicas=<N+delta> kubectl -n $NS rollout status deployment/$EP
Verification¶
- Preemptions return to ~0 and stay there. The rate of
vllm:num_preemptionsflattens to zero across a sustained load window: - KV-cache usage off the ceiling.
vllm:kv_cache_usage_percsits below ~0.9 at steady state withvllm:num_requests_waitingdraining to near zero, proving the running set fits the KV budget. - Latency back under SLO. TTFT and TPOT recover on the histograms; resumed-request TTFT spikes disappear because nothing is being evicted (the inference-SLO-breach runbook).
- No new OOM. The reconfigured replica runs a sustained load test without
torch.OutOfMemoryErroror worker kills; confirmnvidia-smimemory is not at the edge.
Rollback¶
KV pressure is fixed by config changes; rollback is reverting them single-variable to the recorded last-good values (SRE and MLOps practices).
- If the breach is deploy-correlated, revert the revision the way it shipped; keep a warm previous canary so the shift is instant rather than a cold redeploy:
- If a tuning lever overshot (e.g. raising
--gpu-memory-utilizationtraded preemption for a CUDA OOM), revert that single flag to its last-good value and restart the drained replica (step 2), not the live fleet. - Restore any temporarily-added replicas / swap space once the root cause (config or traffic) is addressed, so the steady-state footprint matches the recorded baseline.
Related runbooks¶
- the inference-SLO-breach runbook: Inference SLO breach (the parent TTFT/TPOT runbook; this is its KV-pressure root-cause branch).
- the MFU-regression runbook: Training MFU regression (the training-side performance-regression analogue).
- the thermal-emergency runbook: Thermal / cooling emergency (when a throttling GPU masquerades as KV-pressure latency).
- the GPU-fault runbook: GPU fault, drain, reset, RMA (when a replica's GPU is degraded).
- operational runbooks: Operational runbooks index.
References¶
- vLLM production metrics (
vllm:kv_cache_usage_perc,vllm:num_preemptions,vllm:num_requests_waiting, prefix-cache hits/queries): https://docs.vllm.ai/en/latest/usage/metrics.html - vLLM engine arguments (
--gpu-memory-utilization,--max-num-seqs,--max-model-len,--max-num-batched-tokens,--swap-space,--kv-cache-dtype,--block-size): https://docs.vllm.ai/en/latest/configuration/engine_args/ - vLLM optimization and tuning (preemption, recompute vs swap, prefix caching, chunked prefill): https://docs.vllm.ai/en/stable/configuration/optimization/
- vLLM automatic prefix caching: https://docs.vllm.ai/en/latest/features/automatic_prefix_caching.html
- Kubernetes Deployments (scale, rollout undo, drain): https://kubernetes.io/docs/concepts/workloads/controllers/deployment/
- nvidia-smi (clocks_event_reasons, ECC, memory query): https://docs.nvidia.com/deploy/nvidia-smi/index.html
Related: KV-Cache Management · Inference Serving · Inference SLO Breach · GPU Health Gating · Thermal Emergency · GPU Fault / RMA · Operational Runbooks · Glossary