Markdown

Runbook: inference KV-cache OOM / preemption thrash¶

Scope: stabilize an LLM server thrashing on KV-cache pressure (preemptions/recompute, requests evicted, latency spikes) by sizing KV memory, capping concurrency/context, and tuning the scheduler.

Run this when an LLM endpoint is thrashing on KV-cache: num_preemptions climbing, requests evicted and recomputed, KV-cache usage pinned near 100%, and TTFT/TPOT spiking under load. Severity: user-facing degradation, page-worthy when it pushes the SLO. Stabilise concurrency first, then size KV memory correctly so the engine stops evicting.

Reference templates on real APIs; pin versions and validate before production use.

KV-cache is the per-request decode state held in GPU memory; when the running set needs more KV blocks than the cache holds, the scheduler preempts requests, either swapping their KV to CPU or recomputing it on resume, and that recompute/swap churn is the thrash. This is the KV-pressure branch of the inference-SLO-breach runbook, taken to its hardware root cause. KV-cache mechanics (paged blocks, swap vs recompute, prefix reuse) are in KV-cache management; the prefill/decode two-phase model is in inference serving. Metric names below are vLLM V1 (vllm:kv_cache_usage_perc, vllm:num_preemptions); legacy V0 used vllm:gpu_cache_usage_perc and vllm:num_preemptions_total, so confirm against the running engine version.

Trigger¶

Preemptions climbing. vllm:num_preemptions rate non-zero and rising; the engine logs Sequence group ... is preempted by ... mode repeatedly.
KV-cache pinned. vllm:kv_cache_usage_perc sits at ~1.0 (100%) while vllm:num_requests_waiting stays non-zero, so the running batch cannot grow.
Recompute/swap churn. TTFT spikes on resumed requests (recompute redoes prefill); goodput drops while raw GPU utilisation stays high.
Often triggered by a traffic shift to longer prompts / longer max_tokens, a concurrency bump, or a deploy that raised --max-num-seqs or --max-model-len.

Pre-checks¶

Confirm it is KV pressure, not a node fault. A throttling or ECC-degraded GPU (gpu health gating) presents as the same latency regression. Rule it out before tuning the server:
```
nvidia-smi --query-gpu=name,memory.used,memory.total,clocks_event_reasons.active,\
temperature.gpu,ecc.errors.uncorrected.volatile.total --format=csv,noheader
```
A non-0x0000000000000000 throttle reason diverts to the thermal-emergency runbook or the GPU-fault runbook.
Confirm true OOM vs pressure. A CUDA OOM at startup or mid-serve (torch.OutOfMemoryError, killed worker) is a sizing failure (too high --gpu-memory-utilization for the resident model + activations), not steady-state preemption. OOM crashes the replica; preemption degrades it.
Correlate with change. Did a deploy in the burn window raise --max-num-seqs, --max-model-len, or change quantisation / --kv-cache-dtype? A deploy-correlated breach short-circuits to Rollback (SRE and MLOps practices).
Scope it. One replica or the whole fleet? Check per-replica vllm:kv_cache_usage_perc and preemption rate.
Note the available KV blocks logged at engine start (# GPU blocks: N); this is the KV budget you are tuning against (KV-cache management).

Flow¶

flowchart TB
    A["Preemptions rising, KV usage ~100%"] --> B{"Node healthy?"}
    B -->|"throttle / ECC"| C["Divert: thermal or GPU-fault runbook"]
    B -->|"healthy"| D{"Replica OOM-crashing?"}
    D -->|"yes, CUDA OOM"| E["Lower --gpu-memory-utilization; restart"]
    D -->|"no, steady preemption"| F{"Deploy in burn window?"}
    F -->|"yes"| G["Rollback config / revision"]
    F -->|"no"| H["Cordon + drain one replica"]
    H --> I["Cap concurrency: lower --max-num-seqs / --max-model-len"]
    I --> J{"Recovered?"}
    J -->|"no"| K["Add KV headroom: raise gpu-mem-util, prefix cache, chunked prefill"]
    J -->|"yes"| L["Verify: preemptions ~0, KV usage off ceiling"]
    K --> M{"Still pressured?"}
    M -->|"yes"| N["Scale replicas out / disaggregate"]
    M -->|"no"| L
    N --> L

Procedure¶

KV pressure is fixed by shrinking demand (fewer/shorter concurrent sequences) or growing supply (more KV blocks). Do the safe demand-side cap first to stop the bleed, then size supply. Restart a serving replica only after cordon + drain so in-flight requests finish elsewhere.

EP=inference-llm                 # deployment / service
NS=serving

Read the engine metrics to confirm the diagnosis and capture a baseline (KV-cache management). On vLLM, scrape /metrics:
```
curl -s http://$EP.$NS:8000/metrics | grep -E \
  'vllm:kv_cache_usage_perc|vllm:num_requests_waiting|vllm:num_requests_running|vllm:num_preemptions|vllm:prefix_cache_hits|vllm:prefix_cache_queries'
```
kv_cache_usage_perc near 1.0 + non-zero num_requests_waiting + rising num_preemptions confirms KV-cache preemption thrash.

Cordon and drain one replica before mutating it. Never restart a serving pod that is taking traffic. Remove it from the router first so in-flight requests finish on healthy replicas (inference serving):

kubectl -n $NS scale deployment/$EP --replicas=$(($(kubectl -n $NS get deploy/$EP -o jsonpath='{.spec.replicas}')+1))
kubectl -n $NS rollout status deployment/$EP          # bring up a spare first
# then take the target pod out of rotation and let it drain
kubectl -n $NS label pod/<target-pod> serving=draining --overwrite
kubectl -n $NS delete pod/<target-pod> --grace-period=60

This keeps capacity flat while you reconfigure; do not edit a live, in-rotation replica.

Cap concurrency to stop preemption (demand side). Lower the running-set size so the KV budget covers it without eviction (KV-cache management):
Lower --max-num-seqs. Fewer concurrent sequences directly cut KV-block demand; this is the fastest lever to stop preemption.
Lower --max-model-len if the traffic does not need the full context. KV footprint scales with sequence length, so capping context bounds worst-case per-request KV.
Lower --max-num-batched-tokens to bound prefill chunk size so a long prompt cannot grab the whole KV pool at once.
Grow KV supply (supply side) once demand is bounded (KV-cache management):
Raise --gpu-memory-utilization toward (not to) the limit. vLLM sizes the KV-block pool from the GPU memory left after model weights + activations, so a higher fraction yields more KV blocks. Leave headroom for activation spikes or you trade preemption for a CUDA OOM.
Enable --enable-prefix-caching for prefix-heavy traffic: shared prompt prefixes reuse cached KV blocks instead of recomputing, cutting both prefill cost and KV demand. Confirm with vllm:prefix_cache_hits / vllm:prefix_cache_queries.
Enable --enable-chunked-prefill so long prefills interleave with decodes instead of monopolising a step and forcing decodes to preempt.
Provision CPU swap if recompute is the cost (KV-cache management). vLLM's default preemption recomputes evicted KV on resume; with --swap-space <GiB> it can swap KV to CPU and copy back, which is cheaper than full recompute for long sequences. Set per-GPU swap space deliberately; it is a tradeoff (PCIe copy vs recompute), not a free win. Validate the effect on TTFT before keeping it.
Out of single-replica headroom: scale or disaggregate (inference serving). If demand is capped and supply maxed but the fleet is still pressured, add decode capacity:
```
kubectl -n $NS scale deployment/$EP --replicas=<N+delta>
kubectl -n $NS rollout status deployment/$EP
```
At high volume with long/variable prompts, disaggregating prefill and decode into separately-scaled pools is the structural fix; see step 4 of the inference-SLO-breach runbook. It is a planned change, not an incident lever.

Verification¶

Preemptions return to ~0 and stay there. The rate of vllm:num_preemptions flattens to zero across a sustained load window:
```
rate(vllm:num_preemptions[5m])     # target ~0; V0 engine: vllm:num_preemptions_total
```
KV-cache usage off the ceiling. vllm:kv_cache_usage_perc sits below ~0.9 at steady state with vllm:num_requests_waiting draining to near zero, proving the running set fits the KV budget.
Latency back under SLO. TTFT and TPOT recover on the histograms; resumed-request TTFT spikes disappear because nothing is being evicted (the inference-SLO-breach runbook).
No new OOM. The reconfigured replica runs a sustained load test without torch.OutOfMemoryError or worker kills; confirm nvidia-smi memory is not at the edge.

Rollback¶

KV pressure is fixed by config changes; rollback is reverting them single-variable to the recorded last-good values (SRE and MLOps practices).

If the breach is deploy-correlated, revert the revision the way it shipped; keep a warm previous canary so the shift is instant rather than a cold redeploy:

kubectl -n $NS rollout undo deployment/$EP        # or: git revert <sha> && argocd app sync $EP
kubectl -n $NS rollout status deployment/$EP

If a tuning lever overshot (e.g. raising --gpu-memory-utilization traded preemption for a CUDA OOM), revert that single flag to its last-good value and restart the drained replica (step 2), not the live fleet.
Restore any temporarily-added replicas / swap space once the root cause (config or traffic) is addressed, so the steady-state footprint matches the recorded baseline.

the inference-SLO-breach runbook: Inference SLO breach (the parent TTFT/TPOT runbook; this is its KV-pressure root-cause branch).
the MFU-regression runbook: Training MFU regression (the training-side performance-regression analogue).
the thermal-emergency runbook: Thermal / cooling emergency (when a throttling GPU masquerades as KV-pressure latency).
the GPU-fault runbook: GPU fault, drain, reset, RMA (when a replica's GPU is degraded).
operational runbooks: Operational runbooks index.

References¶

vLLM production metrics (vllm:kv_cache_usage_perc, vllm:num_preemptions, vllm:num_requests_waiting, prefix-cache hits/queries): https://docs.vllm.ai/en/latest/usage/metrics.html
vLLM engine arguments (--gpu-memory-utilization, --max-num-seqs, --max-model-len, --max-num-batched-tokens, --swap-space, --kv-cache-dtype, --block-size): https://docs.vllm.ai/en/latest/configuration/engine_args/
vLLM optimization and tuning (preemption, recompute vs swap, prefix caching, chunked prefill): https://docs.vllm.ai/en/stable/configuration/optimization/
vLLM automatic prefix caching: https://docs.vllm.ai/en/latest/features/automatic_prefix_caching.html
Kubernetes Deployments (scale, rollout undo, drain): https://kubernetes.io/docs/concepts/workloads/controllers/deployment/
nvidia-smi (clocks_event_reasons, ECC, memory query): https://docs.nvidia.com/deploy/nvidia-smi/index.html