Runbook: inference SLO breach¶
Scope: diagnose and remediate an inference SLO breach (TTFT/TPOT burn-rate) without taking the service down.
Run this when a serving endpoint misses its latency SLO: TTFT or TPOT over target, or a multi-window burn-rate alert firing on the inference error budget (the SLO/SLI catalog). Severity: user-facing, page-worthy; act on the fast burn before the budget is gone. Stabilise first (shed load / scale), diagnose the phase, then fix the cause.
Reference templates on real APIs; pin versions and validate before production use.
This is the longform procedure for RB-6 in operational runbooks. The two-phase model (prefill/decode), the engines, and the optimisations referenced below are described in inference serving; the SLIs/SLOs and the burn-rate alert that pages you are defined in the SLO/SLI catalog; disaggregation as a structural lever is disaggregated inference.
Trigger¶
- TTFT over SLO: e.g. the
time_to_first_tokenp99 crosses the target (reference 99% ≤ 500 ms, the SLO/SLI catalog). - TPOT/ITL over SLO: per-output-token latency (inter-token latency) crosses target (reference 99% ≤ 50 ms).
- A multi-window multi-burn-rate alert on the inference error budget: fast burn (e.g. 14.4× over 1h/5m) pages, slow burn tickets.
- Often correlated with a deploy (model/engine/config change) or a traffic shift (longer prompts, a burst).
Pre-checks¶
- Confirm it is real, not an artefact: the SLI is on a histogram/quantile, not an average (the SLO/SLI catalog); the alert fired on both windows; the dashboard agrees with synthetic probes. A single blip on one window is not this runbook.
- Scope it: one model/endpoint or fleet-wide? One replica or all? Check per-replica latency and error rate (telemetry and monitoring).
- Correlate with change: did a deploy land in the burn window (new model revision, engine version,
--max-num-seqs/--max-model-len, quantisation)? Check the rollout history (SRE and MLOps practices). A deploy-correlated breach short-circuits to Rollback. - Rule out the node underneath before tuning the server (see step 5): a throttling or ECC-degraded GPU (reliability and RAS) presents as a latency regression with no serving-config cause.
Flow¶
stateDiagram-v2
[*] --> Triage
Triage --> Deploy: change in burn window
Triage --> Prefill: TTFT over SLO
Triage --> Decode: TPOT over SLO
Deploy --> Rollback: revert revision
Prefill --> Levers: reduce batch, prefix cache
Decode --> Levers: scale replicas, KV headroom
Levers --> Hardware: still breached
Hardware --> Verify: GPU healthy
Levers --> Verify: SLO recovered
Rollback --> Verify: SLO recovered
Verify --> [*]: TTFT and TPOT under SLO
Procedure¶
Split the symptom first: TTFT is prefill-bound, TPOT is decode-bound (inference serving), then pull the matching lever. EP is the endpoint/service; metric names below are vLLM's (vllm:*); verify exact names against the engine's metrics docs.
-
Read the engine metrics to localise the bottleneck (telemetry and monitoring). On vLLM, scrape
/metrics:High# KV-cache pressure, queue depth, preemptions, running batch curl -s http://$EP.$NS:8000/metrics | grep -E \ 'vllm:gpu_cache_usage_perc|vllm:num_requests_waiting|vllm:num_requests_running|vllm:num_preemptions_total'gpu_cache_usage_perc(>~90%) + non-zeronum_requests_waiting+ risingnum_preemptions_totalis KV-cache pressure causing preemption thrash, the classic cause of both TTFT and TPOT spikes (inference serving). (The preemption metric name is version-dependent; confirm in the metrics docs.) -
TTFT over SLO (prefill-bound): a long prefill head-of-line blocks decodes and spikes TTFT (inference serving, disaggregated inference):
- Reduce the running batch so prefill chunks land sooner: lower
--max-num-seqs(and--max-num-batched-tokens); ensure chunked prefill is enabled so long prompts interleave with decodes. -
Enable prefix caching (vLLM
--enable-prefix-caching; SGLang RadixAttention) for prefix-heavy traffic: skips recompute of shared prompt prefixes. -
TPOT over SLO (decode-bound): decode is memory-bandwidth-bound and gated by KV headroom and per-GPU contention (inference serving):
- Scale replicas out (HPA / KServe) on queue depth or the SLI, adding decode capacity:
-
Give KV cache headroom: raise
--gpu-memory-utilizationtoward (not to) the limit, or lower--max-model-len/--max-num-seqsto cut KV footprint and stop preemption (reliability and RAS). Watch for the cold-start trap if scaling from zero: a multi-minute model load misses the SLO on the first request. -
At scale, disaggregate (disaggregated inference): if prefill and decode are fighting over one pool and the traffic is high-volume with long/variable prompts, split them into separate, independently-scaled pools with KV transfer over NIXL/RDMA, orchestrated by Dynamo. This is a structural change, not an incident lever; reach for it when goodput/SLO demands it, and only with GDR proven (
NET/IB/.../GDRDMAin the NCCL log, performance tuning). -
Rule out the GPU underneath (reliability and RAS): confirm no thermal throttle or ECC degradation is masquerading as a serving regression:
A non-nvidia-smi --query-gpu=name,clocks_throttle_reasons.active,temperature.gpu,\ ecc.errors.uncorrected.volatile.total --format=csv,noheader # newer driver branches expose this as clocks_event_reasons.active (throttle name deprecated)0x0000000000000000throttle reason (thermal/power/HW slowdown) means the cause is the node, not the engine: divert to the thermal path (the thermal-emergency runbook) or the GPU-fault path (the GPU-fault runbook).
Verification¶
- TTFT and TPOT back under SLO on the histograms/quantiles, confirmed across both alert windows so the burn rate falls (the SLO/SLI catalog):
- Goodput restored: tokens served within SLO recovers to baseline (inference serving);
num_preemptions_totalrate returns to ~0 andnum_requests_waitingdrains. - The burn-rate alert clears and the error budget stops bleeding (the SLO/SLI catalog); synthetic probes confirm from the client side.
Rollback¶
If the breach is deploy-correlated, revert the model/engine/version via GitOps, single-variable, the same way it shipped (SRE and MLOps practices):
# revert the serving manifest to the last-good revision
kubectl -n $NS rollout undo deployment/$EP # or: git revert <sha> && argocd app sync $EP
kubectl -n $NS rollout status deployment/$EP
Keep a warm previous canary so rollback is instant: hold the last-good revision running at low replica count behind the router, and shift traffic back to it rather than waiting for a cold redeploy (SRE and MLOps practices, inference serving). If config levers (batch size, KV headroom) were the change, revert them to the recorded last-good values.
Related runbooks¶
- the MFU-regression runbook: Training MFU regression (the training-side analogue: a performance regression hunt).
- the thermal-emergency runbook: Thermal / cooling emergency (when the cause is a throttling GPU under the endpoint).
- the GPU-fault runbook: GPU fault, drain, reset, RMA (when a replica's GPU is degraded).
- operational runbooks: Operational runbooks index (RB-6).
References¶
- vLLM production metrics (TTFT/TPOT, cache usage, preemptions, queue): https://docs.vllm.ai/en/latest/usage/metrics.html
- vLLM disaggregated prefilling (
--kv-transfer-config): https://docs.vllm.ai/en/latest/features/disagg_prefill.html - Google SRE Workbook — Alerting on SLOs (multi-window burn rate): https://sre.google/workbook/alerting-on-slos/
- KServe (autoscaling, scale-to-zero): https://kserve.github.io/website/
- Kubernetes Horizontal Pod Autoscaler: https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/
- NVIDIA Dynamo (disaggregated serving at scale): https://docs.dynamo.nvidia.com/
Related: Inference · Telemetry · Practices · Disaggregated Inference · SLO/SLI · GPU Fault/RMA · Thermal Emergency · Glossary