Skip to content
Markdown

Runbook: inference SLO breach

Scope: diagnose and remediate an inference SLO breach (TTFT/TPOT burn-rate) without taking the service down.

Run this when a serving endpoint misses its latency SLO: TTFT or TPOT over target, or a multi-window burn-rate alert firing on the inference error budget (the SLO/SLI catalog). Severity: user-facing, page-worthy; act on the fast burn before the budget is gone. Stabilise first (shed load / scale), diagnose the phase, then fix the cause.

Reference templates on real APIs; pin versions and validate before production use.

This is the longform procedure for RB-6 in operational runbooks. The two-phase model (prefill/decode), the engines, and the optimisations referenced below are described in inference serving; the SLIs/SLOs and the burn-rate alert that pages you are defined in the SLO/SLI catalog; disaggregation as a structural lever is disaggregated inference.

Trigger

  • TTFT over SLO: e.g. the time_to_first_token p99 crosses the target (reference 99% ≤ 500 ms, the SLO/SLI catalog).
  • TPOT/ITL over SLO: per-output-token latency (inter-token latency) crosses target (reference 99% ≤ 50 ms).
  • A multi-window multi-burn-rate alert on the inference error budget: fast burn (e.g. 14.4× over 1h/5m) pages, slow burn tickets.
  • Often correlated with a deploy (model/engine/config change) or a traffic shift (longer prompts, a burst).

Pre-checks

  • Confirm it is real, not an artefact: the SLI is on a histogram/quantile, not an average (the SLO/SLI catalog); the alert fired on both windows; the dashboard agrees with synthetic probes. A single blip on one window is not this runbook.
  • Scope it: one model/endpoint or fleet-wide? One replica or all? Check per-replica latency and error rate (telemetry and monitoring).
  • Correlate with change: did a deploy land in the burn window (new model revision, engine version, --max-num-seqs/--max-model-len, quantisation)? Check the rollout history (SRE and MLOps practices). A deploy-correlated breach short-circuits to Rollback.
  • Rule out the node underneath before tuning the server (see step 5): a throttling or ECC-degraded GPU (reliability and RAS) presents as a latency regression with no serving-config cause.

Flow

stateDiagram-v2
    [*] --> Triage
    Triage --> Deploy: change in burn window
    Triage --> Prefill: TTFT over SLO
    Triage --> Decode: TPOT over SLO
    Deploy --> Rollback: revert revision
    Prefill --> Levers: reduce batch, prefix cache
    Decode --> Levers: scale replicas, KV headroom
    Levers --> Hardware: still breached
    Hardware --> Verify: GPU healthy
    Levers --> Verify: SLO recovered
    Rollback --> Verify: SLO recovered
    Verify --> [*]: TTFT and TPOT under SLO

Procedure

Split the symptom first: TTFT is prefill-bound, TPOT is decode-bound (inference serving), then pull the matching lever. EP is the endpoint/service; metric names below are vLLM's (vllm:*); verify exact names against the engine's metrics docs.

EP=inference-deepseek            # service / deployment
NS=serving
  1. Read the engine metrics to localise the bottleneck (telemetry and monitoring). On vLLM, scrape /metrics:

    # KV-cache pressure, queue depth, preemptions, running batch
    curl -s http://$EP.$NS:8000/metrics | grep -E \
      'vllm:gpu_cache_usage_perc|vllm:num_requests_waiting|vllm:num_requests_running|vllm:num_preemptions_total'
    
    High gpu_cache_usage_perc (>~90%) + non-zero num_requests_waiting + rising num_preemptions_total is KV-cache pressure causing preemption thrash, the classic cause of both TTFT and TPOT spikes (inference serving). (The preemption metric name is version-dependent; confirm in the metrics docs.)

  2. TTFT over SLO (prefill-bound): a long prefill head-of-line blocks decodes and spikes TTFT (inference serving, disaggregated inference):

  3. Reduce the running batch so prefill chunks land sooner: lower --max-num-seqs (and --max-num-batched-tokens); ensure chunked prefill is enabled so long prompts interleave with decodes.
  4. Enable prefix caching (vLLM --enable-prefix-caching; SGLang RadixAttention) for prefix-heavy traffic: skips recompute of shared prompt prefixes.

  5. TPOT over SLO (decode-bound): decode is memory-bandwidth-bound and gated by KV headroom and per-GPU contention (inference serving):

  6. Scale replicas out (HPA / KServe) on queue depth or the SLI, adding decode capacity:
    kubectl -n $NS scale deployment/$EP --replicas=<N+delta>
    # or autoscale on a custom metric (queue depth / SLO violation rate)
    kubectl -n $NS autoscale deployment/$EP --min=<N> --max=<M> --cpu-percent=<...>
    
  7. Give KV cache headroom: raise --gpu-memory-utilization toward (not to) the limit, or lower --max-model-len/--max-num-seqs to cut KV footprint and stop preemption (reliability and RAS). Watch for the cold-start trap if scaling from zero: a multi-minute model load misses the SLO on the first request.

  8. At scale, disaggregate (disaggregated inference): if prefill and decode are fighting over one pool and the traffic is high-volume with long/variable prompts, split them into separate, independently-scaled pools with KV transfer over NIXL/RDMA, orchestrated by Dynamo. This is a structural change, not an incident lever; reach for it when goodput/SLO demands it, and only with GDR proven (NET/IB/.../GDRDMA in the NCCL log, performance tuning).

  9. Rule out the GPU underneath (reliability and RAS): confirm no thermal throttle or ECC degradation is masquerading as a serving regression:

    nvidia-smi --query-gpu=name,clocks_throttle_reasons.active,temperature.gpu,\
    ecc.errors.uncorrected.volatile.total --format=csv,noheader
    # newer driver branches expose this as clocks_event_reasons.active (throttle name deprecated)
    
    A non-0x0000000000000000 throttle reason (thermal/power/HW slowdown) means the cause is the node, not the engine: divert to the thermal path (the thermal-emergency runbook) or the GPU-fault path (the GPU-fault runbook).

Verification

  • TTFT and TPOT back under SLO on the histograms/quantiles, confirmed across both alert windows so the burn rate falls (the SLO/SLI catalog):
    # TTFT SLI: fraction of requests under 500ms (target 99%)
    sum(rate(vllm:time_to_first_token_seconds_bucket{le="0.5"}[5m]))
      / sum(rate(vllm:time_to_first_token_seconds_count[5m]))
    
  • Goodput restored: tokens served within SLO recovers to baseline (inference serving); num_preemptions_total rate returns to ~0 and num_requests_waiting drains.
  • The burn-rate alert clears and the error budget stops bleeding (the SLO/SLI catalog); synthetic probes confirm from the client side.

Rollback

If the breach is deploy-correlated, revert the model/engine/version via GitOps, single-variable, the same way it shipped (SRE and MLOps practices):

# revert the serving manifest to the last-good revision
kubectl -n $NS rollout undo deployment/$EP            # or: git revert <sha> && argocd app sync $EP
kubectl -n $NS rollout status deployment/$EP

Keep a warm previous canary so rollback is instant: hold the last-good revision running at low replica count behind the router, and shift traffic back to it rather than waiting for a cold redeploy (SRE and MLOps practices, inference serving). If config levers (batch size, KV headroom) were the change, revert them to the recorded last-good values.

References

  • vLLM production metrics (TTFT/TPOT, cache usage, preemptions, queue): https://docs.vllm.ai/en/latest/usage/metrics.html
  • vLLM disaggregated prefilling (--kv-transfer-config): https://docs.vllm.ai/en/latest/features/disagg_prefill.html
  • Google SRE Workbook — Alerting on SLOs (multi-window burn rate): https://sre.google/workbook/alerting-on-slos/
  • KServe (autoscaling, scale-to-zero): https://kserve.github.io/website/
  • Kubernetes Horizontal Pod Autoscaler: https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/
  • NVIDIA Dynamo (disaggregated serving at scale): https://docs.dynamo.nvidia.com/

Related: Inference · Telemetry · Practices · Disaggregated Inference · SLO/SLI · GPU Fault/RMA · Thermal Emergency · Glossary