Markdown

SLOs for inference serving¶

Scope: defining and measuring inference SLOs (TTFT, TPOT/ITL, throughput, error rate) with concrete PromQL SLIs, target-setting guidance, and the linkage to burn-rate alerts and the SLO-breach runbook.

Reference templates on real APIs. Set your own targets with stakeholders; pin versions and validate before production use. PromQL below assumes vLLM's Prometheus metrics; metric names are vLLM's and may shift between versions. Prometheus appends _total to counters and _bucket/_count/_sum to histograms.

What it is¶

An inference SLO is a target on a user-facing service-level indicator (SLI) over a window, e.g. "99% of requests have time-to-first-token under 500 ms over 28 days". For LLM serving the SLIs that matter are latency split by phase plus a success ratio:

TTFT (time to first token): prefill latency; how long until the stream starts. Driven by prompt length, queueing, and batch admission.
TPOT / ITL (time per output token / inter-token latency): decode latency; the steady-state streaming cadence. Driven by batch size, KV-cache pressure, and tensor-parallel comms.
Throughput / goodput: tokens served per second, and the share served within the latency SLO (goodput, see goodput in AI systems).
Error rate / availability: failed vs total requests, best measured at the gateway, not the engine.

This page is the inference-specific cut of the broader SLO/SLI catalog: same SLI → SLO → error-budget → burn-rate paradigm, narrowed to serving.

Why it matters¶

Raw GPU utilisation is not a user outcome. A node can be 100% busy while p99 TTFT is unacceptable. SLOs put the contract on what the caller experiences. TTFT and TPOT must be tracked separately because they fail for different reasons and have different fixes: TTFT regresses on queueing and long prompts; TPOT regresses on oversized decode batches and KV pressure (continuous batching internals). A single "latency" SLO hides which phase broke, and an SLI on averages hides the tail entirely; always measure on histograms/quantiles. The error budget (1 − SLO) then arbitrates change velocity, and multi-window burn-rate alerts decide when to page versus ticket, routing to the inference SLO-breach runbook.

When it is needed (and when not)¶

Needed when:

A serving endpoint has external callers or an internal latency contract (interactive chat, agentic loops, RAG).
You run autoscaling or capacity planning off latency headroom (GPU capacity planning, inference serving).
You operate multiple model families/replicas and need per-model SLIs to scope incidents (serving open-weight models).

Not needed (or lighter) when:

Pure offline/batch inference with no interactivity: track throughput and cost, not TTFT; a job-success SLO from the SLO/SLI catalog fits better.
Pre-production benchmarking: use load tests and the smoke suite (smoke tests for a GPU platform), not a paging SLO, until traffic is real.
Single-user dev endpoints: observability without alerts.

How: implement, integrate, maintain¶

Pipeline: vLLM exposes metrics → Prometheus scrapes → recording rules compute the SLI ratio per window → multi-window burn-rate alerts page/ticket → alert annotation links the runbook.

flowchart LR
  ENGINE["vLLM /metrics"] --> SCRAPE["Prometheus scrape"]
  SCRAPE --> SLI["Recording rule: good/total per window"]
  SLI --> BURN["Multi-window burn-rate"]
  BURN -->|"fast burn 14.4x"| PAGE["Page (critical)"]
  BURN -->|"slow burn"| TICKET["Ticket (warning)"]
  PAGE --> RUNBOOK["SLO-breach runbook"]
  TICKET --> RUNBOOK

1. SLIs as PromQL¶

# TTFT SLI: fraction of requests with TTFT <= 500 ms (histogram)
sum(rate(vllm:time_to_first_token_seconds_bucket{le="0.5"}[5m]))
  / sum(rate(vllm:time_to_first_token_seconds_count[5m]))

# TPOT/ITL SLI: fraction with per-output-token latency <= 50 ms
sum(rate(vllm:request_time_per_output_token_seconds_bucket{le="0.05"}[5m]))
  / sum(rate(vllm:request_time_per_output_token_seconds_count[5m]))

# TTFT p99 (for dashboards / target-setting, not the SLI ratio)
histogram_quantile(0.99, sum by (le) (rate(vllm:time_to_first_token_seconds_bucket[5m])))

# Error ratio (engine-completion signal; true availability is at the gateway)
1 - (
  sum(rate(vllm:request_success_total[5m]))
  / sum(rate(vllm:e2e_request_latency_seconds_count[5m]))
)

# Throughput: output tokens/sec
sum(rate(vllm:generation_tokens_total[5m]))

vllm:request_success is the counter (_total when scraped); e2e_request_latency_seconds_count is the total-requests denominator. The success ratio is an engine-completion signal. Measure true availability at the HTTP/gateway boundary, since requests can fail before reaching the engine.

2. Target-setting guidance¶

Set targets from observed p99 under representative load, not aspiration: measure first with the p99 query above, then set the SLO just above the achievable tail with headroom.
TTFT scales with prompt length: bucket SLIs by input-length class if prompts vary widely, or set the target for the dominant traffic profile.
TPOT target follows the perceived-speed requirement (e.g. ~50 ms/token ≈ 20 tok/s, faster than reading speed). Decode batch size trades TPOT for throughput.
Choose the window with stakeholders (28d is common); start from the Google SRE workbook burn-rate factors below and tune to your page load.

3. Burn-rate alerts (recording + multi-window)¶

Per the Google SRE workbook, alert when a short and a long window both confirm a high burn. For a 99% TTFT SLO (budget 0.01):

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: slo-inference-ttft
  namespace: monitoring
  labels: { release: kube-prom }
spec:
  groups:
    - name: slo.inference.ttft
      rules:
        - record: slo:inference_ttft_error:ratio_rate5m
          expr: 1 - (
            sum(rate(vllm:time_to_first_token_seconds_bucket{le="0.5"}[5m]))
            / sum(rate(vllm:time_to_first_token_seconds_count[5m])))
        - record: slo:inference_ttft_error:ratio_rate1h
          expr: 1 - (
            sum(rate(vllm:time_to_first_token_seconds_bucket{le="0.5"}[1h]))
            / sum(rate(vllm:time_to_first_token_seconds_count[1h])))
        - record: slo:inference_ttft_error:ratio_rate6h
          expr: 1 - (
            sum(rate(vllm:time_to_first_token_seconds_bucket{le="0.5"}[6h]))
            / sum(rate(vllm:time_to_first_token_seconds_count[6h])))
        - alert: InferenceTTFTFastBurn        # 14.4x over 1h/5m -> page
          expr: slo:inference_ttft_error:ratio_rate5m > (14.4 * 0.01)
            and slo:inference_ttft_error:ratio_rate1h > (14.4 * 0.01)
          for: 2m
          labels: { severity: critical }
          annotations:
            runbook: "runbook-inference-slo-breach: inference SLO breach (TTFT)"
        - alert: InferenceTTFTSlowBurn        # 6x over 6h/30m -> page
          expr: slo:inference_ttft_error:ratio_rate1h > (6 * 0.01)
            and slo:inference_ttft_error:ratio_rate6h > (6 * 0.01)
          for: 15m
          labels: { severity: warning }
          annotations:
            runbook: "runbook-inference-slo-breach: inference SLO breach (TTFT)"

Workbook burn-rate factors (starting point, tune to your page load): 14.4× (1h/5m, page, 2% budget), 6× (6h/30m, page, 5%), 1× (3d/6h, ticket, 10%). Replicate the same pattern for the TPOT SLI with its own threshold.

4. Integrate and maintain¶

Apply via the monitoring stack (telemetry, monitoring, alerting); route the annotation to the inference SLO-breach runbook. The runbook's trigger consumes exactly these alerts.
Generate the recording + burn-rate rules from one SLO-as-code spec (Sloth/OpenSLO) in git so windows and thresholds are reviewed, not hand-edited.
Validate the alert fires on a synthetic burn (e.g. inject latency or 5xx in staging) before trusting it. See smoke tests for a GPU platform.
Confirm vLLM metric names and histogram bucket boundaries match your engine version after every upgrade; a missing le bucket silently breaks the SLI.
Keep latency SLIs on histograms, never averages; review SLO targets and error-budget policy with stakeholders on a fixed cadence.

# Validate the metrics exist before wiring rules
curl -s http://<vllm-host>:8000/metrics \
  | grep -E 'vllm:(time_to_first_token|request_time_per_output_token|e2e_request_latency|request_success|generation_tokens)'

References¶

vLLM production metrics: https://docs.vllm.ai/en/latest/usage/metrics.html
Google SRE Workbook — Alerting on SLOs (multiwindow burn rate): https://sre.google/workbook/alerting-on-slos/
Prometheus histogram_quantile: https://prometheus.io/docs/prometheus/latest/querying/functions/#histogram_quantile
Prometheus recording rules: https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/
prometheus-operator PrometheusRule CRD: https://prometheus-operator.dev/docs/operator/api/#monitoring.coreos.com/v1.PrometheusRule
OpenSLO spec: https://github.com/OpenSLO/OpenSLO · Sloth: https://sloth.dev/