Skip to content
Markdown

SLO/SLI catalog & error-budget alerts

Scope: index and decision page for service-level indicators and objectives across GPU services. It frames the SLI → SLO → error-budget → burn-rate paradigm, holds the cross-domain catalog table, and points to the focused pages that carry the per-domain PromQL and the alerting rules. Turns the SRE principles in SRE and MLOps practices into queries and rules wired to the telemetry stack (telemetry and monitoring).

Focused pages

  • Inference serving SLOs: use this when you need the availability/TTFT/TPOT/goodput SLIs, their vLLM PromQL, and the inference burn-rate rules.
  • Training platform SLOs: use this when you need job-success, queue-wait, GPU-availability, and checkpoint SLIs with their queries.
  • Cluster / fabric SLOs: use this when you need allocatable-ratio and fabric-error SLIs at the cluster and network-fabric layer.
  • Burn-rate alerting rules: use this when you need the multi-window multi-burn-rate PrometheusRule patterns, thresholds, and SLO-as-code generation.

Reference targets: set your own SLOs with stakeholders. The PromQL assumes dcgm-exporter (telemetry and monitoring), kube-state-metrics, and the inference engine's metrics (vLLM exposes Prometheus metrics; names below are vLLM's and may shift between versions). Note that vLLM's request_success_total is an engine-completion signal, not a full service success rate. True availability is best measured at the HTTP/gateway boundary, since requests can fail before reaching the engine.

flowchart LR
  SLI["SLI"] --> SLO["SLO"]
  SLO --> BUDGET["Error budget"]
  BUDGET --> BURN["Burn rate"]
  BURN --> ALERT["Page or ticket"]
  ALERT --> RUNBOOK["Runbook"]

Paradigm: SLI → SLO → error budget → burn rate

  • SLI is a measured ratio of good events to total (e.g. requests under latency target / all requests). Keep them user-facing.
  • SLO is the target for an SLI over a window (e.g. 99.5% over 28 days).
  • Error budget is 1 − SLO. The allowed failure. When spent, freeze risky change and spend effort on reliability (SRE and MLOps practices).
  • Burn rate is how fast the budget is being consumed: burn = observed_error_rate / (1 − SLO). Burn 1× exhausts the budget exactly at window end; 14.4× exhausts a 30-day budget in ~2 days. Multi-window multi-burn-rate alerts page fast on big burns and ticket on slow ones.

SLI/SLO catalog

Inference service

SLI Definition Reference SLO
Availability successful / total requests 99.5% / 28d
TTFT latency requests with TTFT ≤ target / total 99% ≤ 500 ms
TPOT latency requests with per-token ≤ target / total 99% ≤ 50 ms
Goodput tokens served within SLO (inference serving) tracked, not paged

Training platform

SLI Definition Reference SLO
Job success jobs completing without infra failure 99% / 28d
Scheduler queue wait jobs scheduled ≤ target / total 95% ≤ 10 min
GPU availability healthy allocatable / total GPUs (reliability and RAS) 99% / 28d
Checkpoint success checkpoints written ok (storage and data) 99.9%

Cluster / fabric

SLI Definition Reference SLO
GPU allocatable ratio schedulable / total GPUs ≥ 98%
Fabric error rate links without errors / total (networking fabric) ≥ 99.9%

SLIs as PromQL

The per-domain PromQL for each SLI in the catalog now lives on its focused page: inference availability/TTFT/TPOT on inference serving SLOs, job-success/queue-wait/GPU-availability/checkpoint on training platform SLOs, and allocatable-ratio/fabric-error on cluster / fabric SLOs.

Multi-window burn-rate alert (the SRE pattern)

Record the error ratio, then alert when both a short and long window confirm a high burn. The full PrometheusRule patterns, the per-domain thresholds, and the SLO-as-code (OpenSLO/Sloth) generation live on burn-rate alerting rules.

Burn-rate thresholds (per the Google SRE workbook): 14.4× (1h/5m, page), (6h/30m, page), (1d/2h, ticket), (3d/6h, ticket).

Don't-miss checklist

  • SLIs measure the user's experience (TTFT/TPOT/job-success), not raw GPU-util (observability).
  • Alert on multi-window burn rate, not on a single threshold: fast-page big burns, ticket slow ones.
  • Every page links to a runbook (the troubleshooting runbook, operational runbooks).
  • Generate rules from an SLO-as-code spec in git; review SLOs with stakeholders (SRE and MLOps practices).
  • Let the error budget arbitrate change velocity.

Failure modes

  • SLOs on infra metrics (GPU temp) instead of user outcomes: green dashboards, unhappy users.
  • Single-threshold alerts: flapping on blips, or missing slow steady burns.
  • No error-budget policy: SLOs measured but never acted on.
  • Latency SLI on averages, not histograms/quantiles: tail latency hidden.

Open questions & validation

  • Confirm the inference engine's exact metric names (vLLM vllm:*) and histogram buckets.
  • Set real SLO targets and burn thresholds with stakeholders; validate alerts fire on a synthetic burn.
  • Decide the error-budget policy: what freezes when the budget is spent (SRE and MLOps practices).

References

  • Google SRE Workbook — Alerting on SLOs (burn rate): https://sre.google/workbook/alerting-on-slos/
  • OpenSLO spec: https://github.com/OpenSLO/OpenSLO · Sloth (SLO generator): https://sloth.dev/
  • vLLM production metrics: https://docs.vllm.ai/en/latest/usage/metrics.html
  • Prometheus histogram_quantile: https://prometheus.io/docs/prometheus/latest/querying/functions/#histogram_quantile

Related: Inference · Observability · Reliability · Telemetry · Practices · Runbooks · Glossary