SLO/SLI catalog & error-budget alerts¶
Scope: index and decision page for service-level indicators and objectives across GPU services. It frames the SLI → SLO → error-budget → burn-rate paradigm, holds the cross-domain catalog table, and points to the focused pages that carry the per-domain PromQL and the alerting rules. Turns the SRE principles in SRE and MLOps practices into queries and rules wired to the telemetry stack (telemetry and monitoring).
Focused pages¶
- Inference serving SLOs: use this when you need the availability/TTFT/TPOT/goodput SLIs, their vLLM PromQL, and the inference burn-rate rules.
- Training platform SLOs: use this when you need job-success, queue-wait, GPU-availability, and checkpoint SLIs with their queries.
- Cluster / fabric SLOs: use this when you need allocatable-ratio and fabric-error SLIs at the cluster and network-fabric layer.
- Burn-rate alerting rules: use this when you need the multi-window multi-burn-rate PrometheusRule patterns, thresholds, and SLO-as-code generation.
Reference targets: set your own SLOs with stakeholders. The PromQL assumes dcgm-exporter (telemetry and monitoring), kube-state-metrics, and the inference engine's metrics (vLLM exposes Prometheus metrics; names below are vLLM's and may shift between versions). Note that vLLM's
request_success_totalis an engine-completion signal, not a full service success rate. True availability is best measured at the HTTP/gateway boundary, since requests can fail before reaching the engine.
flowchart LR
SLI["SLI"] --> SLO["SLO"]
SLO --> BUDGET["Error budget"]
BUDGET --> BURN["Burn rate"]
BURN --> ALERT["Page or ticket"]
ALERT --> RUNBOOK["Runbook"]
Paradigm: SLI → SLO → error budget → burn rate¶
- SLI is a measured ratio of good events to total (e.g. requests under latency target / all requests). Keep them user-facing.
- SLO is the target for an SLI over a window (e.g. 99.5% over 28 days).
- Error budget is
1 − SLO. The allowed failure. When spent, freeze risky change and spend effort on reliability (SRE and MLOps practices). - Burn rate is how fast the budget is being consumed:
burn = observed_error_rate / (1 − SLO). Burn 1× exhausts the budget exactly at window end; 14.4× exhausts a 30-day budget in ~2 days. Multi-window multi-burn-rate alerts page fast on big burns and ticket on slow ones.
SLI/SLO catalog¶
Inference service¶
| SLI | Definition | Reference SLO |
|---|---|---|
| Availability | successful / total requests | 99.5% / 28d |
| TTFT latency | requests with TTFT ≤ target / total | 99% ≤ 500 ms |
| TPOT latency | requests with per-token ≤ target / total | 99% ≤ 50 ms |
| Goodput | tokens served within SLO (inference serving) | tracked, not paged |
Training platform¶
| SLI | Definition | Reference SLO |
|---|---|---|
| Job success | jobs completing without infra failure | 99% / 28d |
| Scheduler queue wait | jobs scheduled ≤ target / total | 95% ≤ 10 min |
| GPU availability | healthy allocatable / total GPUs (reliability and RAS) | 99% / 28d |
| Checkpoint success | checkpoints written ok (storage and data) | 99.9% |
Cluster / fabric¶
| SLI | Definition | Reference SLO |
|---|---|---|
| GPU allocatable ratio | schedulable / total GPUs | ≥ 98% |
| Fabric error rate | links without errors / total (networking fabric) | ≥ 99.9% |
SLIs as PromQL¶
The per-domain PromQL for each SLI in the catalog now lives on its focused page: inference availability/TTFT/TPOT on inference serving SLOs, job-success/queue-wait/GPU-availability/checkpoint on training platform SLOs, and allocatable-ratio/fabric-error on cluster / fabric SLOs.
Multi-window burn-rate alert (the SRE pattern)¶
Record the error ratio, then alert when both a short and long window confirm a high burn. The full PrometheusRule patterns, the per-domain thresholds, and the SLO-as-code (OpenSLO/Sloth) generation live on burn-rate alerting rules.
Burn-rate thresholds (per the Google SRE workbook): 14.4× (1h/5m, page), 6× (6h/30m, page), 3× (1d/2h, ticket), 1× (3d/6h, ticket).
Don't-miss checklist¶
- SLIs measure the user's experience (TTFT/TPOT/job-success), not raw GPU-util (observability).
- Alert on multi-window burn rate, not on a single threshold: fast-page big burns, ticket slow ones.
- Every page links to a runbook (the troubleshooting runbook, operational runbooks).
- Generate rules from an SLO-as-code spec in git; review SLOs with stakeholders (SRE and MLOps practices).
- Let the error budget arbitrate change velocity.
Failure modes¶
- SLOs on infra metrics (GPU temp) instead of user outcomes: green dashboards, unhappy users.
- Single-threshold alerts: flapping on blips, or missing slow steady burns.
- No error-budget policy: SLOs measured but never acted on.
- Latency SLI on averages, not histograms/quantiles: tail latency hidden.
Open questions & validation¶
- Confirm the inference engine's exact metric names (vLLM
vllm:*) and histogram buckets. - Set real SLO targets and burn thresholds with stakeholders; validate alerts fire on a synthetic burn.
- Decide the error-budget policy: what freezes when the budget is spent (SRE and MLOps practices).
References¶
- Google SRE Workbook — Alerting on SLOs (burn rate): https://sre.google/workbook/alerting-on-slos/
- OpenSLO spec: https://github.com/OpenSLO/OpenSLO · Sloth (SLO generator): https://sloth.dev/
- vLLM production metrics: https://docs.vllm.ai/en/latest/usage/metrics.html
- Prometheus histogram_quantile: https://prometheus.io/docs/prometheus/latest/querying/functions/#histogram_quantile
Related: Inference · Observability · Reliability · Telemetry · Practices · Runbooks · Glossary