Markdown

SLO/SLI catalog & error-budget alerts¶

Scope: index and decision page for service-level indicators and objectives across GPU services. It frames the SLI → SLO → error-budget → burn-rate paradigm, holds the cross-domain catalog table, and points to the focused pages that carry the per-domain PromQL and the alerting rules. Turns the SRE principles in SRE and MLOps practices into queries and rules wired to the telemetry stack (telemetry and monitoring).

Focused pages¶

Inference serving SLOs: use this when you need the availability/TTFT/TPOT/goodput SLIs, their vLLM PromQL, and the inference burn-rate rules.
Training platform SLOs: use this when you need job-success, queue-wait, GPU-availability, and checkpoint SLIs with their queries.
Cluster / fabric SLOs: use this when you need allocatable-ratio and fabric-error SLIs at the cluster and network-fabric layer.
Burn-rate alerting rules: use this when you need the multi-window multi-burn-rate PrometheusRule patterns, thresholds, and SLO-as-code generation.

Reference targets: set your own SLOs with stakeholders. The PromQL assumes dcgm-exporter (telemetry and monitoring), kube-state-metrics, and the inference engine's metrics (vLLM exposes Prometheus metrics; names below are vLLM's and may shift between versions). Note that vLLM's request_success_total is an engine-completion signal, not a full service success rate. True availability is best measured at the HTTP/gateway boundary, since requests can fail before reaching the engine.

flowchart LR
  SLI["SLI"] --> SLO["SLO"]
  SLO --> BUDGET["Error budget"]
  BUDGET --> BURN["Burn rate"]
  BURN --> ALERT["Page or ticket"]
  ALERT --> RUNBOOK["Runbook"]

Paradigm: SLI → SLO → error budget → burn rate¶

SLI is a measured ratio of good events to total (e.g. requests under latency target / all requests). Keep them user-facing.
SLO is the target for an SLI over a window (e.g. 99.5% over 28 days).
Error budget is 1 − SLO. The allowed failure. When spent, freeze risky change and spend effort on reliability (SRE and MLOps practices).
Burn rate is how fast the budget is being consumed: burn = observed_error_rate / (1 − SLO). Burn 1× exhausts the budget exactly at window end; 14.4× exhausts a 30-day budget in ~2 days. Multi-window multi-burn-rate alerts page fast on big burns and ticket on slow ones.

SLI/SLO catalog¶

Inference service¶

SLI	Definition	Reference SLO
Availability	successful / total requests	99.5% / 28d
TTFT latency	requests with TTFT ≤ target / total	99% ≤ 500 ms
TPOT latency	requests with per-token ≤ target / total	99% ≤ 50 ms
Goodput	tokens served within SLO (inference serving)	tracked, not paged

Training platform¶

SLI	Definition	Reference SLO
Job success	jobs completing without infra failure	99% / 28d
Scheduler queue wait	jobs scheduled ≤ target / total	95% ≤ 10 min
GPU availability	healthy allocatable / total GPUs (reliability and RAS)	99% / 28d
Checkpoint success	checkpoints written ok (storage and data)	99.9%

Cluster / fabric¶

SLI	Definition	Reference SLO
GPU allocatable ratio	schedulable / total GPUs	≥ 98%
Fabric error rate	links without errors / total (networking fabric)	≥ 99.9%

SLIs as PromQL¶

The per-domain PromQL for each SLI in the catalog now lives on its focused page: inference availability/TTFT/TPOT on inference serving SLOs, job-success/queue-wait/GPU-availability/checkpoint on training platform SLOs, and allocatable-ratio/fabric-error on cluster / fabric SLOs.

Multi-window burn-rate alert (the SRE pattern)¶

Record the error ratio, then alert when both a short and long window confirm a high burn. The full PrometheusRule patterns, the per-domain thresholds, and the SLO-as-code (OpenSLO/Sloth) generation live on burn-rate alerting rules.

Burn-rate thresholds (per the Google SRE workbook): 14.4× (1h/5m, page), 6× (6h/30m, page), 3× (1d/2h, ticket), 1× (3d/6h, ticket).

Don't-miss checklist¶

SLIs measure the user's experience (TTFT/TPOT/job-success), not raw GPU-util (observability).
Alert on multi-window burn rate, not on a single threshold: fast-page big burns, ticket slow ones.
Every page links to a runbook (the troubleshooting runbook, operational runbooks).
Generate rules from an SLO-as-code spec in git; review SLOs with stakeholders (SRE and MLOps practices).
Let the error budget arbitrate change velocity.

Failure modes¶

SLOs on infra metrics (GPU temp) instead of user outcomes: green dashboards, unhappy users.
Single-threshold alerts: flapping on blips, or missing slow steady burns.
No error-budget policy: SLOs measured but never acted on.
Latency SLI on averages, not histograms/quantiles: tail latency hidden.

Open questions & validation¶

Confirm the inference engine's exact metric names (vLLM vllm:*) and histogram buckets.
Set real SLO targets and burn thresholds with stakeholders; validate alerts fire on a synthetic burn.
Decide the error-budget policy: what freezes when the budget is spent (SRE and MLOps practices).

References¶

Google SRE Workbook — Alerting on SLOs (burn rate): https://sre.google/workbook/alerting-on-slos/
OpenSLO spec: https://github.com/OpenSLO/OpenSLO · Sloth (SLO generator): https://sloth.dev/
vLLM production metrics: https://docs.vllm.ai/en/latest/usage/metrics.html
Prometheus histogram_quantile: https://prometheus.io/docs/prometheus/latest/querying/functions/#histogram_quantile

Related: Inference · Observability · Reliability · Telemetry · Practices · Runbooks · Glossary