Multi-window burn-rate alerting¶
Scope: a reusable multi-window, multi-burn-rate SLO alerting pattern: how to derive burn rate from an error budget, the fast+slow window pairing, the Prometheus recording and alerting rules, and the tuning that trades alert noise against miss rate. Lifts and generalizes the burn-rate section of the SLO/SLI catalog so any SLI (inference availability, latency, job success, GPU availability) can reuse one rule shape.
What it is¶
Burn-rate alerting pages on how fast you are spending the error budget, not on a raw error-rate threshold. Definitions, consistent with the catalog:
- Error budget =
1 - SLO. For a 99.9% SLO the budget is 0.1% of events over the SLO window (commonly 30 days). - Burn rate =
observed_error_rate / error_budget. Burn1xexhausts the entire window's budget exactly at window end; burn14.4xexhausts it in30d / 14.4 ≈ 50h, i.e. consumes ~2% of a 30-day budget in 1 hour. - Multi-window = require both a short and a long window to exceed the same burn threshold (logical
AND). The long window guarantees the burn is sustained (precision); the short window resets the alert quickly once the incident clears (low reset time). - Multi-burn-rate = run several
(burn, window-pair)tiers in parallel: a fast tier pages on sharp burns, a slow tier tickets on shallow steady burns.
The canonical tiers (Google SRE Workbook, Table 5-8) for a 99.9% SLO:
| Severity | Long window | Short window | Burn rate | Budget consumed when it fires |
|---|---|---|---|---|
| Page | 1 h | 5 m | 14.4x | ~2% |
| Page | 6 h | 30 m | 6x | ~5% |
| Ticket | 3 d | 6 h | 1x | ~10% |
A common extension adds a 3x (1 d / 2 h) ticket tier between the 6x page and the 1x ticket for slow-burn coverage; it is optional, not in the canonical table.
Why it matters¶
A single-threshold alert (e.g. "error rate > 1%") forces a bad choice. Set it tight and you flap on every transient blip and train responders to ignore the page. Set it loose and a slow, steady burn (2x for two weeks) never trips, silently draining the budget until the SLO is already missed. Neither reacts proportionally to budget impact.
Burn-rate alerting fixes both:
- Precision: each tier fires only after a fixed, meaningful fraction of the budget is already gone, so a page always means a real, budget-relevant problem.
- Recall: the slow 1x/3d tier catches low-grade chronic burns that fast tiers miss.
- Reset time: the short window in the
ANDdrops the alert minutes after recovery, not hours; responders are not paged on a problem that already self-healed. - Reusability: the same rule shape applies to any SLI; only the
error_queryandobjectivechange.
When it is needed (and when not)¶
Needed when:
- You have a user-facing SLI with a numeric SLO and an error-budget policy (see SLO/SLI catalog).
- The signal has enough event volume that a 5-minute window is statistically meaningful (high-QPS inference gateways, see inference serving, serving OSS models).
- Page fatigue or missed slow burns are observed with flat-threshold alerts.
Not needed / not appropriate when:
- Low event volume. At a few requests per minute, short-window ratios are dominated by noise; prefer longer windows, count-based alerts, or synthetic probes. Gate fast tiers with a minimum-traffic clause (shown below).
- Binary infra faults: a GPU falling off the bus, a fabric-manager crash, a failed health gate (GPU health gating). Page directly on the condition; there is no budget to burn.
- Capacity/saturation trends: alert on forecast/headroom, not error budget.
- Pre-production / bring-up, where SLOs are not yet committed (workload bring-up recipes, smoke tests).
How: implement, integrate, maintain¶
flowchart LR
SLI["SLI: good/total events"] --> ERR["error ratio = 1 - good/total"]
ERR --> SHORT["short window (5m/30m/6h)"]
ERR --> LONG["long window (1h/6h/3d)"]
SHORT --> AND{"both > burn x budget?"}
LONG --> AND
AND -->|"fast tier 14.4x"| PAGE["page -> on-call"]
AND -->|"slow tier 1x"| TICKET["ticket -> backlog"]
AND -->|"no"| OK["no alert"]
PAGE --> RUNBOOK["runbook"]
Implement: derive the burn threshold¶
For an SLO objective o (as a fraction, e.g. 0.999), the error budget is b = 1 - o. A tier with burn rate r fires when the observed error ratio over a window exceeds r * b. Time to exhaust the whole SLO window at burn r is SLO_window / r. Worked example, 99.9% SLO, 30-day window:
b = 0.001. Fast page tierr = 14.4-> threshold0.0144error ratio. Exhausts the 30-day budget in30d / 14.4 ≈ 50h; the 1h window crossing it means ~2% of budget already spent.- Slow ticket tier
r = 1-> threshold0.001. A sustained 1x burn over 3 days has spent ~10% of the budget.
Integrate: Prometheus recording + alerting rules¶
Record the error ratio at every window once, then have all tiers reference the recordings. Generalize by service/slo labels so one ruleset covers many SLIs. Example below is inference availability at a 99.9% SLO; swap expr of the :error:ratio_* records to retarget any SLI.
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: slo-burn-inference-availability
namespace: monitoring
labels: { release: kube-prometheus-stack }
spec:
groups:
- name: slo.inference.availability.records
interval: 30s
rules:
# Reusable shape: error ratio = 1 - good/total, per window.
# Retarget any SLI by editing only these exprs.
- record: slo:inference_availability:error_ratio_rate5m
expr: |
1 - (
sum(rate(vllm:request_success_total[5m]))
/ clamp_min(sum(rate(vllm:e2e_request_latency_seconds_count[5m])), 1)
)
- record: slo:inference_availability:error_ratio_rate1h
expr: |
1 - (
sum(rate(vllm:request_success_total[1h]))
/ clamp_min(sum(rate(vllm:e2e_request_latency_seconds_count[1h])), 1)
)
- record: slo:inference_availability:error_ratio_rate30m
expr: |
1 - (
sum(rate(vllm:request_success_total[30m]))
/ clamp_min(sum(rate(vllm:e2e_request_latency_seconds_count[30m])), 1)
)
- record: slo:inference_availability:error_ratio_rate6h
expr: |
1 - (
sum(rate(vllm:request_success_total[6h]))
/ clamp_min(sum(rate(vllm:e2e_request_latency_seconds_count[6h])), 1)
)
- record: slo:inference_availability:error_ratio_rate3d
expr: |
1 - (
sum(rate(vllm:request_success_total[3d]))
/ clamp_min(sum(rate(vllm:e2e_request_latency_seconds_count[3d])), 1)
)
# Request rate gates fast tiers off when traffic is too low to be meaningful.
- record: slo:inference_availability:req_rate5m
expr: sum(rate(vllm:e2e_request_latency_seconds_count[5m]))
- name: slo.inference.availability.alerts
rules:
# objective 0.999 -> budget 0.001. threshold = burn * 0.001
- alert: InferenceAvailabilityFastBurn # 14.4x, ~2% budget in 1h
expr: |
slo:inference_availability:error_ratio_rate1h > (14.4 * 0.001)
and slo:inference_availability:error_ratio_rate5m > (14.4 * 0.001)
and slo:inference_availability:req_rate5m > 1
for: 2m
labels: { severity: critical, slo: inference_availability }
annotations:
summary: "Inference availability burning budget at 14.4x"
runbook: "runbook-inference-slo-breach.md"
- alert: InferenceAvailabilityMidBurn # 6x, ~5% budget in 6h
expr: |
slo:inference_availability:error_ratio_rate6h > (6 * 0.001)
and slo:inference_availability:error_ratio_rate30m > (6 * 0.001)
for: 5m
labels: { severity: critical, slo: inference_availability }
annotations:
summary: "Inference availability burning budget at 6x"
runbook: "runbook-inference-slo-breach.md"
- alert: InferenceAvailabilitySlowBurn # 1x, ~10% budget in 3d
expr: |
slo:inference_availability:error_ratio_rate3d > (1 * 0.001)
and slo:inference_availability:error_ratio_rate6h > (1 * 0.001)
for: 15m
labels: { severity: warning, slo: inference_availability }
annotations:
summary: "Inference availability slow-burning error budget (1x)"
runbook: "runbook-inference-slo-breach.md"
Notes on the rules:
clamp_min(..., 1)prevents divide-by-zero and noisyNaNratios when a window has no traffic.req_rate5m > 1gates the fast page tier. Raise the floor to match real QPS so low-traffic blips do not page. Slow tiers omit it because their long windows already average out.request_success_totalis an engine-completion signal, not true availability; measure at the HTTP/gateway boundary where possible (see the catalog's note). The rule shape is unchanged.
Validate the deployment¶
# 1. Lint the rules before applying.
promtool check rules burn-rate-rules.yaml
# 2. Apply and confirm Prometheus loaded the group.
kubectl apply -f burn-rate-rules.yaml
kubectl -n monitoring get prometheusrule slo-burn-inference-availability
# 3. Confirm recording rules are producing series (Prometheus HTTP API).
curl -sG http://localhost:9090/api/v1/query \
--data-urlencode 'query=slo:inference_availability:error_ratio_rate5m' | jq '.data.result'
# 4. Unit-test the alert logic with synthetic samples (no live burn needed).
promtool test rules burn-rate-tests.yaml
# burn-rate-tests.yaml -> promtool test rules
rule_files:
- burn-rate-rules.yaml
evaluation_interval: 1m
tests:
- interval: 1m
input_series:
# 5% error ratio sustained -> above 14.4x*0.001 (0.0144) on both windows
- series: 'slo:inference_availability:error_ratio_rate5m'
values: '0.05x90'
- series: 'slo:inference_availability:error_ratio_rate1h'
values: '0.05x90'
- series: 'slo:inference_availability:req_rate5m'
values: '50x90'
alert_rule_test:
- eval_time: 10m
alertname: InferenceAvailabilityFastBurn
exp_alerts:
- exp_labels: { severity: critical, slo: inference_availability }
Maintain: tune noise vs miss rate¶
- Window ratio. Keep short = long / 12 (Google's guideline): 5m/1h, 30m/6h, 6h/3d. A larger ratio resets faster but admits more false positives; a smaller one is steadier but slow to clear.
for:duration. Shortfor(1-2m) on fast tiers cuts the worst-case detection delay; longerforon slow tiers suppresses flapping. Detection delay ≈ short window +for.- Traffic gate. Tune
req_ratefloor per service. Too low: low-volume noise pages. Too high: real low-traffic incidents are masked. - Long-window cost. A
3dwindow scans 3 days of samples each evaluation; rely on the recording rule (evaluated once atinterval: 30s) so alert rules read a cheap single series, and keep recording intervals modest. - Generate, don't hand-write. Declare SLOs in OpenSLO/Sloth and generate these rules from one spec in git; it eliminates per-tier arithmetic mistakes and keeps thresholds in sync with the objective (see SLO/SLI catalog).
- Validate against a real burn. Drive a synthetic burn (fault injection or a load test that fails a fraction of requests) and confirm the fast tier pages and clears; see runbook: inference SLO breach and runbook: MFU regression.
The threshold arithmetic (burn * budget) and rule structure are vendor-neutral and template-only here; the example metric names are vLLM's and shift between versions. Confirm names against your engine and wire the rules into your stack (telemetry and monitoring, observability). Not hardware-tested.
References¶
- Google SRE Workbook — Alerting on SLOs (multiwindow, multi-burn-rate): https://sre.google/workbook/alerting-on-slos/
- Prometheus — Alerting rules: https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/
- Prometheus — Recording rules: https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/
- Prometheus —
promtoolunit testing for rules: https://prometheus.io/docs/prometheus/latest/configuration/unit_testing_rules/ - Prometheus Operator — PrometheusRule CRD: https://prometheus-operator.dev/docs/developer/alerting/
- OpenSLO spec: https://github.com/OpenSLO/OpenSLO · Sloth (SLO/burn-rate generator): https://sloth.dev/
- vLLM production metrics: https://docs.vllm.ai/en/latest/usage/metrics.html
Related: SLO/SLI catalog · Telemetry · Observability · Inference serving · Serving OSS models · Runbook: inference SLO breach · GPU health gating · Glossary