Markdown

Multi-window burn-rate alerting¶

Scope: a reusable multi-window, multi-burn-rate SLO alerting pattern: how to derive burn rate from an error budget, the fast+slow window pairing, the Prometheus recording and alerting rules, and the tuning that trades alert noise against miss rate. Lifts and generalizes the burn-rate section of the SLO/SLI catalog so any SLI (inference availability, latency, job success, GPU availability) can reuse one rule shape.

What it is¶

Burn-rate alerting pages on how fast you are spending the error budget, not on a raw error-rate threshold. Definitions, consistent with the catalog:

Error budget = 1 - SLO. For a 99.9% SLO the budget is 0.1% of events over the SLO window (commonly 30 days).
Burn rate = observed_error_rate / error_budget. Burn 1x exhausts the entire window's budget exactly at window end; burn 14.4x exhausts it in 30d / 14.4 ≈ 50h, i.e. consumes ~2% of a 30-day budget in 1 hour.
Multi-window = require both a short and a long window to exceed the same burn threshold (logical AND). The long window guarantees the burn is sustained (precision); the short window resets the alert quickly once the incident clears (low reset time).
Multi-burn-rate = run several (burn, window-pair) tiers in parallel: a fast tier pages on sharp burns, a slow tier tickets on shallow steady burns.

The canonical tiers (Google SRE Workbook, Table 5-8) for a 99.9% SLO:

Severity	Long window	Short window	Burn rate	Budget consumed when it fires
Page	1 h	5 m	14.4x	~2%
Page	6 h	30 m	6x	~5%
Ticket	3 d	6 h	1x	~10%

A common extension adds a 3x (1 d / 2 h) ticket tier between the 6x page and the 1x ticket for slow-burn coverage; it is optional, not in the canonical table.

Why it matters¶

A single-threshold alert (e.g. "error rate > 1%") forces a bad choice. Set it tight and you flap on every transient blip and train responders to ignore the page. Set it loose and a slow, steady burn (2x for two weeks) never trips, silently draining the budget until the SLO is already missed. Neither reacts proportionally to budget impact.

Burn-rate alerting fixes both:

Precision: each tier fires only after a fixed, meaningful fraction of the budget is already gone, so a page always means a real, budget-relevant problem.
Recall: the slow 1x/3d tier catches low-grade chronic burns that fast tiers miss.
Reset time: the short window in the AND drops the alert minutes after recovery, not hours; responders are not paged on a problem that already self-healed.
Reusability: the same rule shape applies to any SLI; only the error_query and objective change.

When it is needed (and when not)¶

Needed when:

You have a user-facing SLI with a numeric SLO and an error-budget policy (see SLO/SLI catalog).
The signal has enough event volume that a 5-minute window is statistically meaningful (high-QPS inference gateways, see inference serving, serving OSS models).
Page fatigue or missed slow burns are observed with flat-threshold alerts.

Not needed / not appropriate when:

Low event volume. At a few requests per minute, short-window ratios are dominated by noise; prefer longer windows, count-based alerts, or synthetic probes. Gate fast tiers with a minimum-traffic clause (shown below).
Binary infra faults: a GPU falling off the bus, a fabric-manager crash, a failed health gate (GPU health gating). Page directly on the condition; there is no budget to burn.
Capacity/saturation trends: alert on forecast/headroom, not error budget.
Pre-production / bring-up, where SLOs are not yet committed (workload bring-up recipes, smoke tests).

How: implement, integrate, maintain¶

flowchart LR
  SLI["SLI: good/total events"] --> ERR["error ratio = 1 - good/total"]
  ERR --> SHORT["short window (5m/30m/6h)"]
  ERR --> LONG["long window (1h/6h/3d)"]
  SHORT --> AND{"both > burn x budget?"}
  LONG --> AND
  AND -->|"fast tier 14.4x"| PAGE["page -> on-call"]
  AND -->|"slow tier 1x"| TICKET["ticket -> backlog"]
  AND -->|"no"| OK["no alert"]
  PAGE --> RUNBOOK["runbook"]

Implement: derive the burn threshold¶

For an SLO objective o (as a fraction, e.g. 0.999), the error budget is b = 1 - o. A tier with burn rate r fires when the observed error ratio over a window exceeds r * b. Time to exhaust the whole SLO window at burn r is SLO_window / r. Worked example, 99.9% SLO, 30-day window:

b = 0.001. Fast page tier r = 14.4 -> threshold 0.0144 error ratio. Exhausts the 30-day budget in 30d / 14.4 ≈ 50h; the 1h window crossing it means ~2% of budget already spent.
Slow ticket tier r = 1 -> threshold 0.001. A sustained 1x burn over 3 days has spent ~10% of the budget.

Integrate: Prometheus recording + alerting rules¶

Record the error ratio at every window once, then have all tiers reference the recordings. Generalize by service/slo labels so one ruleset covers many SLIs. Example below is inference availability at a 99.9% SLO; swap expr of the :error:ratio_* records to retarget any SLI.

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: slo-burn-inference-availability
  namespace: monitoring
  labels: { release: kube-prometheus-stack }
spec:
  groups:
    - name: slo.inference.availability.records
      interval: 30s
      rules:
        # Reusable shape: error ratio = 1 - good/total, per window.
        # Retarget any SLI by editing only these exprs.
        - record: slo:inference_availability:error_ratio_rate5m
          expr: |
            1 - (
              sum(rate(vllm:request_success_total[5m]))
              / clamp_min(sum(rate(vllm:e2e_request_latency_seconds_count[5m])), 1)
            )
        - record: slo:inference_availability:error_ratio_rate1h
          expr: |
            1 - (
              sum(rate(vllm:request_success_total[1h]))
              / clamp_min(sum(rate(vllm:e2e_request_latency_seconds_count[1h])), 1)
            )
        - record: slo:inference_availability:error_ratio_rate30m
          expr: |
            1 - (
              sum(rate(vllm:request_success_total[30m]))
              / clamp_min(sum(rate(vllm:e2e_request_latency_seconds_count[30m])), 1)
            )
        - record: slo:inference_availability:error_ratio_rate6h
          expr: |
            1 - (
              sum(rate(vllm:request_success_total[6h]))
              / clamp_min(sum(rate(vllm:e2e_request_latency_seconds_count[6h])), 1)
            )
        - record: slo:inference_availability:error_ratio_rate3d
          expr: |
            1 - (
              sum(rate(vllm:request_success_total[3d]))
              / clamp_min(sum(rate(vllm:e2e_request_latency_seconds_count[3d])), 1)
            )
        # Request rate gates fast tiers off when traffic is too low to be meaningful.
        - record: slo:inference_availability:req_rate5m
          expr: sum(rate(vllm:e2e_request_latency_seconds_count[5m]))

    - name: slo.inference.availability.alerts
      rules:
        # objective 0.999 -> budget 0.001. threshold = burn * 0.001
        - alert: InferenceAvailabilityFastBurn          # 14.4x, ~2% budget in 1h
          expr: |
            slo:inference_availability:error_ratio_rate1h  > (14.4 * 0.001)
            and slo:inference_availability:error_ratio_rate5m > (14.4 * 0.001)
            and slo:inference_availability:req_rate5m > 1
          for: 2m
          labels: { severity: critical, slo: inference_availability }
          annotations:
            summary: "Inference availability burning budget at 14.4x"
            runbook: "runbook-inference-slo-breach.md"
        - alert: InferenceAvailabilityMidBurn           # 6x, ~5% budget in 6h
          expr: |
            slo:inference_availability:error_ratio_rate6h  > (6 * 0.001)
            and slo:inference_availability:error_ratio_rate30m > (6 * 0.001)
          for: 5m
          labels: { severity: critical, slo: inference_availability }
          annotations:
            summary: "Inference availability burning budget at 6x"
            runbook: "runbook-inference-slo-breach.md"
        - alert: InferenceAvailabilitySlowBurn          # 1x, ~10% budget in 3d
          expr: |
            slo:inference_availability:error_ratio_rate3d > (1 * 0.001)
            and slo:inference_availability:error_ratio_rate6h > (1 * 0.001)
          for: 15m
          labels: { severity: warning, slo: inference_availability }
          annotations:
            summary: "Inference availability slow-burning error budget (1x)"
            runbook: "runbook-inference-slo-breach.md"

Notes on the rules:

clamp_min(..., 1) prevents divide-by-zero and noisy NaN ratios when a window has no traffic.
req_rate5m > 1 gates the fast page tier. Raise the floor to match real QPS so low-traffic blips do not page. Slow tiers omit it because their long windows already average out.
request_success_total is an engine-completion signal, not true availability; measure at the HTTP/gateway boundary where possible (see the catalog's note). The rule shape is unchanged.

Validate the deployment¶

# 1. Lint the rules before applying.
promtool check rules burn-rate-rules.yaml

# 2. Apply and confirm Prometheus loaded the group.
kubectl apply -f burn-rate-rules.yaml
kubectl -n monitoring get prometheusrule slo-burn-inference-availability

# 3. Confirm recording rules are producing series (Prometheus HTTP API).
curl -sG http://localhost:9090/api/v1/query \
  --data-urlencode 'query=slo:inference_availability:error_ratio_rate5m' | jq '.data.result'

# 4. Unit-test the alert logic with synthetic samples (no live burn needed).
promtool test rules burn-rate-tests.yaml

# burn-rate-tests.yaml  -> promtool test rules
rule_files:
  - burn-rate-rules.yaml
evaluation_interval: 1m
tests:
  - interval: 1m
    input_series:
      # 5% error ratio sustained -> above 14.4x*0.001 (0.0144) on both windows
      - series: 'slo:inference_availability:error_ratio_rate5m'
        values: '0.05x90'
      - series: 'slo:inference_availability:error_ratio_rate1h'
        values: '0.05x90'
      - series: 'slo:inference_availability:req_rate5m'
        values: '50x90'
    alert_rule_test:
      - eval_time: 10m
        alertname: InferenceAvailabilityFastBurn
        exp_alerts:
          - exp_labels: { severity: critical, slo: inference_availability }

Maintain: tune noise vs miss rate¶

Window ratio. Keep short = long / 12 (Google's guideline): 5m/1h, 30m/6h, 6h/3d. A larger ratio resets faster but admits more false positives; a smaller one is steadier but slow to clear.
for: duration. Short for (1-2m) on fast tiers cuts the worst-case detection delay; longer for on slow tiers suppresses flapping. Detection delay ≈ short window + for.
Traffic gate. Tune req_rate floor per service. Too low: low-volume noise pages. Too high: real low-traffic incidents are masked.
Long-window cost. A 3d window scans 3 days of samples each evaluation; rely on the recording rule (evaluated once at interval: 30s) so alert rules read a cheap single series, and keep recording intervals modest.
Generate, don't hand-write. Declare SLOs in OpenSLO/Sloth and generate these rules from one spec in git; it eliminates per-tier arithmetic mistakes and keeps thresholds in sync with the objective (see SLO/SLI catalog).
Validate against a real burn. Drive a synthetic burn (fault injection or a load test that fails a fraction of requests) and confirm the fast tier pages and clears; see runbook: inference SLO breach and runbook: MFU regression.

The threshold arithmetic (burn * budget) and rule structure are vendor-neutral and template-only here; the example metric names are vLLM's and shift between versions. Confirm names against your engine and wire the rules into your stack (telemetry and monitoring, observability). Not hardware-tested.

References¶

Google SRE Workbook — Alerting on SLOs (multiwindow, multi-burn-rate): https://sre.google/workbook/alerting-on-slos/
Prometheus — Alerting rules: https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/
Prometheus — Recording rules: https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/
Prometheus — promtool unit testing for rules: https://prometheus.io/docs/prometheus/latest/configuration/unit_testing_rules/
Prometheus Operator — PrometheusRule CRD: https://prometheus-operator.dev/docs/developer/alerting/
OpenSLO spec: https://github.com/OpenSLO/OpenSLO · Sloth (SLO/burn-rate generator): https://sloth.dev/
vLLM production metrics: https://docs.vllm.ai/en/latest/usage/metrics.html