Skip to content
Markdown

SLOs for the training platform

Scope: training-platform SLOs distinct from inference (scheduler queue wait, job success rate, goodput/MFU, checkpoint success, and infra-failure rate), with PromQL SLIs and how they drive capacity and reliability work.

What it is

A training platform serves a different user (the ML engineer running jobs) than an inference service serves (the request). The SLIs therefore measure batch outcomes (does my job get scheduled, run to completion, and make useful progress), not per-request latency. Five training-platform SLIs carry the load:

SLI What it measures Good event
Scheduler queue wait time from submit to first GPU allocation admitted within target
Job success rate jobs ending without infra failure exit not caused by platform
Goodput / MFU useful FLOPs vs. wall-clock peak step committed an optimizer update
Checkpoint success checkpoint writes that durably persist write completed and fsynced
Infra-failure rate job-terminating faults per GPU-hour absence of node/fabric/host fault

These extend the training rows of the SLO/SLI catalog into a standalone error-budget surface. Goodput and MFU are defined in goodput; this page is about turning them into SLIs that page and that drive capacity.

The distinction from inference SLOs matters: an inference availability breach is a customer outage measured in seconds; a training queue-wait breach is a capacity signal measured in minutes-to-hours, and a goodput regression is an efficiency signal that bleeds dollars without ever throwing an error. They belong to different error budgets with different owners and different remediation loops (SLO/SLI catalog).

flowchart LR
  SUBMIT["Job submitted"] --> QWAIT["SLI: queue wait <= target"]
  QWAIT --> RUN["Job running"]
  RUN --> GP["SLI: goodput / MFU"]
  RUN --> CKPT["SLI: checkpoint success"]
  RUN --> DONE["Job ends"]
  DONE --> SUCCESS["SLI: success = not infra-fault"]
  QWAIT --> CAPACITY["Breach -> capacity work"]
  GP --> CAPACITY
  SUCCESS --> RELIABILITY["Breach -> reliability work"]
  CKPT --> RELIABILITY

Why it matters

Two budgets, two remediation loops. Capacity-bound SLIs (queue wait, goodput) say buy or schedule better: sustained queue-wait burn means the cluster is oversubscribed or the scheduler is fragmenting GPUs, which routes to GPU capacity planning and quota tuning. Reliability-bound SLIs (job success, checkpoint success, infra-failure rate) say fix the fleet: they route to GPU health gating and the failure runbooks.

The economic argument is the goodput argument (goodput): at scale, infra-failure rate and checkpoint success directly set ETTR. A job that restarts from a stale checkpoint after a node fault re-burns every token since the last good write. That lost work is invisible to job-success-rate but devastating to goodput. Measuring all five together is what lets you attribute an MFU regression (runbook: MFU regression) to a hardware cause instead of a code cause.

Infra-failure rate is the SLI that separates the platform's fault from the user's. A job that OOMs on a bad batch size is not a platform failure; a job killed by an ECC double-bit error, an NVLink drop, or a node going NotReady is. Only the latter spends the platform's error budget. Getting that attribution right is the whole point: it is what keeps the platform team accountable for fleet reliability without owning every user bug.

When it is needed (and when not)

Adopt these SLOs when:

Skip or simplify when:

Do not set a queue-wait SLO before the cluster has stable quota, and do not page on goodput: track it, ticket regressions (runbook: MFU regression). Goodput has too many legitimate causes (a genuinely communication-bound model) to be a paging signal.

How: implement, integrate, maintain

Reference templates below, not hardware-tested. Metric names track Kueue, Volcano, and kube-state-metrics at the versions cited; re-check against the linked docs before relying on any rule, since exact names shift between versions. Targets are reference values to set with stakeholders, not guarantees.

SLIs as PromQL

Source signals from the scheduler (Kueue or Volcano), kube-state-metrics for job outcomes, DCGM for fleet health, and your training loop for goodput/MFU (telemetry and monitoring, observability).

# 1. Scheduler queue wait p95 (Kueue admission wait histogram), seconds
histogram_quantile(0.95,
  sum by (le) (rate(kueue_admission_wait_time_seconds_bucket[1h])))

# 1b. Queue-wait SLI: fraction of workloads admitted within 600s
sum(rate(kueue_admission_wait_time_seconds_bucket{le="600"}[1h]))
  / sum(rate(kueue_admission_wait_time_seconds_count[1h]))

# 1c. Volcano alternative: pending PodGroups per queue (capacity pressure)
sum by (queue_name) (volcano_queue_pod_group_pending_count)

# 2. Job success rate (kube-state-metrics): completed / (completed + failed)
sum(increase(kube_job_status_succeeded[28d]))
  / (sum(increase(kube_job_status_succeeded[28d]))
     + sum(increase(kube_job_status_failed[28d])))

# 3. Goodput proxy: cluster MFU (training loop emits realized vs peak FLOPs)
sum(rate(training_realized_flops_total[15m]))
  / sum(training_peak_flops)            # 0..1, tracked not paged

# 4. Checkpoint success rate (training loop emits attempt + success counters)
sum(increase(checkpoint_write_success_total[7d]))
  / sum(increase(checkpoint_write_attempt_total[7d]))

# 5. Infra-failure rate: infra-attributed job kills per 1000 GPU-hours
1000 * sum(increase(job_terminations_total{cause="infra"}[28d]))
  / sum(increase(gpu_busy_seconds_total[28d]) / 3600)

kube_job_status_succeeded and kube_job_status_failed can both read 1 for a job that failed then succeeded on retry; prefer counting on the Complete/Failed job conditions, or emit your own terminal-state counter from the job wrapper, to avoid double-counting ([kube-state-metrics #2642]).

Burn-rate alerts for the reliability budget

Page on the two reliability-bound SLIs (job success, checkpoint success). Use the same multi-window multi-burn-rate pattern as inference (SLO/SLI catalog). For a 99% job-success SLO the budget is 1%.

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: slo-training-platform
  namespace: monitoring
  labels: { release: kube-prom }
spec:
  groups:
    - name: slo.training
      rules:
        # Record the job-failure ratio over two windows.
        - record: slo:training_job_fail:ratio_rate1h
          expr: |
            sum(increase(kube_job_status_failed[1h]))
              / clamp_min(
                  sum(increase(kube_job_status_succeeded[1h]))
                    + sum(increase(kube_job_status_failed[1h])), 1)
        - record: slo:training_job_fail:ratio_rate6h
          expr: |
            sum(increase(kube_job_status_failed[6h]))
              / clamp_min(
                  sum(increase(kube_job_status_succeeded[6h]))
                    + sum(increase(kube_job_status_failed[6h])), 1)
        # 6x burn on a 1% budget -> page (6h/1h windows).
        - alert: TrainingJobSuccessFastBurn
          expr: |
            slo:training_job_fail:ratio_rate1h > (6 * 0.01)
              and slo:training_job_fail:ratio_rate6h > (6 * 0.01)
          for: 10m
          labels: { severity: critical }
          annotations:
            runbook: "runbook-mfu-regression / gpu-health-gating: triage infra-attributed failures"
        # Checkpoint success below 99.9% over 6h -> page; lost checkpoints destroy ETTR.
        - alert: CheckpointSuccessLow
          expr: |
            sum(increase(checkpoint_write_success_total[6h]))
              / clamp_min(sum(increase(checkpoint_write_attempt_total[6h])), 1) < 0.999
          for: 15m
          labels: { severity: critical }
        # Queue wait: ticket, do not page -> capacity signal.
        - alert: SchedulerQueueWaitHigh
          expr: |
            histogram_quantile(0.95,
              sum by (le) (rate(kueue_admission_wait_time_seconds_bucket[1h]))) > 600
          for: 30m
          labels: { severity: warning }
          annotations:
            runbook: "review quota and capacity: cloud-neoclouds-cost"

Integrate: emit the goodput and checkpoint counters

The scheduler and kube-state-metrics give you queue wait, job success, and infra signals for free. Goodput, MFU, and checkpoint success require the training loop to export counters (goodput, telemetry and monitoring).

from __future__ import annotations

from prometheus_client import Counter, Gauge

REALIZED_FLOPS = Counter(
    "training_realized_flops_total", "Useful FLOPs from committed steps", ["job"]
)
PEAK_FLOPS = Gauge("training_peak_flops", "Cluster peak FLOPs at run precision", ["job"])
CKPT_ATTEMPT = Counter("checkpoint_write_attempt_total", "Checkpoint writes started", ["job"])
CKPT_SUCCESS = Counter("checkpoint_write_success_total", "Checkpoint writes durably persisted", ["job"])


def record_step(job: str, params: int, tokens: int) -> None:
    """Count only steps that committed an optimizer update (see goodput page)."""
    REALIZED_FLOPS.labels(job=job).inc(6.0 * params * tokens)  # ~6*N*D dense fwd+bwd


def write_checkpoint(job: str, save_fn) -> None:
    CKPT_ATTEMPT.labels(job=job).inc()
    save_fn()  # must fsync / confirm durable object-store PUT before counting success
    CKPT_SUCCESS.labels(job=job).inc()

Tokens reprocessed after a restore are not productive runtime. Exclude them, exactly as ETTR requires (goodput). Count CKPT_SUCCESS only after the write is durable (fsync on local FS, confirmed PUT on object store), or the SLI lies about recoverability.

How a breach drives work

  • Queue-wait burn (capacity) -> inspect kueue_pending_workloads / volcano_queue_pod_group_pending_count by queue; is it quota starvation, GPU fragmentation, or genuine oversubscription? Route to quota tuning and capacity/cost planning.
  • Job-success burn (reliability) -> filter job_terminations_total{cause="infra"} by node; correlate with DCGM ECC/NVLink/thermal events (gpu health gating); cordon and RMA bad nodes.
  • Checkpoint-success burn (reliability) -> storage path or object-store throttling; verify durability and restart-from-checkpoint actually works (distributed training recipes).
  • Goodput/MFU regression (efficiency, ticket) -> runbook: MFU regression: bisect code vs. data-loader vs. fabric.

Maintain

Generate the rules from one SLO-as-code spec in git (OpenSLO/Sloth) so targets are reviewed with stakeholders, not edited ad hoc (SLO/SLI catalog). Re-baseline queue-wait and infra-failure targets each capacity change. Adding nodes shifts both. Validate alerts fire on a synthetic burn (drain a node, force a checkpoint failure) before trusting them. Wire every page to a runbook and review the error-budget policy: what change freezes when the reliability budget is spent.

Reference targets (set your own with stakeholders): queue wait 95% ≤ 10 min / 28d; job success 99% / 28d; checkpoint success 99.9%; goodput/MFU tracked, ticket on regression; infra-failure rate trended, no fixed SLO until baselined.

References

  • Kueue Prometheus metrics (admission_wait_time_seconds, pending_workloads, admitted/evicted totals): https://kueue.sigs.k8s.io/docs/reference/metrics/
  • Kueue common Grafana queries: https://kueue.sigs.k8s.io/docs/tasks/manage/observability/common_grafana_queries/
  • Volcano scheduler metrics (e2e_scheduling_latency, queue_pod_group_pending_count): https://github.com/volcano-sh/volcano/blob/master/docs/design/metrics.md
  • kube-state-metrics Job metrics (kube_job_status_succeeded/failed); double-count caveat: https://github.com/kubernetes/kube-state-metrics/blob/main/docs/metrics/workload/job-metrics.md · https://github.com/kubernetes/kube-state-metrics/issues/2642
  • Google SRE Workbook — Alerting on SLOs (multi-window burn rate): https://sre.google/workbook/alerting-on-slos/
  • Prometheus histogram_quantile: https://prometheus.io/docs/prometheus/latest/querying/functions/#histogram_quantile
  • prometheus_client (Python) Counter/Gauge: https://prometheus.github.io/client_python/

Related: SLO/SLI catalog · Goodput · MFU regression runbook · GPU health gating · Cloud and neoclouds cost · Telemetry · Glossary