Markdown

Runbook: RL training API observability bring-up¶

Scope: standing up the observability stack for a managed RL training API, the layers a plain training platform does not have: per-run algorithmic health (reward, KL, entropy, eval gates), per-tenant accounting and failure attribution, and the customer-facing run status derived from both. The platform-level SLIs underneath (queue wait, job success, goodput/MFU, checkpoint success, infra-failure rate) are already defined with PromQL in training-platform SLOs and are reused here, not re-derived; the collection stack is telemetry, monitoring and alerting.

Run this before GA of an RL training service, or after the incident that motivates it everywhere: a customer run burned its GPU budget while learning nothing, and nobody (not the customer, not the on-call) was told. Severity: cost and trust rather than availability; the failure mode is silence, so the deliverable is a signal catalog in which a dead, diverging, or reward-hacked run cannot stay quiet.

Reference templates on real APIs; pin versions and validate before production use. Alert thresholds below are illustrative config defaults to tune per fleet, not measured constants.

An RL run can be perfectly healthy at the infrastructure layer and completely broken as a training run: GPUs busy, steps committing, checkpoints landing, while the policy collapses to a single mode or optimizes a reward the held-out eval refutes. The converse also holds, which is why the layers must stay separate: a platform incident (slow weight sync) shows up as run-mechanics degradation, never as an algorithmic alert. The signal vocabulary for the algorithmic layer comes from the training pages: per-token KL to the frozen reference and clip fraction from PPO, entropy bands and the reward-versus-held-out divergence rule from GRPO variants, and the frozen-evaluator discipline from evaluation integrity.

Trigger¶

A managed RL training API is approaching GA and the only dashboards are platform-level (queue, job success, MFU).
A silent-failure incident: a run finished "successfully" with an unusable model, or burned budget flat-lined, and the first detection was the customer.
Reward hacking reached a customer deliverable: training reward rose while held-out quality fell, and nothing fired.
A goodput or MFU regression alone is not this runbook; that path is the MFU-regression runbook.

Pre-checks¶

Platform SLIs exist and page. Queue wait, job success, checkpoint success, and infra-failure attribution from training-platform SLOs are deployed; this runbook stacks on top of them, not instead of them.
The trainer exports counters. The training loop already emits step and checkpoint counters (the training_realized_flops_total / checkpoint_write_* pattern); if not, start there.
An eval gate exists per run. Algorithmic health needs a periodic held-out score the optimizer cannot touch (evaluation integrity, LLM evaluation harness); without it, reward hacking is undetectable by construction.
Run identity is stable. Every metric can carry a run_id and tenant label from the job spec; without stable identity, per-tenant accounting and per-run alerting cannot be built.
Decide the customer-visibility boundary up front with product: which signals are shown raw, which are summarized into run states, and which stay internal (see step 5); retrofitting this after GA is a breaking API change.

Flow¶

flowchart TB
    A["RL training API needs observability"] --> B["Layer 1: infra SLIs<br/>(reuse training-platform SLOs)"]
    B --> C["Layer 2: run mechanics<br/>step-time breakdown, queues, sync"]
    C --> D["Layer 3: algorithmic health<br/>reward, KL, entropy, eval gate"]
    D --> E["Alert catalog wired to detectors"]
    E --> F["Per-tenant accounting +<br/>failure attribution"]
    F --> G["Customer-facing run states<br/>and charts"]
    G --> H["Canary fault injection:<br/>every alert proven to fire"]
    H -->|"alert silent"| E
    H -->|"all fire"| I["GA gate passed"]

Procedure¶

Inventory the three layers and assign owners. Build the dashboard skeleton in this order, one row per layer, before writing any alert:
Infra (platform team owns): the five SLIs from training-platform SLOs, filtered to the API's job class. Nothing new to build.
Run mechanics (service team owns): per-step wall-clock decomposed into rollout generation, reward scoring, weight sync, and optimizer time; trainer idle fraction; rollout queue depth; checkpoint cadence and success per run. These are service-defined metrics the trainer and rollout workers must emit (names like rlapi_step_phase_seconds{phase="rollout|score|sync|optim"} are this runbook's convention, not a standard; follow the Prometheus naming practices). The rollout fleet side reuses the serving gauges of the inference engine (vLLM metrics; queue depth and preemptions matter most here).
Algorithmic health (shared: service detects, customer decides): per-run reward mean and std, KL to the frozen reference, policy entropy, gradient norm, clip fraction, advantage statistics, and the periodic eval-gate score. These come from the training loop at low frequency (per step or per N steps) and are the only layer with customer-facing meaning.
Wire the run-health alert catalog. Each detector below is per-run, evaluated over windows of steps (not wall-clock), and labelled with its first-response pointer. All thresholds are heuristics to tune on replayed historical runs:

Alert	Detection logic (heuristic)	First response
Reward explosion or collapse	reward mean leaves a robust band (for example, median +/- k*IQR of its own trailing window) in either direction	freeze promotion; inspect reward-scorer logs; reward design
KL blowup	KL to reference exceeds a per-algorithm ceiling or its slope exceeds a per-window cap	check LR and KL coefficient; the policy ran away from the reference (PPO)
Entropy collapse	EWMA of entropy falls faster than a configured rate for k consecutive windows (executed detector below)	clip-higher or KL term per GRPO variants; mode collapse in progress
Reward hacking signature	reward slope positive while eval-gate slope negative over the same window (executed detector below)	stop the run; audit the reward function; evaluation integrity
Dead run	reward and KL both flat within noise for M consecutive windows while GPU spend accrues	check reward-function output distribution and LR; this is budget burning with no learning signal
Gradient-norm spike	grad norm exceeds its trailing robust band; repeated spikes with clipping engaged	check batch composition and LR schedule; correlate with loss scale if mixed precision
Rollout-quality drift	generation length, truncation rate, or non-parseable-output rate shifts materially between windows	check sampling params and stop tokens; drift here often precedes reward artifacts

Validate the two subtle detectors in miniature. The hacking-divergence and entropy-collapse detectors are the two that gate customer trust, and their logic is small enough to test without a training run. This block is executed and asserted: the divergence detector fires on all three windows of a constructed hacking segment and stays silent on the healthy co-rising half (and on a reward-vs-itself control), and the entropy detector flags a constructed collapse while ignoring noisy-flat and constant series:

# run_health.py - validated: two run-health detectors for a managed RL training
# API, on constructed synthetic series. Executed and asserted; thresholds are
# illustrative config defaults, not tuned production values. numpy only.
import numpy as np


def slope(y: np.ndarray) -> float:
    """Least-squares slope per step over a window."""
    x = np.arange(y.size, dtype=np.float64)
    return float(np.polyfit(x, y, 1)[0])


def hacking_windows(reward: np.ndarray, evals: np.ndarray,
                    window: int = 50, tau: float = 0.002) -> list[int]:
    """Reward-up, eval-down divergence: the reward-hacking signature. Returns
    window start indices where reward slope > +tau while eval slope < -tau."""
    assert reward.shape == evals.shape and reward.size >= window
    fired = []
    for s in range(0, reward.size - window + 1, window):
        r, e = reward[s:s + window], evals[s:s + window]
        if slope(r) > tau and slope(e) < -tau:
            fired.append(s)
    return fired


def entropy_collapse_step(entropy: np.ndarray, alpha: float = 0.2,
                          drop: float = 0.004, k: int = 3,
                          window: int = 25) -> int | None:
    """First step index where the EWMA of policy entropy falls faster than
    `drop` per step for k consecutive windows; None if it never does."""
    assert entropy.size >= (k + 1) * window
    ewma = np.empty_like(entropy)
    ewma[0] = entropy[0]
    for i in range(1, entropy.size):
        ewma[i] = alpha * entropy[i] + (1 - alpha) * ewma[i - 1]
    consecutive = 0
    for s in range(window, entropy.size - window + 1, window):
        rate = (ewma[s + window - 1] - ewma[s]) / window
        consecutive = consecutive + 1 if rate < -drop else 0
        if consecutive >= k:
            return s + window - 1
    return None


rng = np.random.default_rng(0)
steps = 300
noise = lambda scale: rng.normal(0.0, scale, steps // 2)

# 1) Divergence detector: healthy first half (reward and eval co-rise), then a
# constructed hacking segment (reward keeps rising, held-out eval decays).
reward = np.concatenate([np.linspace(0.2, 0.5, steps // 2) + noise(0.01),
                         np.linspace(0.5, 0.9, steps // 2) + noise(0.01)])
evals = np.concatenate([np.linspace(0.30, 0.45, steps // 2) + noise(0.01),
                        np.linspace(0.45, 0.10, steps // 2) + noise(0.01)])
fired = hacking_windows(reward, evals)
assert fired and min(fired) >= steps // 2, fired          # silent on healthy half
assert len(fired) == 3, fired                             # 3 of 3 hacking windows
assert not hacking_windows(reward, reward.copy())         # co-rising never fires

# 2) Entropy-collapse detector: noisy-flat entropy never fires; a constructed
# exponential collapse after step 150 fires inside the collapse region.
flat = 2.0 + rng.normal(0.0, 0.05, steps)
assert entropy_collapse_step(flat) is None
collapse = np.concatenate([2.0 + rng.normal(0.0, 0.05, steps // 2),
                           2.0 * np.exp(-np.arange(steps // 2) / 40.0)
                           + rng.normal(0.0, 0.01, steps // 2)])
at = entropy_collapse_step(collapse)
assert at is not None and at >= steps // 2, at
assert entropy_collapse_step(np.full(steps, 2.0)) is None  # constant never fires

print(f"hacking windows fired at steps {fired} (constructed segment starts at {steps // 2})")
print(f"entropy collapse flagged at step {at} (collapse injected at {steps // 2})")
print("all run-health detector assertions passed")

Output of the run: hacking windows fired at steps [150, 200, 250] (constructed segment starts at 150), entropy collapse flagged at step 224 (collapse injected at 150), all run-health detector assertions passed. The detection lag on the collapse (steps 150 to 224) is the EWMA-plus-consecutive-windows price paid for not firing on the noisy-flat control; tune k, window, and drop against replayed runs from your own fleet, and record the false-positive rate per alert as part of its definition. 4. Build per-tenant accounting on the run labels. Meter GPU-seconds per run split by phase (rollout fleet vs trainer pool), tokens generated and trained per run, and derive cost per committed optimizer step; this is what makes the dead-run alert a money number instead of a curiosity. Attribute every terminal run state to one of three classes, extending the infra-vs-user split of training-platform SLOs: platform fault (spends the platform error budget), tenant code fault (reward function or dataset errors; surfaced to the customer with logs), or algorithmic non-convergence (no fault; surfaced with the health history). Where the service itself makes LLM calls for scoring or judging, instrument them under the GenAI conventions so judge spend is attributable per run (GenAI observability with OpenTelemetry). 5. Define the customer-facing story last, from the layers. A run state machine (QUEUED, PROVISIONING, RUNNING, DEGRADED(reason), PAUSED_BY_GATE(reason), SUCCEEDED, FAILED(class)) driven by the alert catalog: a firing hacking or KL alert moves the run to DEGRADED or pauses it at the next checkpoint, with the alert's plain-language reason attached. Expose the minimal chart set per run (reward curve, KL curve, eval-gate scores, spend so far); keep gradient norms, clip fractions, and fleet mechanics internal. Every failure message a customer sees must name the class from step 4, because "training failed" without attribution generates a support ticket and "your reward function raised on 3% of samples, examples attached" does not. Track run history in the experiment tracker so post-hoc analysis has the full series (experiment tracking and model registry; W&B or MLflow both fit). 6. Contain cardinality before it contains you. run_id is an unbounded label: keep high-frequency series (per-step mechanics) in short retention, roll algorithmic health up to per-N-step resolution for long retention, and never put run_id on fleet-level metrics. Per-tenant aggregates keep bounded cardinality and long retention. The split mirrors the general guidance in observability and monitoring and the instrumentation practices.

Verification¶

Every alert fires under injected failure. On a canary run: freeze weight sync (mechanics alert plus staleness growth), feed a constant reward (dead-run alert), set an oversized LR (KL blowup, usually with entropy collapse following), and swap in a gameable reward on a toy task (hacking divergence). Each must fire and each must move the customer-facing state accordingly.
The silent-failure replay passes. Replaying the motivating incident's metric series through the catalog raises at least one page before the historical detection time.
Step-time breakdown renders live. A production-shaped run shows the rollout/score/sync/optimizer decomposition and trainer idle fraction on the dashboard, and the phases sum to within a few percent of wall-clock step time.
Attribution adds up. Over a test week, every terminal run carries exactly one failure class, and platform-fault count matches the infra-failure SLI from training-platform SLOs.
False-positive budget respected. Healthy replayed runs stay quiet; record the per-alert false-positive rate next to its threshold config.

Rollback¶

Thresholds are config, not code. Revert to the previous threshold set if a new catalog pages too much; keep threshold history versioned with the alert definitions (burn-rate alerting hygiene applies).
Detectors degrade to advisory. Any run-health alert can be demoted from paging to ticket per algorithm or per tenant while being retuned; never disable the underlying metric emission, only the routing.
Customer-facing states are append-only. If a state proves noisy (DEGRADED flapping), stop deriving it from the noisy alert rather than removing the state from the API; clients already depend on the enum.

MFU regression: the efficiency triage this runbook's mechanics layer feeds.
Checkpoint recovery: when checkpoint-success alerts fire.
NCCL hang: wedged-step triage when the mechanics layer flatlines.
Inference SLO breach: the rollout fleet is an inference service; its latency incidents route there.
Operational runbooks: index.

References¶

Google SRE Workbook, Alerting on SLOs (multi-window burn rate): https://sre.google/workbook/alerting-on-slos/
Prometheus metric and label naming practices: https://prometheus.io/docs/practices/naming/
Prometheus instrumentation practices (cardinality, what to instrument): https://prometheus.io/docs/practices/instrumentation/
vLLM production metrics (rollout-fleet gauges): https://docs.vllm.ai/en/latest/usage/metrics.html
Weights and Biases experiment tracking: https://docs.wandb.ai/guides/track/
MLflow tracking: https://mlflow.org/docs/latest/tracking.html