Skip to content
Markdown

Runbook: RL training API contract design review

Scope: the joint platform/ML design review that fixes the contract of a managed (serverless) RL training API: what a job is, which knobs customers see, how tenants share hardware, and the five interfaces that connect the platform to the training loop (checkpoints, weight sync, reward functions, eval gates, the run-state machine). Run it as a working session between the platform owner and the ML owner; the output is one recorded decision per step, each with an owner and a revisit condition, plus a contract linter that makes the decisions executable.

Run this before building or materially changing the service's public API or the internal platform/ML interface, and again when onboarding a new training backend behind the same API. Severity: design-time; skipping it converts each unmade decision into a production incident later (orphaned rollout fleets, unloadable checkpoints, retry storms on algorithmic failures). The comparable to study first is Tinker, the shipping training-as-a-service API (Tinker).

Reference templates on real APIs; pin versions and validate before production use. Numeric values in examples (budgets, token counts) are illustrative, not sizing guidance.

The core tension the review resolves: the ML side wants the API to admit arbitrary training loops, the platform side wants every admitted job to have a predictable resource shape and failure story. The resolution is a small set of admitted methods whose resource shapes are known (fine-tuning and post-training), a knob surface chosen deliberately rather than leaked from the trainer's config, and interface contracts pinned tightly enough that either side can swap its implementation without renegotiating. Tinker demonstrates one coherent answer (four primitives, LoRA-only training, multi-tenant clock cycles, checkpoint export to HuggingFace format); this review decides where your service copies that answer and where it deliberately diverges.

Trigger

  • A managed RL training API is about to be designed, or its public surface is about to change (new method, new knob, new tier).
  • A second training backend (a new framework, a new model family) must be onboarded behind the existing API.
  • Recurring cross-team friction: the platform team and the ML team keep re-litigating who owns retries, checkpoint formats, or GPU sizing per job.
  • Post-incident: a production failure traced to an interface nobody had pinned (a checkpoint the serving engine could not load, a reward function that hung the trainer).

Pre-checks

  • Both owners in the room. The review only works with the platform owner and the ML owner present; every decision below gets exactly one owner and a revisit trigger, written down (ADR-shaped, one per step).
  • Read the comparable. Skim Tinker for the shipped reference: four primitives (forward_backward, optim_step, save_state, sample), LoRA-only training (as of mid-2026) against a hosted base-model lineup, per-compute billing on multi-tenant clock cycles, checkpoint export to HuggingFace/PEFT formats.
  • Inventory the substrate. Which scheduler and quota system the jobs land on (Kueue quota, the training-platform service), and which serving engines must load the output checkpoints (weight loading, multi-LoRA serving).
  • Name the customer. Internal teams tolerate sharper edges than external tenants; the knob surface (step 2) and the adversarial pass (step 7) change accordingly (security and multi-tenancy).

Flow

flowchart TB
    A["Contract review scheduled"] --> B["1. Job unit: admitted methods + resource shapes"]
    B --> C["2. Knob surface: exposed / auto-tuned / fixed"]
    C --> D["3. Tenancy tiers: LoRA-first, preemptible rollouts"]
    D --> E["4. Five interface contracts pinned"]
    E --> F["5. Serverless promises written down"]
    F --> G["6. Contract linter encodes decisions"]
    G --> H["7. Adversarial tenant pass"]
    H --> I{"Walking skeleton passes per method?"}
    I -->|"yes"| J["Contract v1 recorded, ADRs filed"]
    I -->|"no"| E

Procedure

  1. Fix the job unit and the admitted methods. A submission is a base-model reference, a dataset reference, a reward specification, a budget, and an eval set. Decide which method families the API admits, because each implies a different resource shape the platform must provision:
  2. SFT (SFT and LoRA): trainer only; no rollout fleet, no reward infrastructure.
  3. DPO (DPO): trainer plus a frozen reference model (extra forward, roughly 1.5x FLOPs in Tinker's implementation); no rollouts, no reward model.
  4. GRPO-class RL (GRPO): trainer plus a rollout fleet sized by group_size, plus reward execution; the critic is traded for rollout multiplicity, so the rollout pool dominates the GPU bill. Where to diverge from Tinker is itself a decision: Tinker exposes training primitives and lets customers write the loop; a narrower service exposes recipes (method + dataset + budget) and owns the loop. Primitives maximize flexibility and support burden; recipes invert that. Record which end of that axis the product sits on, and why. Skipping this step: the API accretes per-customer special cases and every new method becomes an emergency.

  5. Choose the knob surface: exposed, auto-tuned, or fixed. Every exposed knob is support surface (customers will set it badly and file tickets); every hidden knob is trust surface (customers must believe the auto-tuner). A defensible starting split: expose learning rate, KL coefficient, group size, and budget caps with validated ranges and defaults from calibrated rules (LoRA hyperparameter scaling: the 10x LoRA learning-rate multiplier, rank from dataset capacity); auto-tune batch sizing, parallelism layout, and rollout:trainer ratio; fix optimizer family and precision policy. Record the escape hatch policy (an advanced namespace that voids the run-quality SLO is one honest option). Skipping this step: the trainer's whole config leaks into the public API on day one and can never be removed.

  6. Decide tenancy and economics. LoRA-first is the density play: adapters share a hosted base model, checkpoints are megabytes not hundreds of gigabytes, and outputs serve at high density (multi-LoRA serving); Tinker ships LoRA-only for exactly these reasons. Full fine-tuning, if admitted, is a separate tier with its own quota, storage, and serving story. Split the pools: the rollout fleet is stateless-per-step and preemption-tolerant (run it on the cheap/preemptible pool), the trainer holds optimizer state and is protected. Fair-share and quota ride the cluster's existing machinery (Kueue quota); per-tenant billing follows compute actually consumed, which multi-tenant scheduling makes efficient for small jobs (the clock-cycle argument on the Tinker page). Skipping this step: one tenant's full-FT job starves the fleet and the cost model never closes.

  7. Pin the five interface contracts. Each gets a format, an owner, and a version field:

  8. Checkpoints: what is stored (adapter vs full weights, optimizer state, RNG streams), who may read it, and the export guarantee: a customer-visible checkpoint must load in the serving engines (weight loading); Tinker's answer is export to HuggingFace/PEFT format. Resume semantics belong to the resume-validation runbook.
  9. Weight sync: how trainer weights reach the rollout fleet (full broadcast vs sparse delta), its latency budget, and its failure semantics; the technique menu is on delta weight sync.
  10. Reward-function ABI: batched scoring interface, versioned, with the sandbox contract (timeouts, resource caps, failure classification) from the reward-sandboxing runbook; a reward failure must be attributable to the tenant, not absorbed as an infra retry.
  11. Eval gate: where the evaluation harness hooks the run (pre-promotion, on-schedule, on-demand) and who defines the pass bar.
  12. Run-state machine and retry ownership: the platform retries infrastructure failures (node loss, preemption) silently; algorithmic failures (KL blowup, reward collapse) surface to the customer with the run-health evidence from the observability runbook. Write the state machine down; every state transition is either platform-owned or customer-visible, never both. Data-plane isolation wraps all five: tenant datasets, adapters, and reward code never share a trust domain (security and multi-tenancy). Skipping this step: the interfaces get pinned implicitly by whoever writes the first integration, and every later backend inherits accidents.

  13. Write down what serverless means here. Promise: no capacity management by the customer, a time-to-first-step target, billing granularity tied to consumed compute, and durable run state across preemptions. Do not promise: bitwise reproducibility across resumes (that is a paid tier with pinned kernels and deterministic dataloading; see resume validation) or better-than-pool (small-batch) step latency for small jobs on a shared pool (Tinker documents the same trade: shared clock cycles mean a small job sees pool-scale step latency while paying only for its own compute). Skipping this step: sales promises what the scheduler cannot deliver.

  14. Encode the decisions in a contract linter. The review discipline is "make invalid states unrepresentable at the API boundary". This block is executed and asserted (stdlib): it admits two coherent specs and rejects a DPO job requesting a rollout fleet, a GRPO job without a reward or with a degenerate group, a budget below one optimizer step, and an unknown method, each with the violated decision named:

# contract_linter.py - executed: a job-spec linter encoding the contract review's
# decisions, so incoherent submissions are rejected at the API boundary instead of
# failing half a run in. Review-discipline model in pure stdlib, not a product schema.
from dataclasses import dataclass
from typing import Optional

METHODS = ("sft", "dpo", "grpo")


@dataclass(frozen=True)
class JobSpec:
    method: str
    base_model: str
    dataset_ref: str
    budget_tokens: int
    adapter: str = "lora"                 # decision 3: LoRA-first tiers
    reward_spec: Optional[str] = None     # required for grpo only
    group_size: int = 0                   # grpo rollout multiplicity
    rollout_gpus: int = 0                 # requested rollout pool
    prompt_len: int = 512                 # example workload shape
    output_len: int = 512


def min_budget_one_step(spec: JobSpec, batch_prompts: int = 8) -> int:
    """Tokens consumed by ONE optimizer step under the method's resource shape."""
    if spec.method == "grpo":             # group_size completions per prompt
        return batch_prompts * spec.group_size * (spec.prompt_len + spec.output_len)
    if spec.method == "dpo":              # chosen+rejected pair, policy+reference
        return batch_prompts * 2 * 2 * (spec.prompt_len + spec.output_len)
    return batch_prompts * (spec.prompt_len + spec.output_len)


def validate(spec: JobSpec) -> str:
    """Admit or reject a submission; every rejection names the violated decision."""
    if spec.method not in METHODS:
        raise ValueError(f"unknown method {spec.method!r}: contract admits {METHODS}")
    if spec.adapter not in ("lora", "full"):
        raise ValueError(f"unknown adapter tier {spec.adapter!r}")
    if spec.method == "dpo" and spec.rollout_gpus > 0:
        raise ValueError("DPO has no rollout phase: rollout_gpus must be 0")
    if spec.method == "grpo" and spec.reward_spec is None:
        raise ValueError("GRPO requires a reward_spec (verifier or reward model)")
    if spec.method == "grpo" and spec.group_size < 2:
        raise ValueError("GRPO requires group_size >= 2 for group-relative advantages")
    floor = min_budget_one_step(spec)
    if spec.budget_tokens < floor:
        raise ValueError(f"budget {spec.budget_tokens} below one-step minimum {floor}")
    return f"admitted: {spec.method} on {spec.base_model} ({spec.adapter})"


def expect_reject(spec: JobSpec, fragment: str) -> str:
    try:
        validate(spec)
    except ValueError as err:
        assert fragment in str(err), (fragment, str(err))
        return str(err)
    raise AssertionError(f"spec should have been rejected: {spec}")


ok_sft = JobSpec("sft", "qwen3.5-9b", "s3://d/sft.jsonl", budget_tokens=10_000_000)
ok_grpo = JobSpec("grpo", "qwen3.5-9b", "s3://d/prompts.jsonl", budget_tokens=50_000_000,
                  reward_spec="verifier:math", group_size=8)
print(validate(ok_sft))
print(validate(ok_grpo))

r1 = expect_reject(JobSpec("dpo", "m", "d", 10_000_000, rollout_gpus=4), "no rollout phase")
r2 = expect_reject(JobSpec("grpo", "m", "d", 50_000_000, group_size=8), "requires a reward_spec")
r3 = expect_reject(JobSpec("grpo", "m", "d", 50_000_000, reward_spec="v", group_size=1), "group_size >= 2")
r4 = expect_reject(JobSpec("grpo", "m", "d", 1_000, reward_spec="v", group_size=8),
                   "below one-step minimum 65536")
r5 = expect_reject(JobSpec("ppo_custom", "m", "d", 10_000_000), "unknown method")
for r in (r1, r2, r3, r4, r5):
    print("rejected:", r)
print("all contract-linter assertions passed")

Output of the run: both valid specs admitted, all five incoherent specs rejected with the violated decision named (the GRPO one-step floor computes to 65,536 tokens for the example shape), and all contract-linter assertions passed.

  1. Run the adversarial tenant pass. For each contract from step 4, ask what a hostile or confused tenant does, and record the designed behavior next to the contract:
  2. Uploads a 500 GB "checkpoint": admission-time size and format validation on the checkpoint contract, not a trainer crash.
  3. Ships a reward that always returns 1.0: zero-variance advantages make GRPO updates vanish; the reward-integrity gate flags degenerate constant scores at scoring time (reward-function sandboxing, step 5) and the dead-run detector parks the run after flat windows, both surfaced as customer-actionable (observability runbook, evaluation integrity).
  4. Exhausts the budget mid-epoch: the state machine defines the terminal state (checkpoint at last completed step, partial-run billing, resumable if topped up).
  5. Submits a reward function that sleeps forever: the sandbox timeout converts it to a tenant-attributed failure (reward sandboxing). Skipping this step: the first hostile tenant performs the review for you, in production.

Verification

  • Every decision recorded. Steps 1 through 5 each produced a written decision with one owner and a revisit trigger (for example: "LoRA-only until serving density drops below X; owner: platform; revisit: full-FT tier demand").
  • The linter encodes the contract. Every admission rule in the recorded decisions exists as a linter check, and the linter runs at the API boundary (the executed block above is the shape).
  • A walking skeleton passes per method. One minimal job of each admitted method (SFT, DPO, GRPO) runs end to end through submission, training, checkpoint export, and a serving-engine load of the exported artifact (weight loading).
  • Retry ownership is observable. Kill a rollout worker mid-run (platform retry, invisible to the customer) and inject a KL blowup (customer-visible parked run): each lands on the correct side of the state machine.

Rollback

  • Contracts are versioned. A contract revision that breaks a walking-skeleton job reverts to the prior contract version; the ADR records the failed revision and why.
  • Knob exposure is one-way in practice. Treat exposing a knob as irreversible (customers pin it in automation); the rollback for a bad exposure is a deprecation window plus auto-tune shadowing, so prefer under-exposing in v1.
  • Backend swaps ride the contracts. If a new training backend cannot satisfy a pinned interface (checkpoint format, weight-sync latency), the onboarding rolls back and the gap is filed against the backend, not patched into the contract.

References

  • Tinker (Thinking Machines), the shipping training-as-a-service comparable: https://thinkingmachines.ai/tinker/ and docs: https://tinker-docs.thinkingmachines.ai
  • OpenAI platform, model optimization / fine-tuning API (a public managed-fine-tuning contract to compare knob surfaces against): https://platform.openai.com/docs/guides/model-optimization
  • Kueue documentation (quota, fair share, preemption for the tenancy decision): https://kueue.sigs.k8s.io/docs/
  • vLLM LoRA serving (the density story behind LoRA-first tenancy): https://docs.vllm.ai/en/latest/features/lora.html

Related: Tinker (training-as-a-service) · Fine-tuning and post-training · SFT and LoRA/QLoRA · LoRA hyperparameter scaling · Multi-LoRA serving · Delta weight sync · Model weight loading · LLM evaluation harness · Security and multi-tenancy · Distributed training platform service · SLOs: training platform · Glossary