Markdown

Tinker (training-as-a-service)¶

Scope: Thinking Machines' managed fine-tuning API (tinker) and its open-source recipe library (tinker-cookbook): the four training primitives, the multi-tenant clock-cycle execution model, the RL/SFT/DPO/distillation abstractions, the loss-function family (importance sampling, PPO, CISPO), evaluation framework, and how to operate it well (pipelining, checkpoints, weight export). It trades cluster ownership for an API: compare the self-hosted stacks in RL libraries.

The python blocks that call tinker or tinker_cookbook are reference templates on real APIs (they need an API key and the installed SDK; not runnable in this doc's CI): pin versions and verify field names before production use. Each core mechanism is paired with a dependency-free numpy/stdlib block validating the underlying math, and every one of those was executed with asserts passing.

What it is¶

Tinker (Thinking Machines Lab, announced October 2025) is a training-as-a-service API: your training loop runs as ordinary Python on a CPU box, and every GPU-touching operation is an API call. The service exposes four primitives, and everything in post-training composes out of them:

import tinker
service_client = tinker.ServiceClient()          # TINKER_API_KEY
training_client = service_client.create_lora_training_client(
    base_model="Qwen/Qwen3.5-9B", rank=32)
training_client.forward_backward(data, loss_fn="cross_entropy")  # grads
training_client.optim_step(adam_params)                          # update
training_client.save_state(name)                                 # checkpoint
sampling_client = training_client.save_weights_and_get_sampling_client()
sampling_client.sample(prompt, num_samples, sampling_params)     # rollouts

Training is LoRA-only (as of mid-2026): jobs attach adapters to a hosted lineup of open-weight base models (Llama 3.x, Qwen3/3.5/3.6, DeepSeek-V3.1, Kimi-K2.x, gpt-oss, Nemotron-3; authoritative list via get_server_capabilities()). The service time-shares a multi-tenant worker pool across many users' LoRA adapters, stepping in lock-step "clock cycles"; billing follows compute used, and small-batch jobs stay efficient because the pool never idles. forward_backward accepts loss functions cross_entropy, importance_sampling, ppo, cispo, and dro, plus arbitrary differentiable losses on logprobs via forward_backward_custom (how DPO is implemented, at ~1.5x FLOPs for the extra forward).

tinker-cookbook (Apache-2.0 on GitHub, PyPI tinker-cookbook) is the open library on top: renderers (chat rendering and loss masking), Env/EnvGroupBuilder/RLDataset RL abstractions, hyperparameter formulas (LoRA hyperparameter scaling), an eval framework with 12 verified benchmarks, and full recipes: chat SFT, math/code RL, DPO and a three-stage RLHF pipeline, on-policy and off-policy distillation, SDFT, tool-use RL (Search-R1 replication), multi-agent self-play, rubric-graded RL, and VLM classification.

Why use it¶

No cluster to operate. The GPU fleet, distributed training, fault tolerance, and weight hosting are the service's problem; your side is a Python loop and data. This removes the entire training-platform layer for post-training work.
Primitives, not pipelines. Unlike one-trainer-per-method frameworks, the four primitives make custom algorithms (new RL losses, distillation variants, preference schemes) a loop you write, with the cookbook as worked examples rather than a framework to fight.
Small-batch economics. Multi-tenant clock cycles amortize the pool across users, so sample-efficient small-batch training is not wasteful, and MoE models price by active parameters (Qwen3.5-35B-A3B trains cheaper than dense 27B).
Portable results. Checkpoints download as archives and export to HuggingFace format or PEFT adapters for vLLM/SGLang serving; you are not locked into the service for inference.

When to use it (and when not)¶

Use Tinker for research-grade post-training (SFT, RL with verifiable rewards, DPO/RLHF, distillation) when you have no GPU cluster, when iteration speed on algorithm code matters more than infrastructure control, or when LoRA capacity suffices (LoRA hyperparameter scaling: all RL, small-to-medium SFT).
Prefer self-hosted stacks when you need full fine-tuning or custom architectures, when data governance forbids shipping training data to an external service, when you already own the GPUs (verl, slime, TRL), or when rollouts must hit in-house systems the service cannot reach at scale.
Latency-sensitive tight loops are a mismatch: every step rides a shared clock cycle, so a small job sees big-batch step latency even though it is billed only for its own compute.

Architecture¶

flowchart LR
  subgraph Client["Your process (CPU)"]
    LOOP["Cookbook loop: SFT / RL / DPO / distill"]
    ENVS["Envs, rewards, renderers"]
  end
  subgraph Service["Tinker service (GPU pools)"]
    POOL["Multi-tenant LoRA worker pool (clock cycles)"]
    SAMP["Sampling workers"]
    CKPT["Checkpoint store (state / sampler, TTL)"]
  end
  LOOP -->|"forward_backward + optim_step"| POOL
  LOOP -->|"save_state / save_weights_for_sampler"| CKPT
  ENVS <-->|"sample / compute_logprobs"| SAMP
  CKPT -->|"download, export to HF"| SERVE["vLLM / SGLang serving"]

How to use it¶

SFT is cross_entropy over rendered (tokens, weights) datums (chat rendering and loss masking); the cookbook's sl_loop.py is the minimal reference. RL is rollouts through a sampling client, group mean-centered advantages, and one of the RL losses. Reference template (needs tinker, tinker-cookbook):

from tinker_cookbook.recipes import rl_loop   # minimal GRPO-style reference
# python -m tinker_cookbook.recipes.math_rl.train \
#   model_name=Qwen/Qwen3.5-9B group_size=16 groups_per_batch=64 \
#   learning_rate=2e-5 max_tokens=512     # MATH: test correct ~0.84 @ 180 steps

The RL loss family is the load-bearing choice. Tinker's default is per-token importance sampling; PPO clips the ratio; CISPO applies the clipped ratio as a detached coefficient on the log-prob so a token's gradient is capped but never zeroed (MiniMax-M1). The executed block validates the gradient behaviour of all three, the mean-centering (std division deliberately omitted per Dr. GRPO), and the KL-penalty shaping used against a reference model:

import numpy as np


def grad_importance_sampling(ratio, adv):
    # surrogate L = -ratio * A; dL/dlogp = -ratio * A (ratio = exp(logp - logq))
    return -ratio * adv


def grad_ppo(ratio, adv, lo=0.8, hi=1.2):
    # L = -min(ratio*A, clip(ratio, lo, hi)*A); gradient is zero when clipped.
    unclipped = ratio * adv
    clipped = np.clip(ratio, lo, hi) * adv
    return -ratio * adv if unclipped <= clipped else 0.0


def grad_cispo(ratio, adv, lo=0.0, hi=4.0):
    # L = -stopgrad(clip(ratio, lo, hi)) * logp * A; the clipped ratio is a
    # detached coefficient, so the gradient is never zeroed, only capped.
    return -np.clip(ratio, lo, hi) * adv


def centered_advantages(rewards):
    # Group mean-centering only; no std division (Dr. GRPO rationale).
    r = np.asarray(rewards, dtype=float)
    return r - r.mean()


# On-policy (ratio 1) all three objectives give the identical REINFORCE update.
for g in (grad_importance_sampling, grad_ppo, grad_cispo):
    assert abs(g(1.0, 2.0) - (-2.0)) < 1e-12

# Off-policy, ratio 2, positive advantage: PPO zeroes the token's gradient
# (clipped), IS follows the raw ratio unbounded, CISPO caps but keeps learning.
assert grad_ppo(2.0, 1.0) == 0.0
assert grad_importance_sampling(2.0, 1.0) == -2.0
assert grad_cispo(2.0, 1.0) == -2.0

# Runaway ratio 8: IS explodes 8x, CISPO saturates at its cap of 4.
assert grad_importance_sampling(8.0, 1.0) == -8.0
assert grad_cispo(8.0, 1.0) == -4.0

# Mean-centered advantages: shift-invariant (reward r -> r + b cancels), but
# NOT scale-invariant; std-normalization would erase difficulty information.
adv = centered_advantages([0.0, 0.0, 1.0, 1.0])
assert np.allclose(adv, centered_advantages([5.0, 5.0, 6.0, 6.0]))
assert not np.allclose(centered_advantages([0.0, 2.0]), centered_advantages([0.0, 4.0]))

# Degenerate group (all rewards equal) has zero advantage everywhere: no
# gradient, which is why constant-reward groups are dropped before training.
assert np.allclose(centered_advantages([1.0, 1.0, 1.0]), 0.0)

# KL penalty as used in the RL loop: adjusted = A + coef * (avg_kl - kl_t).
# The adjustment is zero-sum across tokens and hits the most divergent token.
kl_t = np.array([0.001, 0.002, 0.050, 0.003])  # logp_policy - logp_reference
adjust = 0.05 * (kl_t.mean() - kl_t)
assert abs(adjust.sum()) < 1e-15 and adjust.argmin() == 2

print("RL loss family + advantages: OK")

Working defaults from the cookbook's own guidance: start cross_entropy for SFT and importance_sampling for RL, switching to ppo or cispo on instability; RL LR 1e-5 to 4e-5, group_size 4-16, kl_penalty_coef 0.05 as a moderate starting point, temperature 1.0 (non-1 temperatures interact badly with the KL penalty); more than one optimizer substep per batch requires the PPO objective; sampler-vs-trainer KL under ~0.01 indicates stable training.

How to develop with it¶

The cookbook's abstractions are the generalizable part; they port to any rollout-based stack:

Env (single-use episode: initial_observation(), step(action)) and EnvGroupBuilder (makes the group of envs whose rewards are centered together). Group-level reward computation is first-class, which is what makes pairwise preference tournaments and zero-sum self-play drop into the same GRPO machinery as verifiable rewards.
RLDataset batches env-group builders; builders must be picklable so rollout work can fan out to processes or Ray.
Completers split the policy interface: TokenCompleter (tokens + logprobs, what RL needs) vs MessageCompleter (parsed messages, what evals and tool loops need).
ProblemEnv for single-turn verifiable tasks scores format_coef * (format - 1) + correct, gating format reward on a clean stop-sequence termination so length-limit truncations cannot collect it (reward design, RLVR).
Preference and distillation reuse the RL plumbing: DPO runs on forward_backward_custom with the frozen initial weights as the implicit reference (DPO); on-policy distillation is the RL loop with per-token reverse KL against a teacher's compute_logprobs as the only reward (on-policy distillation); off-policy distillation is standard supervised fine-tuning (hard-target cross-entropy) on the teacher's sampled reasoning traces.
Evals: run_benchmarks(["gsm8k", "mmlu_pro", "ifeval"], ...) with 12 benchmarks verified against published scores, resumable by content hash, plus inline evaluators on the training cadence (LLM evaluation harness, LLM benchmarks). Their own warning generalizes: max_tokens, temperature, system prompt, and timeouts move scores 10-30% (GSM8K 84.7% to 95.6% from one \boxed{} instruction), so document the exact eval config next to every number.

How to maintain it¶

Pin the client stack and track known-bad versions. The cookbook pins around pydantic >= 2.13.0b1 (serialization stalls) and transformers == 5.3.0 (DeepSeek tokenizer bug); keep the tinker SDK at the latest stable and re-run a smoke test (create a small rank-32 client) after upgrades.
Checkpoints have TTLs. Default persistence is 7 days for periodic checkpoints, with cheap rolling resume-only checkpoints (state, no sampler export) in between and final checkpoints kept indefinitely; download anything you cannot regenerate. Resume needs save_state (weights + optimizer); sampling and export need sampler checkpoints.
Recreate sampling clients after weight saves. A stale sampling client silently samples old weights; the cookbook's "sampler desync" is the canonical silent bug.
Record the renderer with the run. Checkpoint metadata carries the renderer name so later sampling and serving reuse the exact template (chat rendering and loss masking).
Model deprecations are real: the hosted lineup rotates; the docs publish a deprecation schedule, so export weights for anything long-lived.

How to run it in production¶

The performance model is the clock cycle: the worker pool steps in lock-step, requests queued before a cycle run in it, and results land just after the next cycle starts. Throughput is therefore a scheduling property of your client code, validated here (executed, asserts passing):

import math

EPS = 0.1  # network latency: results land just after the next cycle starts


def execute(submit_time):
    """A request submitted at time t runs in the first cycle starting at or
    after t; its result reaches the client just after that cycle ends."""
    cycle = math.ceil(submit_time)
    return cycle, cycle + 1 + EPS


# One naive step (submit forward_backward, await, then submit optim_step)
# burns 3 clock cycles: fb runs on cycle 0, optim on cycle 2, done at 3.1.
c_fb, t = execute(0.0)
c_opt, t_done = execute(t)
assert (c_fb, c_opt) == (0, 2) and math.floor(t_done) == 3

# Submitting both back-to-back before awaiting lands them on the same cycle.
c_fb, _ = execute(0.0)
c_opt, t_done = execute(0.0)
assert c_fb == c_opt == 0 and math.floor(t_done) == 1

# Await-then-submit still wastes every other cycle across a run: each result
# arrives just after a cycle boundary, so steps execute on cycles 0,2,4,...
t, cycles = 0.0, []
for _ in range(8):
    c, t = execute(t)
    cycles.append(c)
assert cycles == [0, 2, 4, 6, 8, 10, 12, 14]

# Pipelined (submit batch k+1 before awaiting batch k): every cycle does work.
cycles = [execute(float(k))[0] for k in range(8)]
assert cycles == [0, 1, 2, 3, 4, 5, 6, 7]

print("clock-cycle pipelining: OK")

Operational consequences:

Always submit forward_backward and optim_step together before awaiting either (the cookbook loops do this via submit_ahead); sequential awaiting is the number-one performance mistake and costs 2-3x wall clock.
Overlap rollouts and training for RL with stream-minibatch mode (train as soon as a minibatch of trajectories lands) or async mode with a bounded max_steps_off_policy, which requeues stale samples and holds batch size constant so the LR needs no adjustment (async RL systems).
Beware the GIL. The SDK's background thread shares the interpreter with your loop; CPU-heavy reward functions block heartbeats and look like service hangs. Move heavy work out of the loop or into subprocess sampling mode.
Export for serving. Download sampler checkpoints and either merge to HF format (merge in float32; MoE fused-expert layouts differ per family, and a wrong gate/up convention silently corrupts weights) or serve the unmerged adapter via vLLM --lora-modules (multi-LoRA serving, serving open-weight models, engine weight loading).

Failure modes¶

Sequential API calls: each step burns 2-3 clock cycles instead of 1; throughput collapses with no error anywhere.
Stale sampling client after a weight save: rollouts silently come from old weights; RL appears to stop learning.
Wrong renderer for the model (thinking vs non-thinking variants especially): loss goes down while outputs degrade; high sampler/trainer KL at step 0 is the tell (chat rendering and loss masking).
Constant-reward groups left in the batch: zero advantage, zero gradient, wasted rollouts; drop them (and expect the metric frac_reward_zero_std in GRPO stacks, GRPO variants).
Checkpoint TTL expiry: a 7-day default quietly deletes the run's artifacts; export or extend TTLs for anything that matters.
Merge-convention bugs on export: MoE gate/up fusion order and bf16 merge arithmetic produce a model that loads fine and scores garbage; verify exported weights against the service's own samples before serving.
Service dependency: rate limits, model deprecations, and clock-cycle latency are outside your control; keep the loop code portable (the cookbook abstractions run against any backend that implements the four primitives).

References¶

Tinker: https://thinkingmachines.ai/tinker/ · announcement: https://thinkingmachines.ai/blog/announcing-tinker/
Tinker docs: https://tinker-docs.thinkingmachines.ai · under the hood (clock cycles): https://tinker-docs.thinkingmachines.ai/tinker/under-the-hood/ · losses: https://tinker-docs.thinkingmachines.ai/tinker/losses/
tinker-cookbook: https://github.com/thinking-machines-lab/tinker-cookbook · SDK: https://github.com/thinking-machines-lab/tinker · PyPI: https://pypi.org/project/tinker-cookbook/
Dr. GRPO (Liu et al., 2025): https://arxiv.org/abs/2503.20783 · CISPO / MiniMax-M1: https://arxiv.org/abs/2506.13585
On-policy distillation (Thinking Machines): https://thinkingmachines.ai/blog/on-policy-distillation/