Tinker (training-as-a-service)¶
Scope: Thinking Machines' managed fine-tuning API (tinker) and its open-source recipe library (tinker-cookbook): the four training primitives, the multi-tenant clock-cycle execution model, the RL/SFT/DPO/distillation abstractions, the loss-function family (importance sampling, PPO, CISPO), evaluation framework, and how to operate it well (pipelining, checkpoints, weight export). It trades cluster ownership for an API: compare the self-hosted stacks in RL libraries.
The
pythonblocks that calltinkerortinker_cookbookare reference templates on real APIs (they need an API key and the installed SDK; not runnable in this doc's CI): pin versions and verify field names before production use. Each core mechanism is paired with a dependency-free numpy/stdlib block validating the underlying math, and every one of those was executed with asserts passing.
What it is¶
Tinker (Thinking Machines Lab, announced October 2025) is a training-as-a-service API: your training loop runs as ordinary Python on a CPU box, and every GPU-touching operation is an API call. The service exposes four primitives, and everything in post-training composes out of them:
import tinker
service_client = tinker.ServiceClient() # TINKER_API_KEY
training_client = service_client.create_lora_training_client(
base_model="Qwen/Qwen3.5-9B", rank=32)
training_client.forward_backward(data, loss_fn="cross_entropy") # grads
training_client.optim_step(adam_params) # update
training_client.save_state(name) # checkpoint
sampling_client = training_client.save_weights_and_get_sampling_client()
sampling_client.sample(prompt, num_samples, sampling_params) # rollouts
Training is LoRA-only (as of mid-2026): jobs attach adapters to a hosted lineup of open-weight base models (Llama 3.x, Qwen3/3.5/3.6, DeepSeek-V3.1, Kimi-K2.x, gpt-oss, Nemotron-3; authoritative list via get_server_capabilities()). The service time-shares a multi-tenant worker pool across many users' LoRA adapters, stepping in lock-step "clock cycles"; billing follows compute used, and small-batch jobs stay efficient because the pool never idles. forward_backward accepts loss functions cross_entropy, importance_sampling, ppo, cispo, and dro, plus arbitrary differentiable losses on logprobs via forward_backward_custom (how DPO is implemented, at ~1.5x FLOPs for the extra forward).
tinker-cookbook (Apache-2.0 on GitHub, PyPI tinker-cookbook) is the open library on top: renderers (chat rendering and loss masking), Env/EnvGroupBuilder/RLDataset RL abstractions, hyperparameter formulas (LoRA hyperparameter scaling), an eval framework with 12 verified benchmarks, and full recipes: chat SFT, math/code RL, DPO and a three-stage RLHF pipeline, on-policy and off-policy distillation, SDFT, tool-use RL (Search-R1 replication), multi-agent self-play, rubric-graded RL, and VLM classification.
Why use it¶
- No cluster to operate. The GPU fleet, distributed training, fault tolerance, and weight hosting are the service's problem; your side is a Python loop and data. This removes the entire training-platform layer for post-training work.
- Primitives, not pipelines. Unlike one-trainer-per-method frameworks, the four primitives make custom algorithms (new RL losses, distillation variants, preference schemes) a loop you write, with the cookbook as worked examples rather than a framework to fight.
- Small-batch economics. Multi-tenant clock cycles amortize the pool across users, so sample-efficient small-batch training is not wasteful, and MoE models price by active parameters (Qwen3.5-35B-A3B trains cheaper than dense 27B).
- Portable results. Checkpoints download as archives and export to HuggingFace format or PEFT adapters for vLLM/SGLang serving; you are not locked into the service for inference.
When to use it (and when not)¶
- Use Tinker for research-grade post-training (SFT, RL with verifiable rewards, DPO/RLHF, distillation) when you have no GPU cluster, when iteration speed on algorithm code matters more than infrastructure control, or when LoRA capacity suffices (LoRA hyperparameter scaling: all RL, small-to-medium SFT).
- Prefer self-hosted stacks when you need full fine-tuning or custom architectures, when data governance forbids shipping training data to an external service, when you already own the GPUs (verl, slime, TRL), or when rollouts must hit in-house systems the service cannot reach at scale.
- Latency-sensitive tight loops are a mismatch: every step rides a shared clock cycle, so a small job sees big-batch step latency even though it is billed only for its own compute.
Architecture¶
flowchart LR
subgraph Client["Your process (CPU)"]
LOOP["Cookbook loop: SFT / RL / DPO / distill"]
ENVS["Envs, rewards, renderers"]
end
subgraph Service["Tinker service (GPU pools)"]
POOL["Multi-tenant LoRA worker pool (clock cycles)"]
SAMP["Sampling workers"]
CKPT["Checkpoint store (state / sampler, TTL)"]
end
LOOP -->|"forward_backward + optim_step"| POOL
LOOP -->|"save_state / save_weights_for_sampler"| CKPT
ENVS <-->|"sample / compute_logprobs"| SAMP
CKPT -->|"download, export to HF"| SERVE["vLLM / SGLang serving"]
How to use it¶
SFT is cross_entropy over rendered (tokens, weights) datums (chat rendering and loss masking); the cookbook's sl_loop.py is the minimal reference. RL is rollouts through a sampling client, group mean-centered advantages, and one of the RL losses. Reference template (needs tinker, tinker-cookbook):
from tinker_cookbook.recipes import rl_loop # minimal GRPO-style reference
# python -m tinker_cookbook.recipes.math_rl.train \
# model_name=Qwen/Qwen3.5-9B group_size=16 groups_per_batch=64 \
# learning_rate=2e-5 max_tokens=512 # MATH: test correct ~0.84 @ 180 steps
The RL loss family is the load-bearing choice. Tinker's default is per-token importance sampling; PPO clips the ratio; CISPO applies the clipped ratio as a detached coefficient on the log-prob so a token's gradient is capped but never zeroed (MiniMax-M1). The executed block validates the gradient behaviour of all three, the mean-centering (std division deliberately omitted per Dr. GRPO), and the KL-penalty shaping used against a reference model:
import numpy as np
def grad_importance_sampling(ratio, adv):
# surrogate L = -ratio * A; dL/dlogp = -ratio * A (ratio = exp(logp - logq))
return -ratio * adv
def grad_ppo(ratio, adv, lo=0.8, hi=1.2):
# L = -min(ratio*A, clip(ratio, lo, hi)*A); gradient is zero when clipped.
unclipped = ratio * adv
clipped = np.clip(ratio, lo, hi) * adv
return -ratio * adv if unclipped <= clipped else 0.0
def grad_cispo(ratio, adv, lo=0.0, hi=4.0):
# L = -stopgrad(clip(ratio, lo, hi)) * logp * A; the clipped ratio is a
# detached coefficient, so the gradient is never zeroed, only capped.
return -np.clip(ratio, lo, hi) * adv
def centered_advantages(rewards):
# Group mean-centering only; no std division (Dr. GRPO rationale).
r = np.asarray(rewards, dtype=float)
return r - r.mean()
# On-policy (ratio 1) all three objectives give the identical REINFORCE update.
for g in (grad_importance_sampling, grad_ppo, grad_cispo):
assert abs(g(1.0, 2.0) - (-2.0)) < 1e-12
# Off-policy, ratio 2, positive advantage: PPO zeroes the token's gradient
# (clipped), IS follows the raw ratio unbounded, CISPO caps but keeps learning.
assert grad_ppo(2.0, 1.0) == 0.0
assert grad_importance_sampling(2.0, 1.0) == -2.0
assert grad_cispo(2.0, 1.0) == -2.0
# Runaway ratio 8: IS explodes 8x, CISPO saturates at its cap of 4.
assert grad_importance_sampling(8.0, 1.0) == -8.0
assert grad_cispo(8.0, 1.0) == -4.0
# Mean-centered advantages: shift-invariant (reward r -> r + b cancels), but
# NOT scale-invariant; std-normalization would erase difficulty information.
adv = centered_advantages([0.0, 0.0, 1.0, 1.0])
assert np.allclose(adv, centered_advantages([5.0, 5.0, 6.0, 6.0]))
assert not np.allclose(centered_advantages([0.0, 2.0]), centered_advantages([0.0, 4.0]))
# Degenerate group (all rewards equal) has zero advantage everywhere: no
# gradient, which is why constant-reward groups are dropped before training.
assert np.allclose(centered_advantages([1.0, 1.0, 1.0]), 0.0)
# KL penalty as used in the RL loop: adjusted = A + coef * (avg_kl - kl_t).
# The adjustment is zero-sum across tokens and hits the most divergent token.
kl_t = np.array([0.001, 0.002, 0.050, 0.003]) # logp_policy - logp_reference
adjust = 0.05 * (kl_t.mean() - kl_t)
assert abs(adjust.sum()) < 1e-15 and adjust.argmin() == 2
print("RL loss family + advantages: OK")
Working defaults from the cookbook's own guidance: start cross_entropy for SFT and importance_sampling for RL, switching to ppo or cispo on instability; RL LR 1e-5 to 4e-5, group_size 4-16, kl_penalty_coef 0.05 as a moderate starting point, temperature 1.0 (non-1 temperatures interact badly with the KL penalty); more than one optimizer substep per batch requires the PPO objective; sampler-vs-trainer KL under ~0.01 indicates stable training.
How to develop with it¶
The cookbook's abstractions are the generalizable part; they port to any rollout-based stack:
Env(single-use episode:initial_observation(),step(action)) andEnvGroupBuilder(makes the group of envs whose rewards are centered together). Group-level reward computation is first-class, which is what makes pairwise preference tournaments and zero-sum self-play drop into the same GRPO machinery as verifiable rewards.RLDatasetbatches env-group builders; builders must be picklable so rollout work can fan out to processes or Ray.- Completers split the policy interface:
TokenCompleter(tokens + logprobs, what RL needs) vsMessageCompleter(parsed messages, what evals and tool loops need). ProblemEnvfor single-turn verifiable tasks scoresformat_coef * (format - 1) + correct, gating format reward on a clean stop-sequence termination so length-limit truncations cannot collect it (reward design, RLVR).- Preference and distillation reuse the RL plumbing: DPO runs on
forward_backward_customwith the frozen initial weights as the implicit reference (DPO); on-policy distillation is the RL loop with per-token reverse KL against a teacher'scompute_logprobsas the only reward (on-policy distillation); off-policy distillation is standard supervised fine-tuning (hard-target cross-entropy) on the teacher's sampled reasoning traces. - Evals:
run_benchmarks(["gsm8k", "mmlu_pro", "ifeval"], ...)with 12 benchmarks verified against published scores, resumable by content hash, plus inline evaluators on the training cadence (LLM evaluation harness, LLM benchmarks). Their own warning generalizes:max_tokens, temperature, system prompt, and timeouts move scores 10-30% (GSM8K 84.7% to 95.6% from one\boxed{}instruction), so document the exact eval config next to every number.
How to maintain it¶
- Pin the client stack and track known-bad versions. The cookbook pins around
pydantic >= 2.13.0b1(serialization stalls) andtransformers == 5.3.0(DeepSeek tokenizer bug); keep the tinker SDK at the latest stable and re-run a smoke test (create a small rank-32 client) after upgrades. - Checkpoints have TTLs. Default persistence is 7 days for periodic checkpoints, with cheap rolling resume-only checkpoints (state, no sampler export) in between and final checkpoints kept indefinitely; download anything you cannot regenerate. Resume needs
save_state(weights + optimizer); sampling and export need sampler checkpoints. - Recreate sampling clients after weight saves. A stale sampling client silently samples old weights; the cookbook's "sampler desync" is the canonical silent bug.
- Record the renderer with the run. Checkpoint metadata carries the renderer name so later sampling and serving reuse the exact template (chat rendering and loss masking).
- Model deprecations are real: the hosted lineup rotates; the docs publish a deprecation schedule, so export weights for anything long-lived.
How to run it in production¶
The performance model is the clock cycle: the worker pool steps in lock-step, requests queued before a cycle run in it, and results land just after the next cycle starts. Throughput is therefore a scheduling property of your client code, validated here (executed, asserts passing):
import math
EPS = 0.1 # network latency: results land just after the next cycle starts
def execute(submit_time):
"""A request submitted at time t runs in the first cycle starting at or
after t; its result reaches the client just after that cycle ends."""
cycle = math.ceil(submit_time)
return cycle, cycle + 1 + EPS
# One naive step (submit forward_backward, await, then submit optim_step)
# burns 3 clock cycles: fb runs on cycle 0, optim on cycle 2, done at 3.1.
c_fb, t = execute(0.0)
c_opt, t_done = execute(t)
assert (c_fb, c_opt) == (0, 2) and math.floor(t_done) == 3
# Submitting both back-to-back before awaiting lands them on the same cycle.
c_fb, _ = execute(0.0)
c_opt, t_done = execute(0.0)
assert c_fb == c_opt == 0 and math.floor(t_done) == 1
# Await-then-submit still wastes every other cycle across a run: each result
# arrives just after a cycle boundary, so steps execute on cycles 0,2,4,...
t, cycles = 0.0, []
for _ in range(8):
c, t = execute(t)
cycles.append(c)
assert cycles == [0, 2, 4, 6, 8, 10, 12, 14]
# Pipelined (submit batch k+1 before awaiting batch k): every cycle does work.
cycles = [execute(float(k))[0] for k in range(8)]
assert cycles == [0, 1, 2, 3, 4, 5, 6, 7]
print("clock-cycle pipelining: OK")
Operational consequences:
- Always submit
forward_backwardandoptim_steptogether before awaiting either (the cookbook loops do this viasubmit_ahead); sequential awaiting is the number-one performance mistake and costs 2-3x wall clock. - Overlap rollouts and training for RL with stream-minibatch mode (train as soon as a minibatch of trajectories lands) or async mode with a bounded
max_steps_off_policy, which requeues stale samples and holds batch size constant so the LR needs no adjustment (async RL systems). - Beware the GIL. The SDK's background thread shares the interpreter with your loop; CPU-heavy reward functions block heartbeats and look like service hangs. Move heavy work out of the loop or into subprocess sampling mode.
- Export for serving. Download sampler checkpoints and either merge to HF format (merge in float32; MoE fused-expert layouts differ per family, and a wrong gate/up convention silently corrupts weights) or serve the unmerged adapter via vLLM
--lora-modules(multi-LoRA serving, serving open-weight models, engine weight loading).
Failure modes¶
- Sequential API calls: each step burns 2-3 clock cycles instead of 1; throughput collapses with no error anywhere.
- Stale sampling client after a weight save: rollouts silently come from old weights; RL appears to stop learning.
- Wrong renderer for the model (thinking vs non-thinking variants especially): loss goes down while outputs degrade; high sampler/trainer KL at step 0 is the tell (chat rendering and loss masking).
- Constant-reward groups left in the batch: zero advantage, zero gradient, wasted rollouts; drop them (and expect the metric
frac_reward_zero_stdin GRPO stacks, GRPO variants). - Checkpoint TTL expiry: a 7-day default quietly deletes the run's artifacts; export or extend TTLs for anything that matters.
- Merge-convention bugs on export: MoE gate/up fusion order and bf16 merge arithmetic produce a model that loads fine and scores garbage; verify exported weights against the service's own samples before serving.
- Service dependency: rate limits, model deprecations, and clock-cycle latency are outside your control; keep the loop code portable (the cookbook abstractions run against any backend that implements the four primitives).
References¶
- Tinker: https://thinkingmachines.ai/tinker/ · announcement: https://thinkingmachines.ai/blog/announcing-tinker/
- Tinker docs: https://tinker-docs.thinkingmachines.ai · under the hood (clock cycles): https://tinker-docs.thinkingmachines.ai/tinker/under-the-hood/ · losses: https://tinker-docs.thinkingmachines.ai/tinker/losses/
- tinker-cookbook: https://github.com/thinking-machines-lab/tinker-cookbook · SDK: https://github.com/thinking-machines-lab/tinker · PyPI: https://pypi.org/project/tinker-cookbook/
- Dr. GRPO (Liu et al., 2025): https://arxiv.org/abs/2503.20783 · CISPO / MiniMax-M1: https://arxiv.org/abs/2506.13585
- On-policy distillation (Thinking Machines): https://thinkingmachines.ai/blog/on-policy-distillation/
Related: RL Libraries · Chat Rendering & Loss Masking · LoRA Hyperparameter Scaling · Fine-tuning · GRPO · DPO · RLVR · On-Policy Distillation · Async RL · LLM Evaluation Harness · Glossary