Skip to content
Markdown

Tinker (training-as-a-service)

Scope: Thinking Machines' managed fine-tuning API (tinker) and its open-source recipe library (tinker-cookbook): the four training primitives, the multi-tenant clock-cycle execution model, the RL/SFT/DPO/distillation abstractions, the loss-function family (importance sampling, PPO, CISPO), evaluation framework, and how to operate it well (pipelining, checkpoints, weight export). It trades cluster ownership for an API: compare the self-hosted stacks in RL libraries.

The python blocks that call tinker or tinker_cookbook are reference templates on real APIs (they need an API key and the installed SDK; not runnable in this doc's CI): pin versions and verify field names before production use. Each core mechanism is paired with a dependency-free numpy/stdlib block validating the underlying math, and every one of those was executed with asserts passing.

What it is

Tinker (Thinking Machines Lab, announced October 2025) is a training-as-a-service API: your training loop runs as ordinary Python on a CPU box, and every GPU-touching operation is an API call. The service exposes four primitives, and everything in post-training composes out of them:

import tinker
service_client = tinker.ServiceClient()          # TINKER_API_KEY
training_client = service_client.create_lora_training_client(
    base_model="Qwen/Qwen3.5-9B", rank=32)
training_client.forward_backward(data, loss_fn="cross_entropy")  # grads
training_client.optim_step(adam_params)                          # update
training_client.save_state(name)                                 # checkpoint
sampling_client = training_client.save_weights_and_get_sampling_client()
sampling_client.sample(prompt, num_samples, sampling_params)     # rollouts

Training is LoRA-only (as of mid-2026): jobs attach adapters to a hosted lineup of open-weight base models (Llama 3.x, Qwen3/3.5/3.6, DeepSeek-V3.1, Kimi-K2.x, gpt-oss, Nemotron-3; authoritative list via get_server_capabilities()). The service time-shares a multi-tenant worker pool across many users' LoRA adapters, stepping in lock-step "clock cycles"; billing follows compute used, and small-batch jobs stay efficient because the pool never idles. forward_backward accepts loss functions cross_entropy, importance_sampling, ppo, cispo, and dro, plus arbitrary differentiable losses on logprobs via forward_backward_custom (how DPO is implemented, at ~1.5x FLOPs for the extra forward).

tinker-cookbook (Apache-2.0 on GitHub, PyPI tinker-cookbook) is the open library on top: renderers (chat rendering and loss masking), Env/EnvGroupBuilder/RLDataset RL abstractions, hyperparameter formulas (LoRA hyperparameter scaling), an eval framework with 12 verified benchmarks, and full recipes: chat SFT, math/code RL, DPO and a three-stage RLHF pipeline, on-policy and off-policy distillation, SDFT, tool-use RL (Search-R1 replication), multi-agent self-play, rubric-graded RL, and VLM classification.

Why use it

  • No cluster to operate. The GPU fleet, distributed training, fault tolerance, and weight hosting are the service's problem; your side is a Python loop and data. This removes the entire training-platform layer for post-training work.
  • Primitives, not pipelines. Unlike one-trainer-per-method frameworks, the four primitives make custom algorithms (new RL losses, distillation variants, preference schemes) a loop you write, with the cookbook as worked examples rather than a framework to fight.
  • Small-batch economics. Multi-tenant clock cycles amortize the pool across users, so sample-efficient small-batch training is not wasteful, and MoE models price by active parameters (Qwen3.5-35B-A3B trains cheaper than dense 27B).
  • Portable results. Checkpoints download as archives and export to HuggingFace format or PEFT adapters for vLLM/SGLang serving; you are not locked into the service for inference.

When to use it (and when not)

  • Use Tinker for research-grade post-training (SFT, RL with verifiable rewards, DPO/RLHF, distillation) when you have no GPU cluster, when iteration speed on algorithm code matters more than infrastructure control, or when LoRA capacity suffices (LoRA hyperparameter scaling: all RL, small-to-medium SFT).
  • Prefer self-hosted stacks when you need full fine-tuning or custom architectures, when data governance forbids shipping training data to an external service, when you already own the GPUs (verl, slime, TRL), or when rollouts must hit in-house systems the service cannot reach at scale.
  • Latency-sensitive tight loops are a mismatch: every step rides a shared clock cycle, so a small job sees big-batch step latency even though it is billed only for its own compute.

Architecture

flowchart LR
  subgraph Client["Your process (CPU)"]
    LOOP["Cookbook loop: SFT / RL / DPO / distill"]
    ENVS["Envs, rewards, renderers"]
  end
  subgraph Service["Tinker service (GPU pools)"]
    POOL["Multi-tenant LoRA worker pool (clock cycles)"]
    SAMP["Sampling workers"]
    CKPT["Checkpoint store (state / sampler, TTL)"]
  end
  LOOP -->|"forward_backward + optim_step"| POOL
  LOOP -->|"save_state / save_weights_for_sampler"| CKPT
  ENVS <-->|"sample / compute_logprobs"| SAMP
  CKPT -->|"download, export to HF"| SERVE["vLLM / SGLang serving"]

How to use it

SFT is cross_entropy over rendered (tokens, weights) datums (chat rendering and loss masking); the cookbook's sl_loop.py is the minimal reference. RL is rollouts through a sampling client, group mean-centered advantages, and one of the RL losses. Reference template (needs tinker, tinker-cookbook):

from tinker_cookbook.recipes import rl_loop   # minimal GRPO-style reference
# python -m tinker_cookbook.recipes.math_rl.train \
#   model_name=Qwen/Qwen3.5-9B group_size=16 groups_per_batch=64 \
#   learning_rate=2e-5 max_tokens=512     # MATH: test correct ~0.84 @ 180 steps

The RL loss family is the load-bearing choice. Tinker's default is per-token importance sampling; PPO clips the ratio; CISPO applies the clipped ratio as a detached coefficient on the log-prob so a token's gradient is capped but never zeroed (MiniMax-M1). The executed block validates the gradient behaviour of all three, the mean-centering (std division deliberately omitted per Dr. GRPO), and the KL-penalty shaping used against a reference model:

import numpy as np


def grad_importance_sampling(ratio, adv):
    # surrogate L = -ratio * A; dL/dlogp = -ratio * A (ratio = exp(logp - logq))
    return -ratio * adv


def grad_ppo(ratio, adv, lo=0.8, hi=1.2):
    # L = -min(ratio*A, clip(ratio, lo, hi)*A); gradient is zero when clipped.
    unclipped = ratio * adv
    clipped = np.clip(ratio, lo, hi) * adv
    return -ratio * adv if unclipped <= clipped else 0.0


def grad_cispo(ratio, adv, lo=0.0, hi=4.0):
    # L = -stopgrad(clip(ratio, lo, hi)) * logp * A; the clipped ratio is a
    # detached coefficient, so the gradient is never zeroed, only capped.
    return -np.clip(ratio, lo, hi) * adv


def centered_advantages(rewards):
    # Group mean-centering only; no std division (Dr. GRPO rationale).
    r = np.asarray(rewards, dtype=float)
    return r - r.mean()


# On-policy (ratio 1) all three objectives give the identical REINFORCE update.
for g in (grad_importance_sampling, grad_ppo, grad_cispo):
    assert abs(g(1.0, 2.0) - (-2.0)) < 1e-12

# Off-policy, ratio 2, positive advantage: PPO zeroes the token's gradient
# (clipped), IS follows the raw ratio unbounded, CISPO caps but keeps learning.
assert grad_ppo(2.0, 1.0) == 0.0
assert grad_importance_sampling(2.0, 1.0) == -2.0
assert grad_cispo(2.0, 1.0) == -2.0

# Runaway ratio 8: IS explodes 8x, CISPO saturates at its cap of 4.
assert grad_importance_sampling(8.0, 1.0) == -8.0
assert grad_cispo(8.0, 1.0) == -4.0

# Mean-centered advantages: shift-invariant (reward r -> r + b cancels), but
# NOT scale-invariant; std-normalization would erase difficulty information.
adv = centered_advantages([0.0, 0.0, 1.0, 1.0])
assert np.allclose(adv, centered_advantages([5.0, 5.0, 6.0, 6.0]))
assert not np.allclose(centered_advantages([0.0, 2.0]), centered_advantages([0.0, 4.0]))

# Degenerate group (all rewards equal) has zero advantage everywhere: no
# gradient, which is why constant-reward groups are dropped before training.
assert np.allclose(centered_advantages([1.0, 1.0, 1.0]), 0.0)

# KL penalty as used in the RL loop: adjusted = A + coef * (avg_kl - kl_t).
# The adjustment is zero-sum across tokens and hits the most divergent token.
kl_t = np.array([0.001, 0.002, 0.050, 0.003])  # logp_policy - logp_reference
adjust = 0.05 * (kl_t.mean() - kl_t)
assert abs(adjust.sum()) < 1e-15 and adjust.argmin() == 2

print("RL loss family + advantages: OK")

Working defaults from the cookbook's own guidance: start cross_entropy for SFT and importance_sampling for RL, switching to ppo or cispo on instability; RL LR 1e-5 to 4e-5, group_size 4-16, kl_penalty_coef 0.05 as a moderate starting point, temperature 1.0 (non-1 temperatures interact badly with the KL penalty); more than one optimizer substep per batch requires the PPO objective; sampler-vs-trainer KL under ~0.01 indicates stable training.

How to develop with it

The cookbook's abstractions are the generalizable part; they port to any rollout-based stack:

  • Env (single-use episode: initial_observation(), step(action)) and EnvGroupBuilder (makes the group of envs whose rewards are centered together). Group-level reward computation is first-class, which is what makes pairwise preference tournaments and zero-sum self-play drop into the same GRPO machinery as verifiable rewards.
  • RLDataset batches env-group builders; builders must be picklable so rollout work can fan out to processes or Ray.
  • Completers split the policy interface: TokenCompleter (tokens + logprobs, what RL needs) vs MessageCompleter (parsed messages, what evals and tool loops need).
  • ProblemEnv for single-turn verifiable tasks scores format_coef * (format - 1) + correct, gating format reward on a clean stop-sequence termination so length-limit truncations cannot collect it (reward design, RLVR).
  • Preference and distillation reuse the RL plumbing: DPO runs on forward_backward_custom with the frozen initial weights as the implicit reference (DPO); on-policy distillation is the RL loop with per-token reverse KL against a teacher's compute_logprobs as the only reward (on-policy distillation); off-policy distillation is standard supervised fine-tuning (hard-target cross-entropy) on the teacher's sampled reasoning traces.
  • Evals: run_benchmarks(["gsm8k", "mmlu_pro", "ifeval"], ...) with 12 benchmarks verified against published scores, resumable by content hash, plus inline evaluators on the training cadence (LLM evaluation harness, LLM benchmarks). Their own warning generalizes: max_tokens, temperature, system prompt, and timeouts move scores 10-30% (GSM8K 84.7% to 95.6% from one \boxed{} instruction), so document the exact eval config next to every number.

How to maintain it

  • Pin the client stack and track known-bad versions. The cookbook pins around pydantic >= 2.13.0b1 (serialization stalls) and transformers == 5.3.0 (DeepSeek tokenizer bug); keep the tinker SDK at the latest stable and re-run a smoke test (create a small rank-32 client) after upgrades.
  • Checkpoints have TTLs. Default persistence is 7 days for periodic checkpoints, with cheap rolling resume-only checkpoints (state, no sampler export) in between and final checkpoints kept indefinitely; download anything you cannot regenerate. Resume needs save_state (weights + optimizer); sampling and export need sampler checkpoints.
  • Recreate sampling clients after weight saves. A stale sampling client silently samples old weights; the cookbook's "sampler desync" is the canonical silent bug.
  • Record the renderer with the run. Checkpoint metadata carries the renderer name so later sampling and serving reuse the exact template (chat rendering and loss masking).
  • Model deprecations are real: the hosted lineup rotates; the docs publish a deprecation schedule, so export weights for anything long-lived.

How to run it in production

The performance model is the clock cycle: the worker pool steps in lock-step, requests queued before a cycle run in it, and results land just after the next cycle starts. Throughput is therefore a scheduling property of your client code, validated here (executed, asserts passing):

import math

EPS = 0.1  # network latency: results land just after the next cycle starts


def execute(submit_time):
    """A request submitted at time t runs in the first cycle starting at or
    after t; its result reaches the client just after that cycle ends."""
    cycle = math.ceil(submit_time)
    return cycle, cycle + 1 + EPS


# One naive step (submit forward_backward, await, then submit optim_step)
# burns 3 clock cycles: fb runs on cycle 0, optim on cycle 2, done at 3.1.
c_fb, t = execute(0.0)
c_opt, t_done = execute(t)
assert (c_fb, c_opt) == (0, 2) and math.floor(t_done) == 3

# Submitting both back-to-back before awaiting lands them on the same cycle.
c_fb, _ = execute(0.0)
c_opt, t_done = execute(0.0)
assert c_fb == c_opt == 0 and math.floor(t_done) == 1

# Await-then-submit still wastes every other cycle across a run: each result
# arrives just after a cycle boundary, so steps execute on cycles 0,2,4,...
t, cycles = 0.0, []
for _ in range(8):
    c, t = execute(t)
    cycles.append(c)
assert cycles == [0, 2, 4, 6, 8, 10, 12, 14]

# Pipelined (submit batch k+1 before awaiting batch k): every cycle does work.
cycles = [execute(float(k))[0] for k in range(8)]
assert cycles == [0, 1, 2, 3, 4, 5, 6, 7]

print("clock-cycle pipelining: OK")

Operational consequences:

  • Always submit forward_backward and optim_step together before awaiting either (the cookbook loops do this via submit_ahead); sequential awaiting is the number-one performance mistake and costs 2-3x wall clock.
  • Overlap rollouts and training for RL with stream-minibatch mode (train as soon as a minibatch of trajectories lands) or async mode with a bounded max_steps_off_policy, which requeues stale samples and holds batch size constant so the LR needs no adjustment (async RL systems).
  • Beware the GIL. The SDK's background thread shares the interpreter with your loop; CPU-heavy reward functions block heartbeats and look like service hangs. Move heavy work out of the loop or into subprocess sampling mode.
  • Export for serving. Download sampler checkpoints and either merge to HF format (merge in float32; MoE fused-expert layouts differ per family, and a wrong gate/up convention silently corrupts weights) or serve the unmerged adapter via vLLM --lora-modules (multi-LoRA serving, serving open-weight models, engine weight loading).

Failure modes

  • Sequential API calls: each step burns 2-3 clock cycles instead of 1; throughput collapses with no error anywhere.
  • Stale sampling client after a weight save: rollouts silently come from old weights; RL appears to stop learning.
  • Wrong renderer for the model (thinking vs non-thinking variants especially): loss goes down while outputs degrade; high sampler/trainer KL at step 0 is the tell (chat rendering and loss masking).
  • Constant-reward groups left in the batch: zero advantage, zero gradient, wasted rollouts; drop them (and expect the metric frac_reward_zero_std in GRPO stacks, GRPO variants).
  • Checkpoint TTL expiry: a 7-day default quietly deletes the run's artifacts; export or extend TTLs for anything that matters.
  • Merge-convention bugs on export: MoE gate/up fusion order and bf16 merge arithmetic produce a model that loads fine and scores garbage; verify exported weights against the service's own samples before serving.
  • Service dependency: rate limits, model deprecations, and clock-cycle latency are outside your control; keep the loop code portable (the cookbook abstractions run against any backend that implements the four primitives).

References

  • Tinker: https://thinkingmachines.ai/tinker/ · announcement: https://thinkingmachines.ai/blog/announcing-tinker/
  • Tinker docs: https://tinker-docs.thinkingmachines.ai · under the hood (clock cycles): https://tinker-docs.thinkingmachines.ai/tinker/under-the-hood/ · losses: https://tinker-docs.thinkingmachines.ai/tinker/losses/
  • tinker-cookbook: https://github.com/thinking-machines-lab/tinker-cookbook · SDK: https://github.com/thinking-machines-lab/tinker · PyPI: https://pypi.org/project/tinker-cookbook/
  • Dr. GRPO (Liu et al., 2025): https://arxiv.org/abs/2503.20783 · CISPO / MiniMax-M1: https://arxiv.org/abs/2506.13585
  • On-policy distillation (Thinking Machines): https://thinkingmachines.ai/blog/on-policy-distillation/

Related: RL Libraries · Chat Rendering & Loss Masking · LoRA Hyperparameter Scaling · Fine-tuning · GRPO · DPO · RLVR · On-Policy Distillation · Async RL · LLM Evaluation Harness · Glossary