Chat rendering and token loss masking¶
Scope: the renderer layer that sits between structured chat messages and token sequences in every post-training stack: how conversations become supervised examples with per-token loss weights, how generation prompts and sampled tokens round-trip back into messages, which masking policy to train on, and the failure modes (template mismatch, missing stop tokens, BOS duplication, thinking-block stripping) that silently corrupt fine-tuning. Patterns generalized from the tinker-cookbook renderers subsystem and HuggingFace chat templates.
The numpy/stdlib blocks below were executed with all asserts passing; they validate the masking and round-trip mechanics on a toy renderer. Blocks that call
tinker_cookbookortransformersare reference templates on real APIs (not runnable in this doc's CI): pin versions and verify before production use.
What it is¶
A renderer (tinker-cookbook's term; HuggingFace calls the template half a chat template) is the bidirectional converter between the structured view of a conversation (a list of {role, content} messages, optionally with tool calls and thinking blocks) and the flat token view the model actually consumes. It owns three obligations:
build_generation_prompt(messages): render history plus the next-role header into tokens for sampling, with the model-specific special tokens (<|im_start|>,<|start_header_id|>, Harmony channels) and stop sequences.build_supervised_example(messages): render the same conversation into(tokens, weights)where the per-token weight decides which positions contribute to the loss. Each message splits into a header (role delimiters the model sees but never generates, weight 0) and an output (content plus the end-of-turn token, weight 1 when trainable).parse_response(tokens): convert sampled tokens back into a message plus a termination status (stop_sequence,eos, ormalformed), extracting tool calls and thinking blocks.
The invariant tying them together: the weight-0 prefix of a supervised example must be token-identical to the generation prompt for the same context, so the model trains on exactly the distribution it sees at inference.
Why use it¶
- Template mismatch is a silent killer. Training with one token layout and sampling with another does not crash; it degrades quality and, in RL, inflates the sampler-vs-trainer KL from step 0, corrupting importance ratios in PPO/GRPO-family losses. tinker-cookbook's guidance: encoding a chat prompt with raw
tokenizer.encode()instead of the template can inflate per-token logprob KL by 5x or more. - Loss placement decides what is learned. Masking prompt tokens stops the model from spending gradient on memorizing questions; leaving the end-of-turn token trainable is what teaches the model to stop. Both come from the weight vector, not the optimizer.
- Round-trip parsing is the API contract for RL and agents. A reward function or tool loop needs sampled tokens parsed back into messages reliably, including the truncated and malformed cases.
When to use it (and when not)¶
- Always for SFT, DPO data preparation, RL rollouts, evals, and serving of chat-tuned models: one renderer implementation shared by all of them.
- Choose the masking policy by task: train on the last assistant message for single-target SFT; on all assistant messages for multi-turn distillation (only when the extension property below holds); on all tokens for continued pretraining on conversation-shaped corpora.
- Not needed for base-model continued pretraining on raw text:
tokenizer.encode()with uniform weights is correct there, and arole_colon-style plain-text renderer covers base models used in chat-shaped experiments.
Architecture¶
flowchart LR
MSG["Messages: system / user / assistant (+tools, +thinking)"]
subgraph REN["Renderer (one per model family)"]
SUP["build_supervised_example: header w=0, output w=1"]
GEN["build_generation_prompt: history + next-role header"]
PAR["parse_response: message + termination status"]
end
MSG --> SUP --> TRAIN["Trainer: loss on w=1 tokens only"]
MSG --> GEN --> ENGINE["Inference engine (stop sequences)"]
ENGINE -->|"sampled tokens"| PAR --> MSG2["Message: content, tool_calls, thinking"]
SUP -.->|"w=0 prefix == generation prompt"| GEN
How to use it¶
The masking policies (tinker-cookbook's TrainOnWhat enum; other stacks expose subsets, e.g. TRL's assistant-only loss):
| Policy | Loss lands on | Use for |
|---|---|---|
last_assistant_message |
final assistant message only | single-target SFT, preference data |
last_assistant_turn |
assistant messages after the last user turn (incl. tool calls) | agentic SFT on the final turn |
all_assistant_messages |
every assistant message | multi-turn SFT/distillation (needs extension property) |
all_messages |
all message outputs, headers still masked | conversation-shaped pretraining |
all_tokens |
everything, headers included | raw continued pretraining |
customized |
per-message trainable flag |
mixed-quality transcripts |
A supervised example is the concatenation BOS + header_0 + output_0 + ... + header_n + output_n; headers get weight 0 (except under all_tokens), BOS gets weight 0, and the end-of-turn token belongs to the trainable output. The executed block below validates the full mechanics, including the adversarial cases:
import numpy as np
EOT = 0
def encode(s):
return [ord(c) for c in s]
def decode(toks):
return "".join(chr(t) for t in toks)
def render_message(msg):
# header: role delimiters the model sees but never generates.
# output: content plus end-of-turn token, the part the model must produce.
header = encode(f"<{msg['role']}>")
output = encode(msg["content"]) + [EOT]
return header, output
def build_supervised_example(messages, train_on_what="last_assistant_message"):
tokens, weights = [], []
for i, msg in enumerate(messages):
header, output = render_message(msg)
is_last = i == len(messages) - 1
is_assistant = msg["role"] == "assistant"
if train_on_what == "last_assistant_message":
trainable = is_last and is_assistant
elif train_on_what == "all_assistant_messages":
trainable = is_assistant
elif train_on_what == "all_tokens":
trainable = True
else:
raise ValueError(train_on_what)
header_w = 1 if train_on_what == "all_tokens" else 0
tokens += header + output
weights += [header_w] * len(header) + [int(trainable)] * len(output)
return np.array(tokens), np.array(weights)
def build_generation_prompt(messages, role="assistant"):
tokens = []
for msg in messages:
header, output = render_message(msg)
tokens += header + output
return tokens + encode(f"<{role}>")
def parse_response(tokens):
n_stop = tokens.count(EOT)
if n_stop == 0:
return decode(tokens), "malformed" # truncated: never emitted stop
if n_stop > 1:
raise ValueError("wrong stop tokens configured for sampling")
return decode(tokens[: tokens.index(EOT)]), "stop_sequence"
convo = [
{"role": "system", "content": "be terse"},
{"role": "user", "content": "2+2?"},
{"role": "assistant", "content": "4"},
]
tokens, weights = build_supervised_example(convo)
# 1. Only the assistant reply carries loss; prompt content and headers are masked.
_, asst_output = render_message(convo[-1])
n_target = len(asst_output)
assert weights.sum() == n_target
assert (weights[-n_target:] == 1).all() and (weights[:-n_target] == 0).all()
# 2. The end-of-turn token is trainable: the model must learn to stop.
assert tokens[-1] == EOT and weights[-1] == 1
# 3. Weights form 0...01...1 and the masked prefix equals the generation prompt
# for the same context: training and inference see identical token streams.
split = int(np.argmax(weights))
assert (weights[:split] == 0).all() and (weights[split:] == 1).all()
assert tokens[:split].tolist() == build_generation_prompt(convo[:-1])
# 4. Round trip: parsing the trainable span recovers the original message.
content, termination = parse_response(tokens[split:].tolist())
assert content == "4" and termination == "stop_sequence"
# 5. A truncated sample (no stop token) must parse as malformed, not crash.
assert parse_response(encode("4"))[1] == "malformed"
# 6. Two stop tokens means the sampler ran with the wrong stop list: hard error.
try:
parse_response([52, EOT, 52, EOT])
raise AssertionError("should have raised")
except ValueError:
pass
# 7. Extension property: prompt(turn k) + completion(turn k) is a prefix of
# prompt(turn k+1), enabling KV-cache reuse and single-datum multi-turn SFT.
convo2 = convo + [{"role": "user", "content": "3+3?"}, {"role": "assistant", "content": "6"}]
p1 = build_generation_prompt(convo2[:2])
full1 = p1 + encode("4") + [EOT]
p2 = build_generation_prompt(convo2[:4])
assert p2[: len(full1)] == full1
# 8. Adversarial: stripping reasoning blocks from history (as thinking-model
# templates do) breaks the prefix property, so KV reuse must be disabled.
think_convo = [convo2[0], convo2[1], {"role": "assistant", "content": "<think>2+2=4</think>4"}, convo2[3]]
p1t = build_generation_prompt(think_convo[:2])
full1t = p1t + encode("<think>2+2=4</think>4") + [EOT]
stripped = dict(think_convo[2], content="4")
p2t = build_generation_prompt([think_convo[0], think_convo[1], stripped, think_convo[3]])
assert p2t[: len(full1t)] != full1t
print("renderer mechanics: OK")
What the mask does to the loss, and the guard for a broken policy (executed, asserts passing):
import numpy as np
def masked_mean_nll(token_logprobs, weights):
# Token-mean NLL over weighted positions; nan when nothing is trainable
# (mirrors tinker-cookbook's compute_mean_nll contract).
w = np.asarray(weights, dtype=float)
lp = np.asarray(token_logprobs, dtype=float)
if w.sum() == 0:
return float("nan")
return float(-(lp * w).sum() / w.sum())
# 8-token example: 6 prompt tokens the model finds surprising (a rare question
# it should not memorize), 2 answer tokens it predicts well.
logprobs = [-4.0, -3.5, -5.0, -4.2, -3.8, -4.5, -0.1, -0.05]
weights = [0, 0, 0, 0, 0, 0, 1, 1]
masked = masked_mean_nll(logprobs, weights)
unmasked = masked_mean_nll(logprobs, [1] * 8)
# Masked loss reflects answer quality only; unmasked loss is dominated by
# prompt tokens and would spend gradient on memorizing the question.
assert abs(masked - 0.075) < 1e-12
assert unmasked > 3.0 and unmasked > 40 * masked
# num_loss_tokens (what training logs report) is the weight sum, not len().
assert sum(weights) == 2
# Adversarial: an all-masked datum yields nan, not a silent 0.0 that would
# hide a broken train_on_what setting.
assert np.isnan(masked_mean_nll(logprobs, [0] * 8))
print("masked NLL: OK")
The termination status has a consumer policy: eval grading accepts is_clean (stop sequence or EOS), while strict RL format rewards gate on stop_sequence only, so a model cannot collect format reward by running into the length limit.
How to develop with it¶
Implementing a renderer for a new model family means encoding four decisions, then testing four properties.
The decisions:
- Special-token layout: e.g. Llama 3 (
<|start_header_id|>role<|end_header_id|>...<|eot_id|>), ChatML/Qwen (<|im_start|>role...<|im_end|>), DeepSeek V3 (full-width<|User|>/<|Assistant|>markers), OpenAI Harmony for gpt-oss (<|start|>role<|channel|>analysis|commentary|final<|message|>...<|end|>with two stop tokens,<|return|>and<|call|>). - Thinking-block semantics: reasoning models strip
<think>...</think>from historical assistant turns but keep it in the final target; hybrid models ship paired renderers (qwen3vsqwen3_disable_thinking) and picking the wrong one silently corrupts training. Some templates prefill<think>into the generation prompt, so the parser must re-inject it before parsing (the prefill/normalize asymmetry). - Tool-call syntax: JSON in
<tool_call>tags (Qwen3), XML function blocks (Qwen3.5), delimiter-chained calls (DeepSeek), typed channels (Harmony). Llama 3's bare-JSON convention is unparseable in the general case, which is why tinker-cookbook refuses to support tool calls for it. - Loss-mask boundaries: which tokens are header vs output, per the table above.
The test properties (all four are enforced in tinker-cookbook's renderer suite and are worth copying into any stack):
build_generation_promptmatchestokenizer.apply_chat_template(..., add_generation_prompt=True, tokenize=True)token for token.- The weight-0 prefix of a supervised example equals the generation prompt of the same context (weights are
0...01...1). parse_response(trainable_tokens)returns the original final message.- The extension property holds where claimed: each turn's prompt-plus-completion is a prefix of the next turn's prompt.
Reference template for the parity test (needs transformers and tinker-cookbook; pin both):
from tinker_cookbook import renderers, model_info, tokenizer_utils
model = "Qwen/Qwen3-8B"
tokenizer = tokenizer_utils.get_tokenizer(model)
renderer = renderers.get_renderer(
model_info.get_recommended_renderer_name(model), tokenizer)
messages = [{"role": "user", "content": "2+2?"}]
ours = renderer.build_generation_prompt(messages).to_ints()
hf = tokenizer.apply_chat_template(
messages, add_generation_prompt=True, tokenize=True)
assert ours == hf # any drift here is a training-vs-inference mismatch
HuggingFace-only stacks can get the supervised mask from templates that carry {% generation %} markers via apply_chat_template(..., return_assistant_tokens_mask=True); verify the installed template actually defines the markers before relying on it.
How to maintain it¶
- Pin and test the tokenizer stack. Template behaviour rides on
transformersversions; tinker-cookbook pins around known-bad releases (a 5.3.0 DeepSeek tokenizer decode bug, pre-5.0 VLM image-token miscounts). Re-run the four-property suite on everytransformersor model upgrade. - Track template drift across model revisions. A new checkpoint of the same family can change thinking defaults or tool syntax (Qwen3 JSON tool calls became XML in Qwen3.5); the renderer name belongs in run metadata so a checkpoint can be re-served with the template it was trained with.
- Keep one renderer registry. A factory keyed by model name (
get_renderer(name, tokenizer)plus a custom-renderer registry) prevents ad-hoc per-script templates; renderers should be picklable so rollout workers can reconstruct them from(renderer_name, model_name). - Watch the boundary conditions. Tokenizing text in chunks can produce different tokens than tokenizing the joined string even when both decode identically; merge adjacent text parts before encoding. Multi-byte UTF-8 characters split across tokens need a buffering decoder.
How to run it in production¶
- Same template at train and serve. OpenAI-compatible servers (vLLM, SGLang) apply the
chat_templateshipped intokenizer_config.jsonserver-side; a fine-tuned checkpoint must ship the template it was trained with, or clients silently reproduce the mismatch the training stack avoided (serving open-weight models). - Exploit the extension property for agentic workloads. When each turn's prompt extends the previous one, a T-turn trajectory trains as one merged datum and samples with KV-cache reuse: O(T) prefill instead of O(T^2). Thinking-history stripping breaks this, which is a real throughput cost of reasoning templates in multi-turn RL (agentic RL, KV cache management).
- Gate rewards and evals on termination status. Truncated generations (
malformed) should score 0 in strict-format RL and be reported separately in evals; benchmark scores move by tens of points on truncation rates alone (LLM benchmarks, evaluation integrity). - Monitor step-0 KL in RL. High sampler-vs-trainer KL on the very first step is the classic renderer-mismatch signature: the prompt tokens the trainer scores are not the ones the sampler saw (GRPO, fine-tuning and post-training).
Failure modes¶
- Raw
tokenizer.encode()on a chat model: skips the template, produces out-of-distribution prompt tokens, inflates sampler/trainer KL and corrupts importance ratios; correct only for base-model completion training. - Loss on prompt tokens: an unmasked or mis-masked dataset trains the model to memorize questions; watch
num_loss_tokensper example, and treat an all-zero mask as an error (nan), not a zero loss. - Stop token missing from the trainable output: the model never learns to stop and generations run to the length limit; conversely, sampling with the wrong stop list yields multi-stop responses that must hard-error, not parse.
- BOS duplication: adding special tokens on every chunk encode instead of once per sequence; the double-BOS prompt is out-of-distribution and evades string-level inspection.
- Wrong thinking variant on hybrid models: training a thinking model with the non-thinking renderer (or vice versa) silently degrades quality; pair each checkpoint with its renderer name.
all_assistant_messageswithout the extension property: earlier assistant turns train on a prefix that generation would never produce (history had thinking stripped); split the conversation into per-turn examples instead.- Tool-call format drift: parsing assumes one syntax while the model emits another family's; malformed tool calls should surface as unparsed-call records with the raw text, not as dropped messages.
References¶
- tinker-cookbook renderers: https://github.com/thinking-machines-lab/tinker-cookbook/tree/main/tinker_cookbook/renderers
- Tinker docs, rendering tutorial: https://tinker-docs.thinkingmachines.ai/tutorials/core-concepts/rendering/
- HuggingFace chat templating: https://huggingface.co/docs/transformers/main/en/chat_templating
- OpenAI Harmony response format (gpt-oss): https://cookbook.openai.com/articles/openai-harmony
- Tinker docs, quickstart (renderer usage in training loops): https://tinker-docs.thinkingmachines.ai/tinker/quickstart/
Related: Tinker · SFT & LoRA · Fine-tuning · GRPO · Agentic RL · Tools & Function Calling · LLM Benchmarks · Serving OSS Models · Glossary