Markdown

Chat rendering and token loss masking¶

Scope: the renderer layer that sits between structured chat messages and token sequences in every post-training stack: how conversations become supervised examples with per-token loss weights, how generation prompts and sampled tokens round-trip back into messages, which masking policy to train on, and the failure modes (template mismatch, missing stop tokens, BOS duplication, thinking-block stripping) that silently corrupt fine-tuning. Patterns generalized from the tinker-cookbook renderers subsystem and HuggingFace chat templates.

The numpy/stdlib blocks below were executed with all asserts passing; they validate the masking and round-trip mechanics on a toy renderer. Blocks that call tinker_cookbook or transformers are reference templates on real APIs (not runnable in this doc's CI): pin versions and verify before production use.

What it is¶

A renderer (tinker-cookbook's term; HuggingFace calls the template half a chat template) is the bidirectional converter between the structured view of a conversation (a list of {role, content} messages, optionally with tool calls and thinking blocks) and the flat token view the model actually consumes. It owns three obligations:

build_generation_prompt(messages): render history plus the next-role header into tokens for sampling, with the model-specific special tokens (<|im_start|>, <|start_header_id|>, Harmony channels) and stop sequences.
build_supervised_example(messages): render the same conversation into (tokens, weights) where the per-token weight decides which positions contribute to the loss. Each message splits into a header (role delimiters the model sees but never generates, weight 0) and an output (content plus the end-of-turn token, weight 1 when trainable).
parse_response(tokens): convert sampled tokens back into a message plus a termination status (stop_sequence, eos, or malformed), extracting tool calls and thinking blocks.

The invariant tying them together: the weight-0 prefix of a supervised example must be token-identical to the generation prompt for the same context, so the model trains on exactly the distribution it sees at inference.

Why use it¶

Template mismatch is a silent killer. Training with one token layout and sampling with another does not crash; it degrades quality and, in RL, inflates the sampler-vs-trainer KL from step 0, corrupting importance ratios in PPO/GRPO-family losses. tinker-cookbook's guidance: encoding a chat prompt with raw tokenizer.encode() instead of the template can inflate per-token logprob KL by 5x or more.
Loss placement decides what is learned. Masking prompt tokens stops the model from spending gradient on memorizing questions; leaving the end-of-turn token trainable is what teaches the model to stop. Both come from the weight vector, not the optimizer.
Round-trip parsing is the API contract for RL and agents. A reward function or tool loop needs sampled tokens parsed back into messages reliably, including the truncated and malformed cases.

When to use it (and when not)¶

Always for SFT, DPO data preparation, RL rollouts, evals, and serving of chat-tuned models: one renderer implementation shared by all of them.
Choose the masking policy by task: train on the last assistant message for single-target SFT; on all assistant messages for multi-turn distillation (only when the extension property below holds); on all tokens for continued pretraining on conversation-shaped corpora.
Not needed for base-model continued pretraining on raw text: tokenizer.encode() with uniform weights is correct there, and a role_colon-style plain-text renderer covers base models used in chat-shaped experiments.

Architecture¶

flowchart LR
  MSG["Messages: system / user / assistant (+tools, +thinking)"]
  subgraph REN["Renderer (one per model family)"]
    SUP["build_supervised_example: header w=0, output w=1"]
    GEN["build_generation_prompt: history + next-role header"]
    PAR["parse_response: message + termination status"]
  end
  MSG --> SUP --> TRAIN["Trainer: loss on w=1 tokens only"]
  MSG --> GEN --> ENGINE["Inference engine (stop sequences)"]
  ENGINE -->|"sampled tokens"| PAR --> MSG2["Message: content, tool_calls, thinking"]
  SUP -.->|"w=0 prefix == generation prompt"| GEN

How to use it¶

The masking policies (tinker-cookbook's TrainOnWhat enum; other stacks expose subsets, e.g. TRL's assistant-only loss):

Policy	Loss lands on	Use for
`last_assistant_message`	final assistant message only	single-target SFT, preference data
`last_assistant_turn`	assistant messages after the last user turn (incl. tool calls)	agentic SFT on the final turn
`all_assistant_messages`	every assistant message	multi-turn SFT/distillation (needs extension property)
`all_messages`	all message outputs, headers still masked	conversation-shaped pretraining
`all_tokens`	everything, headers included	raw continued pretraining
`customized`	per-message `trainable` flag	mixed-quality transcripts

A supervised example is the concatenation BOS + header_0 + output_0 + ... + header_n + output_n; headers get weight 0 (except under all_tokens), BOS gets weight 0, and the end-of-turn token belongs to the trainable output. The executed block below validates the full mechanics, including the adversarial cases:

import numpy as np

EOT = 0


def encode(s):
    return [ord(c) for c in s]


def decode(toks):
    return "".join(chr(t) for t in toks)


def render_message(msg):
    # header: role delimiters the model sees but never generates.
    # output: content plus end-of-turn token, the part the model must produce.
    header = encode(f"<{msg['role']}>")
    output = encode(msg["content"]) + [EOT]
    return header, output


def build_supervised_example(messages, train_on_what="last_assistant_message"):
    tokens, weights = [], []
    for i, msg in enumerate(messages):
        header, output = render_message(msg)
        is_last = i == len(messages) - 1
        is_assistant = msg["role"] == "assistant"
        if train_on_what == "last_assistant_message":
            trainable = is_last and is_assistant
        elif train_on_what == "all_assistant_messages":
            trainable = is_assistant
        elif train_on_what == "all_tokens":
            trainable = True
        else:
            raise ValueError(train_on_what)
        header_w = 1 if train_on_what == "all_tokens" else 0
        tokens += header + output
        weights += [header_w] * len(header) + [int(trainable)] * len(output)
    return np.array(tokens), np.array(weights)


def build_generation_prompt(messages, role="assistant"):
    tokens = []
    for msg in messages:
        header, output = render_message(msg)
        tokens += header + output
    return tokens + encode(f"<{role}>")


def parse_response(tokens):
    n_stop = tokens.count(EOT)
    if n_stop == 0:
        return decode(tokens), "malformed"  # truncated: never emitted stop
    if n_stop > 1:
        raise ValueError("wrong stop tokens configured for sampling")
    return decode(tokens[: tokens.index(EOT)]), "stop_sequence"


convo = [
    {"role": "system", "content": "be terse"},
    {"role": "user", "content": "2+2?"},
    {"role": "assistant", "content": "4"},
]
tokens, weights = build_supervised_example(convo)

# 1. Only the assistant reply carries loss; prompt content and headers are masked.
_, asst_output = render_message(convo[-1])
n_target = len(asst_output)
assert weights.sum() == n_target
assert (weights[-n_target:] == 1).all() and (weights[:-n_target] == 0).all()

# 2. The end-of-turn token is trainable: the model must learn to stop.
assert tokens[-1] == EOT and weights[-1] == 1

# 3. Weights form 0...01...1 and the masked prefix equals the generation prompt
#    for the same context: training and inference see identical token streams.
split = int(np.argmax(weights))
assert (weights[:split] == 0).all() and (weights[split:] == 1).all()
assert tokens[:split].tolist() == build_generation_prompt(convo[:-1])

# 4. Round trip: parsing the trainable span recovers the original message.
content, termination = parse_response(tokens[split:].tolist())
assert content == "4" and termination == "stop_sequence"

# 5. A truncated sample (no stop token) must parse as malformed, not crash.
assert parse_response(encode("4"))[1] == "malformed"

# 6. Two stop tokens means the sampler ran with the wrong stop list: hard error.
try:
    parse_response([52, EOT, 52, EOT])
    raise AssertionError("should have raised")
except ValueError:
    pass

# 7. Extension property: prompt(turn k) + completion(turn k) is a prefix of
#    prompt(turn k+1), enabling KV-cache reuse and single-datum multi-turn SFT.
convo2 = convo + [{"role": "user", "content": "3+3?"}, {"role": "assistant", "content": "6"}]
p1 = build_generation_prompt(convo2[:2])
full1 = p1 + encode("4") + [EOT]
p2 = build_generation_prompt(convo2[:4])
assert p2[: len(full1)] == full1

# 8. Adversarial: stripping reasoning blocks from history (as thinking-model
#    templates do) breaks the prefix property, so KV reuse must be disabled.
think_convo = [convo2[0], convo2[1], {"role": "assistant", "content": "<think>2+2=4</think>4"}, convo2[3]]
p1t = build_generation_prompt(think_convo[:2])
full1t = p1t + encode("<think>2+2=4</think>4") + [EOT]
stripped = dict(think_convo[2], content="4")
p2t = build_generation_prompt([think_convo[0], think_convo[1], stripped, think_convo[3]])
assert p2t[: len(full1t)] != full1t

print("renderer mechanics: OK")

What the mask does to the loss, and the guard for a broken policy (executed, asserts passing):

import numpy as np


def masked_mean_nll(token_logprobs, weights):
    # Token-mean NLL over weighted positions; nan when nothing is trainable
    # (mirrors tinker-cookbook's compute_mean_nll contract).
    w = np.asarray(weights, dtype=float)
    lp = np.asarray(token_logprobs, dtype=float)
    if w.sum() == 0:
        return float("nan")
    return float(-(lp * w).sum() / w.sum())


# 8-token example: 6 prompt tokens the model finds surprising (a rare question
# it should not memorize), 2 answer tokens it predicts well.
logprobs = [-4.0, -3.5, -5.0, -4.2, -3.8, -4.5, -0.1, -0.05]
weights = [0, 0, 0, 0, 0, 0, 1, 1]

masked = masked_mean_nll(logprobs, weights)
unmasked = masked_mean_nll(logprobs, [1] * 8)

# Masked loss reflects answer quality only; unmasked loss is dominated by
# prompt tokens and would spend gradient on memorizing the question.
assert abs(masked - 0.075) < 1e-12
assert unmasked > 3.0 and unmasked > 40 * masked

# num_loss_tokens (what training logs report) is the weight sum, not len().
assert sum(weights) == 2

# Adversarial: an all-masked datum yields nan, not a silent 0.0 that would
# hide a broken train_on_what setting.
assert np.isnan(masked_mean_nll(logprobs, [0] * 8))

print("masked NLL: OK")

The termination status has a consumer policy: eval grading accepts is_clean (stop sequence or EOS), while strict RL format rewards gate on stop_sequence only, so a model cannot collect format reward by running into the length limit.

How to develop with it¶

Implementing a renderer for a new model family means encoding four decisions, then testing four properties.

The decisions:

Special-token layout: e.g. Llama 3 (<|start_header_id|>role<|end_header_id|>...<|eot_id|>), ChatML/Qwen (<|im_start|>role...<|im_end|>), DeepSeek V3 (full-width <|User|>/<|Assistant|> markers), OpenAI Harmony for gpt-oss (<|start|>role<|channel|>analysis|commentary|final<|message|>...<|end|> with two stop tokens, <|return|> and <|call|>).
Thinking-block semantics: reasoning models strip <think>...</think> from historical assistant turns but keep it in the final target; hybrid models ship paired renderers (qwen3 vs qwen3_disable_thinking) and picking the wrong one silently corrupts training. Some templates prefill <think> into the generation prompt, so the parser must re-inject it before parsing (the prefill/normalize asymmetry).
Tool-call syntax: JSON in <tool_call> tags (Qwen3), XML function blocks (Qwen3.5), delimiter-chained calls (DeepSeek), typed channels (Harmony). Llama 3's bare-JSON convention is unparseable in the general case, which is why tinker-cookbook refuses to support tool calls for it.
Loss-mask boundaries: which tokens are header vs output, per the table above.

The test properties (all four are enforced in tinker-cookbook's renderer suite and are worth copying into any stack):

build_generation_prompt matches tokenizer.apply_chat_template(..., add_generation_prompt=True, tokenize=True) token for token.
The weight-0 prefix of a supervised example equals the generation prompt of the same context (weights are 0...01...1).
parse_response(trainable_tokens) returns the original final message.
The extension property holds where claimed: each turn's prompt-plus-completion is a prefix of the next turn's prompt.

Reference template for the parity test (needs transformers and tinker-cookbook; pin both):

from tinker_cookbook import renderers, model_info, tokenizer_utils

model = "Qwen/Qwen3-8B"
tokenizer = tokenizer_utils.get_tokenizer(model)
renderer = renderers.get_renderer(
    model_info.get_recommended_renderer_name(model), tokenizer)

messages = [{"role": "user", "content": "2+2?"}]
ours = renderer.build_generation_prompt(messages).to_ints()
hf = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=True)
assert ours == hf  # any drift here is a training-vs-inference mismatch

HuggingFace-only stacks can get the supervised mask from templates that carry {% generation %} markers via apply_chat_template(..., return_assistant_tokens_mask=True); verify the installed template actually defines the markers before relying on it.

How to maintain it¶

Pin and test the tokenizer stack. Template behaviour rides on transformers versions; tinker-cookbook pins around known-bad releases (a 5.3.0 DeepSeek tokenizer decode bug, pre-5.0 VLM image-token miscounts). Re-run the four-property suite on every transformers or model upgrade.
Track template drift across model revisions. A new checkpoint of the same family can change thinking defaults or tool syntax (Qwen3 JSON tool calls became XML in Qwen3.5); the renderer name belongs in run metadata so a checkpoint can be re-served with the template it was trained with.
Keep one renderer registry. A factory keyed by model name (get_renderer(name, tokenizer) plus a custom-renderer registry) prevents ad-hoc per-script templates; renderers should be picklable so rollout workers can reconstruct them from (renderer_name, model_name).
Watch the boundary conditions. Tokenizing text in chunks can produce different tokens than tokenizing the joined string even when both decode identically; merge adjacent text parts before encoding. Multi-byte UTF-8 characters split across tokens need a buffering decoder.

How to run it in production¶

Same template at train and serve. OpenAI-compatible servers (vLLM, SGLang) apply the chat_template shipped in tokenizer_config.json server-side; a fine-tuned checkpoint must ship the template it was trained with, or clients silently reproduce the mismatch the training stack avoided (serving open-weight models).
Exploit the extension property for agentic workloads. When each turn's prompt extends the previous one, a T-turn trajectory trains as one merged datum and samples with KV-cache reuse: O(T) prefill instead of O(T^2). Thinking-history stripping breaks this, which is a real throughput cost of reasoning templates in multi-turn RL (agentic RL, KV cache management).
Gate rewards and evals on termination status. Truncated generations (malformed) should score 0 in strict-format RL and be reported separately in evals; benchmark scores move by tens of points on truncation rates alone (LLM benchmarks, evaluation integrity).
Monitor step-0 KL in RL. High sampler-vs-trainer KL on the very first step is the classic renderer-mismatch signature: the prompt tokens the trainer scores are not the ones the sampler saw (GRPO, fine-tuning and post-training).

Failure modes¶

Raw tokenizer.encode() on a chat model: skips the template, produces out-of-distribution prompt tokens, inflates sampler/trainer KL and corrupts importance ratios; correct only for base-model completion training.
Loss on prompt tokens: an unmasked or mis-masked dataset trains the model to memorize questions; watch num_loss_tokens per example, and treat an all-zero mask as an error (nan), not a zero loss.
Stop token missing from the trainable output: the model never learns to stop and generations run to the length limit; conversely, sampling with the wrong stop list yields multi-stop responses that must hard-error, not parse.
BOS duplication: adding special tokens on every chunk encode instead of once per sequence; the double-BOS prompt is out-of-distribution and evades string-level inspection.
Wrong thinking variant on hybrid models: training a thinking model with the non-thinking renderer (or vice versa) silently degrades quality; pair each checkpoint with its renderer name.
all_assistant_messages without the extension property: earlier assistant turns train on a prefix that generation would never produce (history had thinking stripped); split the conversation into per-turn examples instead.
Tool-call format drift: parsing assumes one syntax while the model emits another family's; malformed tool calls should surface as unparsed-call records with the raw text, not as dropped messages.

References¶

tinker-cookbook renderers: https://github.com/thinking-machines-lab/tinker-cookbook/tree/main/tinker_cookbook/renderers
Tinker docs, rendering tutorial: https://tinker-docs.thinkingmachines.ai/tutorials/core-concepts/rendering/
HuggingFace chat templating: https://huggingface.co/docs/transformers/main/en/chat_templating
OpenAI Harmony response format (gpt-oss): https://cookbook.openai.com/articles/openai-harmony
Tinker docs, quickstart (renderer usage in training loops): https://tinker-docs.thinkingmachines.ai/tinker/quickstart/