Skip to content
Markdown

Chat rendering and token loss masking

Scope: the renderer layer that sits between structured chat messages and token sequences in every post-training stack: how conversations become supervised examples with per-token loss weights, how generation prompts and sampled tokens round-trip back into messages, which masking policy to train on, and the failure modes (template mismatch, missing stop tokens, BOS duplication, thinking-block stripping) that silently corrupt fine-tuning. Patterns generalized from the tinker-cookbook renderers subsystem and HuggingFace chat templates.

The numpy/stdlib blocks below were executed with all asserts passing; they validate the masking and round-trip mechanics on a toy renderer. Blocks that call tinker_cookbook or transformers are reference templates on real APIs (not runnable in this doc's CI): pin versions and verify before production use.

What it is

A renderer (tinker-cookbook's term; HuggingFace calls the template half a chat template) is the bidirectional converter between the structured view of a conversation (a list of {role, content} messages, optionally with tool calls and thinking blocks) and the flat token view the model actually consumes. It owns three obligations:

  • build_generation_prompt(messages): render history plus the next-role header into tokens for sampling, with the model-specific special tokens (<|im_start|>, <|start_header_id|>, Harmony channels) and stop sequences.
  • build_supervised_example(messages): render the same conversation into (tokens, weights) where the per-token weight decides which positions contribute to the loss. Each message splits into a header (role delimiters the model sees but never generates, weight 0) and an output (content plus the end-of-turn token, weight 1 when trainable).
  • parse_response(tokens): convert sampled tokens back into a message plus a termination status (stop_sequence, eos, or malformed), extracting tool calls and thinking blocks.

The invariant tying them together: the weight-0 prefix of a supervised example must be token-identical to the generation prompt for the same context, so the model trains on exactly the distribution it sees at inference.

Why use it

  • Template mismatch is a silent killer. Training with one token layout and sampling with another does not crash; it degrades quality and, in RL, inflates the sampler-vs-trainer KL from step 0, corrupting importance ratios in PPO/GRPO-family losses. tinker-cookbook's guidance: encoding a chat prompt with raw tokenizer.encode() instead of the template can inflate per-token logprob KL by 5x or more.
  • Loss placement decides what is learned. Masking prompt tokens stops the model from spending gradient on memorizing questions; leaving the end-of-turn token trainable is what teaches the model to stop. Both come from the weight vector, not the optimizer.
  • Round-trip parsing is the API contract for RL and agents. A reward function or tool loop needs sampled tokens parsed back into messages reliably, including the truncated and malformed cases.

When to use it (and when not)

  • Always for SFT, DPO data preparation, RL rollouts, evals, and serving of chat-tuned models: one renderer implementation shared by all of them.
  • Choose the masking policy by task: train on the last assistant message for single-target SFT; on all assistant messages for multi-turn distillation (only when the extension property below holds); on all tokens for continued pretraining on conversation-shaped corpora.
  • Not needed for base-model continued pretraining on raw text: tokenizer.encode() with uniform weights is correct there, and a role_colon-style plain-text renderer covers base models used in chat-shaped experiments.

Architecture

flowchart LR
  MSG["Messages: system / user / assistant (+tools, +thinking)"]
  subgraph REN["Renderer (one per model family)"]
    SUP["build_supervised_example: header w=0, output w=1"]
    GEN["build_generation_prompt: history + next-role header"]
    PAR["parse_response: message + termination status"]
  end
  MSG --> SUP --> TRAIN["Trainer: loss on w=1 tokens only"]
  MSG --> GEN --> ENGINE["Inference engine (stop sequences)"]
  ENGINE -->|"sampled tokens"| PAR --> MSG2["Message: content, tool_calls, thinking"]
  SUP -.->|"w=0 prefix == generation prompt"| GEN

How to use it

The masking policies (tinker-cookbook's TrainOnWhat enum; other stacks expose subsets, e.g. TRL's assistant-only loss):

Policy Loss lands on Use for
last_assistant_message final assistant message only single-target SFT, preference data
last_assistant_turn assistant messages after the last user turn (incl. tool calls) agentic SFT on the final turn
all_assistant_messages every assistant message multi-turn SFT/distillation (needs extension property)
all_messages all message outputs, headers still masked conversation-shaped pretraining
all_tokens everything, headers included raw continued pretraining
customized per-message trainable flag mixed-quality transcripts

A supervised example is the concatenation BOS + header_0 + output_0 + ... + header_n + output_n; headers get weight 0 (except under all_tokens), BOS gets weight 0, and the end-of-turn token belongs to the trainable output. The executed block below validates the full mechanics, including the adversarial cases:

import numpy as np

EOT = 0


def encode(s):
    return [ord(c) for c in s]


def decode(toks):
    return "".join(chr(t) for t in toks)


def render_message(msg):
    # header: role delimiters the model sees but never generates.
    # output: content plus end-of-turn token, the part the model must produce.
    header = encode(f"<{msg['role']}>")
    output = encode(msg["content"]) + [EOT]
    return header, output


def build_supervised_example(messages, train_on_what="last_assistant_message"):
    tokens, weights = [], []
    for i, msg in enumerate(messages):
        header, output = render_message(msg)
        is_last = i == len(messages) - 1
        is_assistant = msg["role"] == "assistant"
        if train_on_what == "last_assistant_message":
            trainable = is_last and is_assistant
        elif train_on_what == "all_assistant_messages":
            trainable = is_assistant
        elif train_on_what == "all_tokens":
            trainable = True
        else:
            raise ValueError(train_on_what)
        header_w = 1 if train_on_what == "all_tokens" else 0
        tokens += header + output
        weights += [header_w] * len(header) + [int(trainable)] * len(output)
    return np.array(tokens), np.array(weights)


def build_generation_prompt(messages, role="assistant"):
    tokens = []
    for msg in messages:
        header, output = render_message(msg)
        tokens += header + output
    return tokens + encode(f"<{role}>")


def parse_response(tokens):
    n_stop = tokens.count(EOT)
    if n_stop == 0:
        return decode(tokens), "malformed"  # truncated: never emitted stop
    if n_stop > 1:
        raise ValueError("wrong stop tokens configured for sampling")
    return decode(tokens[: tokens.index(EOT)]), "stop_sequence"


convo = [
    {"role": "system", "content": "be terse"},
    {"role": "user", "content": "2+2?"},
    {"role": "assistant", "content": "4"},
]
tokens, weights = build_supervised_example(convo)

# 1. Only the assistant reply carries loss; prompt content and headers are masked.
_, asst_output = render_message(convo[-1])
n_target = len(asst_output)
assert weights.sum() == n_target
assert (weights[-n_target:] == 1).all() and (weights[:-n_target] == 0).all()

# 2. The end-of-turn token is trainable: the model must learn to stop.
assert tokens[-1] == EOT and weights[-1] == 1

# 3. Weights form 0...01...1 and the masked prefix equals the generation prompt
#    for the same context: training and inference see identical token streams.
split = int(np.argmax(weights))
assert (weights[:split] == 0).all() and (weights[split:] == 1).all()
assert tokens[:split].tolist() == build_generation_prompt(convo[:-1])

# 4. Round trip: parsing the trainable span recovers the original message.
content, termination = parse_response(tokens[split:].tolist())
assert content == "4" and termination == "stop_sequence"

# 5. A truncated sample (no stop token) must parse as malformed, not crash.
assert parse_response(encode("4"))[1] == "malformed"

# 6. Two stop tokens means the sampler ran with the wrong stop list: hard error.
try:
    parse_response([52, EOT, 52, EOT])
    raise AssertionError("should have raised")
except ValueError:
    pass

# 7. Extension property: prompt(turn k) + completion(turn k) is a prefix of
#    prompt(turn k+1), enabling KV-cache reuse and single-datum multi-turn SFT.
convo2 = convo + [{"role": "user", "content": "3+3?"}, {"role": "assistant", "content": "6"}]
p1 = build_generation_prompt(convo2[:2])
full1 = p1 + encode("4") + [EOT]
p2 = build_generation_prompt(convo2[:4])
assert p2[: len(full1)] == full1

# 8. Adversarial: stripping reasoning blocks from history (as thinking-model
#    templates do) breaks the prefix property, so KV reuse must be disabled.
think_convo = [convo2[0], convo2[1], {"role": "assistant", "content": "<think>2+2=4</think>4"}, convo2[3]]
p1t = build_generation_prompt(think_convo[:2])
full1t = p1t + encode("<think>2+2=4</think>4") + [EOT]
stripped = dict(think_convo[2], content="4")
p2t = build_generation_prompt([think_convo[0], think_convo[1], stripped, think_convo[3]])
assert p2t[: len(full1t)] != full1t

print("renderer mechanics: OK")

What the mask does to the loss, and the guard for a broken policy (executed, asserts passing):

import numpy as np


def masked_mean_nll(token_logprobs, weights):
    # Token-mean NLL over weighted positions; nan when nothing is trainable
    # (mirrors tinker-cookbook's compute_mean_nll contract).
    w = np.asarray(weights, dtype=float)
    lp = np.asarray(token_logprobs, dtype=float)
    if w.sum() == 0:
        return float("nan")
    return float(-(lp * w).sum() / w.sum())


# 8-token example: 6 prompt tokens the model finds surprising (a rare question
# it should not memorize), 2 answer tokens it predicts well.
logprobs = [-4.0, -3.5, -5.0, -4.2, -3.8, -4.5, -0.1, -0.05]
weights = [0, 0, 0, 0, 0, 0, 1, 1]

masked = masked_mean_nll(logprobs, weights)
unmasked = masked_mean_nll(logprobs, [1] * 8)

# Masked loss reflects answer quality only; unmasked loss is dominated by
# prompt tokens and would spend gradient on memorizing the question.
assert abs(masked - 0.075) < 1e-12
assert unmasked > 3.0 and unmasked > 40 * masked

# num_loss_tokens (what training logs report) is the weight sum, not len().
assert sum(weights) == 2

# Adversarial: an all-masked datum yields nan, not a silent 0.0 that would
# hide a broken train_on_what setting.
assert np.isnan(masked_mean_nll(logprobs, [0] * 8))

print("masked NLL: OK")

The termination status has a consumer policy: eval grading accepts is_clean (stop sequence or EOS), while strict RL format rewards gate on stop_sequence only, so a model cannot collect format reward by running into the length limit.

How to develop with it

Implementing a renderer for a new model family means encoding four decisions, then testing four properties.

The decisions:

  • Special-token layout: e.g. Llama 3 (<|start_header_id|>role<|end_header_id|>...<|eot_id|>), ChatML/Qwen (<|im_start|>role...<|im_end|>), DeepSeek V3 (full-width <|User|>/<|Assistant|> markers), OpenAI Harmony for gpt-oss (<|start|>role<|channel|>analysis|commentary|final<|message|>...<|end|> with two stop tokens, <|return|> and <|call|>).
  • Thinking-block semantics: reasoning models strip <think>...</think> from historical assistant turns but keep it in the final target; hybrid models ship paired renderers (qwen3 vs qwen3_disable_thinking) and picking the wrong one silently corrupts training. Some templates prefill <think> into the generation prompt, so the parser must re-inject it before parsing (the prefill/normalize asymmetry).
  • Tool-call syntax: JSON in <tool_call> tags (Qwen3), XML function blocks (Qwen3.5), delimiter-chained calls (DeepSeek), typed channels (Harmony). Llama 3's bare-JSON convention is unparseable in the general case, which is why tinker-cookbook refuses to support tool calls for it.
  • Loss-mask boundaries: which tokens are header vs output, per the table above.

The test properties (all four are enforced in tinker-cookbook's renderer suite and are worth copying into any stack):

  1. build_generation_prompt matches tokenizer.apply_chat_template(..., add_generation_prompt=True, tokenize=True) token for token.
  2. The weight-0 prefix of a supervised example equals the generation prompt of the same context (weights are 0...01...1).
  3. parse_response(trainable_tokens) returns the original final message.
  4. The extension property holds where claimed: each turn's prompt-plus-completion is a prefix of the next turn's prompt.

Reference template for the parity test (needs transformers and tinker-cookbook; pin both):

from tinker_cookbook import renderers, model_info, tokenizer_utils

model = "Qwen/Qwen3-8B"
tokenizer = tokenizer_utils.get_tokenizer(model)
renderer = renderers.get_renderer(
    model_info.get_recommended_renderer_name(model), tokenizer)

messages = [{"role": "user", "content": "2+2?"}]
ours = renderer.build_generation_prompt(messages).to_ints()
hf = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=True)
assert ours == hf  # any drift here is a training-vs-inference mismatch

HuggingFace-only stacks can get the supervised mask from templates that carry {% generation %} markers via apply_chat_template(..., return_assistant_tokens_mask=True); verify the installed template actually defines the markers before relying on it.

How to maintain it

  • Pin and test the tokenizer stack. Template behaviour rides on transformers versions; tinker-cookbook pins around known-bad releases (a 5.3.0 DeepSeek tokenizer decode bug, pre-5.0 VLM image-token miscounts). Re-run the four-property suite on every transformers or model upgrade.
  • Track template drift across model revisions. A new checkpoint of the same family can change thinking defaults or tool syntax (Qwen3 JSON tool calls became XML in Qwen3.5); the renderer name belongs in run metadata so a checkpoint can be re-served with the template it was trained with.
  • Keep one renderer registry. A factory keyed by model name (get_renderer(name, tokenizer) plus a custom-renderer registry) prevents ad-hoc per-script templates; renderers should be picklable so rollout workers can reconstruct them from (renderer_name, model_name).
  • Watch the boundary conditions. Tokenizing text in chunks can produce different tokens than tokenizing the joined string even when both decode identically; merge adjacent text parts before encoding. Multi-byte UTF-8 characters split across tokens need a buffering decoder.

How to run it in production

  • Same template at train and serve. OpenAI-compatible servers (vLLM, SGLang) apply the chat_template shipped in tokenizer_config.json server-side; a fine-tuned checkpoint must ship the template it was trained with, or clients silently reproduce the mismatch the training stack avoided (serving open-weight models).
  • Exploit the extension property for agentic workloads. When each turn's prompt extends the previous one, a T-turn trajectory trains as one merged datum and samples with KV-cache reuse: O(T) prefill instead of O(T^2). Thinking-history stripping breaks this, which is a real throughput cost of reasoning templates in multi-turn RL (agentic RL, KV cache management).
  • Gate rewards and evals on termination status. Truncated generations (malformed) should score 0 in strict-format RL and be reported separately in evals; benchmark scores move by tens of points on truncation rates alone (LLM benchmarks, evaluation integrity).
  • Monitor step-0 KL in RL. High sampler-vs-trainer KL on the very first step is the classic renderer-mismatch signature: the prompt tokens the trainer scores are not the ones the sampler saw (GRPO, fine-tuning and post-training).

Failure modes

  • Raw tokenizer.encode() on a chat model: skips the template, produces out-of-distribution prompt tokens, inflates sampler/trainer KL and corrupts importance ratios; correct only for base-model completion training.
  • Loss on prompt tokens: an unmasked or mis-masked dataset trains the model to memorize questions; watch num_loss_tokens per example, and treat an all-zero mask as an error (nan), not a zero loss.
  • Stop token missing from the trainable output: the model never learns to stop and generations run to the length limit; conversely, sampling with the wrong stop list yields multi-stop responses that must hard-error, not parse.
  • BOS duplication: adding special tokens on every chunk encode instead of once per sequence; the double-BOS prompt is out-of-distribution and evades string-level inspection.
  • Wrong thinking variant on hybrid models: training a thinking model with the non-thinking renderer (or vice versa) silently degrades quality; pair each checkpoint with its renderer name.
  • all_assistant_messages without the extension property: earlier assistant turns train on a prefix that generation would never produce (history had thinking stripped); split the conversation into per-turn examples instead.
  • Tool-call format drift: parsing assumes one syntax while the model emits another family's; malformed tool calls should surface as unparsed-call records with the raw text, not as dropped messages.

References

  • tinker-cookbook renderers: https://github.com/thinking-machines-lab/tinker-cookbook/tree/main/tinker_cookbook/renderers
  • Tinker docs, rendering tutorial: https://tinker-docs.thinkingmachines.ai/tutorials/core-concepts/rendering/
  • HuggingFace chat templating: https://huggingface.co/docs/transformers/main/en/chat_templating
  • OpenAI Harmony response format (gpt-oss): https://cookbook.openai.com/articles/openai-harmony
  • Tinker docs, quickstart (renderer usage in training loops): https://tinker-docs.thinkingmachines.ai/tinker/quickstart/

Related: Tinker · SFT & LoRA · Fine-tuning · GRPO · Agentic RL · Tools & Function Calling · LLM Benchmarks · Serving OSS Models · Glossary