Synthetic data generation¶
Scope: producing training data with LLMs for post-training: teacher/distillation data, instruction synthesis (Self-Instruct, Evol-Instruct, Magpie), and AI feedback (RLAIF), plus the filtering, dedup, and decontamination that stop it from poisoning the model. The data engine behind SFT, preference optimization, and distillation; often the highest-leverage lever in the whole post-training stack.
Blocks that need vLLM or transformers are labelled reference templates: pin versions and validate before production use. Every core algorithm on this page also ships a runnable, asserted numpy or pure-python check you can run with
python3.
What it is¶
Synthetic data is training examples generated by a model rather than written by humans. Three families dominate post-training:
- Instruction synthesis. Bootstrap SFT data from a model's own generations: Self-Instruct grows a seed set of instructions into a large dataset with light filtering;1 Evol-Instruct (WizardLM) iteratively rewrites instructions into harder, more diverse ones along depth and breadth;2 Magpie prompts an aligned model with only the chat-template prefix so it auto-regressively emits both the instruction and the response, alignment data "from nothing".3
- Distillation data. Generate outputs from a strong teacher and use them as SFT targets for a smaller student, the off-policy / sequence-level side of distillation, where the teacher's completions are the dataset.
- AI feedback (RLAIF). An LLM judge ranks or critiques responses to build preference pairs for DPO or a reward model, replacing human labelling of
(chosen, rejected).
Frameworks: distilabel (pipeline-based synthetic data + AI feedback) and NVIDIA NeMo Curator; much of it is plain batch inference on an inference engine.
Why use it¶
- Cheap and scalable. Self-Instruct is "almost annotation-free" and lifted GPT-3 by ~33% on Super-NaturalInstructions;1 Magpie synthesized 4M pairs (curated to 300K) that matched the official Llama-3-Instruct data.3
- Targeted distribution. Generate exactly the domain, format, difficulty, and language you need, instead of hoping a scraped corpus contains it.
- Distillation transfer. Teacher data moves capability into a smaller, cheaper-to-serve student, the standard way to make small models punch above their weight.
- Frontier default. Modern post-training data is majority-synthetic; the constraint has shifted from collecting data to filtering it.
When to use it (and when not)¶
- Use it to bootstrap SFT/preference data when human data is scarce or expensive, or to distill a stronger teacher into a smaller model.
- Keep humans for the hardest quality- and safety-critical slices; synthetic data amplifies the generator's blind spots and errors.
- Licensing caveat. Generating data from a proprietary model (GPT/Claude/Gemini) to train a competing model may violate that provider's terms of service. Check the terms before distilling closed models; open-weight teachers avoid the question.
- Avoid unfiltered recursion. Training on unfiltered self-generated data recursively causes model collapse: the distribution's tails disappear and quality degrades irreversibly.4
Architecture¶
flowchart LR
SEED["Seed prompts / personas"] --> GEN["Generator (teacher LLM, vLLM batch)"]
GEN --> CAND["Candidate examples"]
CAND --> FILT{"Filter"}
FILT -->|"dedup (MinHash)"| KEEP["Curated dataset"]
FILT -->|"quality (LLM-judge / reward model)"| KEEP
FILT -->|"decontaminate vs eval benchmarks"| KEEP
KEEP --> TRAIN["SFT / DPO / distillation"]
TRAIN -->|"deploy → collect → regenerate"| SEED
The pipeline is a straight line with one loop: a generator turns seeds into candidates, a filter stage (dedup, quality, decontamination) keeps only the good ones, and training consumes the curated set. The back edge from deploy to seed is the data flywheel that keeps the corpus fresh.
How to use it¶
Most synthetic-data generation is offline batch inference: run many prompts through a teacher and keep the outputs. vLLM's offline engine is the workhorse (inference serving):
# gen_distill_data.py: batch-generate teacher completions as SFT targets (reference template).
from vllm import LLM, SamplingParams
teacher = LLM(model="Qwen/Qwen3-32B", tensor_parallel_size=4) # strong teacher
sampling = SamplingParams(temperature=0.7, top_p=0.95, max_tokens=2048)
prompts = [tmpl.format(q=q) for q in seed_questions] # your seed instructions
outputs = teacher.generate(prompts, sampling)
sft_rows = [{"prompt": p, "completion": o.outputs[0].text} # teacher output = SFT target
for p, o in zip(prompts, outputs)]
The teacher's diversity comes entirely from that temperature + top_p pair: temperature reshapes the distribution, top-p (nucleus) truncates it to the smallest set of tokens carrying mass p. Here is that core sampling math in numpy, checked against a slow reference and its boundaries:
# nucleus.py: the core math behind teacher sampling -- temperature + top-p (nucleus).
import numpy as np
def softmax_T(logits, T):
"""Temperature-scaled softmax. Small T sharpens toward argmax, large T flattens."""
z = np.asarray(logits, float) / T
z -= z.max()
e = np.exp(z)
return e / e.sum()
def top_p_filter(probs, p):
"""Nucleus: keep the smallest set of top-prob tokens with cumulative mass >= p, renormalize."""
order = np.argsort(probs)[::-1]
cum = np.cumsum(probs[order])
k = min(int(np.searchsorted(cum, p) + 1), len(probs))
out = np.zeros_like(probs)
out[order[:k]] = probs[order[:k]]
return out / out.sum()
def entropy(p):
q = p[p > 0]
return float(-(q * np.log(q)).sum())
logits = np.array([2.0, 1.0, 0.1, -1.0, -2.0])
# 1. Both stages return valid distributions (sum to 1, non-negative).
p = softmax_T(logits, 1.0)
assert np.isclose(p.sum(), 1.0) and (p >= 0).all()
assert np.isclose(top_p_filter(p, 0.9).sum(), 1.0)
# 2. Temperature monotonicity: hotter sampling has higher entropy; T->0 collapses to argmax.
assert entropy(softmax_T(logits, 5.0)) > entropy(softmax_T(logits, 0.5))
cold = softmax_T(logits, 0.01)
assert cold.argmax() == logits.argmax() and cold.max() > 0.99
# 3. top-p boundaries: p=1.0 keeps the full support, tiny p collapses to the single argmax.
assert np.count_nonzero(top_p_filter(p, 1.0)) == len(p)
tiny = top_p_filter(p, 1e-9)
assert np.count_nonzero(tiny) == 1 and tiny.argmax() == p.argmax()
# 4. Equivalence to a slow reference across random distributions and thresholds.
def top_p_slow(probs, p):
idx = sorted(range(len(probs)), key=lambda i: -probs[i])
kept, s = [], 0.0
for i in idx:
kept.append(i); s += probs[i]
if s >= p:
break
out = np.zeros_like(probs)
for i in kept:
out[i] = probs[i]
return out / out.sum()
rng = np.random.default_rng(0)
for _ in range(1000):
r = rng.random(rng.integers(2, 12)); r /= r.sum()
pv = float(rng.uniform(0.05, 1.0))
assert np.allclose(top_p_filter(r, pv), top_p_slow(r, pv))
print("nucleus sampling: all checks passed")
The Magpie trick needs no seed prompts at all: feed the model only the chat-template tokens up to where the user turn begins, and the aligned model generates a plausible user instruction, then its own answer. The exact special tokens are model-specific. Read them off the tokenizer's chat template rather than hard-coding:
# Magpie-style self-synthesis: let the aligned model write the instruction itself (reference template).
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
# Render the template with an empty conversation, then cut it at the user-turn header
# so generation continues *as the user*. Inspect tok.get_chat_template() for the exact tokens.
prefix = tok.apply_chat_template([], tokenize=False, add_generation_prompt=False)
# feed `prefix` (ending at the user header) to the engine; sample the instruction, then the response.
The subtle part is the cut: the prefix must end exactly at the user header and leak no assistant turn, or the model answers instead of inventing an instruction. That truncation logic is pure string handling, so it is testable on its own:
# magpie.py: the core logic behind Magpie -- cut the chat template at the user-turn header
# so the aligned model writes the *user* instruction, then answers it itself.
def magpie_prefix(template):
"""Return the template up to and including the first user header. Fail fast if absent."""
user_hdr = "<|start_header_id|>user<|end_header_id|>\n\n"
i = template.index(user_hdr) # raises ValueError if no user turn
return template[: i + len(user_hdr)]
FULL = (
"<|begin_of_text|>"
"<|start_header_id|>system<|end_header_id|>\n\nYou are helpful.<|eot_id|>"
"<|start_header_id|>user<|end_header_id|>\n\nWhat is 2+2?<|eot_id|>"
"<|start_header_id|>assistant<|end_header_id|>\n\n4<|eot_id|>"
)
pre = magpie_prefix(FULL)
# 1. The prefix ends exactly at the user header, primed for the model to invent an instruction.
assert pre.endswith("<|start_header_id|>user<|end_header_id|>\n\n")
# 2. Adversarial leakage: it must NOT carry user content or the assistant turn,
# otherwise the model would just answer instead of generating a fresh instruction.
assert "What is 2+2?" not in pre
assert "assistant" not in pre
assert "\n\n4" not in pre
# 3. The system persona is preserved, so it still conditions the synthesized instruction.
assert "You are helpful." in pre
# 4. A template with no user turn is a hard error, never a silent bad prefix.
try:
magpie_prefix("<|begin_of_text|>no user turn here")
raise AssertionError("expected ValueError on a template without a user header")
except ValueError:
pass
print("magpie prefix: all checks passed")
How to integrate with it¶
The generator is the easy half; filtering is where quality is won or lost. A production pipeline wires the teacher into a curation chain, and only its output reaches training:
- Diversity. Vary seeds, personas, and temperature; Evol-Instruct explicitly evolves for breadth so the set does not collapse onto a few templates.2
- Dedup. Near-duplicate removal (MinHash/LSH) stops the model over-weighting repeated patterns (data curation).
- Quality filtering. Score candidates with a reward model or an LLM judge and keep the top slice; drop malformed, truncated, or refused generations. In practice: sample N candidates per prompt, score each against a versioned judge rubric, keep the best.
- Decontamination. Remove any synthetic example that overlaps your evaluation benchmarks (n-gram/substring match); synthetic pipelines are a classic contamination source that silently inflates scores (evaluation integrity).
- AI feedback. For preference data, reuse that same judge to rank the sampled responses into
(chosen, rejected)pairs for DPO; keep the judge's rubric versioned.
An end-to-end distillation SFT set is just those stages in order: batch-generate teacher completions (above), filter to well-formed and non-refused answers, dedup, decontaminate, then hand the result to SFT.
Decontamination is the stage that most often ships broken, so it is worth a runnable, self-checked reference. This drops any row that shares a 13-gram (the lm-eval convention) with a held-out eval example:
# decontaminate.py: drop synthetic rows that share a 13-gram with any eval example.
def ngrams(text, n=13):
"""Set of n-gram windows (space-joined). Empty when the text has fewer than n tokens."""
toks = text.split()
return {" ".join(toks[i:i + n]) for i in range(len(toks) - n + 1)}
def decontaminate(rows, eval_texts, n=13):
"""Keep only rows whose prompt+completion shares no n-gram with any eval text."""
banned = set().union(*(ngrams(t, n) for t in eval_texts)) if eval_texts else set()
return [r for r in rows if not (ngrams(r["prompt"] + " " + r["completion"], n) & banned)]
words = "alpha beta gamma delta epsilon zeta eta theta iota kappa lambda mu nu".split()
assert len(words) == 13 # exactly n tokens
ev = [" ".join(words)]
# 1. Happy path: a row echoing a 13-gram from the eval set is dropped.
dirty = {"prompt": "continue:", "completion": " ".join(words)}
clean = {"prompt": "unrelated", "completion": "a totally different sentence with zero overlap here"}
assert decontaminate([dirty, clean], ev) == [clean]
# 2. Boundary: 13 shared tokens is caught, 12 is not (no 13-gram can form).
row12 = {"prompt": "", "completion": " ".join(words[:12])}
assert decontaminate([{"prompt": "", "completion": " ".join(words)}], ev) == []
assert decontaminate([row12], ev) == [row12]
# 3. Whitespace corruption must not evade the filter (.split collapses runs).
noisy = {"prompt": "", "completion": " ".join(words) + "\n\t"}
assert decontaminate([noisy], ev) == []
# 4. Empty eval set bans nothing, so every row survives.
assert decontaminate([dirty, clean], []) == [dirty, clean]
# 5. Eval text shorter than n cannot ban anything (no false positives).
assert decontaminate([dirty], ["too short to ban"]) == [dirty]
# 6. Equivalence to a slow brute-force reference, with a planted contamination hit.
import random
def slow(rows, eval_texts, n=13):
wins = set()
for t in eval_texts:
tk = t.split()
wins |= {tuple(tk[i:i + n]) for i in range(len(tk) - n + 1)}
out = []
for r in rows:
tk = (r["prompt"] + " " + r["completion"]).split()
rw = {tuple(tk[i:i + n]) for i in range(len(tk) - n + 1)}
if not (rw & wins):
out.append(r)
return out
random.seed(0)
vocab = list("abcdefgh")
rows = [{"prompt": "", "completion": " ".join(random.choice(vocab) for _ in range(20))} for _ in range(40)]
rand_eval = [" ".join(random.choice(vocab) for _ in range(20)) for _ in range(4)]
rand_eval.append(rows[7]["completion"]) # plant a guaranteed hit
fast_out = decontaminate(rows, rand_eval)
assert fast_out == slow(rows, rand_eval) # fast == slow reference
assert rows[7] not in fast_out # planted contamination removed
print("decontaminate: all checks passed")
How to run it in production¶
Production generation is a batch job on the cluster, not an interactive script, and a few disciplines make its output trustworthy:
- Reproducibility. Pin and record the generator model, its sampling params, the seed set, the judge rubric, the filter thresholds, and the exact list of benchmarks decontaminated against. A synthetic dataset is only as auditable as the config that produced it.
- Idempotent, resumable runs. Checkpoint outputs so a preempted job resumes instead of regenerating from scratch, and keep a manifest of every candidate with why it was kept or dropped.
- Gate before training. Decontamination and quality filtering are release gates, not optional passes: data that fails decontamination never reaches SFT or DPO.
- Collapse guard. Mix real data in and never train recursively on unfiltered self-generated data.4
- The data flywheel. Deploy the model, collect real prompts, regenerate and re-filter against them, retrain. This is a standing pipeline (SRE/MLOps practices), not a one-shot dataset.
How to maintain it¶
A synthetic corpus is a living artifact, not a frozen file:
- Re-decontaminate on every benchmark change. A set that was clean last quarter is contaminated the moment you adopt a new eval whose text it happens to contain; re-run the n-gram check against the current benchmark list (evaluation integrity).
- Version the judge rubric. When you change how quality is scored, re-score or fork the dataset so old and new labels do not silently mix.
- Track the teacher version. Regenerating with a new teacher shifts the distribution; treat it as a new dataset, not a patch.
- Audit by hand. Sample generations periodically for refusals, truncations, style tics, and the teacher blind spots the student would otherwise inherit.
- Watch diversity run over run. Track the dedup ratio and n-gram entropy; a falling trend is an early collapse warning, visible before quality drops.
How to scale it¶
Generation is a batch-inference workload with the same economics as RL rollouts: throughput-bound on the teacher. Scale with vLLM offline batching across a pool of GPUs, size the batch to saturate the engine, and treat the run like any large inference job (inference serving, continuous batching). Filtering scales the same way: dedup (MinHash/LSH) and n-gram decontamination are embarrassingly parallel over shards, so they keep pace with a growing generator pool instead of becoming the bottleneck.
Failure modes¶
- Model collapse. Recursively training on unfiltered self-generated data erases distribution tails and degrades the model irreversibly; always mix in real data and filter hard.4
- Benchmark contamination. Synthetic data that overlaps eval sets inflates scores and hides regressions; decontaminate every synthetic corpus (evaluation integrity).
- Amplified generator bias. The student inherits the teacher's errors, style tics, and blind spots; audit samples by hand.
- Licensing/ToS violation. Distilling proprietary model outputs to train a competitor can breach terms; use open-weight teachers or confirm the license.
- Low diversity. A generator run at low temperature with few seeds produces a narrow, repetitive set; evolve for breadth and dedup.
- Skipping filtering. Raw generations contain refusals, truncations, and wrong answers; unfiltered synthetic data is garbage-in.
References¶
- Self-Instruct (bootstrap instruction data): https://arxiv.org/abs/2212.10560
- WizardLM / Evol-Instruct (evolve instructions): https://arxiv.org/abs/2304.12244
- Magpie (alignment data from the chat prefix): https://arxiv.org/abs/2406.08464
- The Curse of Recursion / model collapse: https://arxiv.org/abs/2305.17493
- distilabel (synthetic-data + AI-feedback pipelines): https://github.com/argilla-io/distilabel
- vLLM (offline batch generation): https://docs.vllm.ai/en/latest/
Related: SFT/LoRA · On-policy distillation · DPO · Reward model training · Fine-tuning and post-training · Training-data curation · Evaluation integrity · Inference serving · SRE/MLOps practices · Glossary
-
Wang et al., Self-Instruct: bootstrap a large instruction-tuning set from a model's own generations with light filtering; near annotation-free, ~33% gain on Super-NaturalInstructions. https://arxiv.org/abs/2212.10560 ↩↩
-
Xu et al., WizardLM / Evol-Instruct: an LLM iteratively rewrites instructions into more complex and diverse ones (in-depth and in-breadth evolution); evolved instructions beat human-written ones. https://arxiv.org/abs/2304.12244 ↩↩
-
Xu et al., Magpie: prompting an aligned model with only the chat-template prefix makes it emit an instruction and then a response; 4M pairs curated to 300K matched official instruct data. https://arxiv.org/abs/2406.08464 ↩↩
-
Shumailov et al., The Curse of Recursion: training on model-generated content causes irreversible model collapse as the tails of the distribution disappear. https://arxiv.org/abs/2305.17493 ↩↩↩