Markdown

On-policy distillation¶

Scope: the post-training method where a student generates its own on-policy rollouts and a teacher grades them per token (typically reverse KL), combining the dense per-token signal of distillation with the on-policy relevance of RL. The cheaper, denser-reward sibling of RLVR whenever a stronger teacher exists; a distillation stage in fine-tuning and post-training, built on the on-policy idea from imitation learning.

The framework code below is written against real APIs (TRL). Treat every trl.experimental.gkd snippet as a reference template: pin the version, verify the import path on your install, and validate before production use. The numpy blocks (labelled "core math, runnable") are self-contained and executable, and pin down the divergence identities the trainer applies.

What it is¶

Standard knowledge distillation is off-policy: the student is trained to match a teacher on the teacher's own outputs (or a fixed dataset). That creates a train/inference distribution mismatch: at inference the student conditions on its own prefixes, drifts into states the teacher never demonstrated, and errors compound over the sequence (exposure bias).

On-policy distillation fixes this by sampling sequences from the student and having the teacher score those student-generated tokens, so supervision lands exactly on the states the student actually visits. The canonical method is GKD (Generalized Knowledge Distillation, Agarwal et al., ICLR 2024), which "trains the student on its self-generated output sequences by leveraging feedback from the teacher".¹ A lambda parameter interpolates from off-policy (lambda=0, teacher data) to fully on-policy (lambda=1, student rollouts), and the divergence is a generalized Jensen-Shannon that spans forward KL through reverse KL.¹ MiniLLM independently showed reverse KL (mode-seeking) beats forward KL for generative LMs and derived the on-policy objective.² Thinking Machines Lab's 2025 write-up popularized the practical framing: "sample trajectories from the student model and use a high-performing teacher to grade each token" via the teacher's log-probabilities, a reverse-KL per-token reward.³ The on-policy principle traces to DAgger in imitation learning: to avoid compounding error you must train under "the distribution of observations [the policy] induces".⁴

Terminology (read this). The umbrella term is on-policy distillation (OPD). "On-policy self-distillation (OPSD)" is a narrower, recent (2026) label for the self-teacher case, where the model teaches itself (a larger sibling in the same family, an earlier/EMA checkpoint, or a privileged-context copy) rather than a separate teacher.⁷ This page covers the general method and the self-distillation variant; OPSD is not an industry-standard name for the general technique.

Why use it¶

Dense reward. Distillation gives per-token feedback (on the order of O(N) bits of signal per N-token episode) versus RL's single scalar per sequence.³ Every token is supervised, so learning is far more sample-efficient than a sparse RLVR reward.
On-policy, so no exposure bias. Unlike off-policy KD, it supervises the student's own trajectories, correcting the mistakes the student actually makes and removing the compounding train/inference gap.¹²
Cheap versus RL. Thinking Machines report reaching teacher performance roughly 7 to 10x faster than RL, on the order of 50 to 100x less compute, and a 9 to 30x cost reduction versus off-policy distillation.³ Qwen3's strong-to-weak distillation "significantly outperforms reinforcement learning in performance and training efficiency", requiring only 1/10 of the GPU hours of its multi-stage RL pipeline.⁵
Strong results. On AIME'24, Thinking Machines report on-policy distillation at 74.4% versus 67.6% for RL and 55.0% for off-policy distillation from the same start.³

When to use it (and when not)¶

Use on-policy distillation when a stronger teacher exists and you want to transfer its capability cheaply: compressing a large model into a small one (strong-to-weak), or recovering behaviour after domain fine-tuning (continual learning, personalization).³
Prefer RLVR when no teacher is better than the student. You cannot distill capability you do not have, and RL can push past any existing model against a ground-truth reward. Distillation is bounded by the teacher (below).
Prefer SFT to cold-start format cheaply on demonstrations when you do not need the teacher's full distribution.
Combine them. A common 2026 recipe is SFT → on-policy distillation (cheap capability transfer) → RLVR (push past the teacher on verifiable tasks).
Avoid when teacher and student use incompatible tokenizers (per-token logit alignment breaks) or when the teacher cannot meaningfully score the student's out-of-distribution tokens.

Architecture¶

The loop has four moving parts: the student (being trained, holds the generation path), the teacher (frozen, resident or served, scores tokens), the per-token divergence (reverse KL by default), and the on-policy fraction lmbda that decides how much of each batch is student-generated. The teacher runs inference in the loop but takes no gradients.

flowchart LR
  P["Prompt"] --> GEN["Student samples rollout (on-policy)"]
  GEN --> TOK["Student tokens y_t"]
  TOK --> TEACH["Teacher scores each token<br/>(teacher logprobs, frozen)"]
  TEACH --> LOSS["Per-token reverse-KL loss<br/>KL(student || teacher)"]
  LOSS -->|"gradient (student only)"| GEN
  OFF["Off-policy KD: train on the TEACHER's outputs"] -.->|"distribution mismatch / exposure bias"| GEN
  subgraph SERVE["Systems shape"]
    TEACHSRV["Teacher (sharded / served pool)"] --- TEACH
    STUD["Student on FSDP"] --- GEN
  end

Core math (runnable): the per-token reverse-KL reward¶

The signal is a per-token reverse KL, KL(student_t || teacher_t), summed over the vocabulary at each student token. This numpy-only block computes it, and asserts the properties the loss depends on: non-negativity (Gibbs), zero exactly when the student matches the teacher (the objective's optimum), asymmetry versus forward KL (adversarial: if they were equal the whole forward/reverse debate would be vacuous), a blow-up when the teacher assigns near-zero mass where the student has mass (the tokenizer/out-of-distribution failure mode), and the dense O(N) signal count. Run: python3 reverse_kl_reward.py.

# reverse_kl_reward.py — core math, runnable (numpy only).
# The teacher grades each student token t by KL(student_t || teacher_t).
import numpy as np

def softmax(z):
    z = z - z.max(axis=-1, keepdims=True)
    e = np.exp(z)
    return e / e.sum(axis=-1, keepdims=True)

def reverse_kl(student, teacher):        # KL(student || teacher), per token (per row)
    return np.sum(student * (np.log(student) - np.log(teacher)), axis=-1)

def forward_kl(student, teacher):        # KL(teacher || student), the off-policy direction
    return np.sum(teacher * (np.log(teacher) - np.log(student)), axis=-1)

rng = np.random.default_rng(0)
T, V = 6, 50                              # 6 student tokens, vocab 50
s = softmax(rng.normal(size=(T, V)))
t = softmax(rng.normal(size=(T, V)))
per_token = reverse_kl(s, t)             # the dense per-token teacher grade

# 1) Non-negativity (Gibbs' inequality): every per-token KL >= 0.
assert np.all(per_token >= -1e-12), per_token.min()
# 2) Edge / equivalence: zero iff distributions match. The optimum of the objective.
assert np.allclose(reverse_kl(s, s), 0.0, atol=1e-12), "KL(p||p) must be 0"
# 3) Adversarial asymmetry: reverse KL != forward KL in general (mode-seeking vs covering).
assert not np.allclose(reverse_kl(s, t), forward_kl(s, t)), "KL must be asymmetric"
# 4) Edge boundary: a teacher with mass -> 0 where the student has mass blows the reward
#    up. This is the tokenizer-mismatch / OOD "teacher can't score" failure mode.
t_bad = t.copy(); t_bad[0] = 1e-9; t_bad[0, 0] = 1.0 - 49e-9
assert reverse_kl(s[0:1], t_bad[0:1])[0] > reverse_kl(s[0:1], t[0:1])[0], "OOD -> larger penalty"
# 5) Dense-reward accounting: N tokens -> N signals; RL gives 1 scalar per sequence.
assert per_token.shape[0] == T and per_token.size == T
print("reverse-KL per-token reward: PASS", np.round(per_token[:3], 4).tolist())

How to use it¶

TRL implements on-policy distillation as GKD. In TRL ≥ 1.7 it lives under trl.experimental.gkd (it graduated out of the top-level namespace; verify the import path on your installed version). GKDTrainer wraps SFTTrainer and takes a teacher_model; GKDConfig exposes the two defining knobs: lmbda (on-policy fraction) and beta (forward↔reverse KL).

# on_policy_distill.py — REFERENCE TEMPLATE (needs TRL + transformers, not run here).
# TRL >=1.7: import from trl.experimental.gkd. Verify the import path on your version.
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl.experimental.gkd import GKDConfig, GKDTrainer

student = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-1.7B")   # the model being trained
teacher = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-8B")     # stronger; frozen, scores tokens
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-1.7B")        # student + teacher must share a tokenizer

trainer = GKDTrainer(
    model=student,
    teacher_model=teacher,
    args=GKDConfig(
        lmbda=1.0,        # 1.0 = fully on-policy (student generates); 0.0 = off-policy (teacher data)
        beta=1.0,         # 0.0 = forward KL, 1.0 = reverse KL (mode-seeking, MiniLLM/TML choice); 0.5 = JSD (TRL default)
        temperature=0.9,  # sampling temperature for the student rollouts
        max_new_tokens=512,
        bf16=True,
    ),
    processing_class=tokenizer,
    train_dataset=ds,     # rows are chat "messages"; the student generates completions on-policy
)
trainer.train()

lmbda is what makes it on-policy: at 1.0 every batch is student-generated and scored by the teacher; at 0.0 it degenerates to supervised (off-policy) distillation on teacher token-probabilities. beta picks the divergence: the reverse-KL end (beta=1.0) is the mode-seeking objective MiniLLM and Thinking Machines favour, though the GKD authors note the optimal beta is task-dependent (TRL defaults to 0.5, the JSD midpoint).¹²

What beta actually interpolates (core math, runnable). GKD's beta selects a generalized Jensen-Shannon divergence whose limits recover the two KLs: beta -> 0 is forward KL KL(teacher || student), beta -> 1 is reverse KL KL(student || teacher), and beta = 0.5 is the standard symmetric JSD. This numpy block builds that family with the teacher weighted by beta in the mixture (the TRL/MiniLLM convention) and asserts the endpoints match the KLs computed independently (equivalence to a slow reference), the midpoint is bounded by ln 2, identical distributions give zero at every beta, and the two ends genuinely differ (adversarial). Run: python3 gkd_beta_jsd.py.

# gkd_beta_jsd.py — core math, runnable (numpy only).
# beta interpolates forward KL (beta->0) through reverse KL (beta->1), matching GKDConfig.beta.
import numpy as np

def kl(a, b):
    return float(np.sum(a * (np.log(a) - np.log(b))))

def gjsd(student, teacher, beta):        # mixture weights the TEACHER by beta
    m = beta * teacher + (1.0 - beta) * student
    term_t = beta * kl(teacher, m) if beta > 0.0 else 0.0
    term_s = (1.0 - beta) * kl(student, m) if beta < 1.0 else 0.0
    return term_t + term_s

student = np.array([0.60, 0.25, 0.15])
teacher = np.array([0.30, 0.45, 0.25])
fwd = kl(teacher, student)               # forward KL -> beta->0 end
rev = kl(student, teacher)               # reverse KL -> beta->1 end

# 1) Equivalence to slow reference: normalized generalized JSD recovers the KL endpoints.
d0 = gjsd(student, teacher, 1e-4) / (1e-4 * (1 - 1e-4))
d1 = gjsd(student, teacher, 1 - 1e-4) / ((1 - 1e-4) * 1e-4)
assert abs(d0 - fwd) < 1e-2, (d0, fwd)   # beta->0 == forward KL
assert abs(d1 - rev) < 1e-2, (d1, rev)   # beta->1 == reverse KL
# 2) Midpoint is the symmetric JSD and is bounded by ln 2.
mid = gjsd(student, teacher, 0.5)
assert 0.0 < mid < np.log(2) + 1e-12, mid
# 3) Edge: identical distributions -> zero divergence at every beta.
for b in (0.0, 0.25, 0.5, 0.75, 1.0):
    assert abs(gjsd(student, student, b)) < 1e-12, b
# 4) Adversarial: forward != reverse, so the two endpoints must differ.
assert abs(fwd - rev) > 1e-6 and not np.isclose(d0, d1), (fwd, rev)
# 5) Boundary: gjsd is finite and non-negative across the whole sweep.
assert all(g >= -1e-12 and np.isfinite(g) for g in (gjsd(student, teacher, b) for b in np.linspace(0, 1, 11)))
print("generalized JSD (beta): PASS  fwd=%.5f rev=%.5f jsd_mid=%.5f" % (fwd, rev, mid))

How to develop with it¶

Iterate on the teacher, the divergence, and the on-policy fraction:

Teacher choice dominates: the student cannot exceed it (below). A modestly stronger teacher on-policy often beats a much stronger teacher off-policy, because the signal lands on the student's own states.
beta (divergence). Reverse KL (beta→1) concentrates the student on the teacher's dominant modes (crisper, less diverse); forward KL (beta→0) spreads mass to cover the teacher (more diverse, can over-generalize). Sweep it; the sweet spot is task-dependent.¹
lmbda (on-policy fraction). The GKD authors find "on-policy data (high lmbda) performs better"; drop below 1.0 only if student generation is the bottleneck.
temperature controls exploration in the student rollouts; too low starves the teacher of informative mistakes to correct.
seq_kd=True switches to sequence-level KD (supervised fine-tuning on teacher-generated sequences), an off-policy baseline to compare against.

What lmbda mixes, and why the reward is dense (core math, runnable). lmbda is the fraction of each batch that is student-generated (on-policy); the expected per-step loss is a convex combination of the on-policy and off-policy losses. Separately, the reason on-policy distillation is roughly 50 to 100x cheaper than RL per unit of signal is that an N-token episode yields N per-token grades under distillation but a single scalar under RL. This block asserts the endpoints (lmbda 0 and 1), convexity, agreement with a Monte-Carlo Bernoulli(lmbda) reference, rejection of out-of-range lmbda (adversarial guard), and the O(N)-vs-O(1) signal count including the N=1 boundary. Run: python3 gkd_lambda_dense.py.

# gkd_lambda_dense.py — core math, runnable (numpy only).
import numpy as np

def gkd_expected_loss(loss_on_policy, loss_off_policy, lmbda):
    assert 0.0 <= lmbda <= 1.0            # lmbda is a fraction; reject anything else
    return lmbda * loss_on_policy + (1.0 - lmbda) * loss_off_policy

L_on, L_off = 0.30, 0.50
# 1) Edge endpoints: lmbda=1 -> pure on-policy, lmbda=0 -> pure off-policy (SFT-like).
assert gkd_expected_loss(L_on, L_off, 1.0) == L_on
assert gkd_expected_loss(L_on, L_off, 0.0) == L_off
# 2) Convexity: any interior lmbda lies strictly between the endpoints.
assert min(L_on, L_off) < gkd_expected_loss(L_on, L_off, 0.5) < max(L_on, L_off)
# 3) Equivalence to a slow reference: Monte-Carlo the Bernoulli(lmbda) mixture.
rng = np.random.default_rng(1)
draws = np.where(rng.random(200_000) < 0.7, L_on, L_off)
assert abs(draws.mean() - gkd_expected_loss(L_on, L_off, 0.7)) < 5e-3, draws.mean()
# 4) Adversarial guard: lmbda outside [0,1] must raise, not be used silently.
for bad in (-0.1, 1.1):
    try:
        gkd_expected_loss(L_on, L_off, bad); raise SystemExit("accepted bad lmbda")
    except AssertionError:
        pass
# 5) Dense vs sparse: N-token episode -> N signals (distill) vs 1 (RL). Includes N=1.
signal_count = lambda n, mode: n if mode == "distill" else 1
for N in (1, 8, 512):
    assert signal_count(N, "distill") == N and signal_count(N, "rl") == 1
    assert signal_count(N, "distill") >= signal_count(N, "rl")
print("lmbda mixing + dense reward: PASS  E[loss]@0.5=%.3f" % gkd_expected_loss(L_on, L_off, 0.5))

How to integrate with it¶

On-policy distillation slots between cheap format alignment and open-ended RL. The canonical 2026 pipeline is SFT (cold-start format) then on-policy distillation (transfer the teacher's capability densely) then RLVR (push past the teacher on verifiable tasks); each stage feeds the next its checkpoint. Integration checklist:

Tokenizer contract. Student and teacher must share a tokenizer, because the loss aligns per-token logits (below). Confirm identical vocab and special tokens before wiring anything; a mismatch corrupts the KL silently. Within a model family (Qwen3, Llama) this is usually satisfied by construction.
Data contract. train_dataset rows are chat messages with prompts only for the on-policy path: the student generates the completion, the teacher scores it. Do not pre-fill completions unless you are deliberately running the lmbda<1 off-policy mix.
Where it sits in the stack. GKDTrainer wraps SFTTrainer, so it inherits the same SFT/LoRA plumbing (PEFT adapters, packing, bf16); a LoRA student keeps the trainable footprint small while the frozen teacher scores. Feed its output straight into a GRPO/RLVR stage.
Serving the teacher. For large teachers, serve logprobs from a dedicated inference pool (vLLM-style) rather than co-resident weights; the trainer consumes teacher log-probabilities per token, so the integration point is a logprob endpoint, not a full generation API.

How to run it in production¶

Two models resident (or one served). Budget memory for student training state (optimizer + activations under FSDP) and a teacher doing forward-only inference. If the teacher does not fit alongside, serve it on a separate pool and stream logprobs; that decouples teacher throughput from trainer step time.
Pin versions. GKD moved to trl.experimental.gkd in TRL >= 1.7; experimental namespaces move. Pin trl, transformers, and the model revisions, and re-verify the import path in CI, since the reference templates above assume a specific layout.
Throughput is teacher-bound. The teacher scores every student token in the loop, so its inference rate caps the trainer. Size the teacher pool to the student's rollout rate; an under-provisioned teacher starves the trainer (a failure mode below).
Fabric. Two co-trained/served models reuse the same NVLink/IB-with-GDR concerns as any multi-model job; keep teacher-logprob transfer off the critical path where possible (performance tuning).
Determinism and cost. Fix the student sampling temperature and seeds for reproducible rollouts. Because the reward is dense, far fewer gradient steps reach a target than RL, which is where the ~50 to 100x compute reduction and ~7 to 10x fewer steps come from.³

How to maintain it¶

Watch the teacher ceiling. The student is bounded by the teacher (a failure mode below), so track student-minus-teacher eval gap; once the student approaches the teacher, distillation has given what it can and the next gain must come from a stronger teacher or RLVR.
Monitor divergence health. Log per-token reverse KL and output diversity. A collapsing diversity signals reverse-KL mode collapse (beta too aggressive); a persistent large KL on in-distribution tokens signals a broken tokenizer alignment or an over-strong/OOD teacher.
Re-validate on upgrades. Any bump of trl/transformers/teacher revision can change the GKD API or teacher behaviour; re-run the numpy core-math blocks (they are version-independent) plus a smoke train, and re-check the import path.
Keep the off-policy baseline. Retain a lmbda=0 (or seq_kd=True) run as a regression anchor so you can always prove the on-policy contribution is still positive after a change (the ablation below).

How to scale it¶

The systems shape is lighter than RLVR, since there is no reward model and no sparse-reward rollout search, but the teacher must run inference in the loop to score every student token, so two models are resident (or the teacher is served separately) and teacher memory/throughput is the main cost. The student still needs a generation path for its on-policy rollouts. Because the reward is dense, far fewer gradient steps are needed than RL for the same gain, which is where the 50 to 100x compute reduction comes from.³ For a large teacher, shard it (or serve it on a dedicated pool) and keep the student on FSDP; reuse the same NVLink/IB-with-GDR fabric concerns as any multi-model training job (performance tuning).

At the extreme, the teacher is served entirely off-box: run it behind a logprob endpoint on an inference pool, batch the student's rollouts against it, and the trainer scales like an SFT job plus a network round-trip per batch. This is the shape that lets a small student distill from a very large teacher without ever co-locating both sets of weights.

On-policy self-distillation (the OPSD variant)¶

Self-distillation sets the teacher to a version of the same model. The explicit form, and the one least likely to break across TRL versions, is to pass the same checkpoint as both model and teacher_model; TRL's GKD docs also describe a teacher_model_name_or_path=None shortcut ("the teacher model will be the same as the model being trained"; verify on your installed version).⁶ Three practical shapes:

Strong-to-weak within a family. A large sibling teaches a small one: Qwen3 distills from Qwen3-32B/235B into the smaller models.⁵ This is the most common "self" case.
Privileged-context self-teacher. The teacher is the same model conditioned on privileged information (a verified reasoning trace, the answer) that the student does not see; the student learns to reproduce that behaviour from the question alone. This is the setting that coined "on-policy self-distillation (OPSD)".⁷
Earlier/EMA checkpoint. The model distills from a stronger past or averaged copy of itself to stabilize or recover behaviour.

The bound is strict here: a self-teacher cannot transfer capability it lacks, so OPSD sharpens and stabilizes rather than adds new knowledge; for that, use a stronger external teacher or RLVR.

Cookbook (common use cases)¶

1. Recover behaviour after domain fine-tuning (continual learning)

# REFERENCE TEMPLATE (needs TRL, not run here).
# Fine-tuning on internal docs degraded instruction-following. Distill it back on-policy,
# using the ORIGINAL instruct checkpoint as the teacher, on the student's own outputs.
from trl.experimental.gkd import GKDConfig, GKDTrainer
trainer = GKDTrainer(
    model=domain_tuned_model,                      # lost some IF ability
    teacher_model="Qwen/Qwen3-8B",                 # original instruct behaviour = the teacher
    args=GKDConfig(lmbda=1.0, beta=1.0, bf16=True),
    processing_class=tokenizer, train_dataset=if_prompts)   # prompts only; student generates
trainer.train()

2. Self-distillation (model is its own teacher)

# REFERENCE TEMPLATE (needs TRL, not run here).
# Self-distillation: the teacher is a frozen copy of the SAME checkpoint (same id for model + teacher).
# TRL also documents GKDConfig(teacher_model_name_or_path=None) as a shortcut; verify it on your version.
from trl.experimental.gkd import GKDConfig, GKDTrainer
BASE = "Qwen/Qwen3-8B"
trainer = GKDTrainer(
    model=BASE,
    teacher_model=BASE,                     # same checkpoint -> the model is its own (frozen) teacher
    args=GKDConfig(lmbda=1.0, beta=1.0, bf16=True),
    processing_class=tokenizer, train_dataset=ds)
trainer.train()

3. Off-policy baseline to prove the on-policy win

# REFERENCE TEMPLATE (needs TRL, not run here).
# Ablation: lmbda=0.0 is supervised (off-policy) distillation. Compare eval vs lmbda=1.0
# on the SAME teacher to isolate the on-policy contribution.
GKDConfig(lmbda=0.0, beta=1.0)   # off-policy; expect exposure-bias gap vs lmbda=1.0

Failure modes¶

Teacher ceiling. The student is bounded by the teacher; distillation cannot exceed it. To push past the best available model, use RLVR. Self-distillation (OPSD) only sharpens.
Tokenizer mismatch. Per-token logit alignment assumes a shared vocabulary; a teacher with a different tokenizer breaks the KL, silently or loudly.
Off-policy regression (lmbda too low). Dropping toward lmbda=0 reintroduces the exposure bias on-policy distillation exists to remove; keep it high unless generation cost forces otherwise.¹
Reverse-KL mode collapse. Aggressive reverse KL (beta→1) can over-concentrate the student on a few teacher modes, cutting output diversity; back off beta if generations degrade.
Teacher memory/throughput. Holding a large teacher resident (or serving it) alongside the student is the dominant cost; under-provisioning the teacher starves the trainer.
Mistaking it for RL. On-policy distillation has an on-policy rollout like RL but the loss is a teacher-matching KL, not a reward-maximizing policy gradient; it inherits distillation's ceiling, not RL's open-ended improvement.

References¶

On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes (GKD, ICLR 2024): https://arxiv.org/abs/2306.13649
MiniLLM: On-Policy Distillation of Large Language Models (reverse KL, ICLR 2024): https://arxiv.org/abs/2306.08543
On-Policy Distillation (Thinking Machines Lab, 2025): https://thinkingmachines.ai/blog/on-policy-distillation/
A Reduction of Imitation Learning to No-Regret Online Learning (DAgger, AISTATS 2011): https://arxiv.org/abs/1011.0686
Qwen3 Technical Report (strong-to-weak distillation): https://arxiv.org/abs/2505.09388
TRL Generalized Knowledge Distillation (GKD) Trainer: https://huggingface.co/docs/trl/gkd_trainer
Self-Distilled Reasoner: On-Policy Self-Distillation for LLMs (coins "OPSD"): https://arxiv.org/abs/2601.18734
A Survey of On-Policy Distillation for Large Language Models: https://arxiv.org/abs/2604.00626

Agarwal et al., GKD: trains the student on its self-generated sequences with teacher feedback; lambda interpolates off-policy (0) to on-policy (1); generalized JSD spans forward KL to reverse KL; on-policy (high lambda) performs better and optimal beta is task-dependent. https://arxiv.org/abs/2306.13649 ↩↩↩↩↩↩
Gu et al., MiniLLM: replaces forward KLD with reverse KLD (mode-seeking, better for generative LMs) and derives an on-policy optimization approach, reducing exposure bias. https://arxiv.org/abs/2306.08543 ↩↩↩
Lu et al. (Thinking Machines Lab), On-Policy Distillation: sample trajectories from the student and grade each token with a teacher (reverse KL); dense O(N)-bits-per-episode reward; roughly 7 to 10x fewer gradient steps and 50 to 100x less compute than RL, 9 to 30x cheaper than off-policy distillation; AIME'24 74.4% (on-policy) vs 67.6% (RL) vs 55.0% (off-policy). https://thinkingmachines.ai/blog/on-policy-distillation/ ↩↩↩↩↩↩↩
Ross, Gordon, Bagnell, DAgger: sequential prediction must train under the distribution of states the policy itself induces to avoid compounding error; the imitation-learning root of "on-policy". https://arxiv.org/abs/1011.0686 ↩
Qwen3 Technical Report: strong-to-weak distillation for smaller models "significantly outperforms reinforcement learning in performance and training efficiency", using only ~1/10 of the GPU hours of the multi-stage RL pipeline. https://arxiv.org/abs/2505.09388 ↩↩
TRL GKD Trainer: GKDTrainer wraps SFTTrainer with a teacher_model (a model object or a checkpoint id); GKDConfig.lmbda sets the on-policy student-data fraction, beta interpolates forward KL (0.0) to reverse KL (1.0). For self-distillation, pass the same checkpoint as model and teacher_model; the docs also describe a teacher_model_name_or_path=None shortcut (verify on your installed version). https://huggingface.co/docs/trl/gkd_trainer ↩
"On-policy self-distillation (OPSD)" is a narrow 2026 label for the self-teacher case (e.g. a privileged-context copy of the same model), not a standard name for on-policy distillation in general; the umbrella term remains "on-policy distillation (OPD)". https://arxiv.org/abs/2601.18734 ↩↩