RLSD (reinforcement learning with self-distillation)¶
Scope: a post-training framework that fuses RLVR with privileged-context self-distillation: the verifiable reward decides the direction of each update while a token-level self-distillation signal decides its magnitude, giving dense per-token credit on top of RLVR's sparse per-sequence reward. Proposed in "Self-Distilled RLVR" (Yang et al., 2026); the point where "learn from a verifier" (RLVR) and "learn from your own privileged outputs" (OPSD) meet.
The torch block below is a reference template (pin versions, verify against the paper, validate before production). The numpy block labelled core math is self-contained and runnable, and pins down the token-weight identity the update relies on. This is a recent (2026) method demonstrated in one paper, so validate against your own setting first.
What it is¶
RLSD keeps the GRPO/RLVR loop (sample G responses per prompt, score each with a verifier, compute a group-relative sequence advantage A) but reweights every token's update by a self-distillation signal. The reward still supplies a "reliable update direction from environmental feedback"; the self-distillation supplies "token-level policy differences for determining fine-grained update magnitudes".1
The self-distillation teacher is the same model conditioned on privileged information r (a reference answer or a verified reasoning trace), while the student sees only the question. For each generated token the privileged-information gain is the stop-gradient log-prob difference:
Δ_t = sg( log P_T(y_t | x, r, y_<t) − log P_S(y_t | x, y_<t) )
w_t = exp( sign(A) · Δ_t ) = ( P_T(y_t) / P_S(y_t) ) ^ sign(A)
The verifier reward gives sign(A) (reinforce a correct trajectory, penalize a wrong one); the teacher's evidence ratio P_T/P_S gives the per-token magnitude, so tokens the reference answer makes far more predictable receive more credit. That privileged-context self-teacher is exactly the OPSD idea: RLSD wires it into the RLVR update rather than running it as a standalone distillation loss.1
Why use it¶
- Dense credit on a sparse reward. RLVR's reward is a single scalar per sequence, giving no signal about which tokens mattered; RLSD adds a token-level assessment for "fine-grained credit discrimination", directly attacking RLVR's credit-assignment weakness.1
- Both strengths, neither failure. It "simultaneously harness[es] the strengths of both RLVR and OPSD, achieving a higher convergence ceiling and superior training stability"; the reward-gated direction prevents the privileged-information leakage/collapse that sinks OPSD used alone.1
- Cheap add-on. The teacher signal is one extra forward pass of the same model (conditioned on privileged info) per response, so there is no separate teacher to train or host, unlike external-teacher distillation.
- Measured gains. On multimodal reasoning it beat GRPO by +2.32 average points and the base model by +4.69, while OPSD-alone barely moved (below).1
When to use it (and when not)¶
- Use RLSD when you already run RLVR/GRPO on verifiable tasks and have privileged information at training time (reference answers, verified traces) to condition the teacher on.
- Prefer plain RLVR when you have no privileged signal to condition on, or the sequence reward already gives enough gradient.
- Prefer standalone OPSD when you have a genuinely stronger teacher and no verifiable reward, but note that privileged-context distillation without the reward-gated direction risks answer leakage and collapse, which is the failure RLSD is built to avoid.
- Evidence caveat. RLSD was demonstrated on multimodal (vision-language) reasoning (Qwen3-VL-8B-Instruct); the recipe is modality-agnostic in principle, but the published evidence is that setting. Validate on your modality. It is a single, recent (April 2026) paper; treat configs as version-specific.
Architecture¶
Two forward passes on the same weights drive each update: the student pass P_S(y_t | x, y_<t) that also produced the rollout, and a teacher pass P_T(y_t | x, r, y_<t) that additionally sees the privileged reference r. The verifier scores the rollout into a sequence advantage A; its sign sets the update direction, and the teacher-over-student ratio sets each token's magnitude w_t. There is no second model to host: the teacher is the same weights with r in context.
flowchart LR
P["Prompt x (+ privileged info r)"] --> ROLL["G rollouts (student policy)"]
ROLL --> VER["Verifier reward"]
VER --> A["Sequence advantage A → sign(A) = direction"]
ROLL --> STU["Student logprob P_S(y_t | x, y_<t)"]
ROLL --> TCH["Teacher logprob P_T(y_t | x, r, y_<t) (same model + r)"]
STU --> DW["Δ_t, token weight w_t = (P_T/P_S)^sign(A) = magnitude"]
TCH --> DW
A --> UPD["Clipped GRPO update with per-token w_t"]
DW --> UPD
UPD -->|"weight sync"| ROLL
Core math (runnable): the privileged-information token weight¶
The one modification RLSD makes to GRPO is the per-token weight w_t and its clip. This numpy-only block builds both from log-probs and asserts the properties the update relies on: the two algebraic forms of w_t agree (equivalence to a slow reference), sign(A) gates the direction so the same evidence ratio flips to its reciprocal on a wrong trajectory, ε_w caps a single high-gain token (adversarial), no privileged gain degenerates to plain RLVR (edge), and a zero-variance group gives no update (boundary). Run: python3 rlsd_weight.py.
# rlsd_weight.py -- core math, runnable (numpy only).
# RLSD reweights each token of a GRPO update by w_t = (P_T / P_S) ** sign(A),
# then applies a PPO-style clip on w_t. P_T is the SAME model conditioned on the
# privileged reference r; P_S is the student. The verifier supplies the sign of A.
import numpy as np
def token_weight(logp_teacher, logp_student, adv):
delta = logp_teacher - logp_student # sg() information gain; sign(A) sets direction
return np.exp(np.sign(adv) * delta) # w_t = (P_T / P_S) ** sign(A)
def rlsd_surrogate(adv, logp_teacher, logp_student, eps_w=0.2):
w = token_weight(logp_teacher, logp_student, adv)
return np.minimum(w * adv, np.clip(w, 1 - eps_w, 1 + eps_w) * adv)
def surrogate_reference(adv, lpt, lps, eps_w=0.2): # slow, explicit, per token, ratio form
out = []
for a, t, s in zip(np.ravel(np.broadcast_to(adv, np.shape(lpt))), np.ravel(lpt), np.ravel(lps)):
w = (np.exp(t) / np.exp(s)) ** np.sign(a) # (P_T / P_S) ** sign(A), not the exp(sign*delta) form
out.append(min(w * a, max(1 - eps_w, min(w, 1 + eps_w)) * a))
return np.array(out).reshape(np.shape(lpt))
rng = np.random.default_rng(0)
T = 8
lps = rng.normal(-2.0, 0.5, size=T) # student log-probs
lpt = lps + rng.normal(0.3, 0.4, size=T) # teacher (privileged) log-probs
A = 1.0 # a correct trajectory: sign(A) = +1
# 1) Equivalence to a slow reference: exp(sign(A)*delta) form == (P_T/P_S)**sign(A) form.
assert np.allclose(rlsd_surrogate(A, lpt, lps), surrogate_reference(A, lpt, lps)), "forms disagree"
# 2) Reward-gated DIRECTION (load-bearing): same evidence ratio, flipped reward -> reciprocal weight.
rho = np.exp(lpt - lps) # P_T / P_S
w_pos = token_weight(lpt, lps, +1.0)
w_neg = token_weight(lpt, lps, -1.0)
assert np.allclose(w_pos, rho) and np.allclose(w_neg, 1.0 / rho), "sign(A) must gate direction"
more = rho > 1.0 # tokens the reference makes more predictable
assert np.all(w_pos[more] > 1.0) and np.all(w_neg[more] < 1.0), "amplify if correct, damp if wrong"
# 3) Adversarial: a single high-gain token must NOT dominate. eps_w clips the upside for A>0.
lpt_spike = lps.copy(); lpt_spike[0] = lps[0] + 12.0 # huge privileged-information gain on token 0
sur_clipped = rlsd_surrogate(A, lpt_spike, lps, eps_w=0.2)
sur_unclipped = token_weight(lpt_spike, lps, A) * A # what you'd get with no clip
assert sur_unclipped[0] > 100.0, sur_unclipped[0] # unclipped blows up
assert np.isclose(sur_clipped[0], (1 + 0.2) * A), sur_clipped[0] # clipped caps at (1+eps_w)*A
assert sur_clipped[0] < sur_unclipped[0], "clip bounded the token's contribution"
# 4) Edge / degeneracy: no privileged gain (P_T == P_S) -> w=1 -> RLSD == plain RLVR advantage.
assert np.allclose(rlsd_surrogate(A, lps, lps), A), "no gain must degenerate to plain RLVR"
# 5) Boundary: zero-variance group (A=0) -> zero surrogate regardless of the teacher weight.
assert np.allclose(rlsd_surrogate(0.0, lpt, lps), 0.0), "A=0 must give zero update"
# 6) Equivalence holds for a WRONG trajectory too (A<0), where sign(A) inverts every weight.
assert np.allclose(rlsd_surrogate(-1.0, lpt, lps), surrogate_reference(-1.0, lpt, lps)), "A<0 disagrees"
print("RLSD token weight + clipped surrogate: PASS w_pos[:3]=", np.round(w_pos[:3], 4).tolist(),
" clipped_spike=", round(float(sur_clipped[0]), 4))
How it works¶
The paper's key modification to the GRPO objective is to replace the flat sequence advantage A with a token-weighted, clipped surrogate min(w_t·A, clip(w_t, 1−ε_w, 1+ε_w)·A), where w_t is the privileged-information weight above (validated in the core-math block). A faithful reference implementation of that weighting (not a library API; RLSD ships as a research recipe) on top of any GRPO loop:
# rlsd.py -- RLSD per-token surrogate, faithful to the paper's objective.
# REFERENCE TEMPLATE (needs torch, not run here). The core math is validated in numpy above.
import torch
def rlsd_surrogate(adv, logp_student, logp_teacher, eps_w=0.2):
# adv: sequence-level RLVR/GRPO advantage A, broadcast per token (its SIGN is the direction)
# logp_*: per-token log-probs; teacher = SAME model conditioned on privileged info r (x, r, y<t)
delta = (logp_teacher - logp_student).detach() # sg(): stop-gradient information gain Δ_t
w = torch.exp(torch.sign(adv) * delta) # w_t = (P_T / P_S) ** sign(A)
surrogate = torch.min(w * adv, # PPO-style clip, but on the token weight w_t
w.clamp(1 - eps_w, 1 + eps_w) * adv)
return surrogate # maximize mean over tokens & group (GRPO)
Per the paper's Algorithm 1: sample G responses; compute the sequence advantage A from the verifier; run one teacher forward pass per response with the privileged info to get P_T; form w_t; and update with the clipped surrogate, annealing the self-distillation influence over training. See the paper for the complete objective.1
How to use it¶
The reported configuration (Qwen3-VL-8B-Instruct):
λ(mixing coefficient), annealed 0.5 → 0 over the first 50 steps: blends the self-distillation weighting with the plain RLVR advantage, so training leans on token-level shaping early and on pure RLVR later.ε_w = 0.2is the clip bound on the token weightw_t, the stability knob analogous to PPO's ratio clip; too loose and a single high-gain token dominates the update.G = 8, batch256, learning rate1e-6: otherwise a standard GRPO setup.- Choice of privileged info
r. A reference answer is the cheapest; a verified reasoning trace is richer.ris only used to condition the teacher forward pass; it never enters the student's context, which is what keeps it from leaking into the policy.
How to integrate with it¶
RLSD is a modification to an existing GRPO/RLVR loop, not a new trainer. You keep the rollout sampler, the verifier, and the group-relative advantage; you add one teacher forward per response and multiply the per-token surrogate by w_t. Any loop that already exposes per-token log-probs can carry it.
- Data contract. Every training prompt needs a paired reference
r(a reference answer, or a verified reasoning trace).rconditions only the teacher forward; it never enters the student context or the rollout, which is what stops it leaking into the policy. Prompts with norfall back to plain RLVR for that example (w_t = 1, as the core-math degeneracy check shows). - Where it sits in the pipeline. RLSD occupies the same slot as RLVR in an SFT → distillation → RLVR pipeline: it replaces the RLVR stage when you also hold privileged references. Feed it an SFT/OPD-warmed checkpoint; its output is a normal policy checkpoint.
- Verifier reuse. The reward source is unchanged from RLVR, so existing verifiers (exact-match, unit-test, LLM judge) and everything in reward design apply as-is. RLSD changes credit assignment, not the reward.
- Where
w_tplugs in. It is a per-token multiplier on the advantage inside the clipped surrogate; nothing downstream of the loss (optimizer, FSDP sharding, weight sync) changes, because the teacher is the same weights.
How to run it in production¶
- The anneal is a schedule, not a constant.
λruns 0.5 → 0 over the first 50 steps, so token shaping is strongest at cold-start and training converges to pure RLVR. Scale the anneal length to your total step budget rather than copying 50 verbatim, and validate it on your run. - Monitor the clip fraction and the
w_tdistribution the way you would a PPO ratio. A rising fraction of tokens pinned to the1±ε_wbound, or a heavy right tail inw_t, means a few high-gain tokens are trying to dominate: tightenε_wor slow the anneal. This is the observable behind the high-gain-token failure mode below. - Batch the teacher forward. It is one extra forward on the same weights per response, so co-schedule it with rollout log-prob scoring instead of running it as a separate model pass. There is no second set of weights to load, sync, or shard.
- Keep the RLVR guardrails on. RLSD inherits verifier gaming, entropy collapse, and zero-variance groups from RLVR, so keep the same reward-hacking and entropy monitoring an RLVR run already needs.
- Pin and seed. This is a single 2026 recipe, so pin the exact config and fix rollout seeds and sampling temperature for reproducible advantage estimates before you compare anneal or
ε_wsettings.
How to maintain it¶
- Treat the constants as version-specific. The
λanneal,ε_w = 0.2, andG = 8come from one paper on one 8B VLM. Re-tune them per modality and scale; do not assume they transfer. - Keep an RLVR baseline as a regression anchor. Run plain RLVR/GRPO on the same data and verifier alongside RLSD, so you can always show the self-distillation term still adds over the reward alone (the +2.32 over GRPO is the thing to reproduce on your task). If RLSD ever trails its RLVR baseline, the token weighting is misconfigured.
- Guard the reward gate. The
sign(A)gate is load-bearing. Any refactor that drops it, or that conditions the student onr, reintroduces OPSD-style answer leakage and collapse. Cover the gate with the numpy core-math assertions and re-run them on every change; they are library-independent, so they survive torch and TRL upgrades. - Re-validate on upgrades. Re-run the core-math block after any change to the surrogate, and re-check eval on your modality after a base-model or data change, since the published evidence is multimodal-only.
How to scale it¶
RLSD has the systems shape of RLVR/GRPO: rollout-dominated, split into a trainer and a rollout generator (async RL systems), plus one extra forward pass per response for the teacher. Because the teacher is the same weights conditioned on a different context, there is no second model to host or sync, so the memory and weight-transfer story is unchanged from GRPO; the only added cost is that teacher forward, cheap next to the rollouts. Everything in reward design and the RLVR verifier engineering applies unchanged.
Results¶
Reported on Qwen3-VL-8B-Instruct across five multimodal reasoning benchmarks (higher is better); RLSD tops every column except a near-tie, and avoids OPSD's stagnation:1
| Method | MMMU | MathVista | MathVision | ZeroBench | WeMath | Avg |
|---|---|---|---|---|---|---|
| Base | 62.44 | 73.80 | 47.37 | 19.76 | 54.10 | 51.49 |
| GRPO (RLVR) | 65.11 | 76.20 | 48.82 | 22.60 | 56.57 | 53.86 |
| OPSD | 63.82 | 75.10 | 47.53 | 21.06 | 54.95 | 52.49 |
| RLSD | 67.22 | 78.10 | 52.73 | 24.85 | 58.00 | 56.18 |
RLSD is +2.32 over GRPO and +4.69 over base; OPSD-alone adds only +1.0, consistent with the paper's claim that privileged-context distillation needs the reward-gated direction to avoid collapse.1
Failure modes¶
- No privileged information. Without a reference answer or trace to condition the teacher, there is no
w_tsignal and RLSD degenerates to plain RLVR; it is not a drop-in for reward-only settings. - OPSD-style leakage if you drop the reward gate. Using the privileged-context teacher without
sign(A)gating can leak the answer into the policy and collapse: the reward direction is load-bearing, not optional.1 - Weight clip / anneal mis-set. Too large
ε_wor too slow aλanneal lets high-gain tokens dominate and destabilizes training; the paper anneals self-distillation out over 50 steps. - Over-generalizing the evidence. Results are multimodal-reasoning-only on one 8B VLM; do not assume the exact gains transfer to text-only or other scales without checking.
- Inherited RLVR failure modes. Verifier gaming, entropy collapse, and zero-variance groups all still apply; RLSD changes credit assignment, not the reward source (RLVR).
References¶
- Self-Distilled RLVR (introduces RLSD): https://arxiv.org/abs/2604.03128
- RLVR (verifiable-reward RL, the direction signal): https://arxiv.org/abs/2411.15124
- On-policy self-distillation / OPSD (the privileged-context self-teacher): https://arxiv.org/abs/2601.18734
- GRPO (the underlying critic-free optimizer): https://arxiv.org/abs/2402.03300
- SDPG: Self-Distilled Policy Gradient (concurrent sibling, arXiv 2606.04036): https://arxiv.org/abs/2606.04036
A concurrent sibling in the same family is SDPG (Self-Distilled Policy Gradient), which likewise treats on-policy self-distillation as dense supervision for sparse-reward RL, combining group-relative verifier advantages (normalized by group std, as in GRPO) with exact full-vocabulary on-policy self-distillation and a reference-policy KL term.2 Where RLSD folds the self-teacher into a per-token weight on the RLVR advantage, SDPG composes the verifier advantage, the full-vocabulary distillation signal, and the KL as an integrated objective; both are instances of using the model's own privileged or full-distribution signal to densify a verifier's sparse reward.
Related: RLVR · On-policy distillation (OPSD) · GRPO · Reward design · Agentic and tool-use RL · Async RL systems · Fine-tuning and post-training · Glossary
-
Yang et al., Self-Distilled RLVR (RLSD): uses self-distillation for "token-level policy differences for determining fine-grained update magnitudes" while RLVR gives "reliable update directions from environmental feedback"; the teacher is the same model conditioned on privileged information
r(reference answer / verified trace), giving token weightw_t = (P_T/P_S)^sign(A); "simultaneously harness[es] the strengths of both RLVR and OPSD, achieving a higher convergence ceiling and superior training stability". Demonstrated on Qwen3-VL-8B-Instruct across MMMU/MathVista/MathVision/ZeroBench/WeMath (avg 56.18 vs GRPO 53.86, OPSD 52.49, base 51.49). https://arxiv.org/abs/2604.03128 ↩↩↩↩↩↩↩↩↩ -
Liu et al., Self-Distilled Policy Gradient (SDPG, arXiv 2606.04036): on-policy self-distillation (a model conditioning on privileged context to supervise its own generations) as dense supervision for sparse-reward RL; merges group-relative verifier advantages with normalized standard deviation, exact full-vocabulary on-policy self-distillation, and reference-policy KL regularization, reporting improved stability and performance. https://arxiv.org/abs/2606.04036 ↩