Markdown

RLSD (reinforcement learning with self-distillation)¶

Scope: a post-training framework that fuses RLVR with privileged-context self-distillation: the verifiable reward decides the direction of each update while a token-level self-distillation signal decides its magnitude, giving dense per-token credit on top of RLVR's sparse per-sequence reward. Proposed in "Self-Distilled RLVR" (Yang et al., 2026); the point where "learn from a verifier" (RLVR) and "learn from your own privileged outputs" (OPSD) meet.

The torch block below is a reference template (pin versions, verify against the paper, validate before production). The numpy block labelled core math is self-contained and runnable, and pins down the token-weight identity the update relies on. This is a recent (2026) method demonstrated in one paper, so validate against your own setting first.

What it is¶

RLSD keeps the GRPO/RLVR loop (sample G responses per prompt, score each with a verifier, compute a group-relative sequence advantage A) but reweights every token's update by a self-distillation signal. The reward still supplies a "reliable update direction from environmental feedback"; the self-distillation supplies "token-level policy differences for determining fine-grained update magnitudes".¹

The self-distillation teacher is the same model conditioned on privileged information r (a reference answer or a verified reasoning trace), while the student sees only the question. For each generated token the privileged-information gain is the stop-gradient log-prob difference:

Δ_t = sg( log P_T(y_t | x, r, y_<t) − log P_S(y_t | x, y_<t) )
w_t = exp( sign(A) · Δ_t ) = ( P_T(y_t) / P_S(y_t) ) ^ sign(A)

The verifier reward gives sign(A) (reinforce a correct trajectory, penalize a wrong one); the teacher's evidence ratio P_T/P_S gives the per-token magnitude, so tokens the reference answer makes far more predictable receive more credit. That privileged-context self-teacher is exactly the OPSD idea: RLSD wires it into the RLVR update rather than running it as a standalone distillation loss.¹

Why use it¶

Dense credit on a sparse reward. RLVR's reward is a single scalar per sequence, giving no signal about which tokens mattered; RLSD adds a token-level assessment for "fine-grained credit discrimination", directly attacking RLVR's credit-assignment weakness.¹
Both strengths, neither failure. It "simultaneously harness[es] the strengths of both RLVR and OPSD, achieving a higher convergence ceiling and superior training stability"; the reward-gated direction prevents the privileged-information leakage/collapse that sinks OPSD used alone.¹
Cheap add-on. The teacher signal is one extra forward pass of the same model (conditioned on privileged info) per response, so there is no separate teacher to train or host, unlike external-teacher distillation.
Measured gains. On multimodal reasoning it beat GRPO by +2.32 average points and the base model by +4.69, while OPSD-alone barely moved (below).¹

When to use it (and when not)¶

Use RLSD when you already run RLVR/GRPO on verifiable tasks and have privileged information at training time (reference answers, verified traces) to condition the teacher on.
Prefer plain RLVR when you have no privileged signal to condition on, or the sequence reward already gives enough gradient.
Prefer standalone OPSD when you have a genuinely stronger teacher and no verifiable reward, but note that privileged-context distillation without the reward-gated direction risks answer leakage and collapse, which is the failure RLSD is built to avoid.
Evidence caveat. RLSD was demonstrated on multimodal (vision-language) reasoning (Qwen3-VL-8B-Instruct); the recipe is modality-agnostic in principle, but the published evidence is that setting. Validate on your modality. It is a single, recent (April 2026) paper; treat configs as version-specific.

Architecture¶

Two forward passes on the same weights drive each update: the student pass P_S(y_t | x, y_<t) that also produced the rollout, and a teacher pass P_T(y_t | x, r, y_<t) that additionally sees the privileged reference r. The verifier scores the rollout into a sequence advantage A; its sign sets the update direction, and the teacher-over-student ratio sets each token's magnitude w_t. There is no second model to host: the teacher is the same weights with r in context.

flowchart LR
  P["Prompt x (+ privileged info r)"] --> ROLL["G rollouts (student policy)"]
  ROLL --> VER["Verifier reward"]
  VER --> A["Sequence advantage A → sign(A) = direction"]
  ROLL --> STU["Student logprob P_S(y_t | x, y_<t)"]
  ROLL --> TCH["Teacher logprob P_T(y_t | x, r, y_<t)  (same model + r)"]
  STU --> DW["Δ_t, token weight w_t = (P_T/P_S)^sign(A) = magnitude"]
  TCH --> DW
  A --> UPD["Clipped GRPO update with per-token w_t"]
  DW --> UPD
  UPD -->|"weight sync"| ROLL

Core math (runnable): the privileged-information token weight¶

The one modification RLSD makes to GRPO is the per-token weight w_t and its clip. This numpy-only block builds both from log-probs and asserts the properties the update relies on: the two algebraic forms of w_t agree (equivalence to a slow reference), sign(A) gates the direction so the same evidence ratio flips to its reciprocal on a wrong trajectory, ε_w caps a single high-gain token (adversarial), no privileged gain degenerates to plain RLVR (edge), and a zero-variance group gives no update (boundary). Run: python3 rlsd_weight.py.

# rlsd_weight.py -- core math, runnable (numpy only).
# RLSD reweights each token of a GRPO update by w_t = (P_T / P_S) ** sign(A),
# then applies a PPO-style clip on w_t. P_T is the SAME model conditioned on the
# privileged reference r; P_S is the student. The verifier supplies the sign of A.
import numpy as np

def token_weight(logp_teacher, logp_student, adv):
    delta = logp_teacher - logp_student            # sg() information gain; sign(A) sets direction
    return np.exp(np.sign(adv) * delta)            # w_t = (P_T / P_S) ** sign(A)

def rlsd_surrogate(adv, logp_teacher, logp_student, eps_w=0.2):
    w = token_weight(logp_teacher, logp_student, adv)
    return np.minimum(w * adv, np.clip(w, 1 - eps_w, 1 + eps_w) * adv)

def surrogate_reference(adv, lpt, lps, eps_w=0.2):  # slow, explicit, per token, ratio form
    out = []
    for a, t, s in zip(np.ravel(np.broadcast_to(adv, np.shape(lpt))), np.ravel(lpt), np.ravel(lps)):
        w = (np.exp(t) / np.exp(s)) ** np.sign(a)   # (P_T / P_S) ** sign(A), not the exp(sign*delta) form
        out.append(min(w * a, max(1 - eps_w, min(w, 1 + eps_w)) * a))
    return np.array(out).reshape(np.shape(lpt))

rng = np.random.default_rng(0)
T = 8
lps = rng.normal(-2.0, 0.5, size=T)                 # student log-probs
lpt = lps + rng.normal(0.3, 0.4, size=T)            # teacher (privileged) log-probs
A = 1.0                                             # a correct trajectory: sign(A) = +1

# 1) Equivalence to a slow reference: exp(sign(A)*delta) form == (P_T/P_S)**sign(A) form.
assert np.allclose(rlsd_surrogate(A, lpt, lps), surrogate_reference(A, lpt, lps)), "forms disagree"

# 2) Reward-gated DIRECTION (load-bearing): same evidence ratio, flipped reward -> reciprocal weight.
rho = np.exp(lpt - lps)                              # P_T / P_S
w_pos = token_weight(lpt, lps, +1.0)
w_neg = token_weight(lpt, lps, -1.0)
assert np.allclose(w_pos, rho) and np.allclose(w_neg, 1.0 / rho), "sign(A) must gate direction"
more = rho > 1.0                                    # tokens the reference makes more predictable
assert np.all(w_pos[more] > 1.0) and np.all(w_neg[more] < 1.0), "amplify if correct, damp if wrong"

# 3) Adversarial: a single high-gain token must NOT dominate. eps_w clips the upside for A>0.
lpt_spike = lps.copy(); lpt_spike[0] = lps[0] + 12.0   # huge privileged-information gain on token 0
sur_clipped = rlsd_surrogate(A, lpt_spike, lps, eps_w=0.2)
sur_unclipped = token_weight(lpt_spike, lps, A) * A    # what you'd get with no clip
assert sur_unclipped[0] > 100.0, sur_unclipped[0]      # unclipped blows up
assert np.isclose(sur_clipped[0], (1 + 0.2) * A), sur_clipped[0]  # clipped caps at (1+eps_w)*A
assert sur_clipped[0] < sur_unclipped[0], "clip bounded the token's contribution"

# 4) Edge / degeneracy: no privileged gain (P_T == P_S) -> w=1 -> RLSD == plain RLVR advantage.
assert np.allclose(rlsd_surrogate(A, lps, lps), A), "no gain must degenerate to plain RLVR"

# 5) Boundary: zero-variance group (A=0) -> zero surrogate regardless of the teacher weight.
assert np.allclose(rlsd_surrogate(0.0, lpt, lps), 0.0), "A=0 must give zero update"

# 6) Equivalence holds for a WRONG trajectory too (A<0), where sign(A) inverts every weight.
assert np.allclose(rlsd_surrogate(-1.0, lpt, lps), surrogate_reference(-1.0, lpt, lps)), "A<0 disagrees"

print("RLSD token weight + clipped surrogate: PASS  w_pos[:3]=", np.round(w_pos[:3], 4).tolist(),
      " clipped_spike=", round(float(sur_clipped[0]), 4))

How it works¶

The paper's key modification to the GRPO objective is to replace the flat sequence advantage A with a token-weighted, clipped surrogate min(w_t·A, clip(w_t, 1−ε_w, 1+ε_w)·A), where w_t is the privileged-information weight above (validated in the core-math block). A faithful reference implementation of that weighting (not a library API; RLSD ships as a research recipe) on top of any GRPO loop:

# rlsd.py -- RLSD per-token surrogate, faithful to the paper's objective.
# REFERENCE TEMPLATE (needs torch, not run here). The core math is validated in numpy above.
import torch

def rlsd_surrogate(adv, logp_student, logp_teacher, eps_w=0.2):
    # adv:        sequence-level RLVR/GRPO advantage A, broadcast per token (its SIGN is the direction)
    # logp_*:     per-token log-probs; teacher = SAME model conditioned on privileged info r (x, r, y<t)
    delta = (logp_teacher - logp_student).detach()          # sg(): stop-gradient information gain Δ_t
    w = torch.exp(torch.sign(adv) * delta)                  # w_t = (P_T / P_S) ** sign(A)
    surrogate = torch.min(w * adv,                          # PPO-style clip, but on the token weight w_t
                          w.clamp(1 - eps_w, 1 + eps_w) * adv)
    return surrogate                                        # maximize mean over tokens & group (GRPO)

Per the paper's Algorithm 1: sample G responses; compute the sequence advantage A from the verifier; run one teacher forward pass per response with the privileged info to get P_T; form w_t; and update with the clipped surrogate, annealing the self-distillation influence over training. See the paper for the complete objective.¹

How to use it¶

The reported configuration (Qwen3-VL-8B-Instruct):

λ (mixing coefficient), annealed 0.5 → 0 over the first 50 steps: blends the self-distillation weighting with the plain RLVR advantage, so training leans on token-level shaping early and on pure RLVR later.
ε_w = 0.2 is the clip bound on the token weight w_t, the stability knob analogous to PPO's ratio clip; too loose and a single high-gain token dominates the update.
G = 8, batch 256, learning rate 1e-6: otherwise a standard GRPO setup.
Choice of privileged info r. A reference answer is the cheapest; a verified reasoning trace is richer. r is only used to condition the teacher forward pass; it never enters the student's context, which is what keeps it from leaking into the policy.

How to integrate with it¶

RLSD is a modification to an existing GRPO/RLVR loop, not a new trainer. You keep the rollout sampler, the verifier, and the group-relative advantage; you add one teacher forward per response and multiply the per-token surrogate by w_t. Any loop that already exposes per-token log-probs can carry it.

Data contract. Every training prompt needs a paired reference r (a reference answer, or a verified reasoning trace). r conditions only the teacher forward; it never enters the student context or the rollout, which is what stops it leaking into the policy. Prompts with no r fall back to plain RLVR for that example (w_t = 1, as the core-math degeneracy check shows).
Where it sits in the pipeline. RLSD occupies the same slot as RLVR in an SFT → distillation → RLVR pipeline: it replaces the RLVR stage when you also hold privileged references. Feed it an SFT/OPD-warmed checkpoint; its output is a normal policy checkpoint.
Verifier reuse. The reward source is unchanged from RLVR, so existing verifiers (exact-match, unit-test, LLM judge) and everything in reward design apply as-is. RLSD changes credit assignment, not the reward.
Where w_t plugs in. It is a per-token multiplier on the advantage inside the clipped surrogate; nothing downstream of the loss (optimizer, FSDP sharding, weight sync) changes, because the teacher is the same weights.

How to run it in production¶

The anneal is a schedule, not a constant. λ runs 0.5 → 0 over the first 50 steps, so token shaping is strongest at cold-start and training converges to pure RLVR. Scale the anneal length to your total step budget rather than copying 50 verbatim, and validate it on your run.
Monitor the clip fraction and the w_t distribution the way you would a PPO ratio. A rising fraction of tokens pinned to the 1±ε_w bound, or a heavy right tail in w_t, means a few high-gain tokens are trying to dominate: tighten ε_w or slow the anneal. This is the observable behind the high-gain-token failure mode below.
Batch the teacher forward. It is one extra forward on the same weights per response, so co-schedule it with rollout log-prob scoring instead of running it as a separate model pass. There is no second set of weights to load, sync, or shard.
Keep the RLVR guardrails on. RLSD inherits verifier gaming, entropy collapse, and zero-variance groups from RLVR, so keep the same reward-hacking and entropy monitoring an RLVR run already needs.
Pin and seed. This is a single 2026 recipe, so pin the exact config and fix rollout seeds and sampling temperature for reproducible advantage estimates before you compare anneal or ε_w settings.

How to maintain it¶

Treat the constants as version-specific. The λ anneal, ε_w = 0.2, and G = 8 come from one paper on one 8B VLM. Re-tune them per modality and scale; do not assume they transfer.
Keep an RLVR baseline as a regression anchor. Run plain RLVR/GRPO on the same data and verifier alongside RLSD, so you can always show the self-distillation term still adds over the reward alone (the +2.32 over GRPO is the thing to reproduce on your task). If RLSD ever trails its RLVR baseline, the token weighting is misconfigured.
Guard the reward gate. The sign(A) gate is load-bearing. Any refactor that drops it, or that conditions the student on r, reintroduces OPSD-style answer leakage and collapse. Cover the gate with the numpy core-math assertions and re-run them on every change; they are library-independent, so they survive torch and TRL upgrades.
Re-validate on upgrades. Re-run the core-math block after any change to the surrogate, and re-check eval on your modality after a base-model or data change, since the published evidence is multimodal-only.

How to scale it¶

RLSD has the systems shape of RLVR/GRPO: rollout-dominated, split into a trainer and a rollout generator (async RL systems), plus one extra forward pass per response for the teacher. Because the teacher is the same weights conditioned on a different context, there is no second model to host or sync, so the memory and weight-transfer story is unchanged from GRPO; the only added cost is that teacher forward, cheap next to the rollouts. Everything in reward design and the RLVR verifier engineering applies unchanged.

Results¶

Reported on Qwen3-VL-8B-Instruct across five multimodal reasoning benchmarks (higher is better); RLSD tops every column except a near-tie, and avoids OPSD's stagnation:¹

Method	MMMU	MathVista	MathVision	ZeroBench	WeMath	Avg
Base	62.44	73.80	47.37	19.76	54.10	51.49
GRPO (RLVR)	65.11	76.20	48.82	22.60	56.57	53.86
OPSD	63.82	75.10	47.53	21.06	54.95	52.49
RLSD	67.22	78.10	52.73	24.85	58.00	56.18

RLSD is +2.32 over GRPO and +4.69 over base; OPSD-alone adds only +1.0, consistent with the paper's claim that privileged-context distillation needs the reward-gated direction to avoid collapse.¹

Failure modes¶

No privileged information. Without a reference answer or trace to condition the teacher, there is no w_t signal and RLSD degenerates to plain RLVR; it is not a drop-in for reward-only settings.
OPSD-style leakage if you drop the reward gate. Using the privileged-context teacher without sign(A) gating can leak the answer into the policy and collapse: the reward direction is load-bearing, not optional.¹
Weight clip / anneal mis-set. Too large ε_w or too slow a λ anneal lets high-gain tokens dominate and destabilizes training; the paper anneals self-distillation out over 50 steps.
Over-generalizing the evidence. Results are multimodal-reasoning-only on one 8B VLM; do not assume the exact gains transfer to text-only or other scales without checking.
Inherited RLVR failure modes. Verifier gaming, entropy collapse, and zero-variance groups all still apply; RLSD changes credit assignment, not the reward source (RLVR).

References¶

Self-Distilled RLVR (introduces RLSD): https://arxiv.org/abs/2604.03128
RLVR (verifiable-reward RL, the direction signal): https://arxiv.org/abs/2411.15124
On-policy self-distillation / OPSD (the privileged-context self-teacher): https://arxiv.org/abs/2601.18734
GRPO (the underlying critic-free optimizer): https://arxiv.org/abs/2402.03300
SDPG: Self-Distilled Policy Gradient (concurrent sibling, arXiv 2606.04036): https://arxiv.org/abs/2606.04036

A concurrent sibling in the same family is SDPG (Self-Distilled Policy Gradient), which likewise treats on-policy self-distillation as dense supervision for sparse-reward RL, combining group-relative verifier advantages (normalized by group std, as in GRPO) with exact full-vocabulary on-policy self-distillation and a reference-policy KL term.² Where RLSD folds the self-teacher into a per-token weight on the RLVR advantage, SDPG composes the verifier advantage, the full-vocabulary distillation signal, and the KL as an integrated objective; both are instances of using the model's own privileged or full-distribution signal to densify a verifier's sparse reward.

Yang et al., Self-Distilled RLVR (RLSD): uses self-distillation for "token-level policy differences for determining fine-grained update magnitudes" while RLVR gives "reliable update directions from environmental feedback"; the teacher is the same model conditioned on privileged information r (reference answer / verified trace), giving token weight w_t = (P_T/P_S)^sign(A); "simultaneously harness[es] the strengths of both RLVR and OPSD, achieving a higher convergence ceiling and superior training stability". Demonstrated on Qwen3-VL-8B-Instruct across MMMU/MathVista/MathVision/ZeroBench/WeMath (avg 56.18 vs GRPO 53.86, OPSD 52.49, base 51.49). https://arxiv.org/abs/2604.03128 ↩↩↩↩↩↩↩↩↩
Liu et al., Self-Distilled Policy Gradient (SDPG, arXiv 2606.04036): on-policy self-distillation (a model conditioning on privileged context to supervise its own generations) as dense supervision for sparse-reward RL; merges group-relative verifier advantages with normalized standard deviation, exact full-vocabulary on-policy self-distillation, and reference-policy KL regularization, reporting improved stability and performance. https://arxiv.org/abs/2606.04036 ↩