Markdown

GRPO variants and training tricks¶

Scope: the practical modifications that turn vanilla GRPO into a stable, scalable reasoning-RL recipe. This page covers what vanilla GRPO gets wrong, why and when each fix helps, how to apply them (with runnable code for the core fixes and the TRL config that selects them), and the training-health metrics that tell you a run is working. It is the practitioner companion to the GRPO algorithm page, the reward design principles, and the compute planning in RL scaling laws.

Trick names and defaults track fast-moving papers and library versions; treat every hyperparameter here as a starting point to validate. The Python example is executed and asserted (numpy); the TRL config is a reference template.

flowchart TB
  V["Vanilla GRPO"] --> P1["Entropy collapse"]
  V --> P2["Length and difficulty bias"]
  V --> P3["Sampler / trainer probability gap"]
  P1 --> F1["DAPO clip-higher and dynamic sampling; CISPO keeps rare tokens"]
  P2 --> F2["DAPO token-level loss; Dr. GRPO fixed denominator, no std"]
  P3 --> F3["TIS truncated importance sampling; GSPO sequence-level ratio"]

What it is¶

Vanilla GRPO scales poorly for long reasoning because of three failure patterns, each with a named fix:

Entropy collapse. Output distributions sharpen, exploration dies, and reward plateaus early. Fixed by DAPO's clip-higher and dynamic sampling, and by CISPO keeping rare exploration tokens.
Length and difficulty bias. Sample-level loss normalization under-weights tokens in long responses, and the standard-deviation term in the advantage over-weights very easy or very hard prompts. Fixed by DAPO's token-level loss and Dr. GRPO's fixed denominator and dropped std.
Off-policy drift from the engine gap. The inference engine that samples rollouts (vLLM, SGLang) and the trainer (FSDP, DeepSpeed) compute token probabilities slightly differently, quietly turning on-policy RL into off-policy RL. Fixed by truncated importance sampling and, structurally, by GSPO's sequence-level ratio.

Why use it¶

Stability and higher ceilings. DAPO packages four changes and reports lifting Qwen-2.5-32B from about 30% to 50% AIME accuracy (past a from-scratch R1 baseline) in roughly half the training steps.
Better token efficiency. Dr. GRPO's bias fixes reach a reported 43.3% AIME with a 7B model on a light setup (about 27 hours on 8 A100s), needing fewer tokens for the same accuracy.
MoE-safe updates. GSPO's sequence-level ratio removes the expert-routing volatility that per-token ratios cause in mixture-of-experts models, without routing-replay workarounds.
Recover from bad rollouts. Truncated importance sampling recovers full-precision training quality even from quantized (int8) rollouts, and improves stable BF16 runs too.

When to use it (and when not)¶

Turn on clip-higher and token-level loss by default for any long-reasoning GRPO run; they fix the two most common instabilities at almost no cost.
Use Dr. GRPO's bias fixes when response lengths vary widely or groups are often near-uniform in reward.
Prefer GSPO for mixture-of-experts policies, where per-token ratios interact badly with routing.
Keep TIS on whenever a separate inference engine samples the rollouts (almost always in disaggregated RL).
Do not stack every trick blindly. Some overlap (GSPO and GMPO both re-aggregate; CISPO and clip-higher both protect exploration); add one, watch the health metrics, and keep what moves them.

Architecture¶

The fixes attach at three points in the GRPO loop: the advantage computation (Dr. GRPO drops the std term), the loss aggregation (token-level versus sample-level; GSPO/GMPO change the level), and the importance ratio (clip-higher decouples the bounds, CISPO keeps low-probability tokens, TIS corrects the sampler-trainer gap). None changes the reward source or the rollout mechanism, so they compose with reward design and async RL systems unchanged.

How to use it¶

The core fixes are a few lines each. This runnable example implements the four load-bearing ones and asserts the property each is meant to guarantee (executed with numpy):

# grpo_fixes.py — validated core GRPO fixes; asserts the property each one provides. numpy only.
import numpy as np

def advantage(r, drgrpo):                              # group-relative advantage
    a = r - r.mean()
    return a if drgrpo else a / (r.std() + 1e-8)       # Dr. GRPO drops the std term

def aggregate(losses, token_level):                    # losses: list of per-token arrays, one per response
    if token_level:
        return np.concatenate(losses).mean()           # every token weighted equally
    return np.mean([x.mean() for x in losses])         # sample-level: long responses under-weighted

def clip_higher(ratio, adv, lo=0.2, hi=0.28):          # decoupled PPO clip: wider upper bound
    return np.minimum(ratio * adv, np.clip(ratio, 1 - lo, 1 + hi) * adv)

def tis(ratio, cap):                                   # truncated importance sampling: bounded weight
    return np.minimum(ratio, cap)

# (1) Dr. GRPO stays bounded when a group is near-uniform (std -> 0); vanilla explodes.
r = np.array([1.0, 1.0, 1.0, 0.999])
assert np.abs(advantage(r, True)).max() < np.abs(advantage(r, False)).max()
# (2) token-level and sample-level differ when response lengths are uneven.
losses = [np.full(10, 0.5), np.full(100, 0.1)]
assert not np.isclose(aggregate(losses, True), aggregate(losses, False))   # 0.136 vs 0.300
# (3) clip-higher permits a larger positive-advantage update than the symmetric bound.
assert clip_higher(1.5, 1.0, hi=0.28) > clip_higher(1.5, 1.0, hi=0.20)
# (4) TIS caps the importance weight.
assert tis(np.array([0.3, 1.0, 3.0, 50.0]), 2.0).max() <= 2.0
print("Dr.GRPO |adv|max:", round(float(np.abs(advantage(r, True)).max()), 3))   # -> 0.001

In a real run you select these through the library rather than editing the loss. TRL's GRPOConfig exposes the loss variant and the knobs directly:

# TRL >= 1.6; pin the version and verify field names on the installed release.
from trl import GRPOConfig
cfg = GRPOConfig(
    loss_type="dr_grpo",        # Dr. GRPO: fixed denominator + no std in the advantage
    scale_rewards=False,        # drop the std scaling (R1-Zero / Dr. GRPO finding)
    epsilon=0.2, epsilon_high=0.28,   # clip-higher: decoupled lower/upper PPO bounds
    num_generations=8,          # group size G
    beta=0.0,                   # no KL term by default for verifiable-reward RL
)

How to develop with it¶

DAPO (four fixes). Clip-higher decouples the bounds into [1 - eps_low, 1 + eps_high] (for example 0.2 / 0.28) so exploration tokens are not held back as hard as exploitation tokens; dynamic sampling over-samples prompts and drops zero-variance groups; token-level loss averages over all tokens; overlong reward shaping replaces the harsh truncation penalty with a soft, length-aware one.
Dr. GRPO. Divide the summed loss by a fixed constant (a max token count) rather than each sequence's own length, and use A_i = r_i - mean(r) without the std term.
GSPO and GMPO. GSPO computes the importance ratio and clipping at the sequence level (stable for MoE); GMPO replaces the arithmetic mean over token ratios with a geometric mean (less sensitive to outlier tokens, uses a wider clip).
CISPO. Clip the importance-sampling weight while keeping every token's gradient contribution, so rare reflection tokens ("wait", "aha") keep influencing the policy across updates; it matters most in high-update-count setups.

How to maintain it¶

Judge a run by trajectories, not a single reward number: response length should grow as the model learns to reason without exploding into gibberish; training reward should rise steadily; entropy should stay in a healthy band (a fast drop signals collapse, so reach for clip-higher or a KL term); and held-out validation must move with training reward, or the gap is reward hacking (reward design). Re-tune the clip bound or any self-distillation anneal if entropy collapses, and re-audit the reward whenever training reward and held-out quality diverge.

How to run it in production¶

Default to clip-higher plus token-level loss plus truncated importance sampling for any long-reasoning run; they fix the two most common instabilities and the sampler-trainer gap at almost no cost. For mixture-of-experts policies, move to sequence-level aggregation (GSPO) before scaling up, since per-token ratios thrash with expert routing. Use a reasonably large effective batch (for example 64 prompts times 8 rollouts), keep a spread of difficulties, drop trivially guessable items, and match the prompt template to the base model, since template choice measurably changes base-model behavior. The compute-vs-reward planning for all of this is in RL scaling laws.

Failure modes¶

Entropy collapse. Symmetric clipping starves exploration; decouple the bounds (clip-higher) or add KL.
Zero-variance batches. Prompts all-right or all-wrong contribute nothing; dynamic-sample them out and vary difficulty.
Length gaming. Sample-level loss rewards long wrong answers; switch to token-level loss and a fixed denominator.
Silent off-policy drift. Skipping the importance-sampling correction with a fast or quantized engine destabilizes training; keep TIS on.
MoE routing volatility. Token-level ratios thrash with expert routing; move to sequence-level (GSPO).
Reward hacking. Training reward climbs while held-out quality stalls; audit the reward and the eval set.

References¶

Cameron R. Wolfe, "GRPO Tricks: Making RL Actually Work": https://cameronrwolfe.substack.com/p/grpo-tricks
Yu et al., DAPO: An Open-Source LLM Reinforcement Learning System at Scale: https://arxiv.org/abs/2503.14476
Liu et al., Understanding R1-Zero-Like Training: A Critical Perspective (Dr. GRPO): https://arxiv.org/abs/2503.20783
Zheng et al., Group Sequence Policy Optimization (GSPO): https://arxiv.org/abs/2507.18071
Shao et al., DeepSeekMath (original GRPO): https://arxiv.org/abs/2402.03300
TRL GRPO Trainer docs: https://huggingface.co/docs/trl/grpo_trainer