Markdown

GRPO (group relative policy optimization)¶

Scope: critic-free reinforcement learning for LLMs, sampling a group of completions per prompt, scoring each against a (often verifiable) reward, and updating on group-relative advantages. The default RLVR method behind reasoning models; the algorithm under the RL libraries, one stage of the post-training pipeline in fine-tuning and post-training.

The library snippets below (TRL, verl) are reference templates on real APIs: pin versions and validate before production use. The numpy blocks are self-contained and are executed and asserted in this page.

What it is¶

GRPO is a variant of PPO that removes the value/critic network. For each prompt it samples a group of G completions, scores each with a reward function or reward model, and computes each completion's advantage relative to the group: A_i = (r_i - mean(r)) / std(r) (the std term is optional). Because the group mean replaces a learned baseline, there is no separate critic to train or hold in memory, roughly half the memory of PPO. The policy is then updated with a clipped surrogate objective (as in PPO) plus an optional KL penalty toward a frozen reference. Introduced in DeepSeekMath and scaled in DeepSeek-R1 for verifiable-reward reasoning (References).

Why use it¶

Critic-free: no value head to train or store; ~half PPO's optimizer/activation memory, simpler to stabilise.
Verifiable rewards: a programmatic checker (answer match, unit tests, format/regex, judge) gives a clean signal for maths, code, and tool-use, where preferences (DPO) are too coarse.
Online improvement: the model learns from its own sampled rollouts, lifting capability beyond what SFT demonstrations contain.
Proven at scale: the method behind R1-style reasoning gains; broad library support (TRL, verl, slime).

When to use it (and when not)¶

Use GRPO when a reward can be computed (correctness, tests pass, schema valid) and you want to push reasoning/agentic skill past SFT, the third stage in fine-tuning and post-training.
Prefer DPO when you only have offline (chosen, rejected) preference pairs and no rollout budget; it is far cheaper and more stable.
Prefer SFT/LoRA first: cold-start format and behaviour before any RL; many tasks never need RL.
Avoid if the reward is noisy or hackable. A bad reward trains a bad model fast.

Architecture¶

flowchart LR
  P["Prompt batch"] --> ROLL["G rollouts (vLLM / SGLang)"]
  ROLL --> RWD["Reward fn / model"]
  RWD --> ADV["Group-relative advantage"]
  ADV --> UPD["Policy update (clipped + KL)"]
  UPD -->|"weight sync"| ROLL
  REF["Frozen reference"] -.->|"KL penalty"| UPD

Two things dominate: the reward defines the objective, and the weight sync back to the rollout engine defines the cost. Everything else (group sampling, advantage normalization, the clipped update) is the small, deterministic core validated below.

How it works (validated core math)¶

The two computations that make GRPO distinct from PPO are the group-relative advantage (which replaces the critic) and the clipped surrogate update (shared with PPO). Both are a few lines of numpy. The blocks below are the executable companions to the TRL/verl templates in the later sections: they reproduce, on plain arrays, exactly what those libraries compute internally, so the maths can be checked without a GPU. Run either with a numpy-only Python.

Group-relative advantage¶

Centring each reward on its group mean is the baseline; dividing by the group std (the default, disabled by scale_rewards=False in the Dr.GRPO / R1-Zero recipe) rescales it. The edge case that matters in practice is a group where every sample scores the same: the advantage must collapse to zero (no gradient), never a NaN. That is the frac_reward_zero_std signal referenced throughout this page.

import numpy as np


def group_relative_advantage(rewards, scale_by_std=True, eps=1e-8):
    """GRPO advantage: center each reward on its group mean, optionally scale by the
    group std. rewards has shape (num_prompts, group_size); returns the same shape."""
    r = np.asarray(rewards, dtype=np.float64)
    assert r.ndim == 2, "rewards must be (num_prompts, group_size)"
    mean = r.mean(axis=1, keepdims=True)
    centered = r - mean
    if not scale_by_std:
        return centered
    std = r.std(axis=1, keepdims=True)          # population std (ddof=0)
    return centered / (std + eps)


# happy path: a known group maps to known advantages
adv = group_relative_advantage(np.array([[0.0, 1.0]]))   # mean 0.5, std 0.5
assert np.allclose(adv, [[-1.0, 1.0]], atol=1e-6), adv

# property: advantages are zero-mean within every group (the baseline is subtracted)
rng = np.random.default_rng(0)
big = rng.normal(size=(16, 8))
adv_big = group_relative_advantage(big)
assert np.allclose(adv_big.mean(axis=1), 0.0, atol=1e-8)
# with std scaling each group has unit std
assert np.allclose(adv_big.std(axis=1), 1.0, atol=1e-3)

# edge / adversarial: a group where every sample scores identically. This is the
# frac_reward_zero_std failure case; the advantage must be 0 and never NaN/Inf.
flat = np.array([[1.0, 1.0, 1.0, 1.0]])
adv_flat = group_relative_advantage(flat)
assert np.all(np.isfinite(adv_flat)), "zero-std group produced NaN/Inf"
assert np.allclose(adv_flat, 0.0), adv_flat

# scale_rewards=False (Dr.GRPO / R1-Zero finding): mean-center only, no std division
adv_ns = group_relative_advantage(np.array([[0.0, 1.0]]), scale_by_std=False)
assert np.allclose(adv_ns, [[-0.5, 0.5]]), adv_ns      # centered, not unit-scaled

# equivalence: the vectorized form matches an explicit slow reference loop
def slow_advantage(rewards, eps=1e-8):
    out = []
    for group in rewards:
        m = sum(group) / len(group)
        var = sum((x - m) ** 2 for x in group) / len(group)
        s = var ** 0.5
        out.append([(x - m) / (s + eps) for x in group])
    return np.array(out)

assert np.allclose(group_relative_advantage(big), slow_advantage(big), atol=1e-6)
print("group-relative advantage: all asserts passed")

The clipped surrogate update¶

The advantage feeds PPO's clipped surrogate: the ratio between the new and old policy is clipped to [1-epsilon, 1+epsilon] and the objective takes the pessimistic (min) branch. This is what epsilon (and the low/high bounds under num_iterations > 1) control in GRPOConfig, and where a zero advantage produces a zero gradient.

import numpy as np


def clipped_surrogate(ratio, advantage, eps=0.2):
    """PPO/GRPO clipped surrogate objective (to be maximized; the loss is its negative
    mean). ratio = exp(logp_new - logp_old); advantage is the group-relative advantage."""
    ratio = np.asarray(ratio, dtype=np.float64)
    advantage = np.asarray(advantage, dtype=np.float64)
    unclipped = ratio * advantage
    clipped = np.clip(ratio, 1 - eps, 1 + eps) * advantage
    return np.minimum(unclipped, clipped)


eps = 0.2
# known values across the four quadrants of (sign of advantage, ratio vs band)
assert np.isclose(clipped_surrogate(1.5, 2.0, eps), 1.2 * 2.0)    # +adv, ratio high -> capped
assert np.isclose(clipped_surrogate(0.5, 2.0, eps), 0.5 * 2.0)    # +adv, ratio low  -> flows
assert np.isclose(clipped_surrogate(1.5, -2.0, eps), 1.5 * -2.0)  # -adv, ratio high -> flows
assert np.isclose(clipped_surrogate(0.5, -2.0, eps), 0.8 * -2.0)  # -adv, ratio low  -> capped

# property: the objective never exceeds the unclipped term (pessimistic lower bound)
rng = np.random.default_rng(1)
ratio = np.exp(rng.normal(scale=0.3, size=10000))     # positive ratios around 1
adv = rng.normal(size=10000)
obj = clipped_surrogate(ratio, adv, eps)
assert np.all(obj <= ratio * adv + 1e-12), "surrogate exceeded the unclipped term"

# edge: ratios inside the clip band -> clip is identity -> objective == unclipped
inband = np.array([0.9, 1.0, 1.1])
a = np.array([1.0, -1.0, 2.0])
assert np.allclose(clipped_surrogate(inband, a, eps), inband * a)

# edge / adversarial: zero advantage (frac_reward_zero_std groups) -> zero objective,
# hence zero gradient, so these prompts teach the policy nothing.
assert np.allclose(clipped_surrogate(ratio, np.zeros_like(adv), eps), 0.0)

# equivalence: vectorized matches an explicit slow reference
def slow(ratio, adv, eps):
    out = []
    for rr, aa in zip(ratio, adv):
        u = rr * aa
        c = min(max(rr, 1 - eps), 1 + eps) * aa
        out.append(min(u, c))
    return np.array(out)

assert np.allclose(obj, slow(ratio, adv, eps), atol=1e-12)
print("clipped surrogate: all asserts passed")

How to use it¶

TRL's GRPOTrainer takes a model, one or more reward_funcs, and a prompt dataset. A reward function receives prompts and completions (plus any extra dataset columns) and returns a list of floats. Generation is the bottleneck, so back it with vLLM (inference serving).

Reference template (TRL >= 1.6, not executed here); the advantage and clip maths it drives are the validated blocks above.

# train_grpo.py  — TRL >= 1.6; verify GRPOConfig fields on the installed version
from datasets import load_dataset
from trl import GRPOConfig, GRPOTrainer

def reward_correct(prompts, completions, answer, **kw):     # verifiable reward
    return [1.0 if extract_final(c) == a else 0.0 for c, a in zip(completions, answer)]

trainer = GRPOTrainer(
    model="Qwen/Qwen3-8B",
    reward_funcs=[reward_correct],
    args=GRPOConfig(
        num_generations=8,            # group size G
        beta=0.0,                     # KL coeff; 0.0 is the TRL default (KL term off)
        use_vllm=True,                # vllm_mode="colocate" by default
        max_completion_length=2048,
        bf16=True,
    ),
    train_dataset=load_dataset("trl-lib/DeepMath-103K", split="train"),
)
trainer.train()

accelerate launch train_grpo.py

How to develop with it¶

Iterate on three things, in order: the reward, the group size, and the regularisation.

Reward design: pass several functions (reward_funcs=[correct, format]); TRL sums them (weighted by reward_weights). Keep each cheap and deterministic; validate on held-out cases before a full run. The cross-cutting principles (shaping, normalization, hack-resistance) are in reward design for RL post-training.
Group size num_generations (G): larger G gives lower-variance advantages but more rollout cost; 8 to 16 is common. Watch frac_reward_zero_std: prompts where every sample is right or wrong contribute no gradient (the zero-std edge case validated above).
beta / KL: TRL defaults beta=0.0 (no KL term), following R1-Zero-style practice; raise it if the policy drifts off the reference. num_iterations (μ) > 1 reuses each rollout batch for multiple clipped updates (epsilon low/high bound the ratio).
Loss variants: loss_type="dapo" or "dr_grpo" remove the response-length bias of the original sample-level loss (References).

Reference template (TRL, not executed here); scale_rewards=False is the mean-centre-only path validated in the advantage block above.

GRPOConfig(num_generations=16, beta=0.01, num_iterations=2,
           loss_type="dr_grpo", scale_rewards=False)   # disable std scaling (R1-Zero finding)

How to integrate it¶

GRPO plugs into two systems it does not own: the inference engine that produces rollouts, and the post-training stack whose SFT'd checkpoint it starts from.

With the rollout inference engine¶

GRPO does not serve a model, but it depends on an inference engine for the rollout phase, which typically dominates wall-clock. Rollouts run on vLLM or SGLang (inference serving). The trade-off is colocate (rollout shares trainer GPUs, default in TRL) vs server/disaggregated (dedicated rollout GPUs; see disaggregated inference). Note the training-inference mismatch: vLLM and the trainer can produce slightly different distributions, so TRL applies truncated importance sampling by default. Keep it on.

In the post-training stack¶

GRPO is RL fine-tuning. It adapts model weights, the RLVR stage of the post-training stack (SFT → DPO → GRPO). It assumes an SFT'd (SFT and LoRA) starting policy and a reference model; it does not replace SFT but builds on it. LoRA-on-GRPO is supported by the larger libraries to cut memory.

How to run it in production¶

The defining cost is the weight sync trainer → rollout every step (colocated) or every few steps (disaggregated). Get the topology and the sync fabric right, and pin every library version (the APIs above move fast).

Colocated (TRL default, verl): rollout and trainer share GPUs; sync is local but memory pressure is the constraint. Offload between phases, watch for OOM.
Disaggregated (slime, SkyRL): separate pools; the weight transfer wants NVLink intra-node or fast IB/RoCE with GDR inter-node, the same fabric concerns as performance tuning/disaggregated inference.
NCCL for the trainer's all-gather: NCCL_IB_HCA, NCCL_NET_GDR_LEVEL=SYS; confirm [GDRDMA] in NCCL_DEBUG=INFO. On Blackwell, FP8 rollouts cut generation cost; keep training in BF16 for stability (quantization for inference).
Provision the rollout tier to match: generation is rollout-dominated, so too few rollout GPUs leave the (expensive) training GPUs idle. Separating rollout and trainer GPUs (server mode, below) also avoids NCCL clashes between the two.

How to maintain it¶

An RL run is a live system, not a fire-and-forget job. Watch it, checkpoint it, and version its config.

Watch the training signals: entropy (collapse means exploration died), the reward mean and frac_reward_zero_std (rising means groups stopped producing gradient), and the KL to the reference (drift means the policy is wandering). These are the leading indicators of the failure modes below.
Checkpoint and resume: persist the policy and keep the frozen reference so a run can resume after preemption or a fault; see checkpoint recovery.
Re-validate the reward whenever the data or model shifts: the reward functions are deterministic, so unit-test them on held-out cases (as in the cookbook) before every campaign; a silently broken reward trains a broken model fast.
Treat configs as version-specific: the field iterates quickly (DAPO is a direct successor addressing GRPO's length bias and clipping), so pin library and config versions and re-check field names on upgrade.

How to scale it¶

GRPO is two workloads (a trainer and a rollout generator), so multi-node scaling splits GPUs between them. TRL scales the trainer with DeepSpeed ZeRO-3 + Accelerate (DeepSpeed and ZeRO) and dedicates separate GPUs to a vLLM server. The TRL multi-node recipe runs the trainer on N nodes and a vLLM server on a separate node:

# server mode: trainer and rollout on different GPUs (avoids NCCL clashes)
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 trl vllm-serve --model Qwen/Qwen3-8B
# then in GRPOConfig: use_vllm=True, vllm_mode="server"

For frontier-scale RL, use a dedicated system, verl (colocated) or slime (disaggregated/async), both Ray-based (Ray) with FSDP/Megatron trainers and vLLM/SGLang rollouts. verl selects GRPO with algorithm.adv_estimator=grpo and actor_rollout_ref.actor.use_kl_loss=True (kl_loss_coef sets the coefficient).

Cookbook (common use cases)¶

1. A verifiable reward function (maths / exact-match)

Pure Python, so it is runnable and asserted here (no TRL needed). The adversarial cases guard the anchor: an empty box or a correct-but-unboxed answer scores 0, which is what makes the reward resist format-only hacking.

import re


def reward_boxed(prompts, completions, solution, **kw):
    def get(s):
        m = re.search(r"\\boxed\{([^}]*)\}", s)
        return m.group(1).strip() if m else None
    return [1.0 if get(c) == s else 0.0 for c, s in zip(completions, solution)]


prompts = ["p"] * 5
completions = [
    r"reasoning... \boxed{42}",     # correct
    r"answer \boxed{ 42 }",         # correct, surrounding whitespace tolerated
    r"answer \boxed{43}",           # wrong value
    "the final answer is 42",       # no \boxed anchor -> unrewarded
    r"empty \boxed{}",              # empty box
]
solution = ["42", "42", "42", "42", "42"]
rewards = reward_boxed(prompts, completions, solution)
assert rewards == [1.0, 1.0, 0.0, 0.0, 0.0], rewards

# adversarial: an empty box that merely echoes the required format must score 0
assert reward_boxed(["p"], [r"\boxed{}"], ["7"]) == [0.0]
# adversarial: the correct answer stated in prose but not boxed scores 0 (the checker
# requires the anchor, which is what makes the reward hard to hack by formatting alone)
assert reward_boxed(["p"], ["the answer is 7"], ["7"]) == [0.0]
print("reward_boxed: all asserts passed")

2. Combine correctness + format rewards (TRL)

The reward_format function is pure Python and is asserted here; the GRPOTrainer wiring is a reference template (needs TRL, not executed here). TRL sums the two functions.

import re


def reward_format(prompts, completions, **kw):
    return [0.2 if re.search(r"<think>.*</think>", c, re.S) else 0.0 for c in completions]


comps = [
    "<think>reason</think> answer",         # present -> 0.2
    "<think>line1\nline2</think> done",     # multiline body, needs re.S -> 0.2
    "no tags here",                          # absent -> 0.0
    "<think>unterminated reasoning",         # opened but never closed -> 0.0
]
out = reward_format(["p"] * 4, comps)
assert out == [0.2, 0.2, 0.0, 0.0], out
print("reward_format: all asserts passed")

# reference template (TRL, not executed here)
trainer = GRPOTrainer(model="Qwen/Qwen3-8B",
    reward_funcs=[reward_boxed, reward_format],
    args=GRPOConfig(num_generations=8, use_vllm=True, bf16=True), train_dataset=ds)

3. verl GRPO config (large-scale, Ray)

# verl: select GRPO + vLLM rollouts (pin the verl release; verify keys on the repo)
python -m verl.trainer.main_ppo \
  algorithm.adv_estimator=grpo \
  actor_rollout_ref.rollout.name=vllm \
  actor_rollout_ref.actor.use_kl_loss=True \
  actor_rollout_ref.actor.kl_loss_coef=0.001

Failure modes¶

Entropy collapse: outputs go deterministic, exploration dies, reward plateaus. Monitor entropy; add KL (beta>0) or use DAPO-style fixes.
Advantage/reward collapse: if every sample in a group scores identically (frac_reward_zero_std high), the advantage is zero and no learning happens (the zero-std case asserted in the core-math block); vary prompt difficulty or G.
KL drift: with beta=0.0 the policy can wander far from the reference; raise beta or clip more tightly if generations degrade.
Reward hacking: the model exploits a loophole in the reward (e.g. matching the format string without solving). Test the reward adversarially first, as in the cookbook.
Under-provisioned rollout: generation is rollout-dominated; too few rollout GPUs leave training GPUs idle.
Training-inference mismatch: disabling importance-sampling correction with a fast engine can destabilise training; leave TRL's correction on.
The field iterates fast: DAPO is a direct successor addressing GRPO's length bias and clipping; treat configs as version-specific.

References¶

DeepSeekMath (GRPO): https://arxiv.org/abs/2402.03300
DeepSeek-R1 (GRPO at scale): https://arxiv.org/abs/2501.12948
TRL GRPO Trainer docs: https://huggingface.co/docs/trl/grpo_trainer
verl GRPO docs: https://verl.readthedocs.io/en/latest/algo/grpo.html
DAPO (successor): https://arxiv.org/abs/2503.14476