Skip to content
Markdown

RLVR (reinforcement learning with verifiable rewards)

Scope: the post-training paradigm where an LLM's RL reward comes from a deterministic verifier of correctness (an answer checker, a unit-test run, a format/regex check, a proof checker) rather than a learned reward model. The reward source under GRPO, PPO, and the RL libraries; the paradigm behind reasoning models, and the practical half of reward design whenever correctness can be computed.

Blocks that need TRL, verl, or vLLM are labelled reference templates (pin versions before production use); the numpy blocks are runnable and assert the core maths they teach.

What it is

RLVR is reinforcement learning where the reward is produced by a verification function, not a reward model. For each prompt the policy samples completions, a programmatic checker decides whether each is verifiably correct, and the policy receives a reward only when it is, typically binary (a fixed reward for correct, 0 otherwise). It replaces the learned reward model with a deterministic checker, keeping the rest of the RLHF objective.

The term was introduced and popularized by AI2's Tülu 3 (Lambert et al., 2024), which named it "a novel method we call Reinforcement Learning with Verifiable Rewards" while also framing it as "a simplified form of existing approaches".1 The technique predates the name: DeepSeekMath used rule-based math rewards, and DeepSeek-R1 applied RL with rule-based (verifiable) rewards directly to a base model to elicit reasoning, explicitly avoiding a neural reward model because it "may suffer from reward hacking".23 That last point is the whole idea: a ground-truth checker cannot be over-optimized the way a learned proxy can.

RLVR names the reward source, not the optimizer. It is algorithm-agnostic: Tülu 3 ran it with PPO, R1 with GRPO, and the same verifiable reward drops into RLOO, REINFORCE++, DAPO, or Dr. GRPO unchanged.4 Where it applies, it is the cleanest signal in reward design.

Why use it

  • No reward model to train or game. A verifier is ground truth, so it sidesteps reward-model over-optimization: the failure where more RL against a learned reward eventually lowers true quality (reward model training). R1 chose rule-based rewards for exactly this reason.2
  • Cheap and deterministic. The reward is a function call (string match, a SymPy comparison, a test run), not a forward pass through a 7B+ reward model. It is reproducible and auditable.
  • Strong signal for reasoning. Correctness on maths, code, and structured tasks gives a crisp per-sample gradient that preferences (DPO) are too coarse to provide; it is the paradigm behind R1-style reasoning gains.
  • Extends to agents. Any task with a checkable outcome (tests pass, goal state reached) inherits the same clean signal, the basis of agentic and tool-use RL.

When to use it (and when not)

  • Use RLVR when correctness can be computed: maths (answer equivalence), code (unit tests), instruction-following with checkable constraints (IFEval-style), formal proofs (Lean), or tool-use with a verifiable end state. This is the RL stage in fine-tuning and post-training.
  • Prefer a reward model when quality is subjective (helpfulness, tone, safety) and no checker captures it; RLVR simply has nothing to score.
  • Prefer DPO when you only have offline preference pairs and no rollout budget, or SFT to cold-start format before any RL.
  • Consider on-policy distillation when a stronger teacher exists: it gives a dense per-token reward and is often far cheaper than RLVR's sparse per-sequence reward for the same capability transfer.
  • Avoid when the verifier is gameable or has poor coverage. A checker with false negatives (marks correct answers wrong) trains the model against correct behaviour; a checker with loopholes gets hacked (below).

Architecture

flowchart LR
  P["Prompt + ground truth"] --> ROLL["G rollouts (vLLM / SGLang)"]
  ROLL --> VER{"Verifier (deterministic)"}
  VER -->|"answer match / SymPy"| R["Reward: 1 if correct else 0"]
  VER -->|"unit tests in sandbox"| R
  VER -->|"format / regex / proof checker"| R
  R --> ADV["Advantage (group-relative or critic)"]
  ADV --> UPD["Policy update (PPO / GRPO / RLOO)"]
  UPD -->|"weight sync"| ROLL
  NORM["No learned reward model"] -.->|"replaced by the verifier"| VER

How to use it

The concrete work of RLVR is writing the verifier and handing it to an RL trainer as the reward. In GRPO/TRL a reward function receives prompts, completions, and any extra dataset columns and returns a list of floats. The sound way to score maths is symbolic equivalence, not string equality: HuggingFace's math-verify parses LaTeX/expressions (ANTLR4) into SymPy and compares them, so 1/2, 0.5, and \frac{1}{2} all match.

Reference template (needs pip install math-verify, SymPy-backed; not executed here):

# rlvr_reward.py: a verifiable math reward for TRL GRPOTrainer (reference template).
from math_verify import parse, verify

def reward_verifiable(prompts, completions, solution, **kwargs):
    scores = []
    for completion, gold in zip(completions, solution):
        try:
            ok = verify(parse(gold), parse(completion))   # symbolic equivalence, not string match
        except Exception:
            ok = False                                    # unparsable -> not verified
        scores.append(1.0 if ok else 0.0)
    return scores

The load-bearing maths (value equivalence beats string equality, and never crashes on garbage) is runnable and asserted here in numpy, no math-verify needed. The adversarial case is the one that matters for RLVR: naive string == marks a correct answer wrong (0.5 vs 1/2), which trains the model against correct behaviour.

import re
import numpy as np


def to_value(s):
    """Parse a maths answer string to a float value, or None if unparsable.
    Handles \boxed{}, $...$, LaTeX \frac{a}{b}, plain fractions a/b, decimals."""
    if s is None:
        return None
    s = s.strip()
    s = re.sub(r"\\boxed\{([^}]*)\}", r"\1", s)   # unwrap \boxed{...}
    s = s.replace("$", "").strip()
    m = re.fullmatch(r"\\frac\{(-?\d+)\}\{(-?\d+)\}", s)   # \frac{a}{b}
    if m:
        a, b = int(m.group(1)), int(m.group(2))
        return a / b if b else None
    m = re.fullmatch(r"(-?\d+)\s*/\s*(-?\d+)", s)          # a/b
    if m:
        a, b = int(m.group(1)), int(m.group(2))
        return a / b if b else None
    try:
        return float(s)                                    # decimal / int
    except ValueError:
        return None


def verify_value(gold, pred):
    """Symbolic/numeric equivalence: parse both, compare by value with tolerance."""
    g, p = to_value(gold), to_value(pred)
    if g is None or p is None:
        return False                                       # unparsable -> not verified
    return bool(np.isclose(g, p, rtol=0.0, atol=1e-9))


def reward_verifiable(completions, solution):
    """Binary verifiable reward, the RLVR checker (replaces a learned reward model)."""
    return [1.0 if verify_value(gold, c) else 0.0 for c, gold in zip(completions, solution)]


# --- happy path: exact string still matches -------------------------------
assert reward_verifiable(["0.5"], ["0.5"]) == [1.0]

# --- core RLVR claim: value equivalence across representations ------------
gold = ["1/2"] * 5
forms = ["1/2", "0.5", r"\frac{1}{2}", "0.50", r"$\boxed{0.5}$"]
assert reward_verifiable(forms, gold) == [1.0, 1.0, 1.0, 1.0, 1.0], reward_verifiable(forms, gold)

# --- adversarial: naive string equality FALSE-NEGATIVES a correct answer --
# The whole reason RLVR uses symbolic checking: string "==" punishes correct maths.
str_reward = [1.0 if c == g else 0.0 for c, g in zip(["0.5"], ["1/2"])]
val_reward = reward_verifiable(["0.5"], ["1/2"])
assert str_reward == [0.0] and val_reward == [1.0]        # symbolic fixes the false negative
assert str_reward != val_reward                            # they genuinely disagree

# --- boundary: just outside tolerance is wrong, just inside is right -------
assert reward_verifiable(["0.51"], ["0.5"]) == [0.0]       # wrong value -> 0
assert reward_verifiable(["0.5000000001"], ["0.5"]) == [1.0]  # within atol -> 1

# --- corruption / garbage input never crashes, scores 0 -------------------
assert reward_verifiable(["not a number", "", r"\frac{1}{0}"], ["0.5", "0.5", "0.5"]) == [0.0, 0.0, 0.0]

# --- invariant: reward is strictly binary ---------------------------------
scores = reward_verifiable(forms + ["7", "bad"], gold + ["8", "9"])
assert set(scores) <= {0.0, 1.0}, scores

print("block1 reward_verifiable: all asserts passed")

Hand the reward to a trainer. The wiring is a reference template (needs TRL, not executed here):

from trl import GRPOConfig, GRPOTrainer   # TRL >= 1.6; verify GRPOConfig fields on the installed version
trainer = GRPOTrainer(
    model="Qwen/Qwen3-8B",
    reward_funcs=[reward_verifiable],      # RLVR = this checker replaces a reward model
    args=GRPOConfig(num_generations=8, use_vllm=True, max_completion_length=2048, bf16=True),
    train_dataset=ds,                      # rows carry a `solution` (ground-truth) column
)
trainer.train()

The choice of optimizer (GRPO here, PPO in Tülu 3) is orthogonal to the reward; swap it without touching the verifier.

How to engineer the verifier

The verifier is the objective, so verifier quality dominates the run. Build it deliberately by domain:

  • Maths: \boxed{} extraction + symbolic equivalence (math-verify/SymPy), not string equality, which creates false negatives that punish correct answers. Normalize units and formatting before comparing.
  • Code: run the completion against a hidden test suite in a sandbox (container/microVM, never the host), time-limited; reward pass/fail. This is the agentic-RL reward applied to a single completion.
  • Instruction-following: regex/format checkers for verifiable constraints ("answer in JSON", "exactly 3 bullet points"); gate any format bonus on task correctness so it cannot be farmed (reward design).
  • Proofs: a proof assistant (Lean) is the verifier; the reward is whether the proof type-checks.
  • LLM-as-judge is not verifiable. A model grader is a soft, learnable proxy with the same over-optimization risk as a reward model; use it only where no deterministic check exists, and treat it as reward-model territory, not RLVR.

Test the verifier adversarially on held-out cases before a full run, checking both directions: it must not pass wrong answers (loopholes) and must not fail right ones (false negatives). Watch reward, entropy, and kl as first-class metrics (observability).

How to integrate with it

The verifier is a custom reward function the trainer calls on every rollout; it owns the reward, not the update rule. The wiring differs by library but the shape is identical, a function from (prompt, completion, ground_truth) to a float:

  • TRL: pass reward_funcs=[fn]; each fn(prompts, completions, **columns) returns a list of floats, and TRL sums several of them (weighted by reward_weights). Extra dataset columns (a solution column) arrive as keyword arguments.
  • verl: point custom_reward_function.path/name at compute_score(data_source, solution_str, ground_truth, extra_info=None) -> float.
  • OpenRLHF, NeMo-RL, open-instruct expose an equivalent custom-reward hook (RL libraries).

Because the reward is decoupled from the optimizer, the optimizer is a swap: the same verifier drives PPO (Tülu 3) or GRPO/RLOO (R1) unchanged. What the optimizer does with the scalar is turn it into a per-sample advantage, either against a learned critic (PPO) or group-relative across the G completions of one prompt (GRPO/RLOO). That group-relative step is the whole coupling between a binary verifier and the gradient, and its degenerate case is a real failure mode: when every completion in a group scores the same, the advantage is zero and no gradient flows. Runnable and asserted in numpy:

import numpy as np


def group_advantages(rewards, group_size, scale_rewards=True, eps=1e-6):
    """GRPO/RLOO-style group-relative advantage from verifiable rewards.
    Reshape into groups of `group_size`, centre by the group mean, and
    (optionally) divide by the group std. This is how a binary verifier
    reward becomes the per-sample gradient signal (no critic network)."""
    r = np.asarray(rewards, dtype=float).reshape(-1, group_size)
    adv = r - r.mean(axis=1, keepdims=True)                 # mean-centre within group
    if scale_rewards:
        adv = adv / (r.std(axis=1, keepdims=True) + eps)    # std-normalise (R1 default)
    return adv.reshape(-1)


def _reference(rewards, group_size, scale_rewards, eps=1e-6):
    """Slow, explicit reference implementation to check the vectorised one against."""
    out = []
    for start in range(0, len(rewards), group_size):
        g = rewards[start:start + group_size]
        mu = sum(g) / len(g)
        centred = [x - mu for x in g]
        if scale_rewards:
            var = sum((x - mu) ** 2 for x in g) / len(g)
            std = var ** 0.5
            centred = [c / (std + eps) for c in centred]
        out.extend(centred)
    return out


# --- equivalence to the slow reference (mixed rewards, two groups) --------
rewards = [1.0, 0.0, 1.0, 0.0, 1.0, 1.0, 0.0, 0.0]
for scale in (True, False):
    fast = group_advantages(rewards, group_size=4, scale_rewards=scale)
    slow = _reference(rewards, group_size=4, scale_rewards=scale)
    assert np.allclose(fast, slow, atol=1e-9), (scale, fast, slow)

# --- advantages are mean-zero within every group (no constant bias) -------
adv = group_advantages(rewards, group_size=4, scale_rewards=True).reshape(-1, 4)
assert np.allclose(adv.mean(axis=1), 0.0, atol=1e-9), adv.mean(axis=1)

# --- sign: a correct sample in a mixed group is reinforced, a wrong one is not
mixed = group_advantages([1.0, 0.0], group_size=2, scale_rewards=False)
assert mixed[0] > 0 > mixed[1], mixed                       # +0.5 for correct, -0.5 for wrong

# --- EDGE / FAILURE: zero-variance group => zero advantage => NO gradient --
# Every completion right, or every one wrong: the group-relative signal vanishes.
all_right = group_advantages([1.0, 1.0, 1.0, 1.0], group_size=4)
all_wrong = group_advantages([0.0, 0.0, 0.0, 0.0], group_size=4)
assert np.allclose(all_right, 0.0) and np.allclose(all_wrong, 0.0), (all_right, all_wrong)
# this is exactly the "zero-variance groups" failure mode: prompts too easy or too hard
frac_zero_std = np.mean([
    np.std(group_advantages(r, 4, scale_rewards=False)) == 0.0
    for r in ([1.0, 1.0, 1.0, 1.0], [0.0, 0.0, 0.0, 0.0], [1.0, 0.0, 1.0, 0.0])
])
assert abs(frac_zero_std - 2 / 3) < 1e-9, frac_zero_std      # 2 of 3 groups give no gradient

print("block2 group_advantages: all asserts passed")

Combine terms carefully: a format reward stacked on the correctness reward must be gated on correctness (pay the format bonus only when the answer is also right), or the policy farms the easy term without solving the task (reward design).

How to run it in production

RLVR in production is two reliability problems: an untrusted-code sandbox for code and agent verifiers, and a deterministic, monitored reward.

  • Isolate the verifier. A code checker runs model-generated code, so execute it in a container or microVM, never the host: drop the network, cap wall-clock and memory, and treat a non-zero exit or a timeout as a failed check. This is the agentic-RL sandbox applied per completion.
  • Keep it deterministic and cached. The reward must be a pure function of (completion, ground_truth); cache by content hash so re-scored rollouts are free, and pin the checker with its parser (a math-verify/SymPy upgrade can shift what parses, silently moving the objective).
  • Protect the ground truth. The gold answer is the objective; if it leaks into the prompt or into the reasoning trace the verifier is trivially gamed (see failure modes). Keep it in a column the model never sees.
  • Watch reward, entropy, and kl as first-class metrics (observability). A rising reward with a flat held-out eval is the signature of hacking, not learning.

How to maintain it

The verifier is code that defines the objective, so maintain it like production code, not a static config.

  • Keep an adversarial regression suite. The held-out pass/fail cases from engineering the verifier (loopholes it must reject, correct answers it must accept) become a test that runs before every campaign; a silently broken reward trains a broken model fast.
  • Re-validate on data or model shift. As the policy explores it emits new answer formats; extend the normalizer/parser so they do not turn into false negatives, and re-run the suite whenever the dataset or base model changes.
  • Pin versions and treat configs as version-specific. The field iterates quickly (DAPO and Dr. GRPO succeed GRPO), so pin the trainer and checker libraries and re-verify field names on upgrade.
  • Checkpoint for rollback. Persist the policy and keep the frozen reference so a run resumes after preemption and can roll back if a reward change degrades it (checkpoint recovery).

How to scale it

RLVR has the systems shape of GRPO: it is rollout-dominated (generation is most of the wall-clock), so it splits into a trainer and a rollout generator and scales by provisioning both (async and disaggregated RL systems). Two RLVR-specific costs sit on top:

  • Verifier throughput. A code verifier that spins a sandbox per completion becomes its own scheduled workload; a slow verifier starves the trainer exactly like a slow rollout. Pool and parallelize sandboxes; cache deterministic checks.
  • Weight sync trainer → rollout wants NVLink intra-node or IB/RoCE with GDR inter-node (performance tuning); confirm [GDRDMA] in NCCL_DEBUG=INFO.

For frontier scale, dedicated systems supply the verifiable reward as a custom reward function: verl, OpenRLHF (recommends REINFORCE++-baseline "for reasoning tasks (RLVR)"), NeMo-RL, and AI2's open-instruct (the Tülu codebase).4

# verl: verifiable reward supplied as a custom function; pin the release and verify keys on the repo.
# The scorer signature is compute_score(data_source, solution_str, ground_truth, extra_info=None) -> float.
python -m verl.trainer.main_ppo \
  algorithm.adv_estimator=grpo \
  actor_rollout_ref.rollout.name=vllm \
  custom_reward_function.path=/path/to/rlvr_reward.py \
  custom_reward_function.name=compute_score

Does RLVR actually add capability?

The headline open question as of 2026: does RLVR create new reasoning ability, or just re-weight sampling toward answers the base model already had? Three results are in genuine tension, and citing them together is the only way to represent the state of the field honestly:

  • Sharpening (Yue et al., 2025). RLVR-trained models beat their base model at pass@1, but the base models achieve higher pass@k at large k: RLVR narrows the output distribution rather than adding new reasoning paths.5
  • Signal-is-weak (Spurious Rewards, 2025). Even random rewards lift MATH-500 for Qwen2.5-Math-7B by ~21 points, via a GRPO clipping bias that amplifies pretrained "code-reasoning" priors. This is largely Qwen-specific: the same spurious rewards "often fail to produce gains for other model families, such as Llama3 or OLMo2". Much RLVR literature is Qwen-only, so this is a warning about experimental design, not proof RLVR is empty.6
  • Expansion (ProRL, 2025). Prolonged RL with KL control, reference resets, and diverse tasks "can uncover novel reasoning strategies that are inaccessible to base models, even under extensive sampling", directly disputing the sharpening view.7

Net, defensible today: RLVR reliably improves single-shot (pass@1) accuracy (not contested); whether it raises the pass@k capability ceiling beyond the base model is unresolved. Practical takeaways: evaluate with pass@k, not just pass@1; do not extrapolate a Qwen-only result to other families; and expect gains to entangle with training duration and task diversity.

Because the debate lives entirely in the gap between pass@1 and pass@k, measure both with the unbiased estimator (Chen et al.): pass@k = 1 - C(n-c, k) / C(n, k) for c correct out of n samples. The block below reproduces the sharpening result numerically: a model that wins pass@1 can still lose pass@k at large k once it has narrowed its output distribution.

from math import comb
import numpy as np


def pass_at_k(n, c, k):
    """Unbiased pass@k (Chen et al., HumanEval): probability that at least one of
    k samples drawn without replacement from n samples (c correct) is correct.
    pass@k = 1 - C(n-c, k) / C(n, k)."""
    assert 0 <= c <= n and 1 <= k <= n
    if n - c < k:
        return 1.0                                  # too few wrong to fill a k-subset
    return 1.0 - comb(n - c, k) / comb(n, k)


def agg_pass_at_k(counts, n, k):
    """Mean pass@k over a set of prompts, each with (n samples, c correct)."""
    return float(np.mean([pass_at_k(n, c, k) for c in counts]))


# --- equivalence to a slow Monte-Carlo reference --------------------------
rng = np.random.default_rng(0)
n, c, k, trials = 8, 3, 4, 200_000
pool = np.array([1] * c + [0] * (n - c))
hits = sum(rng.permutation(pool)[:k].any() for _ in range(trials)) / trials
assert abs(hits - pass_at_k(n, c, k)) < 5e-3, (hits, pass_at_k(n, c, k))

# --- boundaries -----------------------------------------------------------
assert pass_at_k(8, 0, 4) == 0.0                    # no correct sample -> never passes
assert abs(pass_at_k(8, 3, 1) - 3 / 8) < 1e-12      # k=1 reduces to c/n (pass@1)
assert pass_at_k(8, 5, 4) == 1.0                    # n-c=3 < k=4 -> guaranteed a hit

# --- monotonic non-decreasing in k for fixed (n, c) -----------------------
seq = [pass_at_k(8, 2, k) for k in range(1, 9)]
assert all(b >= a - 1e-12 for a, b in zip(seq, seq[1:])), seq

# --- SHARPENING (Yue et al.): RLVR lifts pass@1 but can lower pass@k -------
# Per-prompt correct-counts out of n=8. The sharpened (RLVR) model concentrates
# probability: it solves easy prompts more often (higher pass@1) but loses the
# rare correct path on hard prompts (c: 1 -> 0), shrinking the pass@k ceiling.
n = 8
base = [1, 1, 4, 4]        # hard prompts sometimes solved (c=1), easy at c=4
sharp = [0, 0, 6, 6]       # hard never solved (c=0), easy sharpened to c=6

# pass@1: sharpened wins (the uncontested RLVR result)
assert agg_pass_at_k(sharp, n, 1) > agg_pass_at_k(base, n, 1), (
    agg_pass_at_k(sharp, n, 1), agg_pass_at_k(base, n, 1))

# pass@k at large k: the BASE model wins (its wider distribution still covers the tail)
assert agg_pass_at_k(base, n, 8) > agg_pass_at_k(sharp, n, 8), (
    agg_pass_at_k(base, n, 8), agg_pass_at_k(sharp, n, 8))

# there is a genuine crossover: sharpened ahead at k=1, base ahead by k=8
diffs = [agg_pass_at_k(sharp, n, k) - agg_pass_at_k(base, n, k) for k in range(1, 9)]
assert diffs[0] > 0 > diffs[-1], diffs

print("block3 pass_at_k: all asserts passed")
print(f"  pass@1  base={agg_pass_at_k(base, n, 1):.3f}  sharp={agg_pass_at_k(sharp, n, 1):.3f}")
print(f"  pass@8  base={agg_pass_at_k(base, n, 8):.3f}  sharp={agg_pass_at_k(sharp, n, 8):.3f}")

Failure modes

  • Verifier gaming (reward hacking). With purely extensional checking (does the output match the expected answer), models can enumerate instance-level labels instead of learning the rule, producing outputs that pass the verifier without generalizing.9 Other exploits: injecting the target answer into the reasoning trace, or hitting an extraction-regex loophole. Detect with perturbation tests and held-out evals.
  • False negatives. A brittle checker (string-equality maths) marks correct answers wrong and trains the model against correct behaviour. Use symbolic equivalence.
  • Coverage gaps. The verifier only scores what it checks; unscored quality drifts freely.
  • Entropy collapse. Policy entropy "drops sharply at the early training stage", saturating performance; the fitted R = -a·e^H + b implies a ceiling as entropy → 0. Mitigate with Clip-Cov/KL-Cov or DAPO's Clip-Higher, and monitor entropy.8
  • Zero-variance groups. If every sampled completion for a prompt is right (or all wrong), the group-relative advantage is zero and no gradient flows (GRPO), the degenerate case asserted in the integrate block above; vary prompt difficulty.
  • Qwen-spurious illusion. A gain that reproduces under a random reward is not evidence the signal works; ablate against a spurious-reward baseline on the target model family.6

References

  • Tülu 3 (introduces/names RLVR): https://arxiv.org/abs/2411.15124 · AI2 blog: https://allenai.org/blog/tulu-3-technical
  • DeepSeek-R1 (rule-based verifiable rewards, avoids neural RM): https://arxiv.org/abs/2501.12948
  • DeepSeekMath (GRPO, rule-based math rewards): https://arxiv.org/abs/2402.03300
  • Does RL Really Incentivize Reasoning Beyond the Base Model? (Yue et al., sharpening): https://arxiv.org/abs/2504.13837
  • Spurious Rewards: Rethinking Training Signals in RLVR (Qwen-specific): https://arxiv.org/abs/2506.10947
  • ProRL (prolonged RL expands reasoning boundaries): https://arxiv.org/abs/2505.24864
  • pass@k unbiased estimator (Chen et al., Codex/HumanEval): https://arxiv.org/abs/2107.03374
  • The Entropy Mechanism of RL for Reasoning LLMs (Clip-Cov/KL-Cov): https://arxiv.org/abs/2505.22617
  • LLMs Gaming Verifiers: RLVR can Lead to Reward Hacking: https://arxiv.org/abs/2604.15149
  • DAPO: https://arxiv.org/abs/2503.14476 · Dr. GRPO (R1-Zero critique): https://arxiv.org/abs/2503.20783 · REINFORCE++: https://arxiv.org/abs/2501.03262 · RLOO: https://arxiv.org/abs/2402.14740
  • Math-Verify (verifier): https://github.com/huggingface/Math-Verify · verl: https://github.com/volcengine/verl · OpenRLHF: https://github.com/OpenRLHF/OpenRLHF · open-instruct (Tülu): https://github.com/allenai/open-instruct

Related: RLSD · GRPO · PPO · Reward design · Reward model training · On-policy distillation · Agentic and tool-use RL · Async RL systems · DPO · SFT/LoRA · Fine-tuning and post-training · RL libraries · verl · Glossary


  1. Lambert et al., Tülu 3 — introduces "Reinforcement Learning with Verifiable Rewards (RLVR)", replacing the reward model with a verification function that pays a reward only when the response is verifiably correct; AI2 frames it as a simplification of existing rule-reward practice. https://arxiv.org/abs/2411.15124 

  2. DeepSeek-R1 uses rule-based (verifiable) rewards and states it does not use a neural reward model because it "may suffer from reward hacking"; trained with GRPO. https://arxiv.org/abs/2501.12948 

  3. DeepSeekMath introduces GRPO and trains maths with rule-based rewards, predating the "RLVR" name. https://arxiv.org/abs/2402.03300 

  4. RLVR is the reward source, not the optimizer: Tülu 3 uses PPO, R1 uses GRPO, and verl/OpenRLHF/NeMo-RL pair verifiable rewards with GRPO/RLOO/REINFORCE++/DAPO/Dr. GRPO. OpenRLHF documents REINFORCE++-baseline "for reasoning tasks (RLVR)". https://github.com/volcengine/verl · https://github.com/OpenRLHF/OpenRLHF 

  5. Yue et al. — RLVR models win at pass@1 but base models reach higher pass@k at large k, i.e. RL sharpens the sampling distribution rather than expanding capability. Contested by ProRL. https://arxiv.org/abs/2504.13837 

  6. Shao et al., Spurious Rewards — random rewards raise MATH-500 for Qwen2.5-Math-7B by ~21 points via a GRPO clipping bias, but "often fail to produce gains for other model families, such as Llama3 or OLMo2"; the result is largely Qwen-specific. https://arxiv.org/abs/2506.10947 

  7. Liu et al., ProRL — prolonged RL with KL control, reference resets, and diverse tasks uncovers reasoning strategies "inaccessible to base models, even under extensive sampling". https://arxiv.org/abs/2505.24864 

  8. Cui et al. — policy entropy collapses early (R = -a·e^H + b), capping performance; Clip-Cov and KL-Cov restore exploration. https://arxiv.org/abs/2505.22617 

  9. Helff et al., LLMs Gaming Verifiers — under extensional verification, RLVR models enumerate instance-level labels that pass the checker instead of inducing the general rule; detected via Isomorphic Perturbation Testing. https://arxiv.org/abs/2604.15149