Markdown

Reward design for RL post-training¶

Scope: how to design the reward that drives RL post-training of LLMs, where the signal comes from (verifiable checkers, reward models, preferences), how to shape, compose, and normalize it, and how to catch reward hacking before it trains a bad model fast. The cross-cutting concern under GRPO, DPO, and the RL libraries; the single biggest determinant of whether an RL run helps or quietly regresses the model.

Reference templates on real APIs; pin versions and validate before production use.

What it is¶

Reward design is choosing the scalar reward that an RL post-training run optimizes and shaping it so the policy pursues the intended goal. It spans three decisions: the source of the signal (a programmatic checker, a learned reward model, or preference pairs), the composition (how correctness, format, and penalties combine into one number), and the normalization (bringing components onto a comparable scale and computing a per-prompt baseline). Those three decisions, not the choice of optimizer, dominate the outcome of a run.

Where the reward comes from¶

Three sources dominate LLM post-training, in rough order of signal cleanliness:

Verifiable / programmatic rewards (RLVR). A deterministic checker scores the completion: exact-answer match for maths, unit tests for code, schema/regex for format, or a tool-execution result. This is the cleanest signal and the default behind reasoning models. DeepSeek-R1 trained reasoning with GRPO on verifiable rewards.² Use it whenever correctness can be computed.
Reward models. A model trained on human (or AI) preference comparisons predicts a scalar reward, as in InstructGPT-style RLHF.³ Necessary for open-ended qualities (helpfulness, tone, safety) that no checker captures, but it is itself a learned proxy and can be gamed (see reward hacking). See reward model training.
Implicit / preference-direct. DPO skips an explicit reward entirely and optimizes the policy directly from (chosen, rejected) pairs; the reward is implicit in the log-probability margin against a reference.

Why use it¶

In RL post-training the reward is the objective: the policy will optimize exactly what you measure, including the parts you did not mean. A poorly designed reward fails faster and more confidently than a bad optimizer: the model learns the loophole, the logged reward climbs, and real quality falls. The reward-engineering discipline below is foundational across business RL and LLM RL alike; the LLM specifics (verifiable rewards, reward-model over-optimization) sit on top of the same principles.¹

When to use it (and when not)¶

Use verifiable / programmatic rewards whenever correctness can be computed: exact-answer match for maths, unit tests for code, schema/regex for format, or a tool-execution result. This is the cleanest signal and the default behind reasoning models (RLVR).
Use a reward model for open-ended qualities (helpfulness, tone, safety) that no checker captures, accepting that it is a learned, gameable proxy (reward model training).
Use implicit / preference-direct optimization (DPO) when you have (chosen, rejected) pairs and want to skip an explicit reward and online rollouts entirely.
Do not use a penalty-only reward for a hard constraint. For must-emit JSON or must-call-a-tool, prefer constrained decoding / structured output (constrained decoding), the LLM analog of action masking; a soft penalty cannot guarantee the constraint.
Do not reward a proxy without a validated correlation. Rewarding clicks, length, or format as a stand-in for quality, without first checking the correlation holds, invites reward hacking.
Do not densify a sparse reward speculatively. Add process / intermediate signals only when they provably correlate with the goal; otherwise the sparse verifiable outcome reward is safer.

Architecture¶

flowchart LR
  subgraph SRC["Reward source"]
    V["Verifiable checker<br/>(answer match, unit tests, regex)"]
    RM["Reward model<br/>(learned from preferences)"]
    PR["Preference pairs<br/>(implicit reward, DPO)"]
  end
  V --> SHAPE["Shape + compose<br/>(correctness + format + penalties)"]
  RM --> SHAPE
  SHAPE --> NORM["Normalize / balance scales<br/>(group-relative advantage)"]
  NORM --> UPD["Policy update<br/>(clipped surrogate + KL)"]
  UPD -->|"reward goes up but quality drops?"| HACK["Reward hacking check<br/>(held-out eval, adversarial probes)"]
  HACK -.->|"loophole found"| SHAPE
  PR -.->|"no explicit reward"| UPD

The reward flows through five stages. A reward source produces a raw score: a verifiable checker, a learned reward model, or (for DPO) implicit preference pairs that bypass an explicit reward and feed the update directly. Shape and compose combines correctness with format and penalty terms into one number. Normalize brings terms onto a comparable scale and computes the group-relative advantage that replaces a learned critic. The policy update applies a clipped surrogate objective with a KL penalty toward a frozen reference. Finally a reward-hacking check (held-out eval, adversarial probes) watches for the signature of a loophole: reward up, quality down. When one is found, the fix is upstream, in shaping and composition, not in the optimizer.

How to use it: design and compose the reward¶

These principles are the RL reward-engineering canon, framed for LLM post-training.¹

Dense vs sparse (outcome vs process). A sparse outcome reward (final answer right/wrong) is unambiguous but gives little gradient on long rollouts; a denser process reward (per-step credit, intermediate format) coaches the model at more decision points but is easier to game. Prefer outcome rewards when verifiable; add process signals only when they provably correlate with the goal.
Compose correctness with shaping terms. The standard LLM pattern is a small set of reward functions summed (optionally weighted): a correctness reward plus a format reward plus light penalties. GRPO trainers accept reward_funcs=[correct, format] directly. Keep each term cheap, deterministic, and independently testable.
Normalize and balance scales. Reward components on mismatched scales make the policy chase the biggest number and ignore the rest; bring terms into a comparable range (for example [-1, 1]).¹ In group methods this is the group-relative advantage A_i = (r_i - mean(r)) / std(r), the per-prompt baseline that replaces a learned critic in GRPO. (The std term is debated: R1-Zero-style runs disable it to avoid length bias.)
Treat constraints as soft governors, not just hard penalties. The KL penalty toward a frozen reference is exactly this, a Lagrangian-style soft constraint that lets the policy explore while keeping it from collapsing onto a degenerate high-reward region.¹ For hard constraints (must-emit JSON, must call a tool), prefer constrained decoding / structured output over penalty-only learning, the LLM analog of action masking.
Potential-based shaping when you must add intermediate reward. Potential-based reward shaping adds guidance without changing the optimal policy, the safe way to densify a sparse reward.⁵

A composed, hack-resistant reward¶

A correctness-gated reward that only pays the format bonus when the answer is right, with a mild length penalty, the TRL reward_funcs shape used by GRPO. The block is self-checking: run it to see the format bonus refused on a wrong answer and the length penalty stay zero at the budget boundary.

import re

def _final(s: str) -> str | None:
    m = re.search(r"\\boxed\{([^}]*)\}", s)
    return m.group(1).strip() if m else None

def reward_correct(prompts, completions, answer, **kw):
    # primary, verifiable signal: exact match of the boxed final answer
    return [1.0 if _final(c) == a else 0.0 for c, a in zip(completions, answer)]

def reward_format_gated(prompts, completions, answer, **kw):
    # bonus paid ONLY when the answer is also correct -> cannot be hacked alone
    out = []
    for c, a in zip(completions, answer):
        ok = _final(c) == a
        has_think = bool(re.search(r"<think>.*</think>", c, re.S))
        out.append(0.2 if (ok and has_think) else 0.0)
    return out

def reward_len_penalty(prompts, completions, **kw):
    # mild penalty for runaway length; keep scale small vs the correctness term
    return [-0.001 * max(0, len(c) - 2048) for c in completions]

# --- validation: correctness, gating vs format-hacking, length boundary ---
prompts, answer = ["q"] * 4, ["42", "42", "42", "42"]
completions = [
    r"<think>reason</think> \boxed{42}",   # correct + think   -> full credit
    r"<think>reason</think> \boxed{41}",   # think but WRONG   -> hack attempt
    r"\boxed{42}",                          # correct, no think
    "no box here",                          # no parsable answer
]
rc = reward_correct(prompts, completions, answer)
rf = reward_format_gated(prompts, completions, answer)
assert rc == [1.0, 0.0, 1.0, 0.0], rc
# ADVERSARIAL: a <think> wrapper on a WRONG answer earns NO format bonus
assert rf == [0.2, 0.0, 0.0, 0.0], rf
assert rf[1] == 0.0, "format reward must be un-hackable without a correct answer"

# length penalty: zero at/under the 2048 budget, linear above it
rl = reward_len_penalty(prompts, ["x" * 2048, "x" * 3048, "", "x" * 2049])
assert rl[0] == 0.0 and rl[2] == 0.0, rl          # boundary + empty -> no penalty
assert abs(rl[1] - (-1.0)) < 1e-9, rl             # 1000 over budget -> -1.0
assert abs(rl[3] - (-0.001)) < 1e-12, rl          # 1 over budget    -> -0.001
print("reward functions OK:", rc, rf, [round(x, 4) for x in rl])

The last principle, potential-based shaping, is worth validating because it is the safe way to densify a sparse reward: adding F = gamma*Phi(s') - Phi(s) provably leaves the optimal policy unchanged.⁵ The intermediate potentials telescope, so every trajectory from a given start state shifts by the same constant and the ranking (hence the optimal policy) is preserved:

# Potential-based reward shaping (Ng, Harada, Russell 1999): F = gamma*Phi(s') - Phi(s)
# leaves the optimal policy invariant. Validate: shaping never flips trajectory ranking.
gamma = 0.9
Phi = {"s0": 5.0, "s1": 3.0, "s2": 7.0, "T": 0.0}   # arbitrary potential, terminal = 0

def base_return(rewards):
    return sum((gamma ** t) * r for t, r in enumerate(rewards))

def shaped_return(states, rewards):
    G = 0.0
    for t, (r, s, s_next) in enumerate(zip(rewards, states[:-1], states[1:])):
        f = gamma * Phi[s_next] - Phi[s]
        G += (gamma ** t) * (r + f)
    return G

# two trajectories from the SAME start state s0
traj_a = (["s0", "s1", "T"], [0.0, 1.0])   # base return 0.9
traj_b = (["s0", "s2", "T"], [1.0, 0.0])   # base return 1.0
ba, bb = base_return(traj_a[1]), base_return(traj_b[1])
sa, sb = shaped_return(*traj_a), shaped_return(*traj_b)

assert bb > ba, (ba, bb)                              # base prefers b
assert sb > sa, (sa, sb)                              # shaping must NOT flip it
# telescoping: shaped return shifts by exactly -Phi[s0] (terminal potential 0)
assert abs((sa - ba) - (-Phi["s0"])) < 1e-9, (sa, ba)
assert abs((sb - bb) - (-Phi["s0"])) < 1e-9, (sb, bb)
# ADVERSARIAL: any other potential (even a huge one) shifts BOTH by the same constant
Phi["s2"] = 1000.0
assert abs(shaped_return(*traj_b) - bb - (-Phi["s0"])) < 1e-9   # still -Phi[s0]
print("PBRS preserves ranking; constant shift = -Phi[s0] =", -5.0)

How to integrate it with a GRPO trainer¶

Wire the reward functions into a GRPO trainer as a list of reward_funcs; TRL sums them (optionally weighted) into one scalar per completion. Keep beta (the KL coefficient) above zero so the KL budget bounds how far the policy can exploit the reward. This block is a reference template (needs trl, torch, vllm): pin versions and verify GRPOConfig fields on the installed TRL.

# reference template: verify GRPOConfig fields on the installed trl/torch/vllm
from trl import GRPOConfig, GRPOTrainer
trainer = GRPOTrainer(
    model="Qwen/Qwen3-8B",
    reward_funcs=[reward_correct, reward_format_gated, reward_len_penalty],
    args=GRPOConfig(num_generations=8, beta=0.01,   # KL on: bounds reward exploitation
                    scale_rewards=False, use_vllm=True, bf16=True),
    train_dataset=ds)
trainer.train()

The core aggregation the trainer performs, summing the reward functions into one scalar, is pure arithmetic you can validate without TRL:

import numpy as np
# TRL aggregates reward_funcs into one scalar per completion: a (weighted) sum of
# each function's output vector. Rows below are the exact outputs the functions
# above produce for c0 (correct+think), c1 (think but WRONG), c2 (correct, over-long).
per_func = np.array([
    [1.0, 0.0, 1.0],     # reward_correct
    [0.2, 0.0, 0.0],     # reward_format_gated: bonus only when correct -> c1 gets 0
    [0.0, 0.0, -1.01],   # reward_len_penalty: c2 is 1010 chars over the 2048 budget
])
weights = np.array([1.0, 1.0, 1.0])
total = (weights[:, None] * per_func).sum(axis=0)
assert np.allclose(total, [1.2, 0.0, -0.01]), total
# ADVERSARIAL: the format-hack column (c1) is 0 in every term, so it sums to 0 and
# stays strictly below the honest correct answer (c0). Summation cannot rescue a hack.
assert total[1] == 0.0 and total[1] < total[0]
# equivalence to an explicit loop (slow reference) -> vectorized sum is correct
ref = [sum(w * per_func[i][j] for i, w in enumerate(weights)) for j in range(3)]
assert np.allclose(total, ref), (total, ref)
print("composite reward OK:", np.round(total, 4).tolist())

How to run it in production¶

Validate the reward on held-out cases before a full run, then watch reward, entropy, and kl as first-class metrics during training (observability). Keep the KL penalty on (beta > 0): it bounds how far the policy can drift from the reference and thus how much it can exploit the reward. Gate promotion on a held-out eval that the reward never sees, because the training reward is exactly the number a hacked policy inflates.

How to maintain it: guard against reward hacking¶

Reward hacking is when the policy maximizes the measured proxy while the true objective falls, the model finds a deceptive shortcut.¹ The classic business example is a recommender rewarded per click that learns clickbait: clicks (and reward) soar while sales and satisfaction collapse.¹ The LLM analogs are direct:

Format-matching without solving: emitting the \boxed{} or <think> wrapper to collect the format reward while the answer is wrong. Gate format reward on correctness, or keep it small.
Length exploitation: padding or truncating to exploit a length-correlated reward; DAPO and Dr. GRPO add length-bias fixes for exactly this.
Reward-model over-optimization: against a learned reward model, more optimization eventually decreases true quality as the policy exploits the model's errors; this trade-off follows a measurable scaling law and is bounded by the KL budget.⁴

Mitigations: keep the KL penalty on (bounds how far the policy can exploit the reward), test the reward adversarially before a full run, gate on a held-out eval that the reward never sees, and never reward a proxy unless you have checked it correlates with the goal.¹⁴

How to scale it¶

Group methods scale the reward by replacing a learned value critic with a per-prompt baseline: the group-relative advantage A_i = (r_i - mean(r)) / std(r) standardizes each completion against the other samples for its prompt, so there is no critic to train or hold in memory (GRPO). Normalizing this way also balances mismatched reward scales onto a comparable range. The std term is optional: R1-Zero and Dr. GRPO drop it to avoid difficulty bias.

import numpy as np

def group_relative_advantage(rewards, scale_by_std=True, eps=1e-8):
    r = np.asarray(rewards, dtype=np.float64)
    adv = r - r.mean()                       # per-prompt group baseline (no critic)
    if scale_by_std:
        adv = adv / (r.std() + eps)
    return adv

r = [0.0, 1.0, 1.0, 0.0, 1.0, 0.0]
a = group_relative_advantage(r)
assert abs(a.mean()) < 1e-9, a.mean()                    # baseline subtracted -> mean 0
assert abs(a.std() - 1.0) < 1e-6, a.std()                # standardized -> unit scale
# ranking preserved: above-mean reward -> positive advantage, and vice versa
assert np.all((a > 0) == (np.asarray(r) > np.mean(r)))

# ADVERSARIAL: zero-variance group -> no gradient (advantage ~ 0), must stay finite
z = group_relative_advantage([0.7, 0.7, 0.7, 0.7])
assert np.allclose(z, 0.0), z
assert not np.any(np.isnan(z)), "eps must keep zero-variance groups finite (no NaN)"

# Dr. GRPO / R1-Zero variant: drop the std term to remove difficulty bias
a_nostd = group_relative_advantage(r, scale_by_std=False)
assert np.allclose(a_nostd, np.asarray(r) - np.mean(r))
print("group-relative advantage OK:", np.round(a, 4).tolist(), "zero-var:", z.tolist())

Beyond the advantage math, keep each reward term cheap and deterministic (the recipe above sums three) and back rollouts with vLLM (use_vllm=True), since generation dominates wall-clock; the group size num_generations trades lower advantage variance against higher rollout cost.

Failure modes¶

Reward up, eval down: the signature of reward hacking; trust the held-out eval, not the training reward.
Mismatched scales: one term dominates; the model ignores the others. Normalize.
Ungated shaping term: a format/length bonus payable without solving becomes the thing the model optimizes.
KL off (beta=0) against a learned reward model, unbounded over-optimization; keep a KL budget.
Zero-variance groups: every sample in a group scores identically, advantage is zero, no gradient (GRPO).
Proxy with no validated correlation: rewarding clicks/length/format as a stand-in for quality without checking the correlation holds.

References¶

Applied Reinforcement Learning (Manning, MEAP): reward engineering and constraint handling (stepwise rewards, constraint injection, penalty shaping, action masking, reward normalization, deceptive-shortcut reward hacking).
DeepSeekMath / GRPO: https://arxiv.org/abs/2402.03300
DeepSeek-R1 (RL with verifiable rewards): https://arxiv.org/abs/2501.12948
InstructGPT (reward model from human feedback): https://arxiv.org/abs/2203.02155
Scaling Laws for Reward Model Overoptimization: https://arxiv.org/abs/2210.10760
Policy invariance under reward transformations (potential-based shaping): https://people.eecs.berkeley.edu/~russell/papers/icml99-shaping.pdf
DPO: https://arxiv.org/abs/2305.18290 · DAPO (length-bias fixes): https://arxiv.org/abs/2503.14476

Applied Reinforcement Learning (Manning, MEAP), ch. 2.6 "Reward engineering and constraint handling strategies": design rewards stepwise, inject constraints into state, shape soft constraints with graded penalties, mask hard constraints, normalize mismatched scales to a comparable range, and avoid deceptive shortcuts where the agent games a proxy reward. ↩↩↩↩↩↩↩
DeepSeek-R1 trained reasoning with GRPO against verifiable rewards (RLVR). https://arxiv.org/abs/2501.12948 ↩
Ouyang et al., training language models to follow instructions with human feedback (reward model + RL). https://arxiv.org/abs/2203.02155 ↩
Gao, Schulman, Hilton, scaling laws for reward model over-optimization: true reward eventually falls as KL from the reference grows. https://arxiv.org/abs/2210.10760 ↩↩
Ng, Harada, Russell, potential-based reward shaping preserves the optimal policy. https://people.eecs.berkeley.edu/~russell/papers/icml99-shaping.pdf ↩↩