Markdown

Reward model training¶

Scope: training the reward model used in RLHF, the Bradley-Terry pairwise preference model, its scalar-head architecture, the log-sigmoid margin loss, and the main variants (outcome RM, process RM, and generative LLM-as-a-judge); distinct from reward design, which shapes the reward signal this model produces and consumes it downstream.

Two kinds of code block appear below. Blocks labelled "reference template" call real torch, TRL, transformers, or DeepSpeed APIs: pin versions and validate before production use. Blocks labelled "runnable" are self-contained numpy that assert the core math the adjacent template illustrates, so you can execute and check them without a GPU.

What it is¶

A reward model (RM) takes a prompt x and a completion y and returns a single scalar r(x, y) measuring quality. It is the component of RLHF where complex, hard-to-specify human preferences are learned and compressed into a signal that downstream optimization can act on. The RM plays the role the environment's reward function plays in standard RL, except in RLHF you learn it from human preferences rather than having it fixed.

The canonical RM for RLHF is a Bradley-Terry model: it predicts the probability that one completion is preferred over another. You build it by taking a pretrained base LM, replacing its token-prediction (LM) head with a linear layer that outputs one scalar, and training on pairwise (chosen, rejected) preference data with a log-sigmoid contrastive loss. This page is about training that model; the upstream question of where the reward signal comes from and how to shape it is reward design.

Why use it¶

Captures hard-to-specify preferences. Open-ended qualities (helpfulness, tone, safety) that no deterministic checker can express are learnable from comparisons. This is the gap a verifiable reward cannot fill (reward design).
Reusable proxy objective. Once trained, one RM scores unlimited new completions (as the terminal reward for PPO / GRPO, or to rank candidates in best-of-N sampling) without collecting fresh human labels each step.
Foundation of RLHF. The InstructGPT recipe (preferences then RM then RL) is the basis of modern post-training (References).

When to use it (and when not)¶

Reach for a trained reward model when the target quality is subjective and no program can score it: helpfulness, tone, style, and safety are the classic cases, and you either have or can collect pairwise (chosen, rejected) comparisons. Pick the variant to match the signal:

Bradley-Terry RM when you need one reusable scalar to drive PPO / GRPO or to rerank best-of-N candidates.
ORM / PRM when the domain is verifiable (math, code, reasoning) and you have outcome labels (ORM) or want step-level scores to guide search (PRM).
Generative judge when you want cheap evaluation or data with no training run, and can accept lower benchmark accuracy plus per-call inference cost.

Do not train an RM when:

A cheap verifiable reward already exists (unit tests, exact-match, format/schema checks). Prefer that signal directly: the RM only adds proxy error and over-optimization risk on top of a reward you could have computed exactly (reward design).
You cannot collect consistent preference labels. Noisy or contradictory labelers teach the RM their noise; pairwise accuracy stalls near chance.
You cannot afford a held-out eval and a KL budget to bound over-optimization. The RM is a learned proxy and the policy will exploit its errors (see Failure modes).
You need calibrated absolute scores. Bradley-Terry fixes only score differences; the absolute scale is underdetermined, so raw RM outputs are not a calibrated quality metric without extra work.

The Bradley-Terry loss¶

The Bradley-Terry model of preference gives each item a latent strength p_i > 0 and sets P(i preferred over j) = p_i / (p_i + p_j). Reparametrizing with unbounded scores p_i = exp(r_i) collapses this to a sigmoid of the score difference:

P(i preferred over j) = sigmoid(r_i - r_j)

For a prompt x with chosen completion y_c and rejected y_r, the RM r_theta gives P(y_c > y_r | x) = sigmoid(r_theta(y_c|x) - r_theta(y_r|x)). Taking the negative log-likelihood yields the standard training loss:

L = -log sigmoid( r_theta(y_c|x) - r_theta(y_r|x) )
equivalently (softplus form): L = log(1 + exp( r_theta(y_r|x) - r_theta(y_c|x) ))

Only differences in scores matter: adding the same constant to every score leaves all preferences unchanged, so the absolute reward scale is underdetermined (a fact the architecture and trainers must manage). In code the loss is a one-liner over the two scalar scores:

# Reference template (torch). Core math validated by the runnable block below.
import torch.nn.functional as F

# rewards_chosen, rewards_rejected: (batch,) scalar RM scores for the paired completions
loss = -F.logsigmoid(rewards_chosen - rewards_rejected).mean()

The block below reproduces that loss in numpy and checks the properties the page relies on: the softplus rewrite is exact, the loss is shift-invariant, it stays finite at extreme margins, the implied probability equals the Bradley-Terry ratio, and the K-wise Plackett-Luce objective (below) reduces to it at K = 2.

# Runnable (numpy): Bradley-Terry loss core math.
import numpy as np

def sigmoid(z):
    return np.where(z >= 0, 1.0 / (1.0 + np.exp(-z)), np.exp(z) / (1.0 + np.exp(z)))

def bt_loss(rc, rr):
    # L = -log sigmoid(rc - rr), computed stably as softplus(-(rc-rr))
    return np.logaddexp(0.0, -(rc - rr))

def bt_loss_naive(rc, rr):
    return -np.log(sigmoid(rc - rr))

rng = np.random.default_rng(0)
rc = rng.normal(size=1000); rr = rng.normal(size=1000)
L = bt_loss(rc, rr)

# 1) stable softplus form == naive -log sigmoid form
assert np.allclose(L, bt_loss_naive(rc, rr), atol=1e-9)
# 2) == page's alternate form log(1 + exp(rr - rc))
assert np.allclose(L, np.log1p(np.exp(rr - rc)), atol=1e-6)
# 3) shift invariance: add same constant to both -> loss unchanged
assert np.allclose(bt_loss(rc + 7.3, rr + 7.3), L, atol=1e-12)
# 4) tie -> prob 0.5, loss = log 2
assert np.isclose(bt_loss(np.array([2.0]), np.array([2.0]))[0], np.log(2.0))
# 5) numerical stability at extreme margins (naive -log sigmoid over/underflows)
z = np.array([-1000.0, 1000.0])  # this is (rc - rr)
Ls = np.logaddexp(0.0, -z)
assert np.isfinite(Ls).all()
assert np.isclose(Ls[0], 1000.0, atol=1e-6)   # chosen << rejected -> huge loss
assert Ls[1] < 1e-6                            # chosen >> rejected -> ~0 loss
# 6) sigmoid(ri-rj) == Bradley-Terry ratio p_i/(p_i+p_j), p = exp(r)
ri, rj = 0.7, -0.4; pi, pj = np.exp(ri), np.exp(rj)
assert np.isclose(sigmoid(np.array([ri - rj]))[0], pi / (pi + pj))
# 7) Plackett-Luce over K items reduces to Bradley-Terry at K=2
def pl_nll(scores_best_to_worst):
    s = np.asarray(scores_best_to_worst, float); nll = 0.0
    for k in range(len(s) - 1):
        nll += -(s[k] - np.logaddexp.reduce(s[k:]))
    return nll
assert np.isclose(pl_nll([1.2, -0.3]), bt_loss(np.array([1.2]), np.array([-0.3]))[0])
print("N1 OK: softplus-equiv, shift-invariant, stable, sigmoid=BT ratio, PL(K=2)=BT")

Two variants appear often but have not converged into a single best practice:

Margin loss (Llama 2): subtract a preference margin m(y_c, y_r) derived from rating magnitude (e.g. a 5-vs-2 Likert pair gives m = 3): L = -log sigmoid(r_theta(y_c|x) - r_theta(y_r|x) - m). Llama 3 dropped it after seeing diminishing returns.
K-wise loss (Starling): a Plackett-Luce ranking objective over K ranked completions that reduces to Bradley-Terry when K = 2 (asserted above). Relatedly, InstructGPT forms all (K choose 2) pairs per prompt and groups them into one batched, reweighted update so the highly correlated pairs do not overfit the RM.

Architecture¶

The common implementation mirrors AutoModelForSequenceClassification: a pretrained LM with a small linear head producing a single logit for a prompt-completion pair. The scalar is read from a sequence-level representation, usually the last non-padding token (often the EOS token) hidden state. At inference the single logit is the relative likelihood the text is the "chosen" one; with scalar outputs no aggregation is needed.

flowchart TD
  PAIRS["Preference pairs (chosen, rejected)"] --> TOK["Tokenize + chat template"]
  TOK --> LM["Base LM, token-prediction head removed"]
  LM --> POOL["Last non-pad token hidden state (often EOS)"]
  POOL --> HEAD["Linear head -> 1 scalar r(x,y)"]
  HEAD --> LOSS["Bradley-Terry log-sigmoid loss"]
  LOSS -->|backprop, 1 epoch| LM
  HEAD -.inference.-> RM["Trained reward model"]
  RM --> TYPES["Reward model type"]
  TYPES --> BT["Bradley-Terry (sequence score)"]
  TYPES --> ORM["ORM (per-token correctness)"]
  TYPES --> PRM["PRM (per-step score)"]
  TYPES --> GEN["Generative judge (LLM-as-a-judge)"]
  RM -.score.-> BON["Best-of-N rerank"]
  RM -.score.-> RL["Terminal reward for PPO / GRPO"]

# Reference template (transformers). Read-out math validated by the runnable block below.
from transformers import AutoModelForSequenceClassification, AutoTokenizer

# Replace the LM (token-prediction) head with a linear layer -> one scalar.
# num_labels=1 makes the classification head output a single reward value.
model = AutoModelForSequenceClassification.from_pretrained("Qwen/Qwen3-0.6B", num_labels=1)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-0.6B")
# r(x, y) is taken from the last non-pad token's hidden state (often EOS).

The head is a projection of exactly one token's hidden state, so the read-out must pick the last non-padding token, not the last column of a right-padded batch. The block below implements that pooling and shows why: it matches an explicit reference loop, and a naive "take the last column" reader returns different (wrong) scores once padding is present. That padding-invariance is also the correctness property production batched scoring depends on.

# Runnable (numpy): scalar-head read-out from the last non-pad token.
import numpy as np

def rm_scores(hidden, mask, w, b=0.0):
    # hidden (B,S,H), mask (B,S) 1=real 0=pad (right padding).
    # Read last non-pad token hidden state, project with linear head -> scalar.
    last_idx = mask.sum(axis=1).astype(int) - 1
    pooled = hidden[np.arange(hidden.shape[0]), last_idx, :]
    return pooled @ w + b

def rm_scores_ref(hidden, mask, w, b=0.0):
    out = []
    for h_seq, m in zip(hidden, mask):
        real = np.where(m == 1)[0]
        out.append(h_seq[real[-1]] @ w + b)
    return np.array(out)

rng = np.random.default_rng(1)
B, S, H = 4, 6, 8
hidden = rng.normal(size=(B, S, H))
lengths = np.array([6, 3, 1, 5])
mask = (np.arange(S)[None, :] < lengths[:, None]).astype(int)
w = rng.normal(size=H); b = 0.5
r = rm_scores(hidden, mask, w, b)

# 1) matches explicit reference loop
assert np.allclose(r, rm_scores_ref(hidden, mask, w, b))
# 2) selects index length-1 per row
assert (mask.sum(1) - 1 == np.array([5, 2, 0, 4])).all()
# 3) padding invariance: junk right-padding must NOT change the score
pad = 3
hidden_pad = np.concatenate([hidden, rng.normal(size=(B, pad, H))], axis=1)
mask_pad = np.concatenate([mask, np.zeros((B, pad), int)], axis=1)
assert np.allclose(r, rm_scores(hidden_pad, mask_pad, w, b)), "must be padding invariant"
# 4) adversarial: naive 'take last column' reads junk pad and is WRONG under padding
naive_last = hidden_pad[:, -1, :] @ w + b
assert not np.allclose(naive_last, r)
print("N2 OK: last-non-pad readout == ref, index=len-1, padding invariant, naive-last-col wrong")

The standard practice is to train for only one epoch to avoid overfitting the comparisons. Most of the real implementation effort is the data loader and distributed setup, which is library-specific. Code that works in TRL will not port directly to other frameworks.

Reward model types¶

The Bradley-Terry RM scores a whole answer; reasoning-heavy tasks motivate two finer-grained variants, plus a model-free judge. All share the linear-head-on-an-LM shape but differ in what they predict and how they are supervised.

Bradley-Terry RM. Output: one sequence-level quality score. Training: contrastive log-sigmoid loss over pairwise (or N-wise) comparisons. Use: rerank completions for best-of-N, or provide the terminal reward for RLHF.
Outcome Reward Model (ORM). Output: a per-token probability that the answer is correct. Training: per-token binary cross-entropy, where the sequence's outcome label (1 correct, 0 incorrect) is copied onto every completion token and prompt tokens are masked with -100. Use: verifiable/reasoning domains; aggregate the per-token probabilities by mean, min (tail risk), or product at inference. (Training a Bradley-Terry model on correct-vs-incorrect pairs is not an ORM; it is still a Bradley-Terry RM.)
Process Reward Model (PRM). Output: a score at each reasoning-step boundary (e.g. a newline or special separator token). Training: per-step cross-entropy over 3 classes: correct (+1), neutral (0), incorrect (-1), with all non-boundary tokens masked. Use: scoring chain-of-thought, or guiding search by pruning low-scoring branches (References: Let's Verify Step by Step; PRM discussion in DeepSeekMath).
Generative RM / LLM-as-a-judge. No trained scalar head: prompt an existing LLM with judging instructions plus the two completions and read its verdict. Cheap for both data collection and evaluation; use temperature 0 to cut rating variance. These tend to trail trained reward models on RM benchmarks, so reward modeling remains an important technique.

The ORM label construction and its inference-time aggregation are the parts most often implemented wrong. The block below builds the masked per-token label vector (-100 on the prompt, the outcome copied onto every completion token), checks the three aggregations obey product <= min <= mean <= max, and shows that forgetting to mask the prompt changes the aggregate.

# Runnable (numpy): ORM label masking and inference-time aggregation.
import numpy as np
IGNORE = -100   # prompt tokens excluded from ORM loss/aggregation

def orm_labels(prompt_len, completion_len, outcome):
    labels = np.full(prompt_len + completion_len, IGNORE, dtype=int)
    labels[prompt_len:] = outcome        # copy sequence outcome onto every completion token
    return labels

def aggregate(probs, how):
    if how == "mean":    return probs.mean()
    if how == "min":     return probs.min()      # tail risk
    if how == "product": return np.prod(probs)
    raise ValueError(how)

labels = orm_labels(prompt_len=3, completion_len=4, outcome=1)
assert (labels[:3] == IGNORE).all()              # prompt masked
assert (labels[3:] == 1).all()                   # outcome copied to completion tokens
assert (labels != IGNORE).sum() == 4             # only completion tokens supervised

probs = np.array([0.9, 0.8, 0.95, 0.6])
assert np.isclose(aggregate(probs, "mean"), 0.8125)
assert np.isclose(aggregate(probs, "min"), 0.6)
assert np.isclose(aggregate(probs, "product"), 0.9 * 0.8 * 0.95 * 0.6)
assert aggregate(probs, "min") <= aggregate(probs, "mean") <= probs.max()
assert aggregate(probs, "product") <= aggregate(probs, "min")   # prod of (0,1] values
# adversarial: leaving prompt tokens unmasked changes the aggregate -> masking matters
full = np.array([0.5, 0.5, 0.5, 0.9, 0.8, 0.95, 0.6])
assert not np.isclose(full.mean(), probs.mean())
print("N5 OK: label copy+mask (4 supervised), mean/min/product ordering, masking changes result")

How to train it¶

TRL's RewardTrainer trains a Bradley-Terry RM directly from a paired-preference dataset whose rows carry chosen and rejected fields (standard text or conversational messages). Passing a model id string loads it with AutoModelForSequenceClassification and sets the scalar head (num_labels=1) automatically; the tokenizer is loaded for you.

# Reference template (TRL >= 1.7): verify RewardConfig fields on the installed version.
# train_reward_model.py
from datasets import load_dataset
from trl import RewardConfig, RewardTrainer

# Paired-preference data: every row has `chosen` and `rejected`
# (standard text or conversational messages; TRL applies the chat template).
dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")

trainer = RewardTrainer(
    model="Qwen/Qwen3-0.6B",
    args=RewardConfig(
        output_dir="Qwen3-0.6B-Reward",
        num_train_epochs=1,                # one epoch is standard to avoid overfitting
        per_device_train_batch_size=8,
        learning_rate=1e-4,                # RewardConfig default
        center_rewards_coefficient=1e-2,   # pin scores near zero (BT is shift-invariant)
        bf16=True,
    ),
    train_dataset=dataset,
)
trainer.train()

accelerate launch train_reward_model.py

TRL logs accuracy (the fraction of pairs where the chosen completion outscores the rejected) and margin (the mean chosen-minus-rejected score). Pairwise accuracy on a held-out preference split is the headline RM metric. Validate it before wiring the RM into an RL run.

Two things in that config carry the math worth checking: how accuracy and margin are computed (ties must not count as wins), and what center_rewards_coefficient does to the shift-invariant loss. The block below reproduces both. The centering penalty coef * mean((r_c + r_r)^2) adds a term that the shift-invariant Bradley-Terry loss lacks, so the total loss now has a unique minimizing global shift (the one that drives the mean reward toward zero), which is what keeps downstream consumers on a stable range.

# Runnable (numpy): TRL's accuracy/margin metrics and reward centering.
import numpy as np

def accuracy(rc, rr): return float(np.mean(rc > rr))   # ties are NOT correct
def margin(rc, rr):   return float(np.mean(rc - rr))
def bt(rc, rr):       return float(np.mean(np.logaddexp(0.0, -(rc - rr))))

rc = np.array([1.0, 0.5, -0.2, 2.0, 0.0])
rr = np.array([0.0, 0.7,  0.3, 1.0, 0.0])   # last pair is a tie
# by hand: 1>0 T, .5>.7 F, -.2>.3 F, 2>1 T, 0>0 F  => 2/5
assert accuracy(rc, rr) == 2 / 5
assert np.isclose(margin(rc, rr), np.mean([1.0, -0.2, -0.5, 1.0, 0.0]))

# TRL centering penalty: coef * mean((rc+rr)^2). Base BT loss is shift-invariant;
# the centered total loss is not, and has a unique minimizing global shift s.
def total(rc, rr, coef, s):
    a, b = rc + s, rr + s
    return bt(a, b) + coef * np.mean((a + b) ** 2)

rng = np.random.default_rng(2)
rc2 = rng.normal(size=256); rr2 = rng.normal(size=256); coef = 1e-2
assert np.isclose(bt(rc2 + 5, rr2 + 5), bt(rc2, rr2))          # BT invariant to shift
shifts = np.linspace(-5, 5, 4001)
tot = np.array([total(rc2, rr2, coef, s) for s in shifts])
best = shifts[np.argmin(tot)]
predicted = -0.5 * np.mean(rc2 + rr2)                          # argmin of mean((u+2s)^2)
assert abs(best - predicted) < 5e-3, (best, predicted)
assert not np.isclose(total(rc2, rr2, coef, 0.0), total(rc2, rr2, coef, 3.0))
print("N3 OK: accuracy(ties excluded)=%.1f margin ok, centering argmin=%.4f (pred %.4f)"
      % (accuracy(rc, rr), best, predicted))

How to integrate it (RL wiring and best-of-N)¶

A trained RM has two downstream jobs: rerank candidates in best-of-N sampling, and supply the terminal reward for PPO / GRPO. The reference template loads the saved RM and exposes a reward_fn that scores each completion from the scalar head, which a GRPO trainer calls per generation.

# Reference template (TRL): wire the trained RM as the terminal reward for GRPO.
# Verify the exact GRPOConfig / GRPOTrainer signature on your installed TRL version.
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from trl import GRPOConfig, GRPOTrainer

rm = AutoModelForSequenceClassification.from_pretrained("Qwen3-0.6B-Reward", num_labels=1)
rm_tok = AutoTokenizer.from_pretrained("Qwen3-0.6B-Reward")

def reward_fn(prompts, completions, **kw):
    texts = [p + c for p, c in zip(prompts, completions)]
    batch = rm_tok(texts, return_tensors="pt", padding=True)
    return rm(**batch).logits[:, 0]        # (batch,) scalar terminal rewards

trainer = GRPOTrainer(
    model="Qwen/Qwen3-0.6B",
    reward_funcs=reward_fn,
    args=GRPOConfig(output_dir="Qwen3-0.6B-GRPO", num_generations=8),
    train_dataset=...,                     # prompt-only dataset
)
trainer.train()

The two consumer-side operations are selection (best-of-N picks the argmax score) and normalization (PPO / GRPO whiten the scalar reward to zero mean and unit variance before forming advantages, since only differences are meaningful). The block below validates both, including the monotonic gain of larger N and the divide-by-zero guard when every candidate scores the same.

# Runnable (numpy): best-of-N selection and reward whitening.
import numpy as np

def best_of_n(scores): return int(np.argmax(scores))       # rerank: pick top RM score
def whiten(rewards, eps=1e-8):                              # PPO/GRPO advantage norm
    return (rewards - rewards.mean()) / (rewards.std() + eps)

assert best_of_n(np.array([0.42])) == 0                          # best-of-1 == identity
assert best_of_n(np.array([0.1, -2.0, 3.5, 3.4999, 0.0])) == 2   # picks true max

rng = np.random.default_rng(3)
draws = rng.normal(size=(20000, 8))                        # 8 candidates / prompt
exp_best = [draws[:, :n].max(axis=1).mean() for n in range(1, 9)]
assert all(exp_best[i + 1] >= exp_best[i] - 1e-9 for i in range(7))  # monotone in N
assert exp_best[-1] > exp_best[0] + 0.5                    # more candidates clearly help

w = whiten(np.array([1.0, 2.0, 3.0, 4.0]))
assert abs(w.mean()) < 1e-9 and abs(w.std() - 1.0) < 1e-6
wc = whiten(np.array([5.0, 5.0, 5.0]))                     # adversarial: std 0
assert np.all(np.abs(wc) < 1e-6)                           # no divide-by-zero blowup
print("N4 OK: argmax rerank, monotone E[best] %.2f->%.2f, whiten mean0/std1, std0 guarded"
      % (exp_best[0], exp_best[-1]))

How to run it in production (serving and scoring)¶

Serving an RM is a batched forward pass that returns the scalar head logit; there is no decoding loop. Set the model to eval mode, disable gradients, batch by similar length, and left-pad or right-pad consistently. The one correctness rule is the read-out: score comes from the last non-padding token, so a padded batch must return the same score as scoring each example alone. That padding-invariance property is the one asserted in the read-out block under Architecture; keep it as a serving regression test.

# Reference template (transformers): batched scoring endpoint.
# Correctness (padding invariance) is validated by the read-out numpy block under Architecture.
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

rm = AutoModelForSequenceClassification.from_pretrained("Qwen3-0.6B-Reward", num_labels=1)
rm.eval()
tok = AutoTokenizer.from_pretrained("Qwen3-0.6B-Reward")

@torch.no_grad()
def score(prompts, completions, batch_size=16):
    out = []
    for i in range(0, len(prompts), batch_size):
        texts = [p + c for p, c in zip(prompts[i:i + batch_size], completions[i:i + batch_size])]
        batch = tok(texts, return_tensors="pt", padding=True, truncation=True)
        out.extend(rm(**batch).logits[:, 0].tolist())
    return out

Because Bradley-Terry fixes only score differences, do not expose raw logits as an absolute quality number: publish a rank or a normalized score against a fixed reference set, and store the centering statistics used at training time so scores stay comparable across model versions.

How to maintain it (monitoring and retraining)¶

The RM decays as the policy it scores drifts away from the completions it was trained on: the policy learns to produce text the RM has never seen, so held-out pairwise accuracy on fresh, on-policy comparisons is the metric to watch, not training loss. Track two things on a canary preference set: pairwise accuracy (does chosen still outscore rejected) and calibration (does the model's predicted P(chosen > rejected) = sigmoid(r_c - r_r) match the empirical win-rate). Rising Expected Calibration Error is an early signal that the RM is overconfident and due for a refresh. When either regresses, collect new on-policy preferences and retrain from the base model (still one epoch), rather than continuing to train the old RM.

# Runnable (numpy): calibration monitor (Expected Calibration Error) for RM drift.
import numpy as np

def implied_pref_prob(rc, rr):
    d = rc - rr
    return np.where(d >= 0, 1 / (1 + np.exp(-d)), np.exp(d) / (1 + np.exp(d)))  # sigmoid(rc-rr)

def ece(pred_prob, outcome, n_bins=10):
    bins = np.linspace(0, 1, n_bins + 1); e = 0.0
    for lo, hi in zip(bins[:-1], bins[1:]):
        m = (pred_prob > lo) & (pred_prob <= hi)
        if m.sum() == 0:
            continue
        e += m.mean() * abs(pred_prob[m].mean() - outcome[m].mean())
    return e

rng = np.random.default_rng(4)
d = rng.normal(scale=2.0, size=300000)
rc, rr = d, np.zeros_like(d)
p = implied_pref_prob(rc, rr)
y_cal = (rng.random(p.shape) < p).astype(int)         # outcomes from model's own prob
assert ece(p, y_cal) < 0.02, ece(p, y_cal)            # calibrated -> ECE ~ 0
y_bad = (rng.random(p.shape) < 0.5).astype(int)       # RM drift: outcomes are coin flips
assert ece(p, y_bad) > 0.1, ece(p, y_bad)             # detector fires
print("N8 OK: ECE calibrated=%.4f (<0.02), drifted=%.4f (>0.1)"
      % (ece(p, y_cal), ece(p, y_bad)))

How to scale it (distributed training)¶

RM training is a single forward-backward over a scalar loss, so it scales like any data-parallel run: shard the preference pairs across GPUs, compute the local mean loss, and reduce. accelerate handles the launch; DeepSpeed ZeRO shards optimizer state when the base model no longer fits.

# Reference template: multi-GPU / multi-node data-parallel training.
accelerate launch --multi_gpu --num_processes 8 train_reward_model.py
# or with DeepSpeed ZeRO-3 to shard optimizer state and reduce memory:
accelerate launch --use_deepspeed --deepspeed_config_file zero3.json train_reward_model.py

The one arithmetic that has to be right is the reduction: the global-batch mean loss equals a size-weighted average of per-shard means, not the plain mean of means. With uneven shard sizes (the common case at the end of an epoch) the unweighted average is a real, silent bug. The block below asserts the weighted reduction matches the single-device loss and that the naive mean-of-means does not.

# Runnable (numpy): data-parallel loss reduction equivalence.
import numpy as np

def bt_losses(rc, rr): return np.logaddexp(0.0, -(rc - rr))   # per-pair BT loss

rng = np.random.default_rng(5)
N = 300; rc = rng.normal(size=N); rr = rng.normal(size=N)
global_mean = bt_losses(rc, rr).mean()

idx = np.arange(N); rng.shuffle(idx)
shards = np.array_split(idx, [140, 210])                      # sizes 140, 70, 90
per_shard = np.array([bt_losses(rc[s], rr[s]).mean() for s in shards])
sizes = np.array([len(s) for s in shards])

weighted = np.sum(sizes * per_shard) / sizes.sum()           # correct DP reduction
assert np.isclose(weighted, global_mean, atol=1e-12)
naive = per_shard.mean()                                      # adversarial: unweighted
assert not np.isclose(naive, global_mean)                    # classic uneven-batch bug

eq = np.array_split(np.arange(N), 3)                          # 100/100/100
eq_means = np.array([bt_losses(rc[s], rr[s]).mean() for s in eq])
assert np.isclose(eq_means.mean(), global_mean, atol=1e-12)  # equal shards: mean-of-means ok
print("N9 OK: sizes %s weighted=%.6f==global=%.6f; naive=%.6f wrong"
      % ([int(x) for x in sizes], weighted, global_mean, naive))

Failure modes¶

Overfitting: RMs are typically trained for a single epoch; more epochs memorize the comparison set. Watch held-out pairwise accuracy, not training loss.
Correlated pairs per prompt: sampling K completions per prompt yields (K choose 2) highly correlated pairs; shuffling them naively overfits, so keep one prompt's pairs in the same batch and average/reweight them (InstructGPT).
Underdetermined reward scale: only score differences matter, so absolute rewards can drift arbitrarily; center them (TRL's center_rewards_coefficient) so downstream consumers see a stable range.
Proxy over-optimization: the RM is a learned proxy, and PPO / GRPO will exploit its errors as you optimize against it; bound this with a KL budget and a held-out eval the RM never sees (reward design).
LLM-as-a-judge bias: generative judges carry position, length, and self-preference bias; instruct against them explicitly and use temperature 0, and remember they still trail trained RMs on RM benchmarks.
Padding read-out bug: scoring from the last column of a right-padded batch instead of the last non-pad token silently corrupts rewards for every shorter sequence; the padding-invariance assert under Architecture is the regression guard.
Wrong data-parallel reduction: averaging per-shard mean losses without size weighting biases the gradient on uneven shards; use the weighted reduction validated under scaling.

References¶

RLHF book (Lambert), ch. 5 "Reward Modeling" (Bradley-Terry loss, ORM, PRM, generative RMs): https://rlhfbook.com
InstructGPT (reward model from human feedback, K-wise comparisons): https://arxiv.org/abs/2203.02155
TRL Reward Trainer (RewardConfig, RewardTrainer): https://huggingface.co/docs/trl/reward_trainer
Let's Verify Step by Step (process supervision / PRM): https://arxiv.org/abs/2305.20050
DeepSeekMath (PRM discussion, GRPO context): https://arxiv.org/abs/2402.03300

Related: reward design · GRPO · DPO · PPO · fine-tuning and post-training · Glossary