Skip to content
Markdown

title: PPO for LLM RLHF with a value model and clipped surrogate description: Proximal Policy Optimization for LLM RLHF: the actor-critic setup, clipped surrogate, GAE value model, per-token KL, and the four model copies PPO holds.


PPO (proximal policy optimization)

Scope: actor-critic RL for LLM RLHF, a trainable policy updated on a clipped surrogate objective, with per-token advantages from a separately-trained value model (GAE) and a per-token KL penalty toward a frozen reference. The original RLHF optimizer behind InstructGPT and the four-model algorithm that GRPO simplifies by removing the value network; one stage of the post-training pipeline in fine-tuning and post-training, the algorithm under the RL libraries.

Reference templates on real APIs; pin versions and validate before production use. The numpy blocks below are runnable and self-checking; the torch, TRL, and verl blocks are labelled reference templates.

What it is

PPO is one of the foundational deep-RL algorithms (it trained OpenAI Five for DOTA 2) and the original optimizer behind RLHF in InstructGPT. For LLMs it is an actor-critic method that holds four model copies: a trainable policy (actor), a separately-trained value model (critic), a frozen reward model that scores completions, and a frozen reference that anchors a KL penalty. Each step it samples completions from the policy, scores them, estimates a per-token advantage with the value model, and updates the policy on a clipped surrogate objective:

J(θ) = E_t[ min( ρ_t · A_t ,  clip(ρ_t, 1−ε, 1+ε) · A_t ) ],   ρ_t = π_θ(a_t|s_t) / π_θ_old(a_t|s_t)

ρ_t is the importance-sampling ratio between the current policy and the policy that collected the data; it lets PPO take several gradient steps on one batch of rollouts. For language models the objective is computed per token. The sequence probability factorizes autoregressively, so ρ_t is recovered as exp(logπ_new − logπ_old) from a log-probability difference. Clipping the ratio to [1−ε, 1+ε] (ε≈0.2) bounds how far the policy can move in one update: a trust region, the stability idea PPO inherits from its predecessor TRPO.

Why use it

PPO buys you the most accurate, lowest-variance per-token credit assignment in RLHF. A learned value function gives every token its own advantage (the token-level MDP view), which is the canonical InstructGPT and Tülu 3 setup and still the reference implementation for value-based RLHF. Reach for it when that precision is worth training a second network.

When to use it (and when not)

Reach for PPO when per-token credit assignment earns the extra critic; otherwise a lighter method usually wins.

  • Prefer GRPO when you want to avoid training a value function (research has not established best practices for learning a value head from an LM backbone) and to save memory. GRPO removes the value/critic entirely, replacing the learned baseline with a group-relative mean over multiple sampled completions (roughly half PPO's trainable-model memory). It is the default for verifiable-reward reasoning.
  • Prefer DPO when you only have offline (chosen, rejected) preference pairs and no rollout or reward budget. DPO is not a policy-gradient method: it bypasses the reward model and the value network entirely and optimizes the policy directly against the frozen reference, far cheaper and more stable, but offline.
  • SFT/LoRA first: cold-start format and behaviour before any RL; many tasks never need a critic at all.

Architecture

PPO runs as two coupled systems: a trainer that updates the policy and value networks, and a rollout generator that samples completions the reward model scores. The frozen reference supplies the KL anchor. The diagram traces one step, prompt to weight sync:

flowchart LR
  P["Prompt batch"] --> POL["Policy actor (trainable)"]
  POL --> ROLL["Rollouts (vLLM / SGLang)"]
  ROLL --> RM["Reward model (frozen)"]
  ROLL --> VAL["Value model critic (trained)"]
  RM --> GAE["GAE per-token advantages"]
  VAL --> GAE
  GAE --> UPD["Clipped surrogate update"]
  UPD -->|"weight sync"| POL
  REF["Reference (frozen)"] -.->|"KL penalty"| UPD

The clipped surrogate and KL

The core mechanism is the min/clip pair. In code, PPO builds two losses and keeps the more pessimistic one. Reference template (torch), the lines that define the update:

ratio = torch.exp(new_logprobs - old_logprobs)                # per-token policy ratio rho_t
pg1   = -advantages * ratio                                   # unclipped advantage-weighted loss
pg2   = -advantages * torch.clamp(ratio, 1.0 - eps, 1.0 + eps)  # ratio clipped to [1-eps, 1+eps]
pg_loss = torch.max(pg1, pg2)                                 # keep the more pessimistic update

The same objective with numpy only, runnable and self-checking (both clip sides, the boundary, the zero-gradient plateau, and equivalence to a slow per-element reference):

import numpy as np


def pg_loss(adv, ratio, eps=0.2):
    # PPO clipped surrogate as a per-token LOSS (the negated objective the
    # optimizer minimizes), so the pessimistic branch is the elementwise max.
    pg1 = -adv * ratio
    pg2 = -adv * np.clip(ratio, 1.0 - eps, 1.0 + eps)
    return np.maximum(pg1, pg2)


def pg_loss_reference(adv, ratio, eps=0.2):
    # Slow, explicit per-element reference implementation.
    out = np.empty(ratio.shape, dtype=float)
    for i in range(ratio.size):
        r = float(ratio.flat[i])
        a = float(adv.flat[i])
        r_clipped = min(max(r, 1.0 - eps), 1.0 + eps)
        out.flat[i] = max(-a * r, -a * r_clipped)
    return out


eps = 0.2
#                pos/above pos/below neg/below neg/above r==1(+) r==1(-)
adv = np.array([1.0, 1.0, -1.0, -1.0, 1.0, -1.0])
ratio = np.array([1.5, 0.5, 0.5, 2.0, 1.0, 1.0])
loss = pg_loss(adv, ratio, eps)

# 1) Equivalence to a slow, explicit reference (off-by-one / corruption catch).
assert np.allclose(loss, pg_loss_reference(adv, ratio, eps)), loss

# 2) ratio == 1 reproduces plain policy gradient: loss == -adv.
assert np.isclose(loss[4], -1.0) and np.isclose(loss[5], 1.0)

# 3) Positive advantage, ratio above 1+eps -> clipped to -(1+eps)*adv.
assert np.isclose(loss[0], -(1.0 + eps) * 1.0)  # -1.2

# 4) Positive advantage, ratio below 1-eps -> UNCLIPPED (clip bites one way only).
assert np.isclose(loss[1], -0.5)

# 5) Negative advantage, ratio below 1-eps -> clipped to -(1-eps)*adv.
assert np.isclose(loss[2], (1.0 - eps))  # 0.8

# 6) Negative advantage, ratio above 1+eps -> UNCLIPPED.
assert np.isclose(loss[3], 2.0)

# 7) Pessimism: the clipped loss is never below the unclipped loss.
assert np.all(loss >= -adv * ratio - 1e-12)

# 8) Boundary / zero-gradient plateau: past the clip, moving the ratio is inert.
plateau = pg_loss(np.array([1.0, 1.0]), np.array([1.5, 1.9]), eps)
assert np.isclose(plateau[0], plateau[1])

# 9) Adversarial: optimistic min() instead of max() must differ (test has teeth).
optimistic = np.minimum(-adv * ratio, -adv * np.clip(ratio, 1 - eps, 1 + eps))
assert not np.allclose(loss, optimistic)

print("clipped surrogate: all 9 asserts passed")

Because the loss is negated, torch.max selects the smaller policy update, bounding drift from the data-collection policy without an explicit trust-region computation:

  • Positive advantage (good action): clipping stops the policy from over-boosting an action whose probability has already risen past (1+ε)·π_old; beyond that the gradient is zero.
  • Negative advantage (bad action): clipping stops over-suppression once probability has fallen below (1−ε)·π_old.

A common practice is 1 to 4 gradient steps per batch before refreshing π_old; with a single step ρ=1 for the whole batch and the clipping branch can be dropped, reducing PPO to vanilla policy gradient.

Two reference policies: do not conflate them. π_old (the denominator of ρ_t) is the policy that generated the rollouts. The frozen reference is a separate, fixed model used only for the KL penalty. RLHF PPO traditionally folds that KL into the reward before computing advantages (reward = reward − β·KL per token) whereas GRPO adds it as a separate loss term. Either way the KL keeps the policy from drifting off the reference and degrading.

The value model and GAE

The value model is an additional copy of the model that predicts a per-token value: the expected future return from each token, here typically the return after the per-token KL deduction. It is a learned baseline (the evolution of REINFORCE's Monte-Carlo baseline) and it gives PPO two losses, an MSE value loss that trains the critic against return targets and the policy loss that consumes the critic's advantages. Reference template (torch), the critic's two-loss contribution:

rewards    = rewards - beta * per_token_kl          # fold the per-token KL into the reward (RLHF PPO)
values     = value_net(completions)                 # critic prediction V(s_t), the learned baseline
returns    = discounted_returns(rewards, gamma=1.0) # backward pass; gamma is 1.0 in RLHF
advantages = (returns - values).detach()            # advantage A_t; GAE generalizes this over many steps
vf_loss    = 0.5 * (returns - values) ** 2          # MSE value loss (PPO's second loss)
loss       = pg_loss + vf_coef * vf_loss            # combined policy + value objective

The advantage and return math with numpy only, runnable and self-checking (GAE equals a slow double-sum reference, collapses to one-step TD at λ=0, and to returns - values at λ=1, γ=1, which is exactly the special case coded above):

import numpy as np


def discounted_returns(rewards, gamma=1.0):
    # Backward recursion R_t = r_t + gamma * R_{t+1}; gamma=1.0 in RLHF.
    out = np.zeros(len(rewards), dtype=float)
    running = 0.0
    for t in range(len(rewards) - 1, -1, -1):
        running = rewards[t] + gamma * running
        out[t] = running
    return out


def gae(rewards, values, gamma=1.0, lam=0.95, last_value=0.0):
    # GAE recursive form: delta_t = r_t + gamma*V_{t+1} - V_t ;
    #                     A_t = delta_t + gamma*lam*A_{t+1}.
    T = len(rewards)
    adv = np.zeros(T, dtype=float)
    next_value, next_adv = last_value, 0.0
    for t in range(T - 1, -1, -1):
        delta = rewards[t] + gamma * next_value - values[t]
        next_adv = delta + gamma * lam * next_adv
        adv[t] = next_adv
        next_value = values[t]
    return adv


def gae_reference(rewards, values, gamma, lam, last_value=0.0):
    # Slow explicit double sum over future TD residuals.
    T = len(rewards)
    v_ext = np.append(values, last_value)
    delta = np.array([rewards[t] + gamma * v_ext[t + 1] - v_ext[t] for t in range(T)])
    adv = np.zeros(T)
    for t in range(T):
        adv[t] = sum((gamma * lam) ** (k - t) * delta[k] for k in range(t, T))
    return adv


rng = np.random.default_rng(0)
rewards = rng.normal(size=8)
values = rng.normal(size=8)

# 1) Recursive GAE == slow explicit reference (off-by-one / corruption catch).
a_rec = gae(rewards, values, gamma=0.99, lam=0.95)
a_ref = gae_reference(rewards, values, gamma=0.99, lam=0.95)
assert np.allclose(a_rec, a_ref)

# 2) lambda = 0 collapses GAE to the one-step TD residual.
a0 = gae(rewards, values, gamma=0.99, lam=0.0)
v_ext = np.append(values, 0.0)
td = np.array([rewards[t] + 0.99 * v_ext[t + 1] - values[t] for t in range(len(rewards))])
assert np.allclose(a0, td)

# 3) lambda = 1, gamma = 1 == Monte-Carlo advantage (the page's returns - values).
a1 = gae(rewards, values, gamma=1.0, lam=1.0)
returns = discounted_returns(rewards, gamma=1.0)
assert np.allclose(a1, returns - values)

# 4) KL folded into the reward shifts the return by exactly the discounted KL sum.
beta, kl = 0.1, rng.random(size=8)
shaped = discounted_returns(rewards - beta * kl, gamma=1.0)
assert np.allclose(shaped, returns - discounted_returns(beta * kl, gamma=1.0))

# 5) Adversarial: a critic off-by-one (V_t instead of V_{t+1}) breaks the identity.
def gae_buggy(rewards, values, gamma, lam):
    T = len(rewards)
    adv = np.zeros(T)
    next_adv = 0.0
    for t in range(T - 1, -1, -1):
        delta = rewards[t] + gamma * values[t] - values[t]  # BUG: V_t not V_{t+1}
        next_adv = delta + gamma * lam * next_adv
        adv[t] = next_adv
    return adv

assert not np.allclose(gae_buggy(rewards, values, 0.99, 0.95), a_ref)

print("value + GAE: all 5 asserts passed")

Advantages come from Generalized Advantage Estimation (GAE), the canonical target in modern systems, which computes the value-prediction error over multiple steps and trades bias against variance with λ (≈0.95). The discount γ is set to 1.0 for RLHF: the reward scores the whole response, so discounting earlier tokens would down-weight them with no principled justification. Value clipping (cliprange_value) mirrors the policy clip. Value initialization matters: InstructGPT (re-used in Tülu 3) initializes the value network from the reward model; others append a randomly-initialized value head to the SFT checkpoint or re-initialize fully.

The four-model memory cost

PPO holds four model copies in GPU memory during RLHF:

Copy Role State
Policy / actor generates rollouts, gets updated trainable (+ optimizer + activations)
Value / critic per-token advantage baseline trainable (+ optimizer + activations)
Reward model scores completions frozen (inference only)
Reference KL anchor frozen (inference only)

Two of the four are trained, so they carry optimizer state (Adam moments ≈ 2× the parameters) and activations, the dominant cost. The two frozen models run inference-only and can be offloaded or pinned to spare GPUs. This is the price PPO pays for token-level credit assignment. GRPO removes the value/critic, eliminating the hardest-to-tune network and roughly halving trainable-model memory; with verifiable rewards it can also drop the reward model to a reward function, leaving just policy + reference. On a cluster, plan for four model shards plus two optimizer states, and place the frozen pair where their weights are cheapest to hold.

How to use it

TRL's PPOTrainer wires up exactly the four models. The value model is loaded from the reward-model checkpoint (the InstructGPT practice above); both are sequence-classification heads with num_labels=1. Reference template (TRL), pin your version:

# train_ppo.py. Recent TRL releases (e.g. v0.21) export PPO at the top level.
# The latest TRL/main moved it: `from trl.experimental.ppo import PPOConfig, PPOTrainer`. Pin your version.
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoModelForSequenceClassification, AutoTokenizer
from trl import PPOConfig, PPOTrainer

sft_path, rm_path = "EleutherAI/pythia-1b-deduped", "EleutherAI/pythia-1b-deduped"
tokenizer = AutoTokenizer.from_pretrained(sft_path, padding_side="left")
tokenizer.add_special_tokens({"pad_token": "[PAD]"})

# The four model copies PPO holds in memory:
policy       = AutoModelForCausalLM.from_pretrained(sft_path)                                  # actor (trained)
ref_policy   = AutoModelForCausalLM.from_pretrained(sft_path)                                  # reference (frozen)
reward_model = AutoModelForSequenceClassification.from_pretrained(rm_path, num_labels=1)       # frozen scorer
value_model  = AutoModelForSequenceClassification.from_pretrained(rm_path, num_labels=1)       # critic (trained)

ds = load_dataset("trl-internal-testing/descriptiveness-sentiment-trl-style", split="descriptiveness")
ds = ds.map(lambda e: {"input_ids": tokenizer(e["prompt"]).input_ids},  # PPOTrainer expects tokenized prompts
            remove_columns=ds.column_names)

config = PPOConfig(
    output_dir="ppo-out",
    per_device_train_batch_size=1, gradient_accumulation_steps=16, total_episodes=10_000,
    num_ppo_epochs=4,                    # 1-4 gradient steps per rollout batch
    kl_coef=0.05,                        # beta on the per-token KL toward the reference
    cliprange=0.2,                       # epsilon: clip the policy ratio to [1-eps, 1+eps]
    vf_coef=0.1, cliprange_value=0.2,    # value-loss weight + value clip
    gamma=1.0, lam=0.95,                 # no discounting in RLHF; GAE lambda
    whiten_rewards=False,                # optional reward/advantage whitening for stability
    missing_eos_penalty=1.0,             # penalize completions that never emit EOS
)

trainer = PPOTrainer(
    args=config, processing_class=tokenizer,
    model=policy, ref_model=ref_policy, reward_model=reward_model, value_model=value_model,
    train_dataset=ds,
)
trainer.train()
accelerate launch --config_file examples/accelerate_configs/deepspeed_zero3.yaml train_ppo.py

The PPOTrainer constructor signature is version-sensitive. Confirm model / ref_model / reward_model / value_model / processing_class on the installed release.

How to integrate with it

PPO is not a single trainer; it composes with the rest of the post-training stack. Rollouts come from an inference engine (vLLM or SGLang, see inference serving); the frozen scorer is a trained reward model; the frozen reference is usually the SFT checkpoint. In the pipeline it sits after SFT/LoRA cold-start and reward-model training, one stage of fine-tuning and post-training. For frontier-scale runs, verl drives the same actor-critic PPO with algorithm.adv_estimator=gae and a dedicated critic (the value model):

# verl PPO (GAE + critic). Pin the verl release; verify keys on the repo.
python3 -m verl.trainer.main_ppo \
  algorithm.adv_estimator=gae \
  actor_rollout_ref.model.path=Qwen/Qwen3-8B \
  actor_rollout_ref.rollout.name=vllm \
  critic.model.path=Qwen/Qwen3-8B \
  critic.optim.lr=1e-5

How to run it in production

Train in BF16 for stability. Fold the per-token KL into the reward and, optionally, whiten rewards or advantages (whiten_rewards) to steady the gradient scale. Score only on the EOS token and set missing_eos_penalty so truncated completions do not push the reward model out of distribution. Watch these signals every step:

  • KL to the reference: hold it in a target band with kl_coef; a runaway KL means the policy is drifting off the reference and will degrade.
  • Mean reward / reward-model score: it should climb; a sudden spike alongside degrading samples is reward hacking, not progress.
  • Value loss: the critic MSE should stay bounded; a diverging value loss corrupts every advantage it feeds.
  • Clip fraction and ratio: the share of tokens hitting the cliprange bound; a very high clip fraction means updates are too aggressive for the current ε.
  • Response length: watch for length inflation that games the reward rather than the task.

How to maintain it

  • Pin the library. PPO's location and the PPOTrainer signature move across releases: recent TRL exports PPO at the top level, TRL/main relocated it to trl.experimental.ppo. Re-verify the four-model kwargs and processing_class on every upgrade, and pin verl the same way.
  • Initialize the critic well. Start the value network from the reward model (see the value model and GAE); a badly-initialized critic is the top failure mode, and that difficulty is the main reason to consider GRPO.
  • Checkpoint the trainable pair. Long RLHF runs must checkpoint policy and value weights plus optimizer state so a run can resume; see checkpoint recovery.
  • Retune the guardrails. As the policy moves, revisit kl_coef and cliprange; if the critic never earns its memory, migrate to GRPO.

How to scale it

PPO is two workloads plus a critic: a trainer that now holds two trainable networks (policy and value, so heavier sharding than GRPO) and a rollout generator backed by vLLM/SGLang (inference serving). Split GPUs between them. TRL scales the trainer with DeepSpeed ZeRO-3 + Accelerate (DeepSpeed and ZeRO) and dedicates separate GPUs to rollout; verl runs colocated on Ray with FSDP/Megatron actor + critic and vLLM rollouts. The frozen reward and reference models are inference-only. Offload them between phases or pin them to spare GPUs. The recurring cost is the trainer to rollout weight sync every step; keep the policy on a fast fabric (NVLink intra-node, IB/RoCE with GDR inter-node), the same concerns as performance tuning. Train in BF16 for stability.

Failure modes

  • The value model is hard to learn: no established best practices exist for fitting a value head to an LM backbone; a poorly-trained or badly-initialized critic destabilizes the whole run. Initialize from the reward model. This difficulty is the main reason GRPO drops the critic.
  • Truncation drives the reward model out-of-distribution: a hard generation-length cap pushes truncated completions to unpredictable reward scores. Score only on the EOS token and penalize over-length generations (missing_eos_penalty).
  • KL too low or too high: too low and the policy drifts off the reference and degrades; too high and it stays glued to the reference and never learns. Tune kl_coef.
  • ε too large: a wide clip range loses the trust region and lets destructively large updates through when ρ diverges far from 1.
  • Train/inference mismatch: the vLLM sampler and the FSDP learner produce slightly different token distributions even at identical weights; uncorrected, this destabilizes training. Truncated importance sampling corrects it and matters more for long reasoning traces.
  • Wrong discount: leave γ=1.0 for RLHF; discounting earlier tokens has no principled justification when the reward scores the whole response.
  • Reward hacking: the policy exploits a loophole in the reward, and a bad reward trains a bad model fast. Test rewards adversarially first; see reward design.
  • OOM from four copies: the extra trainable critic (weights + optimizer state) is real memory pressure. Offload the frozen pair, or move to GRPO if the value network is not earning its keep.

References

  • PPO paper (Schulman et al. 2017): https://arxiv.org/abs/1707.06347
  • GAE (Schulman et al. 2015): https://arxiv.org/abs/1506.02438
  • InstructGPT (Ouyang et al. 2022): https://arxiv.org/abs/2203.02155
  • RLHF Book (Lambert): https://rlhfbook.com
  • TRL PPO Trainer docs: https://huggingface.co/docs/trl/ppo_trainer

Related: GRPO · DPO · reward model training · reward design · async RL systems · fine-tuning and post-training · TRL · verl · Glossary