Markdown

DPO (direct preference optimization)¶

Scope: offline preference alignment, training a policy directly on (chosen, rejected) pairs against a frozen reference, with no reward model and no rollouts. The cheap, stable preference stage of the post-training pipeline in fine-tuning and post-training; contrast with online RL (GRPO).

Reference templates use real APIs (TRL, PEFT, datasets) and are labelled as such; pin versions and validate before production use. Every core-math claim below is checked by an adjacent runnable numpy-only block.

What it is¶

DPO reframes RLHF as a single classification loss. Instead of fitting a reward model and then optimising it with PPO, DPO uses a closed-form relationship between the optimal policy and the reward to train directly on preference pairs. For each prompt it has a preferred completion y+ and a dispreferred y-; the loss widens the log-probability margin between them relative to a frozen reference model (the implicit reward is beta * log(pi_theta / pi_ref)). There is no reward model, no sampling, no rollouts. The data is fixed, so a DPO run costs roughly the same as SFT (SFT and LoRA). Introduced in the DPO paper (References).

Why use it¶

Simple and stable: one supervised-style loss; no reward-model training, no PPO loop, no rollout infrastructure.
Cheap: offline and generation-free, so close to SFT cost and time; no inference engine alongside the trainer.
Effective: matches or beats PPO-based RLHF for sentiment/summarisation/single-turn dialogue control (paper), with far less machinery.
Easy to operate: same accelerate/FSDP/DeepSpeed plumbing as SFT; fits existing training pipelines.

When to use it (and when not)¶

Use DPO when you have (or can collect) preference pairs and want to align tone, helpfulness, or style, the second stage in fine-tuning and post-training, after SFT and before any online RL.
Prefer GRPO when a reward can be computed (correctness, tests) and you want to push reasoning/agentic capability; preferences are too coarse for that.
Prefer SFT/LoRA when you only have demonstrations, not comparisons.
Caveat: DPO needs reasonably clean pairs; noisy or contradictory preferences degrade it.

Architecture¶

Policy and reference both start from the SFT checkpoint; the frozen reference anchors the policy while the preference pairs pull the log-probability margin apart. The gradient updates only the policy, and the output is a single aligned model (or a LoRA adapter) handed to serving.

flowchart LR
  SFT["SFT checkpoint"] --> POL["Policy pi_theta"]
  SFT --> REF["Reference pi_ref (frozen)"]
  D["Preference pairs (chosen, rejected)"] --> POL
  D --> REF
  POL --> LOSS["DPO loss: -log sigmoid(beta * (margin_chosen - margin_rejected))"]
  REF --> LOSS
  LOSS -->|"gradient"| POL
  POL --> OUT["Aligned policy / LoRA adapter -> serving"]

How to use it¶

TRL's DPOTrainer takes a model and a preference dataset with columns prompt, chosen, rejected (standard or conversational format). If no ref_model is passed, the trainer uses the initial policy as the reference automatically.

# reference template (requires TRL >= 1.6); DPOConfig defaults bf16=True and
# gradient_checkpointing=True. The loss it optimises is validated by the numpy block below.
from datasets import load_dataset
from trl import DPOConfig, DPOTrainer

trainer = DPOTrainer(
    model="Qwen/Qwen3-8B",
    args=DPOConfig(beta=0.1, learning_rate=1e-6),    # lr default is 1e-6 in DPOConfig
    train_dataset=load_dataset("trl-lib/ultrafeedback_binarized", split="train"),
)
trainer.train()

Each row is one prompt with a preferred and a dispreferred completion:

# preference row schema (pure python; no `datasets` needed). One prompt, a preferred
# completion, a dispreferred one, and they must differ.
row = {"prompt": "What color is the sky?", "chosen": "It is blue.", "rejected": "It is green."}
assert {"prompt", "chosen", "rejected"} <= set(row), "DPO row needs prompt/chosen/rejected"
assert row["chosen"] != row["rejected"], "a preference pair must actually differ"
print("E1 pass: preference row schema")

The loss DPOTrainer minimises has a closed form. This numpy-only reference implements it exactly and checks the properties the trainer relies on: the frozen-reference edge case, an adversarial mislabeled pair, the rewards/accuracies metric, the gradient sign (finite-difference), and beta scaling.

import numpy as np

def log_sigmoid(x):
    # numerically stable log(sigmoid(x)) = -softplus(-x)
    return -np.logaddexp(0.0, -x)

def dpo_loss(logp_pol_chosen, logp_ref_chosen, logp_pol_rejected, logp_ref_rejected, beta=0.1):
    """Exact DPO objective (Rafailov et al. 2023, eq. 7)."""
    logratio_chosen = logp_pol_chosen - logp_ref_chosen
    logratio_rejected = logp_pol_rejected - logp_ref_rejected
    logits = beta * (logratio_chosen - logratio_rejected)   # implicit-reward margin / beta
    loss = -log_sigmoid(logits)                             # per-example, >= 0
    reward_chosen = beta * logratio_chosen
    reward_rejected = beta * logratio_rejected
    return loss, reward_chosen, reward_rejected

rng = np.random.default_rng(0)
n = 256
ref_c = rng.normal(-10, 2, n)   # frozen reference log-probs, held fixed
ref_r = rng.normal(-10, 2, n)

# 1) EQUIVALENCE: the stable log-sigmoid matches the naive formula on safe inputs.
x = np.linspace(-20, 20, 1001)
assert np.allclose(log_sigmoid(x), np.log(1.0 / (1.0 + np.exp(-x))), atol=1e-9)

# 2) EDGE CASE: policy == reference gives zero margin, so every example costs log(2).
loss0, rc0, rr0 = dpo_loss(ref_c, ref_c, ref_r, ref_r, beta=0.1)
assert np.allclose(loss0, np.log(2.0)), "at pi==pi_ref every example must cost log(2)"
assert np.allclose(rc0 - rr0, 0.0), "reward margin must be 0 when policy==reference"

# 3) ADVERSARIAL / MONOTONE: a correct preference costs < log(2); a WRONG one (policy
#    prefers the rejected answer, e.g. mislabeled data) costs > log(2).
good_c, good_r = ref_c + 1.0, ref_r - 1.0
bad_c,  bad_r  = ref_c - 1.0, ref_r + 1.0
loss_good, gc, gr = dpo_loss(good_c, ref_c, good_r, ref_r, beta=0.1)
loss_bad,  bc, br = dpo_loss(bad_c,  ref_c, bad_r,  ref_r, beta=0.1)
assert (loss_good < np.log(2.0)).all() and (loss_bad > np.log(2.0)).all()
assert (loss_good < loss_bad).all()

# 4) rewards/accuracies: fraction with reward_chosen > reward_rejected (TRL wants > 0.5).
assert np.mean(gc > gr) == 1.0 and np.mean(bc > br) == 0.0

# 5) GRADIENT vs finite difference: d loss / d(chosen logp) = -beta*(1 - sigmoid(logits)).
beta = 0.1
lc, lr = good_c - ref_c, good_r - ref_r
logits = beta * (lc - lr)
analytic = -beta * (1.0 - 1.0 / (1.0 + np.exp(-logits)))
eps = 1e-6
numeric = (-log_sigmoid(beta * ((good_c + eps - ref_c) - lr)) - (-log_sigmoid(logits))) / eps
assert np.allclose(analytic, numeric, atol=1e-4), "gradient disagrees with finite diff"

# 6) BETA SCALING: for a fixed positive margin, larger beta drives lower loss.
margin = (good_c - ref_c) - (good_r - ref_r)   # all == 2.0
losses = [float(np.mean(-log_sigmoid(b * margin))) for b in (0.05, 0.1, 0.3)]
assert losses[0] > losses[1] > losses[2]

print("A pass: equivalence, pi==ref edge, adversarial, accuracy, gradient, beta-scaling")

How to integrate it¶

Where it sits in the pipeline¶

DPO is a fine-tuning stage, the preference-optimisation step of the post-training stack (SFT → DPO → GRPO). It assumes an SFT'd starting model (SFT and LoRA) and a reference (typically that same SFT checkpoint). It is the standard cheap alignment step before, or instead of, online RL (GRPO). Operationally it reuses the same accelerate/FSDP/DeepSpeed plumbing as SFT, so it drops into an existing training pipeline.

Preference data format¶

The data is (prompt, chosen, rejected); conversational rows get the chat template applied automatically. Watch the logged rewards/margins and rewards/accuracies; accuracy should climb above 0.5. To build a set from an existing comparison dataset, map it into the three columns:

# reference template (requires `datasets`): build a preference set from a code dataset.
from datasets import load_dataset
ds = load_dataset("Vezora/Code-Preference-Pairs")
def to_pref(ex):
    return {"prompt": [{"role": "user", "content": ex["input"]}],
            "chosen": [{"role": "assistant", "content": ex["accepted"]}],
            "rejected": [{"role": "assistant", "content": ex["rejected"]}]}
ds = ds.map(to_pref, remove_columns=["instruction", "input", "accepted", "ID"])

The reshape is pure data plumbing; validate it without the library:

# The (chosen, rejected) reshape into conversational format, validated without `datasets`.
def to_pref(ex):
    return {"prompt":   [{"role": "user", "content": ex["input"]}],
            "chosen":   [{"role": "assistant", "content": ex["accepted"]}],
            "rejected": [{"role": "assistant", "content": ex["rejected"]}]}

out = to_pref({"input": "sort xs", "accepted": "sorted(xs)", "rejected": "xs.sort() # returns None"})
assert set(out) == {"prompt", "chosen", "rejected"}, "reshape must emit exactly 3 columns"
assert out["prompt"][0]["role"] == "user"
assert out["chosen"][0]["role"] == out["rejected"][0]["role"] == "assistant"
assert out["chosen"][0]["content"] != out["rejected"][0]["content"], "the pair must differ"
print("E2 pass: (chosen, rejected) reshape schema")

Loss variants¶

loss_type selects the objective: "sigmoid" (default, Bradley-Terry), "ipo" (identity transform, resists overfitting the logit), or "sigmoid_norm", TRL's length-normalised SimPO variant (References). SimPO scores each sequence by its average token log-prob rather than the sum, which removes a length bias:

# reference template (requires TRL): SimPO via the length-normalised loss.
from trl import DPOConfig, DPOTrainer
trainer = DPOTrainer("Qwen/Qwen3-8B",
    args=DPOConfig(loss_type="sigmoid_norm", beta=2.0),   # SimPO: normalise by length
    train_dataset=ds)
trainer.train()

The core of sigmoid_norm is that length normalisation, validated here:

import numpy as np

# SimPO / TRL "sigmoid_norm" uses the *average* token log-prob (length-normalised) as the
# implicit reward instead of the sum. Core claim: normalising removes the length bias that
# otherwise lets a shorter answer win on raw summed log-prob alone.
def rewards(per_token_logp):
    summed = np.array([lp.sum()  for lp in per_token_logp])   # DPO-style total logp
    normed = np.array([lp.mean() for lp in per_token_logp])   # SimPO length-normalised
    return summed, normed

rng = np.random.default_rng(1)
chosen   = rng.normal(-0.20, 0.01, 40)   # longer, but HIGHER per-token prob (better)
rejected = rng.normal(-0.30, 0.01, 8)    # shorter, but LOWER per-token prob (worse)
summed, normed = rewards([chosen, rejected])
assert summed[0] < summed[1], "length bias: the better-but-longer answer loses on summed logp"
assert normed[0] > normed[1], "length normalisation restores the quality ranking"

# EDGE CASE: identical per-token quality, different lengths -> equal normalised reward,
# while summed logp still spuriously separates them.
a, b = np.full(50, -0.25), np.full(10, -0.25)
sa, na = rewards([a, b])
assert not np.isclose(sa[0], sa[1]) and np.isclose(na[0], na[1]), "norm cancels pure length"
print("C pass: length-normalised reward ranking + length invariance")

Tuning beta and PEFT¶

The main knobs are beta, the data, and the loss variant. beta controls deviation from the reference (it scales the implicit reward). Too high means little learning; too low means reward hacking and drift. Start at 0.1 and sweep:

# reference template (requires TRL): sweep beta and compare on a held-out eval.
for beta in (0.05, 0.1, 0.3):
    DPOTrainer("Qwen/Qwen3-8B",
        args=DPOConfig(beta=beta, output_dir=f"dpo-b{beta}"),
        train_dataset=ds).train()        # compare rewards/accuracies + held-out eval

For PEFT, pass peft_config=LoraConfig() to train LoRA adapters instead of the full model; use a higher LR (~1e-5) for adapters:

# reference template (requires TRL + PEFT): adapter-only DPO.
from peft import LoraConfig
trainer = DPOTrainer("Qwen/Qwen3-8B",
    args=DPOConfig(beta=0.1, loss_type="ipo", learning_rate=1e-5),
    train_dataset=ds, peft_config=LoraConfig())     # adapter-only DPO

How to run it in production¶

DPO is a training method and serves nothing itself. The output is a single aligned policy (or a LoRA adapter), deployed like any model on the serving stack. Unlike GRPO, DPO does not even use an inference engine during training, since there are no rollouts.

Because a DPO run can "succeed" on its training loss while regressing real quality, gate every run on a held-out eval before promotion: rising rewards/accuracies (above 0.5) and separating rewards/margins are necessary but not sufficient. A run that regressed real quality yet passed no gate is the classic way a bad policy reaches production, so wire the eval gate and rollout into the same release path as any model change (SRE and MLOps practices).

How to maintain it¶

Preferences shift as the product and the base model move, so DPO is re-run rather than run once:

Refresh the reference to the current SFT checkpoint whenever the base changes; the reference is typically that same SFT checkpoint, so a stale reference optimises against an old model.
Keep the data clean. Noisy or contradictory pairs degrade DPO (the margin never separates); clean and de-duplicate before each re-run.
Re-sweep beta when the data distribution shifts: the stable value is data-dependent, and a beta that was fine can start to reward-hack or stall on new pairs.
Watch for drift on a held-out eval across runs, not just the training rewards/*; a low beta lets the policy diverge from the reference and degrade on everything outside the pairs.

How to scale it¶

DPO scales exactly like SFT. It is offline and generation-free, so there is no rollout pool to provision. Shard the policy (and the reference) with FSDP (FSDP) or DeepSpeed ZeRO (DeepSpeed and ZeRO) across GPUs and nodes via accelerate/torchrun:

# multi-node DPO via Accelerate + FSDP (same plumbing as SFT)
accelerate launch --config_file fsdp.yaml --num_processes 16 dpo.py \
  --model Qwen/Qwen3-32B

For multi-instance/multi-node, the constraint is holding two model copies (policy + reference) in memory. precompute_ref_log_probs=True computes reference log-probs once up front so the reference need not stay resident; with PEFT the adapter is disabled to recover the reference, avoiding a second full copy entirely. Both tricks leave the loss unchanged because the reference is frozen:

import numpy as np

def log_sigmoid(x): return -np.logaddexp(0.0, -x)
def dpo_loss(pc, rc, pr, rr, beta):
    return -log_sigmoid(beta * ((pc - rc) - (pr - rr)))

rng = np.random.default_rng(2)
n = 128
pol_c = rng.normal(-9, 2, n);  pol_r = rng.normal(-9, 2, n)
ref_c = rng.normal(-10, 2, n); ref_r = rng.normal(-10, 2, n)
beta = 0.1

# (a) precompute_ref_log_probs: the reference is frozen, so its log-probs are constant wrt
#     the policy. Caching them once must give the IDENTICAL loss as recomputing every step.
loss_live   = dpo_loss(pol_c, ref_c,        pol_r, ref_r,        beta)
loss_cached = dpo_loss(pol_c, ref_c.copy(), pol_r, ref_r.copy(), beta)
assert np.array_equal(loss_live, loss_cached), "precompute must not change the loss"

# (b) PEFT reference == policy with the LoRA adapter DISABLED. Model the adapter as an
#     additive delta on log-probs; delta -> 0 must recover pi_ref exactly, so at init the
#     reference reward is 0 and the loss is log(2) per example (no second full copy needed).
delta = rng.normal(0, 0.5, n)
assert np.allclose((ref_c + delta) - delta, ref_c), "adapter-disable must recover pi_ref"
assert np.allclose(dpo_loss(ref_c, ref_c, ref_r, ref_r, beta), np.log(2.0))

# (c) ADVERSARIAL: a stale / corrupted reference cache silently changes the loss.
loss_stale = dpo_loss(pol_c, ref_c + 0.7, pol_r, ref_r, beta)
assert not np.allclose(loss_stale, loss_live), "corrupted ref cache must change the loss"
print("D pass: precompute==live, adapter-disable==pi_ref, stale-cache detected")

Hardware profile¶

DPO has the hardware profile of SFT, not RL. No rollout engine, no per-step weight sync:

FSDP all-gather bound: keep shards on NVLink intra-node and IB/RoCE with GDR inter-node; HSDP (shard intra-node, replicate inter-node) suits multi-node (FSDP).
NCCL: NCCL_IB_HCA, NCCL_NET_GDR_LEVEL=SYS; verify [GDRDMA] in NCCL_DEBUG=INFO; ACS off for P2P/GDR.
Memory is dominated by the two model copies; gradient checkpointing (on by default in DPOConfig) and precompute_ref_log_probs cut it. BF16 by default; Blackwell FP8 applies to the forward/backward as with any training.

Failure modes¶

beta too low leads to reward hacking and drift: the policy diverges from the reference and degrades on everything not in the pairs. Raise beta; gate on a held-out eval.
beta too high means no learning; rewards/margins stay flat near zero.
Noisy/contradictory pairs mean the margin never separates; clean and de-duplicate the data first.
Both copies OOM: use precompute_ref_log_probs=True or PEFT (adapter-disable reference) so only one full model is resident.
Skipping SFT: DPO on a non-SFT'd base can move in odd directions; cold-start with SFT first (SFT and LoRA).
No eval gate: a "successful" DPO run that regressed real quality reaches production (SRE and MLOps practices).

References¶

DPO paper: https://arxiv.org/abs/2305.18290
SimPO (length-normalised, sigmoid_norm): https://arxiv.org/abs/2405.14734
TRL DPO Trainer docs: https://huggingface.co/docs/trl/dpo_trainer