Markdown

Fine-tuning and post-training (SFT · LoRA · DPO · GRPO)¶

Scope: adapting open-weight models through supervised fine-tuning, parameter-efficient LoRA/QLoRA, preference optimisation (DPO), and reinforcement learning with verifiable rewards (GRPO and the RL family), with the frameworks, configs, and hardware shape each needs. This is the hub over the per-method deep dives (SFT and LoRA, DPO, GRPO); it is the training-side counterpart to serving (serving open-weight models), built on the distributed mechanics in distributed-training recipes.

Recipes are reference templates on real APIs (TRL/verl/OpenRLHF). Validate against the installed library version; post-training methods and trainer APIs move quickly. Each templated block is paired below with a numpy-only block that validates the core math it teaches, runnable with no ML dependencies.

What it is¶

Post-training is everything you do to a base model after pretraining to make it useful: teach it a format, align it to preferences, and push its reasoning or agentic skill. The dominant 2026 pipeline is three stages, each with a different data and compute shape.

SFT (supervised fine-tuning): continue training on curated demonstrations with the standard next-token loss. It sets format and cold-starts behaviour. Cheapest stage: pure forward/backward, no rollouts, no reward model. LoRA trains small low-rank adapters and freezes the base (~100x fewer trainable params, fits big models on far less memory); QLoRA adds a 4-bit quantised base on top.
Preference optimisation (DPO/SimPO): align to human or AI preferences from (chosen, rejected) pairs. Offline, with no generation, so a run costs close to SFT. DPO's implicit reward is beta * log(pi_theta / pi_ref) against a frozen reference.
RL with verifiable rewards (GRPO/DAPO): improve reasoning and agentic skill against a programmatic reward (correctness, tests, format). Online: it generates rollouts every step, so it needs an inference engine alongside the trainer. GRPO drops the value/critic network and scores each completion relative to a group mean.

Pick the lightest stage that achieves the goal: many tasks need only SFT(+LoRA); reasoning and tool-use generally need RL.

Why use it¶

A base model is not a product. Pretraining gives raw capability, not format, safety, or task behaviour; post-training is where those are installed.
Cost scales with the stage, so you pay only for what you need. SFT and DPO are pure training jobs (no rollouts); RL adds a rollout engine and is materially more expensive. Choosing the lightest stage that works is a direct cost lever.
LoRA/QLoRA make adaptation cheap to run and cheap to store. Adapters are a few MB, many task adapters share one base, and they can be served together (multi-LoRA serving).
RL reaches skills SFT cannot demonstrate. With a verifiable reward the model learns from its own rollouts, lifting reasoning and tool-use past anything present in the demonstration set.

When to use it (and when not)¶

Use SFT(+LoRA) as the default first step for any task adaptation: instruction-following, domain style, format, tool syntax, and as the cold-start before DPO/GRPO. Many tasks never need more (SFT and LoRA).
Use QLoRA when the model plus optimiser will not fit in memory otherwise; accept a small speed and quality cost for the 4-bit base.
Add DPO when you have (or can collect) (chosen, rejected) pairs and want to align tone, helpfulness, or style. It is offline and stable, close to SFT cost (DPO).
Add GRPO/RL when a reward can be computed (correctness, tests pass, schema valid) and you need to push reasoning or agentic capability past SFT (GRPO).
When not: do not reach for RL if the reward is noisy or hackable (a bad reward trains a bad model fast), and do not run DPO on a non-SFT'd base (it can drift in odd directions). Prefer full-parameter SFT over LoRA when you have the GPUs and the distribution shift is large.

Architecture¶

The stages compose into a pipeline gated by evaluation: a failed gate loops back to more SFT or better data rather than promoting a regressed model.

flowchart LR
  BASE["Base model"] --> SFT["SFT (+LoRA/QLoRA)"]
  SFT --> DPO["DPO / SimPO (offline pairs)"]
  DPO --> RL["GRPO / DAPO (RLVR, online rollouts)"]
  RL --> EVAL{"Eval gate"}
  EVAL -->|"pass"| REG["Model registry"]
  EVAL -->|"fail"| SFT
  ROLL["Rollout engine (vLLM / SGLang)"] -.->|"generations"| RL
  RL -.->|"weight sync each step"| ROLL

The load-bearing structural fact: SFT and DPO are single-workload training jobs, whereas RL is two coupled workloads (a trainer and a rollout generator) with a per-step weight sync between them. That split drives the hardware and scaling story below.

SFT and LoRA/QLoRA (the default)¶

LoRA trains small low-rank adapters and freezes the base: ~100x fewer trainable params, fitting big models on far less memory. QLoRA adds 4-bit base quantisation. Config-first frameworks (Axolotl, LLaMA-Factory) make this a YAML.

# LLaMA-Factory / Axolotl-style SFT + QLoRA (pin the framework version)
base_model: Qwen/Qwen3-8B
adapter: qlora
load_in_4bit: true
lora_r: 16
lora_alpha: 32
lora_target: [q_proj, k_proj, v_proj, o_proj]
sequence_len: 4096
gradient_checkpointing: true
bf16: true
datasets: [{ path: ./sft_data.jsonl, type: chat }]

Full-parameter SFT uses FSDP/DeepSpeed across GPUs (distributed-training recipes):

accelerate launch --config_file fsdp.yaml sft.py \
  --model Qwen/Qwen3-32B --bf16 --gradient_checkpointing --fsdp "full_shard auto_wrap"

The core math LoRA teaches: the update is W' = W0 + (alpha/r) * B @ A, a rank-r delta on a frozen base, so trainable params drop from d_out*d_in to r*(d_in+d_out). This block validates the no-op initialisation, the merge equivalence, the rank bound, and the parameter saving with numpy only (no torch).

# LoRA core: W' = W0 + (alpha/r) * B @ A. numpy-only; validates no-op init, merge, param saving.
import numpy as np

def lora_forward(x, W0, A, B, alpha, r):
    return x @ W0.T + (alpha / r) * x @ (B @ A).T

d_out, d_in, r, alpha = 64, 48, 8, 16
rng = np.random.default_rng(0)
W0, A = rng.standard_normal((d_out, d_in)), rng.standard_normal((r, d_in))
B, x = np.zeros((d_out, r)), rng.standard_normal((5, d_in))
assert np.allclose(lora_forward(x, W0, A, B, alpha, r), x @ W0.T)   # B=0 -> starts at base model

B = rng.standard_normal((d_out, r))                                  # after training
W_merged = W0 + (alpha / r) * (B @ A)
assert np.allclose(lora_forward(x, W0, A, B, alpha, r), x @ W_merged.T, atol=1e-10)  # merge_and_unload
assert np.linalg.matrix_rank((alpha / r) * (B @ A)) <= r             # delta is low-rank

full, lora = d_out * d_in, r * (d_in + d_out)                        # trainable-param counts
saving = 1.0 - lora / full
assert lora < full and saving > 0.5                                 # ~100x fewer at production shapes
print(f"LoRA asserts passed: full={full} lora={lora} saving={saving:.1%}")

DPO (preference alignment, offline)¶

DPO reframes RLHF as a single classification loss on (chosen, rejected) pairs, with no reward model and no rollouts. The reference template (TRL):

# Reference template (needs TRL + torch). Data columns: {prompt, chosen, rejected}.
from trl import DPOConfig, DPOTrainer
trainer = DPOTrainer(
    model="Qwen/Qwen3-8B",
    args=DPOConfig(beta=0.1, learning_rate=5e-7, per_device_train_batch_size=4,
                   gradient_checkpointing=True, bf16=True),
    train_dataset=pref_ds, processing_class=tokenizer)
trainer.train()

beta controls deviation from the reference: too high means no learning, too low means reward hacking and drift. No rollouts, so a DPO run is roughly an SFT-cost job. See DPO for loss variants (IPO, SimPO) and the two-model-copy memory constraint.

The core math DPO teaches, validated numpy-only: the loss is -log sigmoid(margin) where margin = beta * ((logp_ch - ref_ch) - (logp_rj - ref_rj)). This block checks the loss is positive, that swapping chosen and rejected (an adversarial preference flip) raises the loss, that beta=0 collapses to ln 2, and that extreme margins do not overflow.

# DPO core: implicit reward beta*log(pi/pi_ref); loss = -log sigmoid(margin). numpy-only.
# -log sigmoid(x) == softplus(-x) == logaddexp(0, -x), which is overflow-safe.
import numpy as np

def dpo_loss(lp_ch, lp_rj, ref_ch, ref_rj, beta=0.1):
    margin = beta * ((lp_ch - ref_ch) - (lp_rj - ref_rj))   # beta * (log-ratio_chosen - log-ratio_rejected)
    return np.logaddexp(0.0, -margin), margin               # (loss, margin)

lp_ch, lp_rj = np.array([-2.0, -1.0]), np.array([-5.0, -4.0])
ref_ch, ref_rj = np.array([-3.0, -2.0]), np.array([-3.0, -3.0])
loss, margin = dpo_loss(lp_ch, lp_rj, ref_ch, ref_rj, 0.1)
assert np.all(loss > 0.0)                                   # -log of a value in (0,1)

# adversarial: swapping chosen<->rejected flips the margin and must raise the loss
loss_sw, margin_sw = dpo_loss(lp_rj, lp_ch, ref_rj, ref_ch, 0.1)
assert np.allclose(margin_sw, -margin)
assert np.all(loss_sw > loss)                               # correct pref is cheaper than its inversion

# boundary: beta=0 removes the signal -> loss == ln 2 for every pair
loss0, m0 = dpo_loss(lp_ch, lp_rj, ref_ch, ref_rj, 0.0)
assert np.allclose(m0, 0.0) and np.allclose(loss0, np.log(2.0))
# stability: extreme margins do not overflow (naive 1/(1+exp(-x)) would)
assert np.isfinite(dpo_loss(np.array([500.0]), np.array([-500.0]), np.array([0.0]), np.array([0.0]), 1.0)[0]).all()
print("DPO loss asserts passed:", np.round(loss, 4).tolist())

GRPO (RL with verifiable rewards)¶

GRPO (Group Relative Policy Optimization, from DeepSeekMath, scaled in DeepSeek-R1) drops the value/critic network: it samples a group of completions per prompt and computes each advantage relative to the group mean, roughly halving memory versus PPO. It is the standard for verifiable reasoning. The reference template (TRL):

# Reference template (needs TRL + torch + vLLM). Rollout-dominated: back it with an inference engine.
from trl import GRPOConfig, GRPOTrainer
def reward_correct(prompts, completions, answers, **kw):     # verifiable reward
    return [1.0 if extract(c) == a else 0.0 for c, a in zip(completions, answers)]
trainer = GRPOTrainer(
    model="Qwen/Qwen3-8B",
    reward_funcs=[reward_correct],
    args=GRPOConfig(num_generations=8,         # group size G
                    use_vllm=True,             # vLLM backend for fast rollouts
                    max_completion_length=2048, beta=0.0, bf16=True),
    train_dataset=ds)
trainer.train()

Operational reality of GRPO/RL (reliability and RAS):

It is rollout-dominated: generation (an inference engine, often vLLM) can be most of the wall-clock. Plan GPUs for both rollout and training; co-locate or disaggregate them (disaggregated inference).
Watch the failure modes: entropy collapse (model goes deterministic), advantage/reward collapse, and KL drift. Monitor reward, entropy, and KL as first-class metrics (observability).
DAPO and related variants address GRPO instabilities; the field iterates fast. See GRPO and GRPO variants and tricks for the successor knobs.

Two core algorithms GRPO teaches, both validated numpy-only. First, the group-relative advantage A_i = (r_i - mean(r)) / std(r): the block checks it is zero-mean per group (the critic-free baseline), that an all-equal group yields zero advantage (the frac_reward_zero_std degeneracy), and that std-scaling is invariant to affine reward rescaling (an adversarial reshaping that must not change the update).

# GRPO core: A_i = (r_i - mean(r)) / std(r) per group. numpy-only, no trainer needed.
import numpy as np

def group_advantages(rewards, scale_std=True, eps=1e-8):
    mean = rewards.mean(axis=1, keepdims=True)
    adv = rewards - mean
    return adv / (rewards.std(axis=1, keepdims=True) + eps) if scale_std else adv

r = np.array([[1.0, 0.0, 1.0, 0.0], [3.0, 1.0, 4.0, 1.0]])
adv = group_advantages(r, scale_std=False)
assert np.allclose(adv.mean(axis=1), 0.0)                 # critic-free baseline: zero-mean per group

# degenerate group (all equal) -> zero advantage, zero gradient (frac_reward_zero_std case)
assert np.allclose(group_advantages(np.array([[0.5, 0.5, 0.5]])), 0.0)

# std-scaling is invariant to affine reward rescaling r -> a*r + b, a>0 (adversarial reshaping)
n1 = group_advantages(r, scale_std=True)
n2 = group_advantages(7.0 * r - 2.0, scale_std=True)
assert np.allclose(n1, n2, atol=1e-6)
print("GRPO advantage asserts passed:", np.round(n1, 3).tolist())

Second, the verifiable reward itself plus the rollout-dominated wall-clock. The reward must reject the adversarial reward-hacking case (right format, wrong value scores 0), and the step-time model must show the rollout is the majority of the step and that under-provisioning it idles the trainer.

# Verifiable reward core + rollout-dominated wall-clock. numpy/stdlib only.
import re, numpy as np

def extract(text):
    m = re.search(r"\\boxed\{([^}]*)\}", text)
    return m.group(1).strip() if m else None

def reward_correct(completions, answers):
    return [1.0 if extract(c) == a else 0.0 for c, a in zip(completions, answers)]

assert reward_correct([r"\boxed{42}", r"\boxed{ 7 }", "no box"], ["42", "7", "5"]) == [1.0, 1.0, 0.0]
# adversarial reward-hacking: right format, wrong value must score 0
assert reward_correct([r"<think></think>\boxed{999}"], ["42"]) == [0.0]

def step_time(gen_tokens, tok_s, train_flops, flops_s):
    t_roll, t_train = gen_tokens / tok_s, train_flops / flops_s
    return t_roll, t_train, t_roll + t_train

t_roll, t_train, total = step_time(8 * 2048, 4000.0, 3.5e14, 2.0e14)
assert t_roll / total > 0.5                              # RL step is rollout-dominated
_, _, slow = step_time(8 * 2048, 2000.0, 3.5e14, 2.0e14)
assert slow > total                                      # under-provisioned rollout idles trainer
print(f"reward+rollout asserts passed: rollout={t_roll/total:.0%} of step")

How to use it¶

The minimal path is a single-node TRL trainer per stage: SFTTrainer with a PEFT LoraConfig for SFT/LoRA, DPOTrainer for preference pairs, GRPOTrainer (backed by vLLM) for RL. Each takes a model, a dataset, and a config; the per-method pages (SFT and LoRA, DPO, GRPO) carry the exact trainer code. Launch with accelerate launch train.py. The workflow rule that matters most: run the stages in order (SFT before DPO before RL) and start each from the previous checkpoint, because DPO and GRPO both assume an SFT'd starting policy.

How to integrate it¶

Data plane. SFT wants (input, output) demonstrations (training-data curation); DPO wants (prompt, chosen, rejected) pairs; GRPO wants prompts plus a reward function (or reward model). Synthetic data feeds all three (synthetic data generation).
Reward plane (RL only). The reward function is the integration point that decides model quality. Keep it cheap, deterministic, and hack-resistant; the cross-cutting principles are in reward design for RL post-training, and trained reward models in reward model training.
Serving plane. The output is a full checkpoint or a LoRA adapter. Merge the adapter (merge_and_unload()) and serve it as a normal model, or keep the base loaded once and hot-swap adapters (multi-LoRA serving), on the serving stack.
Lineage plane. Pin base model, dataset, and config, and record the run in an experiment tracker and model registry (experiment tracking and model registry) so a promoted model is reproducible.

How to run it in production¶

Gate every promotion on an eval. A run that reports "success" (loss down, reward up) can still regress real quality; hold out an evaluation and block promotion on it (LLM evaluation harness, SRE and MLOps practices).
Validate the reward before a full RL run. A bad reward trains a bad model fast, so test it on held-out cases (and adversarially) first (reward design).
Monitor RL health as first-class metrics. Track reward, entropy, and KL live; entropy collapse or reward-std collapse means the run has stopped learning (observability).
Checkpoint and expect faults. Multi-node runs fail; recover from checkpoints rather than restarting from zero (reliability and RAS, checkpoint recovery).

How to maintain it¶

Re-tune as data and base models move. New base releases and fresh preference or verifiable-reward data mean periodic re-runs; the pinned config makes each a controlled change.
Keep configs version-specific. Trainer APIs (TRL/verl/OpenRLHF) and post-training methods move quickly; treat every recipe as pinned to a library version and re-validate on upgrade.
Prefer adapters for iteration. LoRA adapters are cheap to retrain, store, and roll back, which keeps the maintenance loop fast; reach for full-parameter runs only when a large distribution shift needs them.

How to scale it¶

Pick the framework by scale, then provision hardware for the workload shape.

TRL (Hugging Face): SFTTrainer/DPOTrainer/GRPOTrainer; best for single-node and getting started.
verl (Volcano Engine RL): Ray-based, separates actor / rollout (vLLM/SGLang) / reference; scales RL to very large MoE models, including GRPO+LoRA. The choice for large-scale RL.
OpenRLHF: concise, with strong async RL and agentic RL.
NeMo-Aligner, Axolotl, LLaMA-Factory: NVIDIA-stack RLHF, and config-first SFT/DPO.

Hardware and networking follow the two workload shapes:

SFT and DPO scale like training. They are FSDP all-gather bandwidth bound: keep shards on NVLink/IB with GDR (performance tuning). HSDP shards intra-node over NVLink and replicates inter-node over IB (FSDP, DeepSpeed and ZeRO).
RL splits into train and rollout. The weight sync between them (every step) wants fast interconnect: co-locate on NVLink, or transfer over IB (networking fabric, disaggregated inference). Budget GPUs for both, and benchmark the rollout-versus-train split for the target model.
LoRA/QLoRA shrinks memory enough to fine-tune large models on a single node. Full-parameter post-training of a frontier model is a multi-node job (distributed training); reach for sharding only when base plus activations exceed one node.

Failure modes¶

GRPO entropy collapse. Outputs go deterministic, exploration dies, reward plateaus; needs KL/entropy regularisation or DAPO-style fixes. Monitor entropy.
GRPO advantage/reward collapse. Every sample in a group scores identically (frac_reward_zero_std high), so the advantage is zero and no learning happens; vary prompt difficulty or the group size G.
DPO beta too low. Reward hacking and drift from the reference; the policy degrades on everything not in the pairs. Raise beta and gate on a held-out eval.
RL rollout under-provisioned. Generation is rollout-dominated, so too few rollout GPUs leave the training GPUs idle waiting on generation.
LoRA targeting the wrong modules. The adapter learns little; include the attention projections at minimum, add the MLP for capacity.
No eval gate. A "successful" run that regressed real quality reaches production; always gate on a held-out eval before promotion.
Reward not validated. A noisy or hackable reward trains a bad model fast; test it (including adversarially) on held-out cases before a full RL run.
Config drift. Trainer APIs (TRL/verl) change between versions; a copied recipe can silently mismatch the installed library. Confirm the API surface for the pinned version.

References¶

TRL (SFT/DPO/GRPO): https://huggingface.co/docs/trl/index
verl (Volcano Engine RL): https://github.com/volcengine/verl
OpenRLHF: https://github.com/OpenRLHF/OpenRLHF
DeepSeekMath / GRPO paper: https://arxiv.org/abs/2402.03300
LLaMA-Factory: https://github.com/hiyouga/LLaMA-Factory · Axolotl: https://github.com/axolotl-ai-cloud/axolotl
DPO paper: https://arxiv.org/abs/2305.18290