Markdown

TRL (Hugging Face)¶

Scope: Hugging Face's Transformer Reinforcement Learning library, the HF-native post-training stack (SFT/DPO/GRPO and more) built on the Trainer + Accelerate, with optional vLLM rollouts and PEFT/LoRA.

The python blocks that call trl, peft, or accelerate are reference templates on real APIs (GPU-only, not runnable in this doc's CI): pin versions and validate before production use. Each core method is paired with a dependency-free numpy block that validates the underlying math, and every one of those was executed with asserts passing.

What it is¶

TRL (huggingface/trl, v1.x mid-2026) is a full-stack post-training library integrated with HF Transformers. It exposes one Trainer subclass per method, all built on the HF Trainer and Accelerate for distribution, with vLLM rollouts for online methods and PEFT for LoRA/QLoRA. It is the simplest entry point in the RL-library landscape (RL libraries): no Ray, no separate rollout cluster, tightest coupling to the HF ecosystem (models, datasets, Trainer).

Trainers, by family: - Offline: SFTTrainer, DPOTrainer (plus CPO/KTO/ORPO/BCO). - Online (vLLM-accelerated): GRPOTrainer, RLOOTrainer, OnlineDPOTrainer, XPOTrainer, NashMDTrainer (plus PPOTrainer, without the vLLM path). - Reward modelling: RewardTrainer, PRMTrainer. - Distillation: GKDTrainer, MiniLLMTrainer.

Why use it¶

Simplest to start: a trainer is three lines: a model id, a dataset, trainer.train(). No cluster DSL.
Tight HF ecosystem: consumes datasets, any transformers checkpoint, Trainer callbacks, and the Hub directly.
Best for single-node / getting started: Accelerate scales it from one GPU upward without rewriting the loop; ideal for SFT, preference tuning, and small-to-mid GRPO runs.

When to use it (and when not)¶

Use TRL for SFT/DPO/reward-model training, prototyping a GRPO recipe, LoRA/QLoRA on modest hardware, and single-node or small multi-node jobs where the HF Trainer is sufficient.
Prefer the large-scale Ray libraries when the job needs disaggregated async rollouts, Megatron-scale parallelism, or thousand-GPU orchestration: verl (verl), slime (slime), SkyRL (SkyRL), OpenRLHF (OpenRLHF), NeMo-RL (NeMo-RL). TRL has no Ray controller and colocates by design.
Landscape and selection guidance in RL libraries.

Architecture¶

flowchart TB
  subgraph Acc["Accelerate launch (DDP / FSDP / DeepSpeed)"]
    TR["HF Trainer: SFT / DPO / GRPO"]
    PEFT["PEFT adapters (LoRA / QLoRA)"]
  end
  DS["datasets (HF Hub)"] --> TR
  TR --- PEFT
  subgraph Roll["Online methods only"]
    VS["vLLM server (trl vllm-serve)"]
  end
  TR -.->|"prompts (use_vllm=True)"| VS
  VS -.->|"completions"| TR

How to use it¶

Install, then pick a trainer. SFT is the baseline.

pip install trl                    # add extras as needed: pip install trl[vllm]

Reference template (needs trl and a GPU):

from datasets import load_dataset
from trl import SFTTrainer

dataset = load_dataset("trl-lib/Capybara", split="train")
trainer = SFTTrainer(model="Qwen/Qwen2.5-0.5B", train_dataset=dataset)
trainer.train()

Under the hood SFT is next-token cross-entropy over the completion tokens. The numpy block below reproduces that objective and asserts its endpoints, including a confidently-wrong adversarial case that must exceed the uniform baseline:

import numpy as np


def softmax(z):
    z = z - z.max(axis=-1, keepdims=True)
    e = np.exp(z)
    return e / e.sum(axis=-1, keepdims=True)


def sft_token_loss(logits, targets):
    # SFT objective: mean next-token cross-entropy over completion tokens.
    p = softmax(np.asarray(logits, dtype=float))
    picked = p[np.arange(len(targets)), targets]
    return float(-np.log(picked).mean())


V, T = 5, 3
targets = np.array([1, 2, 3])

# Happy path: confident-correct logits drive the loss to ~0.
logits = np.full((T, V), -10.0)
logits[np.arange(T), targets] = 10.0
assert sft_token_loss(logits, targets) < 1e-3

# Reference: uniform logits give exactly ln(V) (max-entropy baseline).
assert abs(sft_token_loss(np.zeros((T, V)), targets) - np.log(V)) < 1e-12

# Adversarial: confidently WRONG predictions exceed the uniform baseline.
wrong = np.full((T, V), -10.0)
wrong[np.arange(T), (targets + 1) % V] = 10.0
assert sft_token_loss(wrong, targets) > np.log(V)

print("SFT cross-entropy: OK")

DPO (offline preference, no rollouts) is the same shape with a binarized-preference dataset. Reference template (needs trl):

from datasets import load_dataset
from trl import DPOTrainer

ds = load_dataset("trl-lib/ultrafeedback_binarized", split="train")
DPOTrainer(model="Qwen/Qwen3-0.6B", train_dataset=ds).train()

DPO replaces the reward model with a log-ratio between the policy and a frozen reference. The core loss, validated in numpy (init equals ln 2, monotone in the chosen log-prob, and a wrong-direction adversarial case):

import numpy as np


def dpo_loss(lp_chosen, lp_rejected, ref_chosen, ref_rejected, beta=0.1):
    # DPO implicit reward = policy-vs-reference log-ratio (arxiv 2305.18290).
    margin = (lp_chosen - ref_chosen) - (lp_rejected - ref_rejected)
    return float(-np.log(1.0 / (1.0 + np.exp(-beta * margin))))


# At init (policy == reference) the loss is exactly -log(0.5) = ln 2.
assert abs(dpo_loss(-2.0, -3.0, -2.0, -3.0) - np.log(2)) < 1e-12

# Raising the chosen log-prob must reduce the loss monotonically.
losses = [dpo_loss(lp, -5.0, -4.0, -5.0) for lp in [-6.0, -4.0, -2.0, 0.0]]
assert all(a > b for a, b in zip(losses, losses[1:]))

# Adversarial: optimizing the WRONG way (rejected above chosen) exceeds the ln 2 baseline.
assert dpo_loss(-5.0, -1.0, -4.0, -4.0) > np.log(2)

print("DPO loss: OK")

How to integrate with it¶

TRL consumes the HF ecosystem directly: any datasets split, any transformers checkpoint, Trainer callbacks, and the Hub. Each trainer expects a specific dataset schema (prompt-completion for SFT, chosen/rejected for DPO, prompt-only for GRPO); the trainer's dataset-format guide is authoritative. Integration means composing three things into that Trainer: a reward function (online methods), PEFT adapters, and a vLLM rollout backend.

Custom reward functions (GRPO and the online family)¶

Online RL is GRPOTrainer plus a reward function (or a built-in). Config goes through the per-trainer *Config (a TrainingArguments subclass). Reference template (needs trl and vllm):

from datasets import load_dataset
from trl import GRPOTrainer, GRPOConfig

dataset = load_dataset("trl-lib/DeepMath-103K", split="train")

def reward_len(completions, **kwargs):                # custom reward
    return [-abs(64 - len(c)) for c in completions]   # toward 64 tokens

trainer = GRPOTrainer(
    model="Qwen/Qwen2.5-0.5B-Instruct",
    reward_funcs=reward_len,                           # or trl.rewards.accuracy_reward
    args=GRPOConfig(use_vllm=True, vllm_mode="server", bf16=True),
    train_dataset=dataset,
)
trainer.train()

GRPO is critic-free: it normalizes each reward against its own group of completions to the same prompt, so the reward function is the only learning signal you supply. The numpy block validates both reward_len and the group-relative advantage, checks it against a slow explicit-loop reference, and covers the zero-variance divide-by-zero guard:

import numpy as np


def reward_len(completions, target=64):
    # The page's GRPO reward: peak at `target` chars, linear falloff.
    return np.array([-abs(target - len(c)) for c in completions], dtype=float)


def grpo_advantages(rewards, eps=1e-8):
    # Group-relative, critic-free advantage (GRPO, DeepSeekMath arxiv 2402.03300).
    r = np.asarray(rewards, dtype=float)
    return (r - r.mean()) / (r.std() + eps)


# reward_len peaks exactly at the target length and is symmetric around it.
r = reward_len(["x" * n for n in [60, 64, 68]])
assert r.argmax() == 1 and r[0] == r[2] == -4.0

# Advantages are zero-mean; the best sample is positive.
adv = grpo_advantages([1.0, 2.0, 3.0, 10.0])
assert abs(adv.mean()) < 1e-9 and adv.argmax() == 3 and adv[3] > 0


# Reference equivalence: matches an explicit-loop implementation.
def ref_adv(xs):
    m = sum(xs) / len(xs)
    var = sum((x - m) ** 2 for x in xs) / len(xs)
    return [(x - m) / (var ** 0.5 + 1e-8) for x in xs]


assert np.allclose(adv, ref_adv([1.0, 2.0, 3.0, 10.0]))

# Adversarial: a zero-variance group must not divide by zero.
flat = grpo_advantages([5.0, 5.0, 5.0])
assert np.all(np.isfinite(flat)) and np.allclose(flat, 0.0)

print("GRPO advantages + reward_len: OK")

PEFT / LoRA adapters¶

Pass a LoraConfig via peft_config; QLoRA adds a 4-bit BitsAndBytesConfig on the model. Reference template (needs trl and peft):

from peft import LoraConfig
from trl import SFTTrainer

peft_config = LoraConfig(r=16, lora_alpha=32, target_modules="all-linear",
                         task_type="CAUSAL_LM")
SFTTrainer(model="Qwen/Qwen2.5-0.5B", train_dataset=dataset,
           peft_config=peft_config).train()

LoRA trains a scaled low-rank update (alpha/r)*B@A added to the frozen weight, so only the small A and B factors move. The numpy block confirms the update is genuinely rank-bounded, that merging is exact, and that a zero-init leaves the base weights bit-identical:

import numpy as np


def lora_delta(A, B, alpha, r):
    # LoRA update: scaled low-rank product added to frozen W (arxiv 2106.09685).
    return (alpha / r) * (B @ A)


rng = np.random.default_rng(0)
d, k, r, alpha = 8, 6, 2, 4
A = rng.standard_normal((r, k))
B = rng.standard_normal((d, r))
W = rng.standard_normal((d, k))

delta = lora_delta(A, B, alpha, r)
# The update is genuinely low rank: rank(delta) <= r << full rank.
assert np.linalg.matrix_rank(delta) <= r < min(d, k)

# Merging is exact: (W + delta) equals the explicit scaled sum.
assert np.allclose(W + delta, W + (alpha / r) * B.dot(A))

# Adversarial: zero-initialized B (LoRA's init) leaves the base weights bit-identical.
B0 = np.zeros((d, r))
assert np.array_equal(W + lora_delta(A, B0, alpha, r), W)

print("LoRA low-rank update: OK")

Methods, rollouts, and the serving handoff¶

TRL implements the catalog in fine-tuning and post-training: supervised fine-tuning and LoRA (SFT and LoRA) via SFTTrainer; offline preference optimization (DPO, arxiv 2305.18290) via DPOTrainer; and the critic-free GRPO family (GRPO, DeepSeekMath) via GRPOTrainer, the method TRL highlights for reproducing DeepSeek-R1-style training. Reward-model training (RewardTrainer) feeds RLHF-style pipelines.

Online methods integrate vLLM purely to generate rollouts during training (server or colocate mode, see how to scale it). TRL is a training library: it does not serve models in production. To serve a trained or merged checkpoint, run it under vLLM/SGLang/Dynamo per the serving pages inference serving, serving open-weight models, disaggregated inference. Merge LoRA adapters first (model.merge_and_unload() / peft merge) before serving if you trained with PEFT.

How to run it in production¶

Separate generation from training (online methods)¶

For online methods, separate the generation and training GPUs: run a vLLM server on one set, train on the other.

CUDA_VISIBLE_DEVICES=0,1,2,3 trl vllm-serve --model Qwen/Qwen2.5-0.5B-Instruct
CUDA_VISIBLE_DEVICES=4,5,6,7 accelerate launch train.py   # GRPOConfig(use_vllm=True, vllm_mode="server")

vllm_mode="colocate" instead runs vLLM in the training process group (no separate server), trading isolation for fewer moving parts.

Hardware and throughput¶

Accelerate + FSDP/DeepSpeed: shard parameters/optimizer state to fit larger models; HSDP (shard intra-node over NVLink, replicate inter-node over IB) for multi-node SFT/DPO (FSDP, DeepSpeed and ZeRO).
QLoRA: 4-bit base + LoRA adapters trains big models on modest GPUs; pair with bf16 and gradient checkpointing.
vLLM rollout on separate GPUs: keep generation off the training GPUs (CUDA_VISIBLE_DEVICES split) so PagedAttention throughput isn't fighting the optimizer for memory.
NCCL fabric: multi-node Accelerate collectives ride NCCL; on RDMA clusters set NCCL_IB_HCA, NCCL_NET_GDR_LEVEL=SYS, verify [GDRDMA] in NCCL_DEBUG=INFO, ACS off (performance tuning, networking fabric). Liger Kernel (use_liger_kernel=True) cuts memory and lifts throughput.

Recipes¶

1. SFT a small instruct model (single GPU) (reference template, needs trl):

from datasets import load_dataset
from trl import SFTTrainer
SFTTrainer(model="Qwen/Qwen2.5-0.5B",
           train_dataset=load_dataset("trl-lib/Capybara", split="train")).train()

2. DPO preference tuning (offline, no rollouts) (reference template, needs trl):

from datasets import load_dataset
from trl import DPOTrainer
DPOTrainer(model="Qwen/Qwen3-0.6B",
           train_dataset=load_dataset("trl-lib/ultrafeedback_binarized", split="train")).train()

3. GRPO with a vLLM server (online, 4+4 GPU split):

CUDA_VISIBLE_DEVICES=0,1,2,3 trl vllm-serve --model Qwen/Qwen2.5-0.5B-Instruct
CUDA_VISIBLE_DEVICES=4,5,6,7 accelerate launch grpo_train.py   # GRPOConfig(use_vllm=True, vllm_mode="server")

How to maintain it¶

Pin the stack together: TRL v1.x moves fast and trainer signatures plus *Config fields shift across minor versions. Pin trl, transformers, accelerate, peft, vllm, and datasets as one set; the reference templates above target the v1.x API (mid-2026) and should be revalidated on upgrade.
Validate the dataset schema before long runs: a wrong format silently mis-trains (see Failure modes); check the trainer's dataset-format guide against a small sample first.
Merge LoRA before serving: a PEFT checkpoint served without merging runs as the base model; merge with model.merge_and_unload() (see the serving handoff above).
Monitor the training signals: reward, KL, and entropy for online methods; loss and eval for offline. Reward/KL/entropy drift is the leading indicator of GRPO collapse (observability).
Checkpoints and the Hub: the HF Trainer writes checkpoints on the save_steps cadence and can push_to_hub; resume long jobs with trainer.train(resume_from_checkpoint=...).

How to scale it¶

TRL scales via Accelerate, not Ray. One accelerate launch spans multi-GPU and multi-node with DDP (DDP), FSDP (FSDP), or DeepSpeed ZeRO (DeepSpeed and ZeRO), selected by an Accelerate config or a DeepSpeed JSON.

accelerate config                                  # choose FSDP or DeepSpeed, num processes/nodes
accelerate launch --config_file fsdp.yaml train.py
# or DeepSpeed ZeRO-3:
accelerate launch --use_deepspeed --deepspeed_config_file ds_z3.json train.py

Multi-node training is standard Accelerate/torchrun rendezvous on top of this. For online methods, scale generation independently by pointing the trainer at a vLLM server on separate GPUs (see running in production); vllm_mode="colocate" keeps generation in the training process group when you want fewer moving parts. Past a few nodes the dedicated Ray stacks (verl, slime, SkyRL, OpenRLHF, NeMo-RL) scale online RL better, because TRL has no disaggregated async rollout controller.

Failure modes¶

GPU contention in online methods: colocating vLLM with training without splitting GPUs causes OOM/throughput collapse; use a server on separate devices or vllm_mode="colocate" deliberately.
Dataset format mismatch: each trainer expects a specific schema (prompt-completion vs chosen/rejected vs prompt-only); a wrong format silently mis-trains. Check the trainer's dataset-format guide.
Forgetting to merge LoRA: serving a PEFT checkpoint without merging adapters yields the base model's behaviour.
Expecting Ray-scale throughput: TRL has no disaggregated async rollout controller; past a few nodes the dedicated RL stacks scale better.
GRPO instability: entropy/advantage collapse and KL drift still apply; monitor reward/KL/entropy (fine-tuning and post-training, observability, GRPO).

References¶

TRL repo: https://github.com/huggingface/trl
TRL docs: https://huggingface.co/docs/trl/index · speeding up (vLLM): https://huggingface.co/docs/trl/main/en/speeding_up_training
TRL v1 blog: https://huggingface.co/blog/trl-v1 · co-located vLLM in TRL: https://huggingface.co/blog/vllm-colocate
Anyscale: Open Source RL Libraries for LLMs: https://www.anyscale.com/blog/open-source-rl-libraries-for-llms
GRPO (DeepSeekMath): https://arxiv.org/abs/2402.03300 · DPO: https://arxiv.org/abs/2305.18290 · LoRA: https://arxiv.org/abs/2106.09685

Related: RL Libraries · Fine-tuning · GRPO · DPO · SFT/LoRA · FSDP · DeepSpeed ZeRO · NeMo-RL · Glossary