Markdown

RL libraries for LLMs (verl · slime · SkyRL · …)¶

Scope: comparison/selection overview and index for the open-source RL post-training libraries: how the systems are structured, which inference and training backends they use, how they orchestrate rollouts vs training, and how to choose one. This page maps the field and routes you to the right library; the detailed WHAT/WHY/WHEN/HOW for each library lives on its own dedicated page (see Focused pages below). The systems layer under the methods in fine-tuning and post-training, usually built on Ray (orchestration overview). Framing follows Anyscale's "Open Source RL Libraries for LLMs" survey (References).

The field moves monthly; treat the table as a map, not a spec. Verify a library's current backends and scale on its repo before committing.

Focused pages¶

Pick a library here, then open its page for the implementable detail (full WHAT/WHY/WHEN/HOW):

verl: use this when you need the high-performance, colocated default for large-scale RL throughput.
slime: use this when you want the Megatron + SGLang decoupled async stack (the one behind GLM), especially for MoE.
SkyRL: use this when you want to switch between colocated and disaggregated, or run flexible/agentic workloads.
OpenRLHF: use this when your focus is RLHF with reward models on DeepSpeed.
NeMo-RL: use this when you are on the NVIDIA stack or need async agentic training.
TRL: use this when you want single-node simplicity to get a small run going fast.
Tinker: use this when you want post-training without operating any GPUs; a managed training API (LoRA-only) with an open recipe library on top.

For the library-by-library tradeoffs that drive this choice, see The landscape and Selection guidance below.

Paradigm: Generator + Trainer, colocated ↔ disaggregated¶

Every RL-for-LLM system decomposes into two components plus a controller:

Generator (rollout) runs inference (vLLM/SGLang) to sample completions and interacts with the environment / reward. Single-turn or multi-turn/agentic.
Trainer runs the policy update (PPO/GRPO/DPO, fine-tuning and post-training) on FSDP / DeepSpeed / Megatron.
Controller coordinates the two, usually Ray (orchestration overview).

The defining design axis is how generator and trainer share GPUs:

Colocated / tightly-coupled (e.g. verl): rollout and training share the same GPUs, swapping/offloading between phases. Maximum efficiency, less flexibility.
Disaggregated / decoupled (e.g. slime, SkyRL, AReaL): separate rollout and training pools, enabling async generation, straggler tolerance, and heterogeneous hardware, the same disaggregation idea as inference serving (disaggregated inference).

Architecture: generator/trainer loop

flowchart LR
  subgraph Gen["Generator (rollout)"]
    ENG["Inference engine: vLLM / SGLang"]
    ENV["Environment / reward"]
  end
  subgraph Train["Trainer"]
    OPT["PPO / GRPO on FSDP / Megatron"]
  end
  Gen -->|"rollouts + rewards"| Train
  Train -->|"updated weights"| Gen
  CTRL["Controller: Ray"] -.-> Gen
  CTRL -.-> Train

Architecture: colocated vs disaggregated

flowchart TB
  subgraph Colo["Colocated (verl): share GPUs"]
    G1["Rollout"] <-->|"offload / swap"| T1["Train"]
  end
  subgraph Disagg["Disaggregated (slime, SkyRL): separate pools"]
    G2["Rollout pool (SGLang)"] -->|"async rollouts"| T2["Train pool (Megatron)"]
    T2 -->|"weight sync"| G2
  end

The landscape (mid-2026)¶

Library	Origin	Trainer	Rollout	Orchestration	Coupling	Best for
verl	ByteDance	FSDP/FSDP2/Megatron	vLLM/SGLang	Ray	colocated default; async/agentic modes	large-scale, performance
slime	THUDM / Z.ai	Megatron only	SGLang	Ray	decoupled, async-first	high-perf MoE; powers GLM
SkyRL	UC Berkeley	FSDP2/Megatron/JAX	vLLM/OpenAI	Ray	colocated or disaggregated	flexible, agentic
OpenRLHF	community	DeepSpeed	vLLM	Ray	async + colocation	RLHF, reward models
NeMo-RL	NVIDIA	FSDP2/Megatron	vLLM/SGLang	Ray	async	agentic, NVIDIA stack
ROLL	Alibaba	FSDP2/Megatron	vLLM/SGLang	Ray	flexible	multi-purpose
AReaL	Ant Group	FSDP2/Megatron	vLLM/SGLang	optional Ray	async, interruptible	long rollouts, stragglers
TRL	Hugging Face	HF Trainer	vLLM/HF	none	colocated	single-node, simplicity
Tinker	Thinking Machines	managed service (LoRA)	service sampling API	none (client loop)	training-as-a-service	post-training without a cluster
Verifiers	Prime Intellect	own + prime-rl	vLLM/OpenAI	none	—	multi-turn env, research
RAGEN	community	on verl (FSDP/Megatron)	vLLM/SGLang	Ray	on verl	multi-turn agentic

How they relate to the rest of this KB¶

slime is the RL stack behind the GLM models (serving open-weight models): Megatron training + SGLang rollouts, decoupled and async; opinionated and minimal.
verl is the high-performance default referenced in fine-tuning and post-training; colocated by default for throughput, with async/agentic modes added.
All of these juggle a rollout engine and a trainer as separate process groups, which is why Ray (orchestration overview) is the common controller.

Selection guidance¶

Choose by the row that matches your constraint, then open that library's page:

Maximum performance / scale → verl or slime.
Flexibility / research / agentic → SkyRL or Verifiers.
Multi-turn / environment-heavy → RAGEN, NeMo-RL, or SkyRL.
Simplicity / single-node → TRL (fine-tuning and post-training).
RLHF with reward models → OpenRLHF.
No cluster at all / managed service → Tinker.

Hardware & networking notes¶

The weight-sync from trainer to rollout happens every step (colocated) or every few steps (disaggregated); it wants NVLink (colocated) or fast IB/RoCE with GDR (disaggregated), the same fabric concerns as performance tuning/disaggregated inference.
Colocated trades GPUs between phases via offload: memory pressure is the constraint; disaggregated keeps both hot but needs the interconnect to move weights/rollouts.
Run any of these on Ray via KubeRay on the GPU platform (orchestration overview, the Kubernetes platform); expose RDMA into the Ray workers.

Don't-miss checklist¶

Map the choice to the coupling you need: colocated (verl) for throughput, disaggregated (slime/SkyRL) for async/heterogeneous/straggler-tolerant.
Match rollout backend to the serving stack already in use (SGLang vs vLLM, inference serving).
Budget GPUs for rollout and training; rollouts often dominate wall-clock.
Reuse the platform's Ray/KubeRay + gang scheduler rather than a bespoke stack.
Monitor reward / entropy / KL for collapse regardless of library (observability).

Failure modes¶

Picking a colocated library then hitting memory limits that a disaggregated one would have avoided.
Rollout backend mismatched to the model's best engine (e.g. forcing vLLM where SGLang prefix caching wins).
Under-provisioned rollout pool → training GPUs idle (fine-tuning and post-training).
Treating any of these as turnkey: RL stability (entropy/KL) still needs active management.

Open questions & validation¶

Confirm each candidate library's current trainer/rollout backends and Ray dependency on its repo.
Validate the trainer↔rollout weight-sync path and bandwidth for the chosen coupling.
Benchmark a small GRPO run end-to-end before scaling; measure rollout vs train time split.

References¶

Anyscale — Open Source RL Libraries for LLMs: https://www.anyscale.com/blog/open-source-rl-libraries-for-llms
verl: https://github.com/volcengine/verl
slime (THUDM): https://github.com/THUDM/slime · docs: https://thudm.github.io/slime/
SkyRL (UC Berkeley): https://github.com/NovaSky-AI/SkyRL
OpenRLHF: https://github.com/OpenRLHF/OpenRLHF · NeMo-RL: https://github.com/NVIDIA-NeMo/RL
Tinker cookbook: https://github.com/thinking-machines-lab/tinker-cookbook
Ray (controller): https://docs.ray.io/en/latest/

Related: Inference · Optimization · OSS Models · Disaggregated · Fine-tuning · Orchestration · Glossary