RL libraries for LLMs (verl · slime · SkyRL · …)¶
Scope: comparison/selection overview and index for the open-source RL post-training libraries: how the systems are structured, which inference and training backends they use, how they orchestrate rollouts vs training, and how to choose one. This page maps the field and routes you to the right library; the detailed WHAT/WHY/WHEN/HOW for each library lives on its own dedicated page (see Focused pages below). The systems layer under the methods in fine-tuning and post-training, usually built on Ray (orchestration overview). Framing follows Anyscale's "Open Source RL Libraries for LLMs" survey (References).
The field moves monthly; treat the table as a map, not a spec. Verify a library's current backends and scale on its repo before committing.
Focused pages¶
Pick a library here, then open its page for the implementable detail (full WHAT/WHY/WHEN/HOW):
- verl: use this when you need the high-performance, colocated default for large-scale RL throughput.
- slime: use this when you want the Megatron + SGLang decoupled async stack (the one behind GLM), especially for MoE.
- SkyRL: use this when you want to switch between colocated and disaggregated, or run flexible/agentic workloads.
- OpenRLHF: use this when your focus is RLHF with reward models on DeepSpeed.
- NeMo-RL: use this when you are on the NVIDIA stack or need async agentic training.
- TRL: use this when you want single-node simplicity to get a small run going fast.
- Tinker: use this when you want post-training without operating any GPUs; a managed training API (LoRA-only) with an open recipe library on top.
For the library-by-library tradeoffs that drive this choice, see The landscape and Selection guidance below.
Paradigm: Generator + Trainer, colocated ↔ disaggregated¶
Every RL-for-LLM system decomposes into two components plus a controller:
- Generator (rollout) runs inference (vLLM/SGLang) to sample completions and interacts with the environment / reward. Single-turn or multi-turn/agentic.
- Trainer runs the policy update (PPO/GRPO/DPO, fine-tuning and post-training) on FSDP / DeepSpeed / Megatron.
- Controller coordinates the two, usually Ray (orchestration overview).
The defining design axis is how generator and trainer share GPUs:
- Colocated / tightly-coupled (e.g. verl): rollout and training share the same GPUs, swapping/offloading between phases. Maximum efficiency, less flexibility.
- Disaggregated / decoupled (e.g. slime, SkyRL, AReaL): separate rollout and training pools, enabling async generation, straggler tolerance, and heterogeneous hardware, the same disaggregation idea as inference serving (disaggregated inference).
Architecture: generator/trainer loop
flowchart LR
subgraph Gen["Generator (rollout)"]
ENG["Inference engine: vLLM / SGLang"]
ENV["Environment / reward"]
end
subgraph Train["Trainer"]
OPT["PPO / GRPO on FSDP / Megatron"]
end
Gen -->|"rollouts + rewards"| Train
Train -->|"updated weights"| Gen
CTRL["Controller: Ray"] -.-> Gen
CTRL -.-> Train
Architecture: colocated vs disaggregated
flowchart TB
subgraph Colo["Colocated (verl): share GPUs"]
G1["Rollout"] <-->|"offload / swap"| T1["Train"]
end
subgraph Disagg["Disaggregated (slime, SkyRL): separate pools"]
G2["Rollout pool (SGLang)"] -->|"async rollouts"| T2["Train pool (Megatron)"]
T2 -->|"weight sync"| G2
end
The landscape (mid-2026)¶
| Library | Origin | Trainer | Rollout | Orchestration | Coupling | Best for |
|---|---|---|---|---|---|---|
| verl | ByteDance | FSDP/FSDP2/Megatron | vLLM/SGLang | Ray | colocated default; async/agentic modes | large-scale, performance |
| slime | THUDM / Z.ai | Megatron only | SGLang | Ray | decoupled, async-first | high-perf MoE; powers GLM |
| SkyRL | UC Berkeley | FSDP2/Megatron/JAX | vLLM/OpenAI | Ray | colocated or disaggregated | flexible, agentic |
| OpenRLHF | community | DeepSpeed | vLLM | Ray | async + colocation | RLHF, reward models |
| NeMo-RL | NVIDIA | FSDP2/Megatron | vLLM/SGLang | Ray | async | agentic, NVIDIA stack |
| ROLL | Alibaba | FSDP2/Megatron | vLLM/SGLang | Ray | flexible | multi-purpose |
| AReaL | Ant Group | FSDP2/Megatron | vLLM/SGLang | optional Ray | async, interruptible | long rollouts, stragglers |
| TRL | Hugging Face | HF Trainer | vLLM/HF | none | colocated | single-node, simplicity |
| Tinker | Thinking Machines | managed service (LoRA) | service sampling API | none (client loop) | training-as-a-service | post-training without a cluster |
| Verifiers | Prime Intellect | own + prime-rl | vLLM/OpenAI | none | — | multi-turn env, research |
| RAGEN | community | on verl (FSDP/Megatron) | vLLM/SGLang | Ray | on verl | multi-turn agentic |
How they relate to the rest of this KB¶
- slime is the RL stack behind the GLM models (serving open-weight models): Megatron training + SGLang rollouts, decoupled and async; opinionated and minimal.
- verl is the high-performance default referenced in fine-tuning and post-training; colocated by default for throughput, with async/agentic modes added.
- All of these juggle a rollout engine and a trainer as separate process groups, which is why Ray (orchestration overview) is the common controller.
Selection guidance¶
Choose by the row that matches your constraint, then open that library's page:
- Maximum performance / scale → verl or slime.
- Flexibility / research / agentic → SkyRL or Verifiers.
- Multi-turn / environment-heavy → RAGEN, NeMo-RL, or SkyRL.
- Simplicity / single-node → TRL (fine-tuning and post-training).
- RLHF with reward models → OpenRLHF.
- No cluster at all / managed service → Tinker.
Hardware & networking notes¶
- The weight-sync from trainer to rollout happens every step (colocated) or every few steps (disaggregated); it wants NVLink (colocated) or fast IB/RoCE with GDR (disaggregated), the same fabric concerns as performance tuning/disaggregated inference.
- Colocated trades GPUs between phases via offload: memory pressure is the constraint; disaggregated keeps both hot but needs the interconnect to move weights/rollouts.
- Run any of these on Ray via KubeRay on the GPU platform (orchestration overview, the Kubernetes platform); expose RDMA into the Ray workers.
Don't-miss checklist¶
- Map the choice to the coupling you need: colocated (verl) for throughput, disaggregated (slime/SkyRL) for async/heterogeneous/straggler-tolerant.
- Match rollout backend to the serving stack already in use (SGLang vs vLLM, inference serving).
- Budget GPUs for rollout and training; rollouts often dominate wall-clock.
- Reuse the platform's Ray/KubeRay + gang scheduler rather than a bespoke stack.
- Monitor reward / entropy / KL for collapse regardless of library (observability).
Failure modes¶
- Picking a colocated library then hitting memory limits that a disaggregated one would have avoided.
- Rollout backend mismatched to the model's best engine (e.g. forcing vLLM where SGLang prefix caching wins).
- Under-provisioned rollout pool → training GPUs idle (fine-tuning and post-training).
- Treating any of these as turnkey: RL stability (entropy/KL) still needs active management.
Open questions & validation¶
- Confirm each candidate library's current trainer/rollout backends and Ray dependency on its repo.
- Validate the trainer↔rollout weight-sync path and bandwidth for the chosen coupling.
- Benchmark a small GRPO run end-to-end before scaling; measure rollout vs train time split.
References¶
- Anyscale — Open Source RL Libraries for LLMs: https://www.anyscale.com/blog/open-source-rl-libraries-for-llms
- verl: https://github.com/volcengine/verl
- slime (THUDM): https://github.com/THUDM/slime · docs: https://thudm.github.io/slime/
- SkyRL (UC Berkeley): https://github.com/NovaSky-AI/SkyRL
- OpenRLHF: https://github.com/OpenRLHF/OpenRLHF · NeMo-RL: https://github.com/NVIDIA-NeMo/RL
- Tinker cookbook: https://github.com/thinking-machines-lab/tinker-cookbook
- Ray (controller): https://docs.ray.io/en/latest/
Related: Inference · Optimization · OSS Models · Disaggregated · Fine-tuning · Orchestration · Glossary