Skip to content
Markdown

RL libraries for LLMs (verl · slime · SkyRL · …)

Scope: comparison/selection overview and index for the open-source RL post-training libraries: how the systems are structured, which inference and training backends they use, how they orchestrate rollouts vs training, and how to choose one. This page maps the field and routes you to the right library; the detailed WHAT/WHY/WHEN/HOW for each library lives on its own dedicated page (see Focused pages below). The systems layer under the methods in fine-tuning and post-training, usually built on Ray (orchestration overview). Framing follows Anyscale's "Open Source RL Libraries for LLMs" survey (References).

The field moves monthly; treat the table as a map, not a spec. Verify a library's current backends and scale on its repo before committing.

Focused pages

Pick a library here, then open its page for the implementable detail (full WHAT/WHY/WHEN/HOW):

  • verl: use this when you need the high-performance, colocated default for large-scale RL throughput.
  • slime: use this when you want the Megatron + SGLang decoupled async stack (the one behind GLM), especially for MoE.
  • SkyRL: use this when you want to switch between colocated and disaggregated, or run flexible/agentic workloads.
  • OpenRLHF: use this when your focus is RLHF with reward models on DeepSpeed.
  • NeMo-RL: use this when you are on the NVIDIA stack or need async agentic training.
  • TRL: use this when you want single-node simplicity to get a small run going fast.
  • Tinker: use this when you want post-training without operating any GPUs; a managed training API (LoRA-only) with an open recipe library on top.

For the library-by-library tradeoffs that drive this choice, see The landscape and Selection guidance below.

Paradigm: Generator + Trainer, colocated ↔ disaggregated

Every RL-for-LLM system decomposes into two components plus a controller:

  • Generator (rollout) runs inference (vLLM/SGLang) to sample completions and interacts with the environment / reward. Single-turn or multi-turn/agentic.
  • Trainer runs the policy update (PPO/GRPO/DPO, fine-tuning and post-training) on FSDP / DeepSpeed / Megatron.
  • Controller coordinates the two, usually Ray (orchestration overview).

The defining design axis is how generator and trainer share GPUs:

  • Colocated / tightly-coupled (e.g. verl): rollout and training share the same GPUs, swapping/offloading between phases. Maximum efficiency, less flexibility.
  • Disaggregated / decoupled (e.g. slime, SkyRL, AReaL): separate rollout and training pools, enabling async generation, straggler tolerance, and heterogeneous hardware, the same disaggregation idea as inference serving (disaggregated inference).

Architecture: generator/trainer loop

flowchart LR
  subgraph Gen["Generator (rollout)"]
    ENG["Inference engine: vLLM / SGLang"]
    ENV["Environment / reward"]
  end
  subgraph Train["Trainer"]
    OPT["PPO / GRPO on FSDP / Megatron"]
  end
  Gen -->|"rollouts + rewards"| Train
  Train -->|"updated weights"| Gen
  CTRL["Controller: Ray"] -.-> Gen
  CTRL -.-> Train

Architecture: colocated vs disaggregated

flowchart TB
  subgraph Colo["Colocated (verl): share GPUs"]
    G1["Rollout"] <-->|"offload / swap"| T1["Train"]
  end
  subgraph Disagg["Disaggregated (slime, SkyRL): separate pools"]
    G2["Rollout pool (SGLang)"] -->|"async rollouts"| T2["Train pool (Megatron)"]
    T2 -->|"weight sync"| G2
  end

The landscape (mid-2026)

Library Origin Trainer Rollout Orchestration Coupling Best for
verl ByteDance FSDP/FSDP2/Megatron vLLM/SGLang Ray colocated default; async/agentic modes large-scale, performance
slime THUDM / Z.ai Megatron only SGLang Ray decoupled, async-first high-perf MoE; powers GLM
SkyRL UC Berkeley FSDP2/Megatron/JAX vLLM/OpenAI Ray colocated or disaggregated flexible, agentic
OpenRLHF community DeepSpeed vLLM Ray async + colocation RLHF, reward models
NeMo-RL NVIDIA FSDP2/Megatron vLLM/SGLang Ray async agentic, NVIDIA stack
ROLL Alibaba FSDP2/Megatron vLLM/SGLang Ray flexible multi-purpose
AReaL Ant Group FSDP2/Megatron vLLM/SGLang optional Ray async, interruptible long rollouts, stragglers
TRL Hugging Face HF Trainer vLLM/HF none colocated single-node, simplicity
Tinker Thinking Machines managed service (LoRA) service sampling API none (client loop) training-as-a-service post-training without a cluster
Verifiers Prime Intellect own + prime-rl vLLM/OpenAI none multi-turn env, research
RAGEN community on verl (FSDP/Megatron) vLLM/SGLang Ray on verl multi-turn agentic

How they relate to the rest of this KB

  • slime is the RL stack behind the GLM models (serving open-weight models): Megatron training + SGLang rollouts, decoupled and async; opinionated and minimal.
  • verl is the high-performance default referenced in fine-tuning and post-training; colocated by default for throughput, with async/agentic modes added.
  • All of these juggle a rollout engine and a trainer as separate process groups, which is why Ray (orchestration overview) is the common controller.

Selection guidance

Choose by the row that matches your constraint, then open that library's page:

Hardware & networking notes

  • The weight-sync from trainer to rollout happens every step (colocated) or every few steps (disaggregated); it wants NVLink (colocated) or fast IB/RoCE with GDR (disaggregated), the same fabric concerns as performance tuning/disaggregated inference.
  • Colocated trades GPUs between phases via offload: memory pressure is the constraint; disaggregated keeps both hot but needs the interconnect to move weights/rollouts.
  • Run any of these on Ray via KubeRay on the GPU platform (orchestration overview, the Kubernetes platform); expose RDMA into the Ray workers.

Don't-miss checklist

  • Map the choice to the coupling you need: colocated (verl) for throughput, disaggregated (slime/SkyRL) for async/heterogeneous/straggler-tolerant.
  • Match rollout backend to the serving stack already in use (SGLang vs vLLM, inference serving).
  • Budget GPUs for rollout and training; rollouts often dominate wall-clock.
  • Reuse the platform's Ray/KubeRay + gang scheduler rather than a bespoke stack.
  • Monitor reward / entropy / KL for collapse regardless of library (observability).

Failure modes

  • Picking a colocated library then hitting memory limits that a disaggregated one would have avoided.
  • Rollout backend mismatched to the model's best engine (e.g. forcing vLLM where SGLang prefix caching wins).
  • Under-provisioned rollout pool → training GPUs idle (fine-tuning and post-training).
  • Treating any of these as turnkey: RL stability (entropy/KL) still needs active management.

Open questions & validation

  • Confirm each candidate library's current trainer/rollout backends and Ray dependency on its repo.
  • Validate the trainer↔rollout weight-sync path and bandwidth for the chosen coupling.
  • Benchmark a small GRPO run end-to-end before scaling; measure rollout vs train time split.

References

  • Anyscale — Open Source RL Libraries for LLMs: https://www.anyscale.com/blog/open-source-rl-libraries-for-llms
  • verl: https://github.com/volcengine/verl
  • slime (THUDM): https://github.com/THUDM/slime · docs: https://thudm.github.io/slime/
  • SkyRL (UC Berkeley): https://github.com/NovaSky-AI/SkyRL
  • OpenRLHF: https://github.com/OpenRLHF/OpenRLHF · NeMo-RL: https://github.com/NVIDIA-NeMo/RL
  • Tinker cookbook: https://github.com/thinking-machines-lab/tinker-cookbook
  • Ray (controller): https://docs.ray.io/en/latest/

Related: Inference · Optimization · OSS Models · Disaggregated · Fine-tuning · Orchestration · Glossary