verl (Volcano Engine RL)¶
Scope: ByteDance's flexible, high-performance RL post-training library for LLMs, built on the HybridFlow programming model.
Reference templates on real APIs; pin versions and validate before production use.
What it is¶
verl is the open-source implementation of HybridFlow (EuroSys 2025), a hybrid-controller programming model for RL post-training. A single-controller process expresses the RL dataflow (rollout, reward, advantage, update) in a few lines, while multi-controller worker groups execute the heavy compute. It plugs existing training backends (PyTorch FSDP/FSDP2, Megatron-LM) and inference engines (vLLM, SGLang, HF Transformers) into one loop, orchestrated by Ray. It ships PPO, GRPO (GRPO), GSPO, DAPO, RLOO, REINFORCE++, ReMax, PRIME and more. See the landscape page RL libraries for how it compares.
Why use it¶
- Performance at scale. Colocated actor/rollout/reference share GPUs; the 3D-HybridEngine reshards the actor between train and generate layouts with minimal memory redundancy and no idle weights.
- Mature ecosystem. The most battle-tested OSS RL stack (mid-2026): broad model coverage, many recipes, active fork tree (ROCm, Ascend, agent extensions).
- Algorithm breadth with little code. Swap PPO↔GRPO↔DAPO via config; the HybridFlow controller keeps the dataflow declarative.
When to use it (and when not)¶
- Use it for large-scale, throughput-bound RL where colocation maximises GPU utilisation, and when you want a proven path with many reference configs.
- Prefer slime for opinionated, decoupled async MoE RL (the GLM stack), or SkyRL when you need first-class disaggregated/async or heavily agentic multi-turn loops. verl's colocation is its strength and its constraint: memory pressure during phase swaps is the usual limiter.
Architecture¶
flowchart TB
CTRL["Single controller (HybridFlow)"]
subgraph Colo["Colocated GPU pool"]
ACT["Actor (FSDP / Megatron)"]
ROLL["Rollout (vLLM / SGLang)"]
REF["Reference / reward"]
end
CTRL -->|"generate"| ROLL
ROLL -->|"trajectories"| CTRL
CTRL -->|"advantage + update"| ACT
ACT ==>|"weight resync (3D-HybridEngine)"| ROLL
REF -.->|"KL / reward"| CTRL
RAY["Ray orchestration"] -.-> Colo
How to use it¶
pip install verl # or build from source: pip install -e .
# (the NVIDIA/ROCm docker image is recommended for matched CUDA + vLLM/SGLang)
# Prepare a dataset (parquet), then run a PPO job on GSM8K:
PYTHONUNBUFFERED=1 python3 -m verl.trainer.main_ppo \
data.train_files=$HOME/data/gsm8k/train.parquet \
data.val_files=$HOME/data/gsm8k/test.parquet \
actor_rollout_ref.model.path=Qwen/Qwen2.5-0.5B-Instruct \
actor_rollout_ref.rollout.name=vllm \
trainer.n_gpus_per_node=1 trainer.nnodes=1 \
trainer.total_epochs=15
Config is Hydra: every key is overridable on the CLI. main_ppo is the shared entrypoint for PPO and its critic-free variants.
How to develop with it¶
GRPO is PPO without a critic, selected by the advantage estimator. Group sampling needs rollout.n > 1:
python3 -m verl.trainer.main_ppo \
algorithm.adv_estimator=grpo \
data.train_batch_size=1024 \
actor_rollout_ref.rollout.name=vllm \
actor_rollout_ref.rollout.n=8 \
actor_rollout_ref.actor.use_kl_loss=True \
algorithm.kl_ctrl.kl_coef=0.001
Custom reward: point custom_reward_function.path at a Python file exposing a scoring function (data_source, solution_str, ground_truth, extra_info) -> float. Datasets are parquet with a prompt column plus a reward_model field carrying the ground truth. See examples/grpo_trainer/ for ready scripts (e.g. run_qwen3_8b_fsdp.sh, run_qwen3_8b_megatron.sh).
How to scale it¶
Multi-node is Ray plus more workers. Start a head node, attach workers, then submit the same main_ppo command with trainer.nnodes raised:
ray start --head --port=6379 --num-gpus=8 # head
ray start --address=<HEAD_IP>:6379 --num-gpus=8 # each worker
python3 -m verl.trainer.main_ppo \
trainer.nnodes=4 trainer.n_gpus_per_node=8 \
actor_rollout_ref.actor.strategy=fsdp2 \
actor_rollout_ref.actor.fsdp_config.param_offload=True
For very large models switch the strategy to megatron and set tensor/pipeline parallel sizes (tensor parallelism/pipeline parallelism). Colocation offloads optimiser/params to host between phases to free HBM for rollout.
Inference¶
verl does not serve traffic; the rollout step uses an inference engine (vLLM or SGLang) internally to sample completions. Choose it with actor_rollout_ref.rollout.name=vllm|sglang and tune rollout.gpu_memory_utilization, rollout.tensor_model_parallel_size, and prefix caching. For production serving of the trained model, export the checkpoint and serve via the standalone inference stack.
Fine-tuning¶
verl is a post-training (RL) library: GRPO/PPO/DAPO on top of an SFT or instruct base. It is the high-performance default referenced from the post-training overview; for the GRPO method itself see GRPO. Pair an SFT/LoRA warm-start (SFT and LoRA) with verl RL for the full recipe.
Optimised hardware¶
- Weight resync (actor → rollout each step) is the hot path; colocation keeps it on-device or over NVLink/NVSwitch, avoiding fabric round-trips.
- Phase offload: param/optimiser offload to host (
fsdp_config.param_offload,optimizer_offload) trades PCIe/CPU bandwidth for HBM headroom during rollout. - Megatron multi-node uses NCCL over InfiniBand/RoCE with GDR: set
NCCL_IB_HCA,NCCL_NET_GDR_LEVEL=SYS, and confirm[GDRDMA]inNCCL_DEBUG=INFO(networking fabric/performance tuning). Blackwell FP8/NVFP4 rollout precision can cut generation cost; verify reward parity against BF16 (the Blackwell platform).
Cookbook (common use cases)¶
1) GRPO on math (single node, 8 GPU):
python3 -m verl.trainer.main_ppo algorithm.adv_estimator=grpo \
data.train_files=$HOME/data/gsm8k/train.parquet \
actor_rollout_ref.model.path=Qwen/Qwen3-8B \
actor_rollout_ref.rollout.name=sglang actor_rollout_ref.rollout.n=8 \
trainer.n_gpus_per_node=8 trainer.nnodes=1
2) PPO with a critic (reward-model RLHF shape):
python3 -m verl.trainer.main_ppo algorithm.adv_estimator=gae \
critic.model.path=$RM_PATH critic.ppo_micro_batch_size_per_gpu=4 \
algorithm.kl_ctrl.kl_coef=0.001 actor_rollout_ref.rollout.name=vllm
3) Multi-node GRPO (4×8 on Ray): start the Ray head/workers as above, then submit (1) with trainer.nnodes=4 trainer.n_gpus_per_node=8 and actor_rollout_ref.actor.strategy=megatron for large models.
Gotchas & failure modes¶
- OOM during rollout is the signature colocation failure. Enable param/optimiser offload or lower
rollout.gpu_memory_utilization. - GRPO needs
rollout.n>1; n=1 silently degenerates (no group baseline). Watch reward/entropy/KL for collapse (GRPO). - Backend/version drift: vLLM, SGLang and Megatron move fast; mismatched CUDA/engine versions break the rollout, so pin via the maintained docker image and verify on the repo.
- Reward function bugs dominate outcomes; unit-test the scorer on held-out strings before a full run.
References¶
- verl repo: https://github.com/volcengine/verl
- verl docs (quickstart, GRPO): https://verl.readthedocs.io/en/latest/start/quickstart.html · https://verl.readthedocs.io/en/latest/algo/grpo.html
- HybridFlow paper (EuroSys 2025): https://arxiv.org/abs/2409.19256
- GRPO examples: https://github.com/volcengine/verl/tree/main/examples/grpo_trainer
- Anyscale — Open Source RL Libraries for LLMs: https://www.anyscale.com/blog/open-source-rl-libraries-for-llms
Related: RL libraries · Post-training · GRPO · Ray · slime · SkyRL · Glossary