Markdown

verl (Volcano Engine RL)¶

Scope: ByteDance's flexible, high-performance RL post-training library for LLMs, built on the HybridFlow programming model.

Reference templates on real APIs; pin versions and validate before production use.

What it is¶

verl is the open-source implementation of HybridFlow (EuroSys 2025), a hybrid-controller programming model for RL post-training. A single-controller process expresses the RL dataflow (rollout, reward, advantage, update) in a few lines, while multi-controller worker groups execute the heavy compute. It plugs existing training backends (PyTorch FSDP/FSDP2, Megatron-LM) and inference engines (vLLM, SGLang, HF Transformers) into one loop, orchestrated by Ray. It ships PPO, GRPO (GRPO), GSPO, DAPO, RLOO, REINFORCE++, ReMax, PRIME and more. See the landscape page RL libraries for how it compares.

Why use it¶

Performance at scale. Colocated actor/rollout/reference share GPUs; the 3D-HybridEngine reshards the actor between train and generate layouts with minimal memory redundancy and no idle weights.
Mature ecosystem. The most battle-tested OSS RL stack (mid-2026): broad model coverage, many recipes, active fork tree (ROCm, Ascend, agent extensions).
Algorithm breadth with little code. Swap PPO↔GRPO↔DAPO via config; the HybridFlow controller keeps the dataflow declarative.

When to use it (and when not)¶

Use it for large-scale, throughput-bound RL where colocation maximises GPU utilisation, and when you want a proven path with many reference configs.
Prefer slime for opinionated, decoupled async MoE RL (the GLM stack), or SkyRL when you need first-class disaggregated/async or heavily agentic multi-turn loops. verl's colocation is its strength and its constraint: memory pressure during phase swaps is the usual limiter.

Architecture¶

flowchart TB
  CTRL["Single controller (HybridFlow)"]
  subgraph Colo["Colocated GPU pool"]
    ACT["Actor (FSDP / Megatron)"]
    ROLL["Rollout (vLLM / SGLang)"]
    REF["Reference / reward"]
  end
  CTRL -->|"generate"| ROLL
  ROLL -->|"trajectories"| CTRL
  CTRL -->|"advantage + update"| ACT
  ACT ==>|"weight resync (3D-HybridEngine)"| ROLL
  REF -.->|"KL / reward"| CTRL
  RAY["Ray orchestration"] -.-> Colo

How to use it¶

pip install verl            # or build from source: pip install -e .
# (the NVIDIA/ROCm docker image is recommended for matched CUDA + vLLM/SGLang)

# Prepare a dataset (parquet), then run a PPO job on GSM8K:
PYTHONUNBUFFERED=1 python3 -m verl.trainer.main_ppo \
  data.train_files=$HOME/data/gsm8k/train.parquet \
  data.val_files=$HOME/data/gsm8k/test.parquet \
  actor_rollout_ref.model.path=Qwen/Qwen2.5-0.5B-Instruct \
  actor_rollout_ref.rollout.name=vllm \
  trainer.n_gpus_per_node=1 trainer.nnodes=1 \
  trainer.total_epochs=15

Config is Hydra: every key is overridable on the CLI. main_ppo is the shared entrypoint for PPO and its critic-free variants.

How to develop with it¶

GRPO is PPO without a critic, selected by the advantage estimator. Group sampling needs rollout.n > 1:

python3 -m verl.trainer.main_ppo \
  algorithm.adv_estimator=grpo \
  data.train_batch_size=1024 \
  actor_rollout_ref.rollout.name=vllm \
  actor_rollout_ref.rollout.n=8 \
  actor_rollout_ref.actor.use_kl_loss=True \
  algorithm.kl_ctrl.kl_coef=0.001

Custom reward: point custom_reward_function.path at a Python file exposing a scoring function (data_source, solution_str, ground_truth, extra_info) -> float. Datasets are parquet with a prompt column plus a reward_model field carrying the ground truth. See examples/grpo_trainer/ for ready scripts (e.g. run_qwen3_8b_fsdp.sh, run_qwen3_8b_megatron.sh).

How to scale it¶

Multi-node is Ray plus more workers. Start a head node, attach workers, then submit the same main_ppo command with trainer.nnodes raised:

ray start --head --port=6379 --num-gpus=8        # head
ray start --address=<HEAD_IP>:6379 --num-gpus=8  # each worker
python3 -m verl.trainer.main_ppo \
  trainer.nnodes=4 trainer.n_gpus_per_node=8 \
  actor_rollout_ref.actor.strategy=fsdp2 \
  actor_rollout_ref.actor.fsdp_config.param_offload=True

For very large models switch the strategy to megatron and set tensor/pipeline parallel sizes (tensor parallelism/pipeline parallelism). Colocation offloads optimiser/params to host between phases to free HBM for rollout.

Inference¶

verl does not serve traffic; the rollout step uses an inference engine (vLLM or SGLang) internally to sample completions. Choose it with actor_rollout_ref.rollout.name=vllm|sglang and tune rollout.gpu_memory_utilization, rollout.tensor_model_parallel_size, and prefix caching. For production serving of the trained model, export the checkpoint and serve via the standalone inference stack.

Fine-tuning¶

verl is a post-training (RL) library: GRPO/PPO/DAPO on top of an SFT or instruct base. It is the high-performance default referenced from the post-training overview; for the GRPO method itself see GRPO. Pair an SFT/LoRA warm-start (SFT and LoRA) with verl RL for the full recipe.

Optimised hardware¶

Weight resync (actor → rollout each step) is the hot path; colocation keeps it on-device or over NVLink/NVSwitch, avoiding fabric round-trips.
Phase offload: param/optimiser offload to host (fsdp_config.param_offload, optimizer_offload) trades PCIe/CPU bandwidth for HBM headroom during rollout.
Megatron multi-node uses NCCL over InfiniBand/RoCE with GDR: set NCCL_IB_HCA, NCCL_NET_GDR_LEVEL=SYS, and confirm [GDRDMA] in NCCL_DEBUG=INFO (networking fabric/performance tuning). Blackwell FP8/NVFP4 rollout precision can cut generation cost; verify reward parity against BF16 (the Blackwell platform).

Cookbook (common use cases)¶

1) GRPO on math (single node, 8 GPU):

python3 -m verl.trainer.main_ppo algorithm.adv_estimator=grpo \
  data.train_files=$HOME/data/gsm8k/train.parquet \
  actor_rollout_ref.model.path=Qwen/Qwen3-8B \
  actor_rollout_ref.rollout.name=sglang actor_rollout_ref.rollout.n=8 \
  trainer.n_gpus_per_node=8 trainer.nnodes=1

2) PPO with a critic (reward-model RLHF shape):

python3 -m verl.trainer.main_ppo algorithm.adv_estimator=gae \
  critic.model.path=$RM_PATH critic.ppo_micro_batch_size_per_gpu=4 \
  algorithm.kl_ctrl.kl_coef=0.001 actor_rollout_ref.rollout.name=vllm

3) Multi-node GRPO (4×8 on Ray): start the Ray head/workers as above, then submit (1) with trainer.nnodes=4 trainer.n_gpus_per_node=8 and actor_rollout_ref.actor.strategy=megatron for large models.

Gotchas & failure modes¶

OOM during rollout is the signature colocation failure. Enable param/optimiser offload or lower rollout.gpu_memory_utilization.
GRPO needs rollout.n>1; n=1 silently degenerates (no group baseline). Watch reward/entropy/KL for collapse (GRPO).
Backend/version drift: vLLM, SGLang and Megatron move fast; mismatched CUDA/engine versions break the rollout, so pin via the maintained docker image and verify on the repo.
Reward function bugs dominate outcomes; unit-test the scorer on held-out strings before a full run.

References¶

verl repo: https://github.com/volcengine/verl
verl docs (quickstart, GRPO): https://verl.readthedocs.io/en/latest/start/quickstart.html · https://verl.readthedocs.io/en/latest/algo/grpo.html
HybridFlow paper (EuroSys 2025): https://arxiv.org/abs/2409.19256
GRPO examples: https://github.com/volcengine/verl/tree/main/examples/grpo_trainer
Anyscale — Open Source RL Libraries for LLMs: https://www.anyscale.com/blog/open-source-rl-libraries-for-llms

Related: RL libraries · Post-training · GRPO · Ray · slime · SkyRL · Glossary