Skip to content
Markdown

OpenRLHF

Scope: OpenRLHF, a community Ray + vLLM RLHF/agentic-RL framework on DeepSpeed, an early pioneer of asynchronous RL execution with strong reward-model support and a concise codebase.

Reference templates on real APIs; pin versions and validate before production use.

What it is

OpenRLHF (OpenRLHF/OpenRLHF, arxiv 2405.11143) was the first RLHF framework built on a Ray + vLLM distributed architecture. It separates Actor, Reward, Reference, and Critic models across GPUs, trains with DeepSpeed ZeRO-3 (AutoTP, RingAttention), and generates with vLLM. It implements the full post-training ladder in a deliberately small, readable codebase: SFT, reward-model (RM) training, PPO, GRPO, REINFORCE++ / REINFORCE++-baseline, RLOO, and DPO/IPO/cDPO. Among RL libraries (RL libraries) it is the "RLHF / reward-model" option.

Why use it

  • RLHF and reward modelling are first-class: train an RM, then PPO against it, or supply a custom Python reward function for reinforced fine-tuning without a trained RM.
  • Async execution is a founding feature: OpenRLHF was an early implementation of asynchronous RL, overlapping generation and training.
  • Concise, production-leaning codebase that is easy to read and adapt; mature ecosystem with multimodal (OpenRLHF-M) and agentic variants.

When to use it (and when not)

  • Use for classic RLHF (RM + PPO), DPO preference tuning, or when you want async RL with a small surface area.
  • Use when DeepSpeed + vLLM + Ray matches your existing stack.
  • Prefer verl (verl) or slime (slime) for the highest-throughput large-scale runs (Megatron, tightly-coupled or async MoE); prefer SkyRL (SkyRL) for flexible colocated↔disaggregated agentic research. OpenRLHF is vLLM-only on the rollout side.

Architecture

flowchart TB
  CTRL["Ray scheduler"]
  subgraph Roll["Rollout"]
    VLLM["vLLM generation (AutoTP)"]
  end
  subgraph Trainer["DeepSpeed ZeRO-3"]
    ACT["Actor / Critic"]
    REF["Reference"]
    RM["Reward model"]
  end
  CTRL -.-> Roll
  CTRL -.-> Trainer
  VLLM -->|"samples"| RM
  RM -->|"rewards"| ACT
  ACT -.->|"async weight sync"| VLLM

--train.async_enable decouples generation from the policy update; an async queue (--train.async_queue_size) buffers rollouts and --train.partial_rollout_enable uses vLLM pause/resume to overlap weight refresh.

How to use it

# Install (vLLM extra recommended; pins a vLLM build)
pip install openrlhf[vllm]                # or openrlhf[vllm,ring,liger]

# Start Ray, then submit a PPO job (hybrid engine)
ray start --head --port 6379
ray job submit --address="http://127.0.0.1:8265" \
  --runtime-env-json='{"working_dir": "."}' \
  -- python3 -m openrlhf.cli.train_ppo_ray \
     --actor.model_name_or_path OpenRLHF/Llama-3-8b-sft-mixture \
     --reward.model_name_or_path OpenRLHF/Llama-3-8b-rm-700k \
     --ckpt.output_dir ./checkpoint/llama3-8b-ppo

Switch the advantage estimator to run GRPO instead of PPO:

     --algo.advantage.estimator group_norm     # GRPO (group-relative, critic-free)

How to develop with it

OpenRLHF exposes one CLI entrypoint per stage; a typical pipeline chains SFT → RM → PPO/GRPO (or jumps to DPO):

# SFT
deepspeed --module openrlhf.cli.train_sft \
  --actor.model_name_or_path meta-llama/Meta-Llama-3-8B \
  --data.dataset OpenRLHF/sft-mixture

# Reward model
deepspeed --module openrlhf.cli.train_rm \
  --actor.model_name_or_path OpenRLHF/Llama-3-8b-sft-mixture \
  --data.dataset OpenRLHF/preference_dataset_mixture2 \
  --ckpt.output_dir ./checkpoint/llama3-8b-rm

# DPO (offline preference, no rollouts)
deepspeed --module openrlhf.cli.train_dpo \
  --actor.model_name_or_path OpenRLHF/Llama-3-8b-sft-mixture \
  --data.dataset OpenRLHF/preference_dataset_mixture2

For agentic / custom reward, pass --train.agent_func_path /path/to/agent_func.py to compute rewards on-the-fly.

How to scale it

# Multi-node async PPO/GRPO via Ray + DeepSpeed
ray start --head --port 6379                 # head
ray start --address='<head-ip>:6379'          # each worker

ray job submit --address="http://127.0.0.1:8265" \
  --runtime-env-json='{"working_dir": "."}' \
  -- python3 -m openrlhf.cli.train_ppo_ray \
     --actor.model_name_or_path <model> \
     --reward.model_name_or_path <rm> \
     --train.async_enable \
     --train.async_queue_size 2 \
     --train.partial_rollout_enable

DeepSpeed ZeRO-3 shards optimiser/params across nodes; Ray places Actor/Critic/Reference/Reward and vLLM workers on distinct GPUs (Ray). CPU/NVMe offload extends model size at a bandwidth cost.

Inference

OpenRLHF uses vLLM purely to generate rollouts (AutoTP, optional PP), not to serve production traffic. Async mode overlaps that generation with the policy update. For production serving, see inference serving/serving open-weight models.

Fine-tuning

OpenRLHF is a fine-tuning/RL framework: PPO and GRPO (GRPO), REINFORCE++/RLOO, DPO (DPO), plus SFT and reward-model training (fine-tuning and post-training). GRPO is selected via --algo.advantage.estimator group_norm; DPO needs only a preference dataset and no rollout engine.

Optimised hardware

  • DeepSpeed ZeRO-3 offload (CPU/NVMe) trades PCIe/NVMe bandwidth for capacity (DeepSpeed and ZeRO); keep NVLink/NVSwitch hot for intra-node all-gather (the Blackwell platform).
  • Async weight sync from Actor to vLLM wants low-latency interconnect; set NCCL_IB_HCA, NCCL_NET_GDR_LEVEL=SYS, and verify [GDRDMA] via NCCL_DEBUG=INFO; disable PCIe ACS for P2P/GDR (performance tuning).
  • vLLM on Blackwell can use FP8/NVFP4 for rollout throughput; training precision follows the recipe.

Cookbook (common use cases)

# 1) PPO with a trained reward model (synchronous)
ray job submit --address="http://127.0.0.1:8265" \
  -- python3 -m openrlhf.cli.train_ppo_ray \
     --actor.model_name_or_path OpenRLHF/Llama-3-8b-sft-mixture \
     --reward.model_name_or_path OpenRLHF/Llama-3-8b-rm-700k
# 2) Reward-model training (Bradley-Terry on preference pairs)
deepspeed --module openrlhf.cli.train_rm \
  --actor.model_name_or_path OpenRLHF/Llama-3-8b-sft-mixture \
  --data.dataset OpenRLHF/preference_dataset_mixture2 \
  --ckpt.output_dir ./checkpoint/llama3-8b-rm
# 3) Async GRPO (overlap generation + update)
ray job submit --address="http://127.0.0.1:8265" \
  -- python3 -m openrlhf.cli.train_ppo_ray \
     --actor.model_name_or_path <model> \
     --algo.advantage.estimator group_norm \
     --train.async_enable --train.async_queue_size 2

(Flag names follow the current --train.* / --algo.* namespaces; the older --async_train form was renamed, so confirm on the repo README before a real run.)

Gotchas & failure modes

  • Async can hurt convergence if the queue is too deep. Enable it only when the algorithm tolerates off-policy lag (--train.async_queue_size small first).
  • vLLM build must match the installed extra (openrlhf[vllm] pins a version); a mismatched vLLM breaks generation.
  • Forgetting to place Actor/Critic/Reference/Reward on distinct GPUs starves rollout or training (Ray).
  • GRPO entropy/advantage collapse and KL drift still apply (GRPO); monitor reward/entropy/KL (observability).
  • Flag namespaces shifted over releases (--async_train--train.async_enable); pin a commit.

References

  • OpenRLHF: https://github.com/OpenRLHF/OpenRLHF
  • OpenRLHF paper (arxiv 2405.11143): https://arxiv.org/abs/2405.11143
  • Anyscale — Open Source RL Libraries for LLMs: https://www.anyscale.com/blog/open-source-rl-libraries-for-llms

Related: RL libraries · Fine-tuning · GRPO · DPO · verl · slime · SkyRL · DeepSpeed ZeRO · Ray