Markdown

Asynchronous and disaggregated RL systems¶

Scope: the systems design that scales RL post-training of LLMs, splitting the rollout/generation workload from the policy-training workload and running them asynchronously, the off-policy staleness this introduces, and the truncated importance sampling that keeps it stable. The infrastructure companion to the algorithm in GRPO and the serving split in disaggregated inference.

Reference templates on real APIs; pin versions and validate before production use.

flowchart LR
  subgraph ACT["Actors (generation GPUs)"]
    G1["vLLM / SGLang rollout"]
  end
  subgraph LRN["Learners (training GPUs)"]
    T1["Policy-gradient update (FSDP / Megatron)"]
  end
  G1 -->|"completions + rewards"| Q["Rollout queue / buffer"]
  Q --> T1
  T1 -->|"weight sync (every N steps)"| G1
  T1 -.->|"rollouts from an older policy"| STALE["Off-policy staleness"]
  STALE -.->|"reweight, cap at C"| TIS["Truncated importance sampling"]
  TIS -.-> T1

What it is¶

A policy-gradient run is two very different workloads bolted together: generation (sampling completions from the current policy with an inference engine) and training (a gradient update on those completions). The theory of policy gradients assumes the data is on-policy: every completion is scored and trained on before the next update. In practice, exact on-policy execution is both impractically slow and technically impossible to synchronize perfectly, so modern systems run the two phases on separate GPU pools and let them overlap.¹ The convention, from the OLMo/open-RL tooling, is actors (GPUs dedicated to sampling) and learners (GPUs taking the RL steps), with a distributed library such as Ray passing rollouts from actors to learners and weights back the other way.¹

Why asynchrony¶

No idle compute. In a synchronous loop the trainer sits idle while the generator works and vice versa. Overlapping them keeps both pools busy, the same motivation as disaggregated inference, applied to training.
The straggler problem dominates reasoning RL. Reasoning models emit 10K-100K+ tokens per answer, so generation is the bottleneck. In a synchronous batch, one slow prompt (more tokens, more tool calls) leaves most of the allocation idle until it finishes.¹ Asynchrony (filling each training batch from the most-recently-completed rollouts across many generators) removes that barrier.
Multi-datacenter scale. Increasing the time between weight syncs makes the loop more off-policy but lets a run span datacenters, where tight synchronization is infeasible.¹

Off-policy staleness and the train-inference mismatch¶

Asynchrony buys throughput at the cost of being slightly off-policy: by the time a completion is trained on, the policy that generated it is a few updates stale. These systems are built on the premise that nearly on-policy data is good enough for stable learning.¹ Two distinct gaps appear:

Policy drift: the sampling policy pi_old lags the policy being updated pi_theta across the steps between weight syncs.
Engine mismatch: the inference engine (vLLM/SGLang) and the trainer compute token probabilities slightly differently, so even a "fresh" rollout is not exactly on-policy (GRPO notes this train-inference mismatch).

Both push the data away from the distribution the gradient assumes, and unbounded, both destabilize training.

Truncated importance sampling (TIS)¶

Importance sampling corrects for sampling from pi_old while optimizing pi_theta by reweighting each sample by the ratio rho = pi_theta / pi_old. Raw ratios have unbounded variance; a single large ratio can blow up the gradient. Truncated importance sampling caps the weight at min(rho, C) for a constant C, trading a small bias for bounded variance.¹² Unlike the bilateral clipping in PPO and CISPO (which constrains the ratio near 1 on both sides), TIS is a one-sided upper cap: the ratio may fall freely below 1 but is capped above at C to prevent extreme upweighting.¹ It is the correction that makes aggressively asynchronous and off-policy updates trainable.

Colocated vs disaggregated¶

	Colocated	Disaggregated
Layout	rollout and trainer share the same GPUs, phases alternate	separate actor and learner pools run concurrently
Sync	local, every step	over the network, every N steps
Best for	smaller models, simpler ops (TRL, verl colocate)	frontier scale, long rollouts, async (slime, verl)
Cost	memory pressure (both fit on one GPU; offload between phases)	weight-transfer bandwidth and staleness management

GRPO defaults to colocated in TRL; large-scale RL moves to a disaggregated, Ray-based actor/learner split.

# Disaggregated rollout: a dedicated vLLM server (actors) feeding the trainer (learners).
# TRL server mode keeps generation off the trainer GPUs.
CUDA_VISIBLE_DEVICES=0,1,2,3 trl vllm-serve --model Qwen/Qwen3-8B
# In GRPOConfig on the trainer GPUs: use_vllm=True, vllm_mode="server"

# verl (Ray): actor_rollout_ref groups the rollout engine, the actor (policy), and the
# reference; the trainer runs separately. Pin the verl release and verify keys on the repo.
python -m verl.trainer.main_ppo \
  actor_rollout_ref.rollout.name=vllm \
  actor_rollout_ref.rollout.mode=async \
  trainer.nnodes=4

Weight sync and the interconnect¶

The recurring cost is moving updated weights from learners to actors. Colocated sync is local; disaggregated sync is a network transfer that wants NVLink intra-node or IB/RoCE with GPUDirect RDMA inter-node (networking fabric, performance tuning). Confirm [GDRDMA] in NCCL_DEBUG=INFO, set NCCL_IB_HCA and NCCL_NET_GDR_LEVEL=SYS, and keep ACS off for P2P. The longer the sync interval, the more off-policy the data and the more TIS has to absorb; sync cadence is the central knob trading throughput against stability. Cutting the bytes moved per sync, not just the cadence, is the subject of delta weight sync: only about 1-3% of weights change per step, so shipping the sparse delta cuts sync traffic roughly 100x, losslessly and bit-identically.

Failure modes¶

Too off-policy: long sync intervals push pi_old far from pi_theta; gradients destabilize. Sync more often or rely on TIS, and watch KL.
Unbounded importance weights: skipping TIS in an async loop lets one large ratio dominate the update. Keep the cap on.
Straggler-bound generation: a synchronous batch idles on the longest rollout; move to async or pack sequences.
Actor/learner imbalance: too few generation GPUs starve the trainer; too few learners leave generators waiting. Profile and rebalance the split.
Weight-sync stall: a slow learner->actor transfer becomes the per-step bottleneck; put it on the fast fabric, not the management network.

References¶

Reinforcement Learning from Human Feedback (Nathan Lambert, Manning MEAP) — asynchronous RL systems (actors/learners, Ray + vLLM) and truncated importance sampling: https://rlhfbook.com
Asynchronous RLHF (off-policy generation/training): https://arxiv.org/abs/2410.18252
verl (Volcano Engine RL, Ray-based): https://github.com/volcengine/verl
slime (disaggregated, async RL): https://github.com/THUDM/slime
vLLM: https://docs.vllm.ai/en/latest/

Reinforcement Learning from Human Feedback (Manning MEAP), 6.2.3 Asynchronous RL Systems and 6.2.4 Truncated Importance Sampling: exact on-policy execution is impractically slow, so actors (generation GPUs) and learners (training GPUs) run concurrently via Ray + vLLM on the premise that nearly on-policy data suffices; long reasoning rollouts make generation the straggler-bound bottleneck. ↩↩↩↩↩↩↩
Truncated importance sampling caps the per-sample ratio at min(rho, C) — a one-sided upper cap (versus PPO/CISPO bilateral clipping near 1) that bounds policy-gradient variance at the cost of a small bias. ↩