Runbook: rollout fleet sizing and rebalancing¶
Scope: sizing and rebalancing the rollout fleet against the trainer pool in disaggregated RL post-training, the rate-matching step that decides whether trainer GPUs wait on samples or rollout GPUs generate tokens nobody consumes. This is the RL-training sibling of prefill/decode rate matching: two pools with different workloads must sustain the same request rate, and the ratio is a live control variable, not an install-time constant.
Run this when trainer GPUs sit idle waiting for rollouts, the rollout output queue grows without bound, a new model or workload lands on the RL training API, the generation-length mix shifts, or cost per optimizer step drifts up. Severity: cost and goodput, page-worthy only if a delivery SLO on run completion is also at risk. Nothing is down; one pool is starving the other.
Reference templates on real APIs; pin versions and validate before production use. The sizing calculator below is executed and asserted; its numbers are a worked example, and every default in it is a starting rule of thumb, not a measured law.
In a disaggregated RL system the rollout fleet is an inference deployment (vLLM-class engines generating trajectories) and the trainer is a gang-scheduled distributed job consuming them; the split, and the off-policy staleness budget that lets the two run without lockstep, are covered in async RL systems. Weight synchronization couples the pools on a cadence of its own (delta weight sync). Before buying capacity, remember that group-sampling RL wastes most of its rollout tokens on repeated prompt prefixes, and that waste is recoverable in software (rollout redundancy).
Trigger¶
- Trainer idle, rollouts saturated. Optimizer steps stretch because the sample queue is empty; rollout engines run at full batch. The fleet is undersized (or slow).
- Rollout backlog growing. Generated-but-unconsumed samples accumulate; with no backpressure the backlog ages into off-policy staleness or gets discarded. The fleet is oversized, or the trainer regressed.
- Workload changed. New base model, new max generation length, new group size, or a different prompt mix landed on the API; the old ratio is stale by construction.
- Cost per optimizer step drifting up at constant model and batch size: one pool is paying for the other's bottleneck.
Pre-checks¶
- Confirm the trainer itself is healthy. A trainer stalled on a collective or a bad node presents as "rollouts can't keep up"; rule out the NCCL-hang path and MFU regression before resizing anything.
- Confirm weight sync is not the stall. If rollout engines pause long per update, the fix is sync cadence or delta weight sync, not more instances; measure the pause separately.
- Capture both rates under production settings. Rollout throughput moves hard with sampling parameters, group size, and max generation length; a rate measured at the wrong temperature or length cap sizes the wrong fleet.
- Record the current allocation and its cost (instances per pool, GPU types) so the rebalance has a baseline and a rollback target.
- Check the algorithm's staleness budget. How many steps of weight lag the recipe tolerates (importance-sampling corrections, bounded lag) sets how much queue smoothing is allowed between the pools (async RL systems).
Flow¶
flowchart TB
A["Imbalance suspected"] --> B["Measure: rollout tok/s per instance,<br/>trainer consumption tok/s"]
B --> C{"Which side starves?"}
C -->|"trainer idle"| D["Recover free throughput first:<br/>prefix caching, dedup, batching"]
D --> E{"Still short?"}
E -->|"yes"| F["Add rollout instances to the matched count"]
E -->|"no"| G["Re-measure and verify"]
C -->|"rollout backlog grows"| H["Shrink rollout pool, or raise trainer<br/>throughput / batch; cap queue with backpressure"]
F --> G
H --> G
G --> I["Wire idle-fraction and queue-age alerts"]
Procedure¶
- Measure per-instance rollout throughput at production settings. Scrape the engines under the real recipe (sampling parameters, group size, length caps), not a synthetic load:
# Per-instance generation rate (tokens/s) over a 5-minute window:
curl -s http://$ROLLOUT_POD:8000/metrics | grep -E \
'vllm:generation_tokens|vllm:num_requests_running|vllm:num_requests_waiting'
# PromQL equivalent: rate(vllm:generation_tokens_total[5m]) per pod
# (Prometheus scrapes prometheus_client Counters with the _total suffix)
Record the mean and the spread across instances; a wide spread is a straggler or placement problem, not a sizing problem.
- Measure trainer consumption and compute the matched fleet. Consumption is tokens (or samples) per optimizer step times steps per second. The calculator below is executed and asserted; it computes the matched instance count with a utilization headroom derate, then demonstrates both imbalance signatures (idle trainer, growing stale backlog):
# rollout_sizing.py - executed: the rate-matching arithmetic for sizing an RL
# rollout fleet against a trainer pool, and the two imbalance signatures.
# Deterministic first-order model (stdlib only); rules of thumb, not benchmarks.
import math
def rollout_instances(consume_tok_s: float, per_instance_tok_s: float,
headroom: float = 0.8) -> int:
"""Instances so derated rollout supply covers trainer consumption.
headroom < 1 derates nameplate throughput for sampling-parameter drift,
weight-sync pauses, and stragglers (0.8 is a starting default, not a law)."""
assert consume_tok_s > 0 and per_instance_tok_s > 0 and 0 < headroom <= 1
return math.ceil(consume_tok_s / (per_instance_tok_s * headroom))
def trainer_idle_fraction(consume_tok_s: float, batch_tokens: float,
n_inst: int, per_instance_tok_s: float) -> float:
"""Steady-state trainer idle share when rollouts are the bottleneck.
One pipelined cycle = max(step_time, batch production time)."""
step_s = batch_tokens / consume_tok_s
produce_s = batch_tokens / (n_inst * per_instance_tok_s)
cycle = max(step_s, produce_s)
return max(0.0, 1.0 - step_s / cycle)
def queue_growth(n_inst: int, per_instance_tok_s: float,
consume_tok_s: float, seconds: int) -> list[float]:
"""Unconsumed-rollout backlog (tokens) over time when supply > demand
and no backpressure caps generation."""
q, hist = 0.0, []
for _ in range(seconds):
q += max(0.0, n_inst * per_instance_tok_s - consume_tok_s)
hist.append(q)
return hist
# Worked example: GRPO-style job. 512 samples/optimizer step, mean 1536
# generated tokens/sample = 786,432 tokens/step; 20 s/step of trainer compute.
BATCH_TOKENS = 512 * 1536.0
CONSUME = BATCH_TOKENS / 20.0 # 39,321.6 tok/s consumed
PER_INST = 4_000.0 # measured tok/s per rollout instance
n = rollout_instances(CONSUME, PER_INST) # ceil(39321.6 / 3200) = 13
assert n == 13 and n == math.ceil(39321.6 / (4000 * 0.8))
# Signature 1: undersized rollouts leave the trainer idle a computable share.
idle_8 = trainer_idle_fraction(CONSUME, BATCH_TOKENS, 8, PER_INST)
assert abs(idle_8 - (1 - 20.0 / 24.576)) < 1e-12 # produce_s = 24.576 s
idle_13 = trainer_idle_fraction(CONSUME, BATCH_TOKENS, 13, PER_INST)
assert idle_13 == 0.0 # matched: no idle
# Signature 2: oversized rollouts without backpressure grow a stale backlog
# monotonically; those tokens age (off-policyness) or are thrown away.
hist = queue_growth(15, PER_INST, CONSUME, seconds=600)
assert all(b > a for a, b in zip(hist, hist[1:])) # strictly monotone
assert abs(hist[-1] - 600 * (60_000 - CONSUME)) < 1e-6 # +20,678.4 tok/s
print(f"matched fleet: {n} rollout instances "
f"(consumption {CONSUME:.1f} tok/s, per-instance {PER_INST:.0f} tok/s, headroom 0.8)")
print(f"undersized (8 instances): trainer idle {idle_8:.1%} of every cycle")
print(f"oversized (15 instances, no backpressure): backlog +{60_000 - CONSUME:.1f} tok/s, "
f"{hist[-1] / 1e6:.1f}M stale tokens after 10 min")
print("all sizing assertions passed")
Output of the run: matched fleet: 13 rollout instances, undersized (8 instances): trainer idle 18.6% of every cycle, oversized (15 instances, no backpressure): backlog +20678.4 tok/s, 12.4M stale tokens after 10 min, all sizing assertions passed. An 18.6% idle trainer pool is pure burn at trainer-GPU prices; 12 million stale tokens is either wasted rollout spend or an off-policy drift you did not budget for.
- Recover free throughput before adding GPUs. Group sampling repeats each prompt prefix per group member; prefix caching and shared-prefix batching recover most of that work, and prompt deduplication in the training pass recovers the rest (rollout redundancy). Confirm the engines actually hit the cache:
curl -s http://$ROLLOUT_POD:8000/metrics | grep -E \
'vllm:prefix_cache_hits|vllm:prefix_cache_queries'
A low hit ratio on a group-sampling workload means configuration, not capacity (continuous batching internals, KV cache management). Re-measure step 1 after fixing; the matched count often drops.
-
Use async slack deliberately. If the recipe tolerates bounded weight staleness, a small sample queue smooths transient imbalance and weight-sync pauses; size the queue in optimizer steps against the staleness budget from the pre-check, and treat unbounded queues as a bug, not a buffer (async RL systems). Align the weight-sync cadence so rollout engines do not pause more often than the budget assumes (delta weight sync).
-
Apply the rebalance on the platform. Resize the rollout deployment to the matched count; the trainer gang stays fixed. Keep rollout workers on the preemptible or burstable tier and the trainer gang protected: rollout loss costs a partial batch, trainer preemption costs a restart from checkpoint (gang-scheduled training recipe, Kueue quota).
-
Wire the imbalance signals into standing alerts. Trainer idle fraction (step time attributable to waiting on samples), rollout queue depth and oldest-sample age (staleness in steps, not just count), and tokens generated per trained token. These join the platform's training SLIs (training-platform SLOs); the north-star roll-up is goodput, useful trained tokens against every GPU-second both pools spend.
Verification¶
- Trainer idle below target. Over a representative window (several hundred optimizer steps), the sample-wait share of step time sits under the agreed threshold; the calculator's matched count predicted this, now the fleet confirms it.
- Queue bounded and fresh. Rollout backlog oscillates under the cap and the oldest queued sample stays inside the staleness budget in optimizer steps.
- Cost per optimizer step recomputed and at or below the pre-change baseline; if throughput rose but cost per step rose faster, the rebalance bought the wrong thing.
- No new pressure signals.
vllm:num_requests_waitingstays bounded on the rollout engines and the prefix-cache hit ratio held after the resize.
Rollback¶
- Restore the recorded pool sizes. Resizing is cheap and stateless on the rollout side; the previous allocation was captured in the pre-checks, so the revert is one deployment change.
- Do not revert the redundancy fixes (prefix caching, dedup) with the sizing change; they are independent wins and keeping them makes the retry cheaper.
- If trainer idle persists at the matched count, the model is wrong somewhere: re-measure per-instance throughput at true production settings and re-check weight-sync pauses before adding more instances.
Related runbooks¶
- the MFU-regression runbook: the trainer-side throughput triage when the trainer, not the fleet, is slow.
- the NCCL-hang runbook: rule out a wedged trainer before resizing.
- the checkpoint-recovery runbook: trainer restarts that the protected-tier placement exists to avoid.
- operational runbooks: operational runbooks index.
References¶
- vLLM production metrics (
vllm:generation_tokens,vllm:prefix_cache_hits,vllm:num_requests_waiting): https://docs.vllm.ai/en/latest/usage/metrics.html - vLLM engine arguments (batching and cache knobs behind the throughput measurement): https://docs.vllm.ai/en/latest/configuration/engine_args/
- verl HybridFlow (trainer/rollout hybrid architecture and worker groups): https://verl.readthedocs.io/en/latest/hybrid_flow.html
- verl (RL post-training framework with disaggregated rollout workers): https://github.com/volcengine/verl
- OpenRLHF (Ray-based disaggregated RLHF with vLLM rollout engines): https://github.com/OpenRLHF/OpenRLHF
Related: Disaggregation rate matching · Async RL systems · Rollout redundancy · Delta weight sync · Continuous batching internals · Training-platform SLOs · Goodput · Build-vs-rent cost model · Operational runbooks · Glossary