Markdown

Runbook: rollout fleet sizing and rebalancing¶

Scope: sizing and rebalancing the rollout fleet against the trainer pool in disaggregated RL post-training, the rate-matching step that decides whether trainer GPUs wait on samples or rollout GPUs generate tokens nobody consumes. This is the RL-training sibling of prefill/decode rate matching: two pools with different workloads must sustain the same request rate, and the ratio is a live control variable, not an install-time constant.

Run this when trainer GPUs sit idle waiting for rollouts, the rollout output queue grows without bound, a new model or workload lands on the RL training API, the generation-length mix shifts, or cost per optimizer step drifts up. Severity: cost and goodput, page-worthy only if a delivery SLO on run completion is also at risk. Nothing is down; one pool is starving the other.

Reference templates on real APIs; pin versions and validate before production use. The sizing calculator below is executed and asserted; its numbers are a worked example, and every default in it is a starting rule of thumb, not a measured law.

In a disaggregated RL system the rollout fleet is an inference deployment (vLLM-class engines generating trajectories) and the trainer is a gang-scheduled distributed job consuming them; the split, and the off-policy staleness budget that lets the two run without lockstep, are covered in async RL systems. Weight synchronization couples the pools on a cadence of its own (delta weight sync). Before buying capacity, remember that group-sampling RL wastes most of its rollout tokens on repeated prompt prefixes, and that waste is recoverable in software (rollout redundancy).

Trigger¶

Trainer idle, rollouts saturated. Optimizer steps stretch because the sample queue is empty; rollout engines run at full batch. The fleet is undersized (or slow).
Rollout backlog growing. Generated-but-unconsumed samples accumulate; with no backpressure the backlog ages into off-policy staleness or gets discarded. The fleet is oversized, or the trainer regressed.
Workload changed. New base model, new max generation length, new group size, or a different prompt mix landed on the API; the old ratio is stale by construction.
Cost per optimizer step drifting up at constant model and batch size: one pool is paying for the other's bottleneck.

Pre-checks¶

Confirm the trainer itself is healthy. A trainer stalled on a collective or a bad node presents as "rollouts can't keep up"; rule out the NCCL-hang path and MFU regression before resizing anything.
Confirm weight sync is not the stall. If rollout engines pause long per update, the fix is sync cadence or delta weight sync, not more instances; measure the pause separately.
Capture both rates under production settings. Rollout throughput moves hard with sampling parameters, group size, and max generation length; a rate measured at the wrong temperature or length cap sizes the wrong fleet.
Record the current allocation and its cost (instances per pool, GPU types) so the rebalance has a baseline and a rollback target.
Check the algorithm's staleness budget. How many steps of weight lag the recipe tolerates (importance-sampling corrections, bounded lag) sets how much queue smoothing is allowed between the pools (async RL systems).

Flow¶

flowchart TB
    A["Imbalance suspected"] --> B["Measure: rollout tok/s per instance,<br/>trainer consumption tok/s"]
    B --> C{"Which side starves?"}
    C -->|"trainer idle"| D["Recover free throughput first:<br/>prefix caching, dedup, batching"]
    D --> E{"Still short?"}
    E -->|"yes"| F["Add rollout instances to the matched count"]
    E -->|"no"| G["Re-measure and verify"]
    C -->|"rollout backlog grows"| H["Shrink rollout pool, or raise trainer<br/>throughput / batch; cap queue with backpressure"]
    F --> G
    H --> G
    G --> I["Wire idle-fraction and queue-age alerts"]

Procedure¶

Measure per-instance rollout throughput at production settings. Scrape the engines under the real recipe (sampling parameters, group size, length caps), not a synthetic load:

# Per-instance generation rate (tokens/s) over a 5-minute window:
curl -s http://$ROLLOUT_POD:8000/metrics | grep -E \
  'vllm:generation_tokens|vllm:num_requests_running|vllm:num_requests_waiting'
# PromQL equivalent: rate(vllm:generation_tokens_total[5m]) per pod
# (Prometheus scrapes prometheus_client Counters with the _total suffix)

Record the mean and the spread across instances; a wide spread is a straggler or placement problem, not a sizing problem.

Measure trainer consumption and compute the matched fleet. Consumption is tokens (or samples) per optimizer step times steps per second. The calculator below is executed and asserted; it computes the matched instance count with a utilization headroom derate, then demonstrates both imbalance signatures (idle trainer, growing stale backlog):

# rollout_sizing.py - executed: the rate-matching arithmetic for sizing an RL
# rollout fleet against a trainer pool, and the two imbalance signatures.
# Deterministic first-order model (stdlib only); rules of thumb, not benchmarks.
import math


def rollout_instances(consume_tok_s: float, per_instance_tok_s: float,
                      headroom: float = 0.8) -> int:
    """Instances so derated rollout supply covers trainer consumption.
    headroom < 1 derates nameplate throughput for sampling-parameter drift,
    weight-sync pauses, and stragglers (0.8 is a starting default, not a law)."""
    assert consume_tok_s > 0 and per_instance_tok_s > 0 and 0 < headroom <= 1
    return math.ceil(consume_tok_s / (per_instance_tok_s * headroom))


def trainer_idle_fraction(consume_tok_s: float, batch_tokens: float,
                          n_inst: int, per_instance_tok_s: float) -> float:
    """Steady-state trainer idle share when rollouts are the bottleneck.
    One pipelined cycle = max(step_time, batch production time)."""
    step_s = batch_tokens / consume_tok_s
    produce_s = batch_tokens / (n_inst * per_instance_tok_s)
    cycle = max(step_s, produce_s)
    return max(0.0, 1.0 - step_s / cycle)


def queue_growth(n_inst: int, per_instance_tok_s: float,
                 consume_tok_s: float, seconds: int) -> list[float]:
    """Unconsumed-rollout backlog (tokens) over time when supply > demand
    and no backpressure caps generation."""
    q, hist = 0.0, []
    for _ in range(seconds):
        q += max(0.0, n_inst * per_instance_tok_s - consume_tok_s)
        hist.append(q)
    return hist


# Worked example: GRPO-style job. 512 samples/optimizer step, mean 1536
# generated tokens/sample = 786,432 tokens/step; 20 s/step of trainer compute.
BATCH_TOKENS = 512 * 1536.0
CONSUME = BATCH_TOKENS / 20.0                    # 39,321.6 tok/s consumed
PER_INST = 4_000.0                               # measured tok/s per rollout instance

n = rollout_instances(CONSUME, PER_INST)         # ceil(39321.6 / 3200) = 13
assert n == 13 and n == math.ceil(39321.6 / (4000 * 0.8))

# Signature 1: undersized rollouts leave the trainer idle a computable share.
idle_8 = trainer_idle_fraction(CONSUME, BATCH_TOKENS, 8, PER_INST)
assert abs(idle_8 - (1 - 20.0 / 24.576)) < 1e-12        # produce_s = 24.576 s
idle_13 = trainer_idle_fraction(CONSUME, BATCH_TOKENS, 13, PER_INST)
assert idle_13 == 0.0                                    # matched: no idle

# Signature 2: oversized rollouts without backpressure grow a stale backlog
# monotonically; those tokens age (off-policyness) or are thrown away.
hist = queue_growth(15, PER_INST, CONSUME, seconds=600)
assert all(b > a for a, b in zip(hist, hist[1:]))        # strictly monotone
assert abs(hist[-1] - 600 * (60_000 - CONSUME)) < 1e-6   # +20,678.4 tok/s

print(f"matched fleet: {n} rollout instances "
      f"(consumption {CONSUME:.1f} tok/s, per-instance {PER_INST:.0f} tok/s, headroom 0.8)")
print(f"undersized (8 instances): trainer idle {idle_8:.1%} of every cycle")
print(f"oversized (15 instances, no backpressure): backlog +{60_000 - CONSUME:.1f} tok/s, "
      f"{hist[-1] / 1e6:.1f}M stale tokens after 10 min")
print("all sizing assertions passed")

Output of the run: matched fleet: 13 rollout instances, undersized (8 instances): trainer idle 18.6% of every cycle, oversized (15 instances, no backpressure): backlog +20678.4 tok/s, 12.4M stale tokens after 10 min, all sizing assertions passed. An 18.6% idle trainer pool is pure burn at trainer-GPU prices; 12 million stale tokens is either wasted rollout spend or an off-policy drift you did not budget for.

Recover free throughput before adding GPUs. Group sampling repeats each prompt prefix per group member; prefix caching and shared-prefix batching recover most of that work, and prompt deduplication in the training pass recovers the rest (rollout redundancy). Confirm the engines actually hit the cache:

curl -s http://$ROLLOUT_POD:8000/metrics | grep -E \
  'vllm:prefix_cache_hits|vllm:prefix_cache_queries'

A low hit ratio on a group-sampling workload means configuration, not capacity (continuous batching internals, KV cache management). Re-measure step 1 after fixing; the matched count often drops.

Use async slack deliberately. If the recipe tolerates bounded weight staleness, a small sample queue smooths transient imbalance and weight-sync pauses; size the queue in optimizer steps against the staleness budget from the pre-check, and treat unbounded queues as a bug, not a buffer (async RL systems). Align the weight-sync cadence so rollout engines do not pause more often than the budget assumes (delta weight sync).
Apply the rebalance on the platform. Resize the rollout deployment to the matched count; the trainer gang stays fixed. Keep rollout workers on the preemptible or burstable tier and the trainer gang protected: rollout loss costs a partial batch, trainer preemption costs a restart from checkpoint (gang-scheduled training recipe, Kueue quota).
Wire the imbalance signals into standing alerts. Trainer idle fraction (step time attributable to waiting on samples), rollout queue depth and oldest-sample age (staleness in steps, not just count), and tokens generated per trained token. These join the platform's training SLIs (training-platform SLOs); the north-star roll-up is goodput, useful trained tokens against every GPU-second both pools spend.

Verification¶

Trainer idle below target. Over a representative window (several hundred optimizer steps), the sample-wait share of step time sits under the agreed threshold; the calculator's matched count predicted this, now the fleet confirms it.
Queue bounded and fresh. Rollout backlog oscillates under the cap and the oldest queued sample stays inside the staleness budget in optimizer steps.
Cost per optimizer step recomputed and at or below the pre-change baseline; if throughput rose but cost per step rose faster, the rebalance bought the wrong thing.
No new pressure signals. vllm:num_requests_waiting stays bounded on the rollout engines and the prefix-cache hit ratio held after the resize.

Rollback¶

Restore the recorded pool sizes. Resizing is cheap and stateless on the rollout side; the previous allocation was captured in the pre-checks, so the revert is one deployment change.
Do not revert the redundancy fixes (prefix caching, dedup) with the sizing change; they are independent wins and keeping them makes the retry cheaper.
If trainer idle persists at the matched count, the model is wrong somewhere: re-measure per-instance throughput at true production settings and re-check weight-sync pauses before adding more instances.

the MFU-regression runbook: the trainer-side throughput triage when the trainer, not the fleet, is slow.
the NCCL-hang runbook: rule out a wedged trainer before resizing.
the checkpoint-recovery runbook: trainer restarts that the protected-tier placement exists to avoid.
operational runbooks: operational runbooks index.

References¶

vLLM production metrics (vllm:generation_tokens, vllm:prefix_cache_hits, vllm:num_requests_waiting): https://docs.vllm.ai/en/latest/usage/metrics.html
vLLM engine arguments (batching and cache knobs behind the throughput measurement): https://docs.vllm.ai/en/latest/configuration/engine_args/
verl HybridFlow (trainer/rollout hybrid architecture and worker groups): https://verl.readthedocs.io/en/latest/hybrid_flow.html
verl (RL post-training framework with disaggregated rollout workers): https://github.com/volcengine/verl
OpenRLHF (Ray-based disaggregated RLHF with vLLM rollout engines): https://github.com/OpenRLHF/OpenRLHF