Markdown

Runbook: RL training-step data-path review¶

Scope: a design-review procedure that audits an RL post-training system as a distributed system: what moves over the wire in one training step, what the step serializes on, how much weight-lag the rollouts tolerate, and whether the placement (colocated or disaggregated) matches the traffic. Run it against an in-house stack or an adopted framework (verl, OpenRLHF, NeMo-RL, slime, TRL; the field map is RL libraries overview) before the stack backs a managed RL training API.

Run this when adopting or building an RL training stack, when onboarding a new framework version, or when an existing run shows an unexplained step-time or cost regression. Severity: design gate, not an incident; skipping it is how a service ships with a weight broadcast on the critical path of every customer step.

Reference templates on real APIs; pin versions and validate before production use.

An RL step is a pipeline of dissimilar workloads: rollout generation is inference, reward scoring is inference or plain code, the policy update is training, and a weight sync stitches them together. Each has its own natural placement and its own traffic, so the review's job is to make every byte and every serialization explicit before the scheduler, quota, and billing layers are built on top. The systems background lives in async RL systems (rollout/trainer split, staleness), delta weight sync (the sync bottleneck), rollout redundancy (prompt dedup and cascade attention), and comms-compute overlap (hiding collectives behind compute).

Trigger¶

A new RL training stack (in-house or framework) is proposed for the managed API, or a framework upgrade changes its rollout, sync, or parallelism machinery.
Step time or cost regressed on an existing stack with no model or data change: something new is on the critical path.
The colocate-vs-disaggregate decision is being made (or remade) for a new GPU pool shape or network tier.
A customer-visible knob (model size, LoRA vs full fine-tune, group size) is being priced, and the price needs a traffic model behind it.

Pre-checks¶

Fix the reference workload. Pick one model size, one rollout batch shape (prompts x samples per prompt x average tokens), and one sync cadence; the review quantifies this one configuration, then parameterizes. Record it in the decision record.
Identify the framework's execution mode. On-policy synchronous (every step waits for fresh rollouts), one-step-off pipelined, or fully async with a staleness bound; the framework docs and the async RL page define the vocabulary.
Confirm profiling access. The verification step needs a per-phase timeline; torch.profiler or nsys must be runnable on the stack (Nsight profiling workflow).
Note the network tiers available: NVLink domain, intra-cluster RDMA, or commodity/cross-site links. The same design reviews differently on different fabrics (networking fabric).

Flow¶

flowchart TB
 A["New / changed RL stack"] --> B["1. Enumerate actors and placement"]
 B --> C["2. Quantify bytes per step\n(weight sync, rollouts, gradients)"]
 C --> D["3. Staleness budget:\nhow stale may rollout weights be?"]
 D --> E["4. Overlap audit:\nwhat serializes the step?"]
 E --> F["5. Collectives and topology fit"]
 F --> G{"Traffic and lag acceptable\non the target fabric?"}
 G -->|"yes"| H["6. Decision record: placement,\nsync mode, rate-matching plan"]
 G -->|"no"| I["Change sync mode / placement\n(delta, LoRA, colocate) and re-run 2-5"]
 I --> C
 H --> J["Verify: measured step breakdown\nmatches the paper budget"]

Procedure¶

Step 1: enumerate the actors and draw the data path. For one step, list: rollout workers (which inference engine, how many replicas), trainer ranks (parallelism layout: DP/FSDP/TP), reward scoring (reward model, verifier code, or judge), any replay/buffer service, and the weight-sync channel. For each edge, write down direction, payload type, and trigger (per step, per batch, per sync interval). A colocated engine receives weights through a CUDA-IPC handle (about 64 bytes on the wire, per delta weight sync); a disaggregated one pays real bandwidth. If an edge cannot be named, the stack is not understood well enough to host customer jobs.

Step 2: quantify the bytes. The calculator below is executed and asserted; edit the assumption block for the stack under review. It prices the three weight-sync modes, the rollout payload, and the per-rank gradient traffic for a 14B-class dense policy (all inputs are labelled assumptions; the 1-3% delta fraction comes from delta weight sync):

# rl_step_bytes.py - executed: first-order bytes-per-step calculator for one RL
# training step's data path. All inputs are stated assumptions for a 14B-class
# dense policy; edit them for the stack under review. Deterministic, no RNG.
GB = 1e9  # decimal gigabyte

P = 14e9                    # policy parameters (14B-class, assumption)
BYTES_BF16 = 2
LORA_FRACTION = 0.005       # adapter params as share of base (assumption; <1% typical)
DELTA_FRACTION = 0.02       # weights changed per step (1-3% per delta-weight-sync page)
DP_RANKS = 8                # trainer data-parallel width (assumption)

PROMPTS, SAMPLES, AVG_TOKENS = 256, 8, 4096   # rollout batch shape (assumption)
SYNCS_PER_HOUR = 60                            # one sync per minute (assumption)


def ring_allreduce_bytes(msg_bytes: float, k: int) -> float:
    """Per-rank wire traffic of a ring all-reduce: 2*(k-1)/k * message size."""
    assert k >= 2
    return 2.0 * (k - 1) / k * msg_bytes


full_sync = P * BYTES_BF16                       # full bf16 weight broadcast
lora_sync = P * LORA_FRACTION * BYTES_BF16       # adapter-only sync
delta_sync = full_sync * DELTA_FRACTION          # sparse delta payload (indexes ignored)
rollout_tokens = PROMPTS * SAMPLES * AVG_TOKENS
rollout_payload = rollout_tokens * (4 + 4)       # int32 token ids + fp32 logprobs
grad_wire = ring_allreduce_bytes(P * BYTES_BF16, DP_RANKS)

assert full_sync == 28.0 * GB                    # 14e9 * 2 bytes
assert abs(delta_sync - 0.56 * GB) < 1e-6
assert abs(lora_sync - 0.14 * GB) < 1e-6
assert abs(grad_wire - 49.0 * GB) < 1e-6         # 2*(7/8)*28 GB at 8 ranks
assert lora_sync < delta_sync < full_sync
assert abs(full_sync / delta_sync - 50.0) < 1e-9 and abs(full_sync / lora_sync - 200.0) < 1e-9
assert rollout_payload < 0.005 * full_sync       # rollout wire is noise next to weights

print(f"weight sync per step : full bf16 {full_sync / GB:.2f} GB | "
      f"delta ({DELTA_FRACTION:.0%}) {delta_sync / GB:.2f} GB | "
      f"LoRA ({LORA_FRACTION:.1%}) {lora_sync / GB:.2f} GB")
print(f"rollout payload      : {rollout_tokens / 1e6:.1f}M tokens = "
      f"{rollout_payload / GB:.3f} GB (ids + logprobs)")
print(f"gradient all-reduce  : {grad_wire / GB:.2f} GB/rank/step at DP={DP_RANKS} (ring)")
print(f"per hour at {SYNCS_PER_HOUR} syncs: full {full_sync * SYNCS_PER_HOUR / 1e12:.2f} TB | "
      f"delta {delta_sync * SYNCS_PER_HOUR / GB:.1f} GB | LoRA {lora_sync * SYNCS_PER_HOUR / GB:.1f} GB")
print("all data-path assertions passed")

Output of the run: full bf16 sync 28.00 GB per step against 0.56 GB for a 2% delta (50x) and 0.14 GB for adapter-only sync (200x); the 8.4M-token rollout payload is 0.067 GB, noise next to the weights; the trainer's ring all-reduce moves 49.00 GB per rank per step at DP=8; and at one sync per minute the hourly wire cost is 1.68 TB full versus 33.6 GB delta versus 8.4 GB LoRA. The review conclusion is usually visible right here: on anything but an RDMA-adjacent pool, full-weight sync per step dominates every other flow, which is why the sync mode is a product decision, not an implementation detail.

Step 3: audit the staleness budget. Establish how many optimizer steps of weight lag the rollout fleet may run behind, and where the correction happens: on-policy modes force a barrier every step; async modes tolerate bounded lag and reweight with truncated importance sampling (async RL systems). Ask the stack three questions: what enforces the bound (a version check, a queue depth, nothing), what happens to in-flight generations when weights update (finish on old weights, abort, or mixed), and does the trainer's loss correct for the logprob gap between the rollout engine and the trainer. "Nothing enforces it" is a finding, not a footnote.

Step 4: audit the overlap. For each pair of phases (rollout generation, reward scoring, weight sync, optimizer step), record whether they overlap or serialize, and what enforces it. The two overlaps that pay most: weight sync overlapped with the tail of rollout generation, and reward scoring streamed while decode continues. Confirm on a timeline rather than in the architecture diagram: run one step under the profiler and look for sync collectives sitting in gaps where compute idles (comms-compute overlap, Nsight profiling workflow):

# Reference template: capture one training step's timeline.
nsys profile -o rl_step --trace=cuda,nvtx,osrt -- <one-step launch cmd>

Step 5: check the collective paths and topology fit. List the collectives one step issues (gradient all-reduce or FSDP reduce-scatter/all-gather on the trainer; the weight broadcast or point-to-point delta to rollout ranks) and map each onto the fabric: trainer collectives want the NVLink domain or RDMA rails, the rollout sync crosses pools and takes whatever link connects them. NCCL_DEBUG=INFO on a single step names the transports actually chosen (see the NCCL-hang runbook for reading it); nvidia-smi topo -m shows the intra-node paths. Note the gang-scheduling consequence: trainer ranks are all-or-nothing, rollout replicas are individually restartable, so the scheduler must treat them differently (gang-scheduled training recipe).

Step 6: write the decision record. One page: chosen placement (colocated or disaggregated, and why for this fabric), sync mode (full, delta, LoRA-only) with the step-2 numbers attached, the staleness bound and its enforcement point, the phases that must overlap, and the rate-matching plan for the two pools (the sizing logic is disaggregation rate matching; the rollout side is priced further by rollout redundancy if group sampling is used). This record is what the API's pricing, quotas, and SLOs get built against.

Verification¶

Measured step breakdown matches the budget. A profiled step decomposes into rollout / scoring / sync / optimizer within the tolerance stated in the decision record (start at 20% until the stack has history); a phase far over budget reopens the matching procedure step.
Designed overlaps actually overlap. The profiler timeline shows sync traffic concurrent with rollout tail or backprop, not serialized after it (comms-compute overlap).
The staleness bound holds under load. Under a saturating run, the observed weight-version lag at the rollout fleet never exceeds the recorded bound.
No unexplained traffic. Per-NIC counters during one step account for the step-2 predictions; a large unexplained flow means the actor enumeration missed an edge.

Rollback¶

A review has no production state to revert; its rollback is the decision:

Mark the stack not-approved in the decision record with the failing step (traffic, staleness, overlap, or topology) and the measured evidence.
Re-run steps 2-5 after each remediation (switching to delta or LoRA-only sync, colocating, changing the staleness mode); do not approve on the diagram, approve on the re-measured timeline.
If the stack is already serving jobs, treat a failed re-review as a capacity or correctness incident and route to the matching runbook (MFU regression for throughput, NCCL hang for stalls).

the NCCL-hang runbook: when a reviewed stack still wedges on a collective.
the MFU-regression runbook: when throughput drifts after approval.
the checkpoint-recovery runbook: the failure-path sibling for state, as this runbook is for traffic.
operational runbooks: operational runbooks index.

References¶

verl (rollout/trainer orchestration, colocated and disaggregated modes): https://github.com/volcengine/verl
OpenRLHF (Ray-based disaggregated RLHF): https://github.com/OpenRLHF/OpenRLHF
PyTorch distributed (collectives, process groups): https://docs.pytorch.org/docs/stable/distributed.html
PyTorch profiler (per-phase timeline capture): https://docs.pytorch.org/docs/stable/profiler.html
NCCL environment variables (NCCL_DEBUG and transport selection): https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html