Markdown

RL scaling laws for LLMs¶

Scope: making RL post-training predictable the way pretraining already is. This page covers what the RL scaling curves are (sigmoidal and power-law), why and when to fit them, how to fit one and extrapolate the asymptote from an early run (with runnable code), what raises the ceiling versus the speed, and how to allocate compute. It builds on GRPO and its variants, the systems in async and disaggregated RL, and the pretraining-era scaling to 100T parameters; the extrapolation mechanics overlap learning-curve extrapolation.

Formulas summarize recent papers; fit your own curves on your own stack before betting compute on a recipe. The Python example is executed and asserted (numpy); it recovers a known asymptote from early points.

flowchart LR
  EARLY["Early RL run (through the inflection)"] --> FIT["Fit sigmoidal curve: A ceiling, B efficiency"]
  FIT --> PRED["Predict final reward at target compute"]
  PRED --> PICK["Choose recipe by asymptote A, not early score"]
  RECIPE["Recipe knobs"] --> AA["Asymptote A: batch size, loss type, FP32 head, data filtering"]
  RECIPE --> BB["Efficiency B: loss aggregation, async bound, curriculum"]

What it is¶

Pretraining scaling laws are power laws: loss falls as a power of compute, data, and parameters, and Chinchilla made the compute-optimal split of parameters against tokens predictable. RL post-training was, until recently, more art than science. The 2026 result is that RL reward also follows predictable curves, in two complementary forms:

Sigmoidal in compute. The largest systematic study (ScaleRL, over 400,000 GPU-hours) fits reward as a saturating S-curve in RL compute C: Reward(C) = A / (1 + (C_mid / C)^B), where A is the asymptotic ceiling, B the efficiency exponent (steepness), and C_mid the midpoint compute. The curve has three phases: a flat start, a rapid middle, a plateau.¹
Power-law across model sizes. A complementary study fits test loss as L(N, X) = K(N) * X^(-E(N)) across 60+ models from 0.5B to 72B, where K(N) is a learning-efficiency term that saturates with model size and E(N) the scaling exponent, supporting extrapolation both across sizes and within a run.

Why use it¶

Judge a recipe before its full run. Fit the curve on an early fraction of a run and predict the final reward, so you can rank recipes without paying for each at full scale. ScaleRL predicted a run scaled to 100,000 GPU-hours from far smaller experiments.¹
Separate the ceiling from the speed. A recipe that looks better early may share the same asymptote, and a higher-ceiling recipe may start slower; fitting A tells you which is which.
Plan compute. DeepSeek-R1-Zero used about 100,000 H800 GPU-hours for RL, roughly 3.75% of its pretraining compute; knowing the curve turns that spend into a planned allocation rather than a guess.

When to use it (and when not)¶

Use it once a run has enough compute coverage to see the curve bend (through the inflection), then extrapolate the plateau; and to rank candidate recipes by fitted asymptote.
Do not extrapolate the asymptote from pre-inflection data. Fitting A from points that are all far below the midpoint is ill-posed: many (A, B, C_mid) triples fit early points equally, so the recovered ceiling is unreliable until the curve bends (a failure mode, below).
Do not read early speed as final quality. Rank by fitted A, not by an early checkpoint score.
Re-fit as compute grows. A curve fit on small runs can miss a later regime change.

Architecture¶

The workflow is fit, then extrapolate, then choose: log (compute, reward) pairs from an early run, fit the sigmoid, predict reward at the target compute, and pick the recipe with the best predicted asymptote. The recipe knobs split into two groups: those that move the ceiling A (total batch size, loss formulation, a full-precision head, data filtering) and those that only change efficiency B (loss aggregation, advantage normalization, bounded-staleness asynchrony, curriculum). Ranking on A is the whole point.

How to use it¶

Fitting the sigmoid needs no heavy solver. With the asymptote A fixed, the curve linearizes: log(A/R - 1) = B*log(C_mid) - B*log(C), which is linear in log C, so a grid over A plus a linear fit for B and C_mid recovers all three. This runnable example fits on an early run (compute through the inflection) and extrapolates to 100,000 GPU-hours, recovering a known asymptote (asserted A_fit within 0.05 of the true 0.80, prediction within 0.02):

# fit_rl_curve.py — validated: fit the sigmoid on early compute, extrapolate the asymptote. numpy only.
import numpy as np
rng = np.random.default_rng(0)
A_true, B_true, C_mid = 0.80, 1.5, 5000.0
reward = lambda C: A_true / (1.0 + (C_mid / C) ** B_true)
C = np.array([500, 1000, 2000, 4000, 7000, 12000, 20000, 30000.0])  # early run, through the inflection
R = reward(C) * (1 + 0.01 * rng.standard_normal(C.size))            # +1% measurement noise

def fit(C, R):
    best = None
    for A in np.linspace(R.max() * 1.001, R.max() * 1.6, 600):      # A must exceed the observed max
        slope, inter = np.polyfit(np.log(C), np.log(A / R - 1.0), 1)  # linear in log C
        B = -slope
        if B <= 0:
            continue
        pred = A / (1 + np.exp(-(slope * np.log(C) + inter)))
        sse = float(((pred - R) ** 2).sum())
        if best is None or sse < best[0]:
            best = (sse, A, B, np.exp(inter / B))
    return best[1], best[2], best[3]                                # A, B, C_mid

A_fit, B_fit, Cmid_fit = fit(C, R)
pred_100k = A_fit / (1 + (Cmid_fit / 100000.0) ** B_fit)
assert abs(A_fit - A_true) < 0.05                                   # asymptote recovered
assert abs(pred_100k - reward(100000)) < 0.02                       # extrapolation holds
print(f"A_fit={A_fit:.3f}  reward@100k: pred={pred_100k:.3f} true={reward(100000):.3f}")  # 0.799 / 0.790 / 0.791

On real runs, replace the synthetic reward(C) with logged (GPU-hours, validation reward) pairs, and treat the fit as a decision tool: keep the recipe with the best A_fit.

How to develop with it¶

Raise the ceiling, not just the speed. What raises the asymptote A: total batch size, the loss formulation (CISPO and GSPO beat a plain baseline, see GRPO variants), a full-precision language-model head, and data filtering. What only improves efficiency B: loss aggregation, advantage normalization, bounded-staleness asynchrony (PipelineRL with a staleness bound), dynamic sampling, and curriculum.
Regularization is difficulty-dependent. Large-scale verifiable-reward RL often needs no KL penalty at all; where regularization helps, an entropy bonus and light KL prevent entropy collapse on easy problems, while hard problems want less regularization and more exploration. This is the opposite of one setting for all.
Batch and learning rate. Grow total batch size (prompts times rollouts) first, since it dominates the exact split, and scale learning rate with roughly the square root of the batch size to stay stable.

How to maintain it¶

Re-fit the curve as a run accumulates compute: a sigmoid fit on early points can miss a later regime change, so refit as coverage grows and re-rank recipes on the updated asymptotes. Keep the logged (GPU-hours, validation reward) pairs for every run so the fit is reproducible and comparable across recipes, and track a recipe's fitted A over time rather than its score at any single checkpoint.

How to run it in production¶

Fit the curve on a live run's early logs to gate the compute allocation: commit full compute only to the recipe with the best predicted asymptote, and stop the rest early (learning-curve extrapolation). Across model sizes the power-law view sets the crossover: under a tight budget a smaller model trained longer can beat a larger one, while at higher budgets the larger model's saturating learning efficiency K(N) wins, so pick the model size to the budget. Allocate rollouts by budget and difficulty rather than a fixed G (the optimal count rises then saturates), lean on the high RL-data reuse factor (performance tracks total tokens processed, not unique samples) to relax data pressure, and scale learning rate with roughly the square root of the batch size.

Failure modes¶

Extrapolating the asymptote too early. A curve fit on pre-inflection points cannot pin A; wait for the bend, and re-fit as compute grows.
Reading early speed as final quality. A steeper early climb can share a lower ceiling; extrapolate the asymptote and rank on it.
Over-splitting the batch. Chasing the perfect prompt-vs-rollout ratio when total batch size is what matters.
Fixed regularization. One entropy or KL setting across difficulties causes collapse on easy prompts or throttles exploration on hard ones.
Ignoring the train-inference gap. Efficiency knobs assume a stable off-policy correction; without it the curve is not reproducible (GRPO variants).

References¶

Cameron R. Wolfe, "RL Scaling Laws for LLMs": https://cameronrwolfe.substack.com/p/rl-scaling-laws
Khatri, Madaan et al., The Art of Scaling Reinforcement Learning Compute for LLMs (ScaleRL, sigmoidal fit, 400K+ GPU-hours): https://arxiv.org/abs/2510.13786
Kaplan et al., Scaling Laws for Neural Language Models: https://arxiv.org/abs/2001.08361
Hoffmann et al., Training Compute-Optimal Large Language Models (Chinchilla): https://arxiv.org/abs/2203.15556
DeepSeek-AI, DeepSeek-R1 (RL compute as a fraction of pretraining): https://arxiv.org/abs/2501.12948

ScaleRL (arXiv 2510.13786) fits reward as a sigmoid in compute, Reward(C) = A / (1 + (C_mid/C)^B); loss aggregation, normalization, curriculum, and off-policy algorithm modulate efficiency without shifting the asymptote, while batch size, loss type, FP32 head, and data filtering raise it; a run was predicted out to 100,000 GPU-hours from smaller-scale fits. ↩↩