Markdown

SkyRL (UC Berkeley)¶

Scope: SkyRL, a modular full-stack RL library for LLMs from the Berkeley Sky Computing Lab (NovaSky), built for multi-turn agentic RL with flexible colocated-or-disaggregated placement.

The SkyRL CLI and skyrl-gym snippets below are reference templates on real APIs: pin versions and validate before production use. The numpy blocks are self-contained, executed and asserted in this page.

What it is¶

SkyRL (NovaSky-AI/SkyRL) is an open-source RL post-training stack, split into composable packages: skyrl-train (the FSDP2/Megatron trainer), skyrl-tx (a Tinker-API backend with a unified train/inference engine, including a JAX path), skyrl-agent (the agent layer for long-horizon, real-environment tasks), and skyrl-gym (a Gymnasium of tool-use environments: math, coding, search, SQL). Among the RL libraries (RL libraries) it is the flexible, agentic option. Training backends are FSDP2 / Megatron (with the JAX path in skyrl-tx); rollout engines are vLLM or the OpenAI API (SGLang support was dropped once SkyRL unified on vLLM); orchestration is Ray (Ray). The project moves fast (mid-2026), so check the current package layout against the repo.

Why use it¶

SkyRL builds on lessons from earlier libraries (verl, slime, OpenRLHF) and keeps its interfaces small. What sets it apart is how much is a config toggle rather than a fork: sync or async pipelining, colocated or disaggregated generation and training, an external or an integrated inference engine, and pluggable weight-sync. It targets multi-turn agentic settings first, and ships example scripts for SWE, search, and Text2SQL.

When to use it (and when not)¶

Use it when you need agentic or multi-turn RL, want to switch between colocated and disaggregated placement without rewriting, or need a choice of vLLM or OpenAI-API rollouts.
Use it for research where flexible placement and heterogeneous train/infer GPUs matter (long-horizon tasks with stragglers).
Prefer verl (verl) for maximum tightly-coupled throughput at scale; prefer slime (slime) for opinionated Megatron+SGLang MoE training (the GLM stack). SkyRL trades some ecosystem maturity for flexibility.

Architecture¶

flowchart TB
  CTRL["Controller (Ray)"]
  subgraph Gen["Generator (rollout)"]
    ENG["vLLM / OpenAI API"]
    ENV["skyrl-gym env / reward"]
  end
  subgraph Train["Trainer (skyrl-train)"]
    OPT["GRPO / PPO on FSDP / DeepSpeed / Megatron"]
  end
  CTRL -.-> Gen
  CTRL -.-> Train
  Gen -->|"trajectories + rewards"| Train
  Train -->|"weight sync (colocated or disaggregated)"| Gen

In colocated mode the generator and trainer share GPUs and offload between phases; in disaggregated mode they run as separate Ray pools with async weight sync.

How it works (validated core math)¶

SkyRL's distinctive parts are the multi-turn rollout, which produces a trajectory of per-turn rewards, and the GRPO trainer, which scores a group of those trajectories against each other. Both reduce to a few lines of numpy, so the maths can be checked without a GPU. The two blocks below are the executable companions to the skyrl-gym env and the GRPO/PPO trainer shown later; run either with a numpy-only Python.

Multi-turn trajectory return¶

A skyrl-gym episode emits one reward per env.step. Return-to-go is the standard way to turn that per-turn stream into per-step returns: walk the turns in reverse so credit flows backward. With gamma=1 the trajectory return G[0] is just the sum of turn rewards, which is what an outcome-reward GRPO run scores. The boundary that bites is gamma=0 (no credit crosses turns) versus gamma=1 (undiscounted suffix sum), and a single-turn episode must round-trip to its own reward.

import numpy as np


def trajectory_return(step_rewards, gamma=1.0):
    """Return-to-go for one multi-turn rollout: G_t = sum_{k>=t} gamma**(k-t) * r_k.

    step_rewards is the per-turn reward vector of a single trajectory (what a
    skyrl-gym episode emits, one scalar per env.step). Returns a same-length vector
    of discounted returns; G[0] is the trajectory's total return that GRPO scores.
    """
    r = np.asarray(step_rewards, dtype=np.float64)
    assert r.ndim == 1, "step_rewards is one trajectory's per-turn rewards"
    out = np.zeros_like(r)
    running = 0.0
    for t in range(r.size - 1, -1, -1):        # walk the turns in reverse
        running = r[t] + gamma * running
        out[t] = running
    return out


# happy path: undiscounted (gamma=1) return-to-go is the suffix sum
g = trajectory_return([1.0, 0.0, 2.0], gamma=1.0)
assert np.allclose(g, [3.0, 2.0, 2.0]), g

# boundary gamma=0: no credit flows across turns, each return is its own reward
g0 = trajectory_return([1.0, 0.0, 2.0], gamma=0.0)
assert np.allclose(g0, [1.0, 0.0, 2.0]), g0

# boundary: a single-turn trajectory returns just its reward, for any gamma
assert np.allclose(trajectory_return([5.0], gamma=0.9), [5.0])

# discount applied per hop: G0 = 1 + 0.5*0 + 0.25*2 = 1.5
gd = trajectory_return([1.0, 0.0, 2.0], gamma=0.5)
assert np.allclose(gd, [1.5, 1.0, 2.0]), gd

# equivalence: the O(n) reverse recursion matches the explicit double-sum definition
def slow_return(rewards, gamma):
    n = len(rewards)
    return np.array([sum(gamma ** (k - t) * rewards[k] for k in range(t, n))
                     for t in range(n)])

rng = np.random.default_rng(0)
traj = rng.normal(size=17)
for gm in (0.0, 0.5, 0.99, 1.0):
    assert np.allclose(trajectory_return(traj, gm), slow_return(traj, gm), atol=1e-9)

print("trajectory_return: all asserts passed")

Group-relative advantage¶

The trainer groups G trajectory returns per prompt and centres each on the group mean, dividing by the group std by default. Because the group mean replaces a learned critic, GRPO holds no value network (roughly half of PPO's memory). The case that must not break is a group where every rollout scores the same: the std is zero, and the advantage has to collapse to zero rather than divide-by-zero into NaN. That is the zero-signal group SkyRL's reward monitors watch for.

import numpy as np


def group_relative_advantage(returns, scale_by_std=True, eps=1e-8):
    """GRPO advantage over a group of trajectory returns: center each on its group
    mean, optionally divide by the group std. returns has shape
    (num_prompts, group_size), each entry one rollout's total return; same shape out.
    """
    R = np.asarray(returns, dtype=np.float64)
    assert R.ndim == 2, "returns must be (num_prompts, group_size)"
    centered = R - R.mean(axis=1, keepdims=True)
    if not scale_by_std:
        return centered
    return centered / (R.std(axis=1, keepdims=True) + eps)     # population std (ddof=0)


# happy path: a two-sample group maps to known advantages (mean 0.5, std 0.5)
adv = group_relative_advantage(np.array([[0.0, 1.0]]))
assert np.allclose(adv, [[-1.0, 1.0]], atol=1e-6), adv

# property: every group is zero-mean after centering (the group replaces the critic)
rng = np.random.default_rng(1)
grp = rng.normal(size=(12, 8))
adv_grp = group_relative_advantage(grp)
assert np.allclose(adv_grp.mean(axis=1), 0.0, atol=1e-9)
assert np.allclose(adv_grp.std(axis=1), 1.0, atol=1e-3)        # unit std with scaling

# adversarial: a group where every rollout scores identically (all solved or all
# failed). std is 0, so the advantage must be exactly 0 and never NaN/Inf: no
# gradient, the zero-signal case SkyRL's reward monitors flag.
flat = group_relative_advantage(np.array([[0.7, 0.7, 0.7, 0.7]]))
assert np.all(np.isfinite(flat)), "zero-std group produced NaN/Inf"
assert np.allclose(flat, 0.0), flat

# scale_by_std=False keeps mean-centering only (the Dr.GRPO / R1-Zero setting)
adv_ns = group_relative_advantage(np.array([[0.0, 1.0]]), scale_by_std=False)
assert np.allclose(adv_ns, [[-0.5, 0.5]]), adv_ns

# equivalence: the vectorized form matches an explicit per-group reference loop
def slow_adv(returns, eps=1e-8):
    out = []
    for row in returns:
        m = sum(row) / len(row)
        s = (sum((x - m) ** 2 for x in row) / len(row)) ** 0.5
        out.append([(x - m) / (s + eps) for x in row])
    return np.array(out)

assert np.allclose(group_relative_advantage(grp), slow_adv(grp), atol=1e-6)
print("group_relative_advantage: all asserts passed")

How to use it¶

# Clone + uv env (Python 3.12); choose the FSDP extra
git clone https://github.com/NovaSky-AI/SkyRL.git
cd SkyRL
uv venv --python 3.12 .venv && source .venv/bin/activate
uv sync --active --extra fsdp           # or --extra megatron

# Ray must use the uv runtime-env hook
export RAY_RUNTIME_ENV_HOOK=ray._private.runtime_env.uv_runtime_env_hook.hook
ray start --head

# First GRPO run on GSM8K (see docs.skyrl.ai quickstart for the exact recipe)
bash examples/gsm8k/run_gsm8k.sh

Prebuilt images exist (e.g. novaskyai/skyrl-train-ray-<pinned>-cu12.8, ...-megatron). Pin the image tag; never :latest.

How to integrate with it¶

A SkyRL task is an environment (the Gymnasium API in skyrl-gym) plus a trainer/generator config. The environment returns observations and a scalar reward; the generator drives the multi-turn interaction over the OpenAI API while SkyRL manages the engine lifecycle. The per-turn rewards it emits are exactly the step_rewards fed to trajectory_return above.

Reference template (needs skyrl-gym, not executed here; confirm signatures on the repo):

# skyrl-gym style env (reference shape; confirm signatures on the repo)
import skyrl_gym

env = skyrl_gym.make("text2sql")          # or "gsm8k", "swe", "search"
obs, info = env.reset()
obs, reward, terminated, truncated, info = env.step(action_text)

Config selects backend and placement, e.g. trainer.backend=fsdp, generator.backend=vllm, placement.colocate=true|false, plus the weight-sync method. Override on the CLI via the entrypoint -m skyrl.train.entrypoints.main_base.

With the rollout inference engine¶

SkyRL does not serve production traffic; its inference engines exist to generate rollouts. It standardises on the OpenAI API so agent scaffolds call the engine over HTTP while SkyRL manages the vLLM lifecycle and weight refresh. For production model serving, see the serving pages: inference serving, serving open-weight models, and disaggregated inference.

In the post-training stack¶

SkyRL is itself an RL fine-tuning system: it runs GRPO (GRPO) and PPO-family algorithms for post-training (fine-tuning and post-training), computing the group-relative advantage validated above. LoRA training is supported on the FSDP backend with the vLLM engine for parameter-efficient runs (SFT and LoRA). Reward comes from skyrl-gym environments or custom verifiable reward functions.

How to run it in production¶

SkyRL runs RL jobs, not a serving endpoint, so production here means a stable, reproducible training campaign. Two things dominate: pin every moving part, and get the weight-sync fabric right.

Pin the runtime. Use a pinned prebuilt image (see above), never :latest, and export RAY_RUNTIME_ENV_HOOK=ray._private.runtime_env.uv_runtime_env_hook.hook so Ray workers inherit the uv environment.
Weight sync is the hot path. Colocated mode refreshes weights in place over NVLink/NVSwitch (the Blackwell platform); disaggregated mode moves weights between pools and benefits from GDR (NCCL_NET_GDR_LEVEL=SYS), with NCCL_IB_HCA set to the right rails (performance tuning).
Verify the fabric. Place the rollout and train pools on the same fabric island, confirm GDR via NCCL_DEBUG=INFO showing [GDRDMA], and turn PCIe ACS off for P2P.
Precision. Blackwell FP8/NVFP4 helps rollout throughput in vLLM (quantization for inference); keep training precision as the recipe specifies.

How to maintain it¶

An RL campaign is a live system, so treat it like one.

Watch the training signals. Track the reward mean, entropy, and the KL to the reference; GRPO's instabilities (entropy or advantage collapse, KL drift) show up here first (observability). The zero-std groups from the advantage block are the leading indicator that a prompt set has stopped producing gradient.
Checkpoint and resume. Persist the policy so a run recovers after preemption or a fault (checkpoint recovery); this matters more for long-horizon agentic rollouts, where a single trajectory is expensive.
Track the fast-moving package layout. SkyRL's packages (skyrl / skyrl-train / skyrl-tx) are still merging and renaming; pin a commit, and re-read the README and docs.skyrl.ai before an upgrade instead of assuming flag names held.
Re-validate rewards. The reward functions are deterministic, so unit-test them on held-out cases before a campaign; a silently broken reward trains a broken model fast.

How to scale it¶

# Multi-node: start a Ray head, join workers, then submit
ray start --head --port=6379
ray start --address='<head-ip>:6379'      # on each worker node

# Disaggregated placement: separate rollout and train GPU pools
uv run -m skyrl.train.entrypoints.main_base \
  trainer.backend=fsdp generator.backend=vllm \
  placement.colocate=false \
  trainer.num_gpus=8 generator.num_gpus=8

Colocated: one pool, generator and trainer swap GPUs; memory-bound, wants NVLink intra-node.
Disaggregated: independent rollout and train pools that scale separately and tolerate stragglers; wants fast IB/RoCE with GDR between pools (networking fabric, disaggregated inference).
FSDP sharding (HSDP intra-node NVLink, replicate inter-node IB) follows FSDP; Megatron TP/PP follows tensor parallelism and pipeline parallelism.

Cookbook (common use cases)¶

# 1) GRPO on GSM8K (colocated, FSDP + vLLM)
uv run -m skyrl.train.entrypoints.main_base \
  trainer.backend=fsdp generator.backend=vllm \
  placement.colocate=true algo=grpo data=gsm8k

# 2) Multi-turn agentic task (skyrl-agent over the OpenAI API)
uv run -m skyrl.train.entrypoints.main_base \
  env=text2sql generator.backend=vllm algo=grpo \
  generator.max_turns=8

# 3) Disaggregated mode with async rollouts (separate pools)
uv run -m skyrl.train.entrypoints.main_base \
  placement.colocate=false generator.async_rollout=true \
  trainer.backend=megatron generator.backend=vllm

(Flag names are illustrative of the config surface; confirm exact keys in docs.skyrl.ai before a real run.)

Failure modes¶

Missing RAY_RUNTIME_ENV_HOOK. Ray workers miss the uv environment and imports fail.
Disaggregated without enough inter-pool bandwidth. Weight-sync stalls come to dominate the step.
Under-provisioned rollout pool. Training GPUs sit idle while generation lags behind.
GRPO instability (entropy or advantage collapse, KL drift) is the algorithm's problem, not the library's (GRPO); watch reward, entropy, and KL (observability). The zero-std advantage case is validated in the core-math block above.
Package churn. The layout (skyrl / skyrl-train / skyrl-tx) is still merging; pin a commit and re-read the README.

References¶

SkyRL (NovaSky / UC Berkeley): https://github.com/NovaSky-AI/SkyRL
SkyRL docs: https://docs.skyrl.ai/docs/
Anyscale, Open Source RL Libraries for LLMs: https://www.anyscale.com/blog/open-source-rl-libraries-for-llms
Anyscale, Vision-Language RL in SkyRL: https://www.anyscale.com/blog/vision-language-model-reinforcement-learning-skyrl
SkyRL-Agent (multi-turn): https://arxiv.org/abs/2511.16108

Related: RL libraries · Fine-tuning · GRPO · verl · slime · OpenRLHF · Ray · Async RL · Disaggregated · Glossary