Markdown

Recipe: RL cluster bring-up (k3s + KubeRay + verl)¶

Scope: the end-to-end bring-up sequence for a self-hosted RL post-training cluster, assembled from parts this KB covers individually: GPU nodes joined to k3s, GPUs scheduled through the NVIDIA runtime and device plugin, the KubeRay operator running a RayCluster shaped for RL, a verl GRPO job submitted over the Ray job API, and the weight-sync path decision (verl's colocated resync vs disaggregated delta weight sync) with its staleness knobs modeled and asserted. Each stage ends with a verification gate; do not proceed past a red gate. The architecture instantiates the generator/trainer paradigm from the RL libraries overview and the actor/learner split in async RL systems.

Manifests and commands are reference templates on real APIs, pinned to versions verified against the upstream repos on 2026-07-03 (k3s v1.36.2+k3s1, KubeRay v1.6.2, Ray 2.56.0, verl v0.8.0, NVIDIA device plugin v0.19.3); re-verify tags before installing, they move monthly. The weight-sync knob model is executed and asserted; the manifests have not been run against live GPUs from this page.

What it is¶

A single ordered path from bare GPU nodes to a running GRPO job, with a verification gate at each layer boundary. Five layers, each depending only on the one below:

Nodes: NVIDIA driver, NVIDIA Container Toolkit, and (multi-node) the network fabric (networking fabric).
Orchestrator: k3s, a single-binary Kubernetes; its containerd auto-detects the NVIDIA runtime.
GPU scheduling: the NVIDIA device plugin advertises nvidia.com/gpu; pods request GPUs like any resource (Kubernetes for GPUs).
Ray substrate: the KubeRay operator reconciles a RayCluster (head + GPU worker groups) as native Kubernetes objects.
RL job: verl submitted via ray job submit; the HybridFlow controller drives rollout, reward, advantage, and update over the worker GPUs, colocating actor, rollout engine, and reference by default.

The final stage is a decision, not an install: where trainer-to-rollout weight synchronization runs. Colocated verl keeps it on-device and needs nothing extra; a disaggregated rollout fleet turns weight sync into a network problem that delta weight sync shrinks by roughly two orders of magnitude.

Why it matters¶

The layers are documented; the seams are not. Each component page covers its own failure modes, but most bring-up failures live at the boundaries: a GPU visible to nvidia-smi but not to Ray, a RayCluster whose rayVersion disagrees with the image, a worker group the autoscaler shrinks mid-run. Gating each stage isolates the failing seam instead of debugging the whole stack at once.
RL is the most demanding workload shape a cluster runs. It is simultaneously a gang-scheduled training job and an inference deployment, coupled by a per-step weight-sync loop. A cluster that runs SFT fine can still fail RL on shared-memory limits, staleness, or rollout OOM (async RL systems).
The weight-sync decision is architectural, not tunable-later. Colocated and disaggregated layouts have different hardware requirements, different failure modes, and different observability needs; choosing late means re-provisioning.

When to use it (and when not)¶

Use this recipe for a single-tenant RL lab or product team on one to a few dozen GPU nodes it controls: the k3s + KubeRay + verl stack is the least-moving-parts path to multi-node GRPO with standard Kubernetes semantics.
Use full Kubernetes (Kubernetes, Kubernetes for GPUs) at datacenter scale or multi-tenant: k3s's control plane is not built for large fleets, and quota, gang scheduling (Kueue, Volcano), and DRA at scale belong there. Every manifest below ports unchanged.
Use Ray on Slurm (Ray on Slurm) where Slurm already owns the hardware; do not run two schedulers.
Use a managed training API (Tinker) when the goal is post-training without operating GPUs at all; this recipe is for owning the cluster.

Architecture¶

flowchart TB
  subgraph L1["Nodes + k3s"]
    SRV["k3s server (API, scheduler, SQLite/etcd)"]
    AG["GPU agents: containerd nvidia runtime"]
    DP["device plugin: nvidia.com/gpu"]
  end
  subgraph L2["Ray substrate (KubeRay)"]
    OP["kuberay-operator"]
    HEAD["Ray head: GCS, dashboard :8265"]
    WG["GPU worker group (gang, no autoscale)"]
  end
  subgraph L3["verl job (HybridFlow)"]
    CTRL["single controller"]
    ACT["actor update (FSDP)"]
    ROLL["rollout (vLLM)"]
  end
  SRV --> AG
  DP --> WG
  OP --> HEAD
  OP --> WG
  CTRL -->|"generate"| ROLL
  ROLL -->|"trajectories + rewards"| CTRL
  CTRL -->|"advantage + update"| ACT
  ACT ==>|"weight resync"| ROLL

verl colocates ACT and ROLL on the same worker GPUs and reshards between the train and generate phases (the 3D-HybridEngine), so the ACT ==> ROLL edge stays on NVLink or on-device. The disaggregated variant in stage 6 moves ROLL to its own pool and replaces that edge with a sparse delta over the network.

How: the bring-up sequence¶

Stage 0: node prerequisites¶

On every GPU node, in order: the NVIDIA driver, then the NVIDIA Container Toolkit. k3s only auto-detects the runtime if the toolkit is present before k3s starts.

nvidia-smi                                   # driver loaded, GPUs enumerate
nvidia-ctk --version                         # container toolkit installed

Multi-node NCCL needs the fabric proven first: run nccl-tests between nodes before blaming any higher layer (fabric bring-up and benchmarking). Gate: nvidia-smi clean on every node; multi-node all_reduce_perf at expected bus bandwidth.

Stage 1: k3s server and GPU agents¶

# server (control plane; disable the bundled ingress if you run your own)
curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION="v1.36.2+k3s1" sh -s - server
sudo cat /var/lib/rancher/k3s/server/node-token          # join token

# each GPU node joins as an agent with nvidia as default runtime
curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION="v1.36.2+k3s1" \
  K3S_URL=https://<server>:6443 K3S_TOKEN=<node-token> \
  INSTALL_K3S_EXEC="--default-runtime nvidia" sh -

Pin INSTALL_K3S_VERSION; an unpinned install drifts across nodes. For a resilient control plane use three servers with --cluster-init (embedded etcd); the default SQLite datastore is a single point of failure (k3s). Gate: kubectl get nodes shows every node Ready, and grep nvidia /var/lib/rancher/k3s/agent/etc/containerd/config.toml hits on every agent. If the grep misses, the toolkit was installed after k3s: reinstall k3s or restart the agent.

Stage 2: GPU scheduling¶

The device plugin DaemonSet advertises GPUs to the scheduler. The full GPU Operator (driver + toolkit + DCGM + MIG lifecycle) is the heavier alternative; on k3s it must be pointed at the k3s containerd socket /run/k3s/containerd/containerd.sock (k3s, Kubernetes for GPUs). Note the operator's documented Rancher path targets RKE2; on k3s, host-installed toolkit plus device plugin is the path the k3s docs describe.

kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.19.3/deployments/static/nvidia-device-plugin.yml
kubectl get nodes -o custom-columns='NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu'

Gate: every GPU node allocates the expected nvidia.com/gpu count, and a smoke pod runs nvidia-smi:

kubectl run smi --rm -it --restart=Never --image=nvidia/cuda:13.0.0-base-ubuntu24.04 \
  --overrides='{"spec":{"containers":[{"name":"smi","image":"nvidia/cuda:13.0.0-base-ubuntu24.04","command":["nvidia-smi"],"resources":{"limits":{"nvidia.com/gpu":1}}}]}}'

Stage 3: KubeRay operator¶

helm repo add kuberay https://ray-project.github.io/kuberay-helm/ && helm repo update
helm install kuberay-operator kuberay/kuberay-operator \
  --version 1.6.2 --namespace kuberay-system --create-namespace

Gate: kubectl get crd | grep ray.io lists rayclusters, rayjobs, rayservices, and the operator deployment is Available.

Stage 4: a RayCluster shaped for RL¶

Three choices distinguish an RL cluster from a generic Ray one. First, head and workers run the verl image, not the plain Ray image, so the job's environment exists on every node; rayVersion must match the Ray inside that image. Second, the GPU worker group is gang-shaped: replicas == minReplicas == maxReplicas, no autoscaling, because a mid-run scale-down kills a collective. Third, workers mount /dev/shm sized for NCCL and vLLM and a shared data volume for datasets and checkpoints.

# raycluster-rl.yaml (reference template; pin and verify the image tag)
apiVersion: ray.io/v1
kind: RayCluster
metadata: { name: rl, namespace: ml }
spec:
  rayVersion: "2.56.0"              # MUST equal `python3 -c "import ray; print(ray.__version__)"` in the image
  headGroupSpec:
    rayStartParams: { dashboard-host: "0.0.0.0" }
    template:
      spec:
        containers:
          - name: ray-head
            image: verlai/verl:vllm023.dev1        # verl app image, vLLM variant
            resources:
              limits: { cpu: "8", memory: 32Gi }   # head schedules; it needs no GPU
  workerGroupSpecs:
    - groupName: gpu
      replicas: 2
      minReplicas: 2
      maxReplicas: 2                # gang: never let the autoscaler resize a training group
      rayStartParams: {}
      template:
        spec:
          containers:
            - name: ray-worker
              image: verlai/verl:vllm023.dev1
              resources:
                limits:   { nvidia.com/gpu: "8", cpu: "48", memory: 512Gi }
                requests: { nvidia.com/gpu: "8", cpu: "48", memory: 512Gi }
              volumeMounts:
                - { name: dshm, mountPath: /dev/shm }
                - { name: data, mountPath: /data }
          volumes:
            - name: dshm
              emptyDir: { medium: Memory, sizeLimit: 64Gi }
            - name: data
              persistentVolumeClaim: { claimName: rl-data }   # shared storage (NFS/CephFS); local-path only single-node

kubectl apply -f raycluster-rl.yaml
kubectl -n ml get raycluster rl                          # State: ready
kubectl -n ml exec -it svc/rl-head-svc -- python3 -c "import ray, verl; print(ray.__version__, verl.__version__)"

Gate: the cluster reports ready, the exec prints matching versions, and ray status inside the head shows all worker GPUs. For multi-tenant clusters, admit the worker group through Kueue or Volcano so a partially scheduled gang cannot deadlock; keep degraded GPUs out with health gating.

Stage 5: dataset and the verl GRPO job¶

verl consumes parquet with a prompt column and a ground-truth field; its preprocessors write both splits. Prepare once onto the shared volume, then submit through the dashboard port (verl's documented multi-node path).

# one-off, inside the head pod (or a Job): GSM8K to /data
python3 examples/data_preprocess/gsm8k.py --local_save_dir /data/gsm8k

# from an operator workstation
kubectl -n ml port-forward svc/rl-head-svc 8265:8265 &
ray job submit --address http://127.0.0.1:8265 \
  --runtime-env verl/trainer/runtime_env.yaml --no-wait -- \
  python3 -m verl.trainer.main_ppo \
    algorithm.adv_estimator=grpo \
    data.train_files=/data/gsm8k/train.parquet \
    data.val_files=/data/gsm8k/test.parquet \
    actor_rollout_ref.model.path=Qwen/Qwen2.5-7B-Instruct \
    actor_rollout_ref.rollout.name=vllm \
    actor_rollout_ref.rollout.n=8 \
    actor_rollout_ref.actor.use_kl_loss=True \
    actor_rollout_ref.actor.kl_loss_coef=0.001 \
    trainer.n_gpus_per_node=8 trainer.nnodes=2 \
    trainer.save_freq=20 trainer.test_freq=10 \
    trainer.default_local_dir=/data/ckpt/grpo-qwen7b

trainer.nnodes * trainer.n_gpus_per_node must equal the gang's GPU total. GRPO needs rollout.n > 1; n=1 silently degenerates to zero advantage (GRPO). Reward design and its failure modes are their own discipline (reward design). Gate: the job reaches RUNNING on the dashboard, reward trends up, entropy does not collapse toward zero, and KL stays bounded. Checkpoints land under trainer.default_local_dir on the shared volume, so a rerun after pod loss resumes instead of restarting.

Stage 6: choose the weight-sync path¶

flowchart LR
  Q{"Do trainer and rollout share GPUs?"}
  Q -->|"yes: verl colocated default"| CO["on-device / NVLink resync; no delta sync needed"]
  Q -->|"no: separate pools, regions, or engines"| DIS["sparse delta over disk / bucket / NCCL"]
  DIS --> KN["knobs: sync interval N, in-flight cap K, anchor every A steps"]
  KN --> ST["worst staleness = (K+1)*N - 1 optimizer steps"]

The colocated default is correct until generation dominates: long multi-turn or agentic rollouts, a rollout fleet that must scale independently of the trainer (fleet sizing runbook), heterogeneous or scattered capacity, or cross-datacenter runs. Disaggregating turns every optimizer step into a weight shipment, and at RL learning rates about 99% of BF16 bytes do not change per step, so shipping only the sparse delta cuts that traffic by roughly two orders of magnitude, losslessly; Fireworks supported Cursor's Composer 2 training run, distributed across three to four global clusters, this way. The mechanics, transports (slime's disk path, SGLang's in-engine NCCL receiver, TRL's Hub bucket), and the codec round-trip are on delta weight sync; prompt-level rollout economics are on rollout redundancy.

Whatever the framework calls them, a disaggregated deployment exposes four control knobs: the sync interval N (publish one cumulative delta per N optimizer steps), the in-flight cap K (published-but-unapplied deltas before the trainer blocks), a staleness bound the rollout engine enforces before serving, and an anchor interval A (periodic full snapshot for recovery). The model below is executed and asserted; it is what the knobs do to on-policyness, independent of implementation:

# sync_knobs.py -- validated model of the disaggregated weight-sync control knobs:
# sync_interval (publish one cumulative delta per N optimizer steps),
# max_inflight_deltas (K published-but-unapplied deltas before the trainer blocks),
# and anchor recovery. Executed and asserted, including adversarial cases.
import numpy as np

def sparse_delta(prev: np.ndarray, cur: np.ndarray):
    pos = np.flatnonzero(prev != cur)
    return pos, cur[pos]                      # overwrite encoding: exact values at changed positions

def apply_delta(w: np.ndarray, delta) -> np.ndarray:
    pos, vals = delta
    out = w.copy()
    out[pos] = vals
    return out

def compose_window(deltas):
    # cumulative window delta: the LAST write per index must win
    acc: dict = {}
    for pos, vals in deltas:
        acc.update(zip(pos.tolist(), vals.tolist()))
    pos = np.array(sorted(acc), dtype=np.int64)
    return pos, np.array([acc[p] for p in pos], dtype=np.float32)

def worst_staleness(steps: int, sync_interval: int, max_inflight: int) -> int:
    # laziest-legal engine: applies a published delta only when the trainer
    # would otherwise exceed max_inflight. Staleness at step t = optimizer
    # steps between the served weights and the policy being updated.
    served, inflight, worst = 0, [], 0
    for t in range(1, steps + 1):
        worst = max(worst, (t - 1) - served)
        if t % sync_interval == 0:
            inflight.append(t)
            while len(inflight) > max_inflight:
                served = inflight.pop(0)
    return worst

rng = np.random.default_rng(7)
N_W, T = 200_000, 6
w = [rng.standard_normal(N_W).astype(np.float32)]
for t in range(T):                             # T optimizer steps, ~2% of weights move per step
    nxt = w[-1].copy()
    chg = rng.choice(N_W, int(0.02 * N_W), replace=False)
    nxt[chg] = rng.standard_normal(chg.size).astype(np.float32)
    w.append(nxt)
deltas = [sparse_delta(w[t], w[t + 1]) for t in range(T)]

# 1) per-step apply chain reconstructs the final weights exactly
cur = w[0]
for d in deltas:
    cur = apply_delta(cur, d)
assert np.array_equal(cur, w[T])

# 2) sync_interval > 1: one composed window delta equals the per-step chain,
#    and equals the direct diff from window start to window end
win = compose_window(deltas)
assert np.array_equal(apply_delta(w[0], win), w[T])
dpos, dvals = sparse_delta(w[0], w[T])
wpos, wvals = win
overlap, seen = set(), set()
for pos, _ in deltas:
    s = set(pos.tolist())
    overlap |= (s & seen)
    seen |= s
assert overlap, "test not adversarial: no index changed twice in the window"
# direct diff can be SMALLER than the composed window (an index can change and
# then change back to its start value); applied, both reconstruct identically
assert dpos.size <= wpos.size
assert np.array_equal(apply_delta(w[0], (dpos, dvals)), w[T])

# 3) adversarial: first-write-wins composition is WRONG when an index changed twice
first: dict = {}
for pos, vals in deltas:
    for p, v in zip(pos.tolist(), vals.tolist()):
        first.setdefault(p, v)
fpos = np.array(sorted(first), dtype=np.int64)
fvals = np.array([first[p] for p in fpos], dtype=np.float32)
assert not np.array_equal(apply_delta(w[0], (fpos, fvals)), w[T]), "first-write-wins must corrupt"

# 4) adversarial: skipping one delta in the chain (out-of-order apply) corrupts silently
cur = w[0]
for i, d in enumerate(deltas):
    if i != 2:
        cur = apply_delta(cur, d)
assert not np.array_equal(cur, w[T]), "a skipped delta must not reconstruct"

# 5) anchor recovery: a full snapshot every A steps + the deltas since it
#    rebuilds the current weights after a failed/corrupt apply
A = 3
anchor = w[A].copy()                            # anchor kept at step A
rebuilt = anchor
for d in deltas[A:]:
    rebuilt = apply_delta(rebuilt, d)
assert np.array_equal(rebuilt, w[T])

# 6) staleness bound: worst-case staleness == (K+1)*N - 1 optimizer steps
#    (tight, achieved by the laziest-legal schedule); N=1, K=0 is synchronous
for N in (1, 2, 3, 4):
    for K in (0, 1, 2, 3):
        got = worst_staleness(steps=(K + 3) * N * 4, sync_interval=N, max_inflight=K)
        assert got == (K + 1) * N - 1, (N, K, got)
assert worst_staleness(100, 1, 0) == 0          # fully synchronous: on-policy every step

print("window composition, corruption cases, anchor recovery, staleness bound: all asserted")
print(f"worst staleness (N=2,K=1) = {worst_staleness(100, 2, 1)} steps; (N=1,K=0) = 0 (synchronous)")

Read the asserted facts as operating rules. Deltas are order-sensitive and non-skippable, so sequence numbers and checksums are mandatory, and any failed apply must trigger an anchor resync, never a continue (cases 3 to 5). Staleness is a budget you set, not an accident: (K+1)*N - 1 optimizer steps in the worst case, so N=2 with K=1 means training on data up to 3 steps old, which the update rule must absorb via truncated importance sampling (async RL systems). Raising N cuts bucket round-trips linearly but widens the window delta and the staleness bound with it. Gate (disaggregated only): bit-identical reconstruction proven on your engine pair before the first real run, and staleness exported as a metric from day one.

How to develop on it¶

Iterate on the recipe, not the infrastructure. The reward function is the highest-leverage file: point custom_reward_function.path at it and unit-test the scorer on held-out strings before every campaign, exactly as GRPO prescribes; hyperparameter sweeps change only the ray job submit overrides. Keep a single-GPU dev loop: the same verl command with trainer.nnodes=1 trainer.n_gpus_per_node=1 and a 0.5B model runs on one node (memory-efficient GRPO recipe when HBM is the constraint), and k3s makes that dev cluster a one-command install. Version manifests and job commands in git; the cluster should be reproducible from the repo alone.

How to run it in production¶

Observability first. The Ray head exports ray_* metrics on :8080; scrape it and DCGM, and alert on pending tasks and GPU utilization (telemetry and monitoring, training-API observability):

sum(ray_tasks{State="PENDING_NODE_ASSIGNMENT"})          # gang stuck: quota or capacity
sum(ray_resources{Name="GPU", State="USED"}) / sum(ray_resources{Name="GPU"})

Training health lives in the job logs: reward, entropy, KL, and the per-group reward std for GRPO group collapse (verl logs group reward statistics; TRL's equivalent is frac_reward_zero_std). - Long-lived RayCluster vs RayJob. A shared RayCluster keeps models and caches warm between runs; a RayJob per campaign (shutdownAfterJobFinishes: true) guarantees a clean slate and returns the GPUs. Default to RayJob for scheduled campaigns, RayCluster for interactive research. - Checkpoint discipline. trainer.save_freq against shared storage, and validate resume before a long run, not during the incident (RL resume validation). - Rate-match the pools if disaggregated: trainer starvation and idle rollout GPUs are both sizing bugs; treat the ratio as a live control variable (fleet sizing runbook).

How to maintain it¶

Pin the whole stack as one versioned set: k3s, device plugin, KubeRay chart, rayVersion, the verl image tag, and the model revision. Upgrade one layer at a time, bottom-up, draining jobs first; after any inference-engine or verl bump, re-verify the weight-sync path (colocated: one sanity run comparing reward traces; disaggregated: bit-identical reconstruction again, since the apply path lives inside the engine). CRD schemas change across KubeRay minors, so helm upgrade the operator only after reading its release notes. Keep the SQLite-to-etcd migration (or the move to full Kubernetes) ahead of node count, not behind it.

Failure modes¶

GPUs visible to the node, invisible to Ray. Device plugin healthy but ray status shows 0 GPUs: the worker pod lacks nvidia.com/gpu limits, or the image has no CUDA userspace. Fix the manifest, not the node.
rayVersion / image skew. Head and workers must run the same Ray as the CRD declares; mismatch surfaces as GCS handshake failures at cluster create.
NCCL bus error or hang at first collective. /dev/shm too small (raise the dshm emptyDir) or the fabric was never proven (back to stage 0; fabric bring-up).
Rollout OOM mid-run. The signature colocation failure: the vLLM phase inherits too little HBM. Lower rollout.gpu_memory_utilization or enable verl's param/optimizer offload (verl).
Autoscaler kills a training worker. A resized gang aborts the collective; keep min == max == replicas for training groups and let only rollout fleets autoscale.
Reward flat while loss moves. Usually the reward function or dataset schema, not the cluster; unit-test the scorer, check the per-group reward std for collapsed groups (GRPO, reward design).
Silent staleness in a disaggregated loop. Weights drift past the assumed bound and training destabilizes off-policy; export staleness, enforce the bound at generate time, resync from anchor on any failed apply (delta weight sync, async RL systems).
k3s control plane on SQLite under churn. Many nodes or high write rates on the default datastore: move to embedded etcd or full Kubernetes (k3s).

References¶

k3s docs (quick start, advanced/NVIDIA runtime): https://docs.k3s.io/quick-start · https://docs.k3s.io/advanced
NVIDIA device plugin releases: https://github.com/NVIDIA/k8s-device-plugin/releases
NVIDIA GPU Operator docs: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html
KubeRay docs and releases: https://docs.ray.io/en/latest/cluster/kubernetes/index.html · https://github.com/ray-project/kuberay/releases
verl repo and releases: https://github.com/verl-project/verl · https://github.com/verl-project/verl/releases
verl multi-node guide (ray job submit path): https://verl.readthedocs.io/en/latest/start/multinode.html
verl quickstart and GRPO docs: https://verl.readthedocs.io/en/latest/start/quickstart.html · https://verl.readthedocs.io/en/latest/algo/grpo.html
verl GSM8K preprocessor: https://github.com/verl-project/verl/blob/main/examples/data_preprocess/gsm8k.py
verl images on Docker Hub: https://hub.docker.com/r/verlai/verl/tags
Miahi & Belilovsky, PULSE (weight-update sparsity for distributed RL): https://arxiv.org/abs/2602.03839
Fireworks, "Frontier RL Is Cheaper Than You Think": https://fireworks.ai/blog/frontier-rl-is-cheaper-than-you-think