slime (THUDM)¶

Scope: Tsinghua / Z.ai's decoupled, async-first RL post-training framework, the RL stack behind the GLM model family.

Reference templates on real APIs; pin versions and validate before production use.

What it is¶

slime is an LLM post-training framework for RL scaling that tightly connects Megatron-LM training with high-throughput SGLang rollouts, orchestrated by Ray. It is opinionated and minimal: three modules, namely Training (Megatron), Rollout (SGLang + Router), and a central Data Buffer that bridges them, all wired for asynchronous operation. It is the framework behind the GLM line (as of mid-2026 the repo lists GLM-4.5, 4.6, 4.7, 5, 5.1, 5.2; verify on the repo), and also supports Qwen and DeepSeek-V3-class models. For the broader landscape see RL libraries.

Why use it¶

High-performance MoE RL. Battle-tested on SOTA MoE models (the GLM stack serving open-weight models); Megatron handles large expert-parallel training, SGLang handles fast rollouts.
Decoupled & async. Training and rollout are separate pools joined by the Data Buffer, so generation overlaps training at configurable staleness, straggler-tolerant and throughput-oriented.
Simple to extend. Megatron and SGLang args pass straight through; customisation is a single rollout/data-generation function, not a framework rewrite.

When to use it (and when not)¶

Use it for high-throughput RL on large dense or MoE models where you want decoupled async generation and a minimal, opinionated surface, especially GLM-class training.
Prefer verl for the broadest algorithm/recipe ecosystem and colocated throughput, or SkyRL for flexible colocated-or-disaggregated agentic loops. slime is Megatron+SGLang only by design; if your stack is FSDP+vLLM, that friction matters.

Architecture¶

flowchart LR
  subgraph Train["Train pool (Megatron-LM)"]
    T["Actor update"]
  end
  subgraph Roll["Rollout pool (SGLang + Router)"]
    G["Generate + reward / verifier"]
  end
  BUF["Data Buffer"]
  G -->|"trajectories + rewards"| BUF
  BUF -->|"training batches"| T
  T ==>|"async weight sync"| G
  RAY["Ray orchestration"] -.-> Train
  RAY -.-> Roll

How to use it¶

slime is distributed as a Docker image (matched Megatron + SGLang). Pin a tagged release rather than latest:

docker pull slimerl/slime:<pinned>     # e.g. a dated release tag; check the repo
docker run --rm --gpus all --ipc=host --shm-size=16g \
  --ulimit memlock=-1 --ulimit stack=67108864 -it slimerl/slime:<pinned> /bin/bash

# inside the container:
cd /root/slime && git pull && pip install -e . --no-deps

# start Ray, then submit a run:
ray start --head --node-ip-address ${MASTER_ADDR} --num-gpus 8 --disable-usage-stats
ray job submit --address="http://127.0.0.1:8265" -- \
  python3 train.py --actor-num-nodes 1 --actor-num-gpus-per-node 4 --rollout-num-gpus 4

The fastest start is an example script, e.g. bash scripts/run-glm4-9B.sh.

How to develop with it¶

Runs are shell scripts that group arguments into arrays: MODEL_ARGS (Megatron model, sourced from scripts/models/<model>.sh), CKPT_ARGS, ROLLOUT_ARGS, PERF_ARGS, GRPO_ARGS (algorithm), and SGLANG_ARGS. Megatron flags pass through verbatim; SGLang flags take a --sglang- prefix (so SGLang's --mem-fraction-static becomes --sglang-mem-fraction-static).

Custom generation / reward plug in by path, not by editing core:

python3 train.py \
  --rollout-function-path my_pkg.rollout.generate \
  --custom-generate-function-path my_pkg.gen.multi_turn \
  --rollout-num-gpus 8

The custom function receives prompts from the Data Buffer and returns completions plus rewards/verifier outputs.

How to scale it¶

Decoupled async means rollout and train pools scale independently across nodes. Add worker nodes to Ray, then size each pool:

# head:
ray start --head --node-ip-address ${MASTER_ADDR} --num-gpus 8 --disable-usage-stats
# each worker:
ray start --address=${MASTER_ADDR}:6379 --num-gpus 8
# submit with separate train/rollout sizing across the cluster:
ray job submit --address="http://127.0.0.1:8265" -- \
  python3 train.py --actor-num-nodes 4 --actor-num-gpus-per-node 8 --rollout-num-gpus 32

MoE training uses Megatron expert/tensor/pipeline parallelism set through MODEL_ARGS (tensor parallelism/pipeline parallelism); rollout scales by adding SGLang replicas behind the Router. Weight sync streams updated actor weights to rollout and overlaps with the next step.

Inference¶

slime does not serve external traffic; rollout uses SGLang internally to generate trajectories (multi-turn capable via the custom generate function). BF16 training with FP8 rollout is supported to cut generation cost. For standalone serving of a trained GLM checkpoint, use the serving stack; for disaggregated serving concepts that mirror slime's train/rollout split, see disaggregated inference.

Fine-tuning¶

slime is an RL post-training framework (GRPO/PPO-style updates on Megatron), used to RL-tune GLM-class models on top of an SFT/instruct base. For the RL methods see GRPO and the post-training overview; warm-start from SFT/LoRA (SFT and LoRA) before RL.

Optimised hardware¶

Async weight sync (train → rollout) streams over the interconnect and overlaps the next training step; across pools it wants fast InfiniBand/RoCE with GDR (networking fabric/performance tuning).
Megatron MoE training relies on NVLink/NVSwitch for intra-node expert/tensor parallel all-to-all; NCCL env (NCCL_IB_HCA, NCCL_NET_GDR_LEVEL=SYS, NCCL_NVLS_ENABLE) and ACS-off are prerequisites.
FP8 rollout / Blackwell: FP8 generation on Blackwell (the Blackwell platform) lowers rollout cost; keep training in BF16 and verify reward parity.

Cookbook (common use cases)¶

1) A GLM-class RL run (single node):

cd /root/slime && bash scripts/run-glm4-9B.sh     # edit ROLLOUT_ARGS / GRPO_ARGS inline

2) Custom rollout generation (multi-turn / agentic):

python3 train.py \
  --rollout-function-path my_pkg.rollout.generate \
  --custom-generate-function-path my_pkg.agent.loop \
  --sglang-mem-fraction-static 0.8 --rollout-num-gpus 8

3) MoE training (multi-node, separate pools): source a MoE MODEL_ARGS with expert/tensor parallel sizes, start Ray head+workers, then ray job submit ... -- python3 train.py --actor-num-nodes 4 --actor-num-gpus-per-node 8 --rollout-num-gpus 32.

Gotchas & failure modes¶

SGLang flags need the --sglang- prefix: an un-prefixed SGLang flag is silently ignored or rejected; mirror SGLang's own names underneath.
Pool sizing: an under-provisioned rollout pool starves the train pool (and vice versa); tune --rollout-num-gpus against --actor-num-* and watch GPU idle.
Async staleness: too-stale rollouts hurt convergence; keep weight-sync cadence tight and monitor reward/entropy/KL for collapse (GRPO).
Megatron checkpoint format: model args must match the converted Megatron checkpoint exactly, or load fails. Source the matching scripts/models/<model>.sh.

References¶

slime repo: https://github.com/THUDM/slime
slime docs: https://thudm.github.io/slime/
GLM-4.5 (ARC foundation models): https://arxiv.org/abs/2508.06471
Anyscale — Open Source RL Libraries for LLMs: https://www.anyscale.com/blog/open-source-rl-libraries-for-llms

Related: RL libraries · OSS models · Post-training · GRPO · verl · SkyRL · Glossary