slime (THUDM)¶
Scope: Tsinghua / Z.ai's decoupled, async-first RL post-training framework, the RL stack behind the GLM model family.
Reference templates on real APIs; pin versions and validate before production use.
What it is¶
slime is an LLM post-training framework for RL scaling that tightly connects Megatron-LM training with high-throughput SGLang rollouts, orchestrated by Ray. It is opinionated and minimal: three modules, namely Training (Megatron), Rollout (SGLang + Router), and a central Data Buffer that bridges them, all wired for asynchronous operation. It is the framework behind the GLM line (as of mid-2026 the repo lists GLM-4.5, 4.6, 4.7, 5, 5.1, 5.2; verify on the repo), and also supports Qwen and DeepSeek-V3-class models. For the broader landscape see RL libraries.
Why use it¶
- High-performance MoE RL. Battle-tested on SOTA MoE models (the GLM stack serving open-weight models); Megatron handles large expert-parallel training, SGLang handles fast rollouts.
- Decoupled & async. Training and rollout are separate pools joined by the Data Buffer, so generation overlaps training at configurable staleness, straggler-tolerant and throughput-oriented.
- Simple to extend. Megatron and SGLang args pass straight through; customisation is a single rollout/data-generation function, not a framework rewrite.
When to use it (and when not)¶
- Use it for high-throughput RL on large dense or MoE models where you want decoupled async generation and a minimal, opinionated surface, especially GLM-class training.
- Prefer verl for the broadest algorithm/recipe ecosystem and colocated throughput, or SkyRL for flexible colocated-or-disaggregated agentic loops. slime is Megatron+SGLang only by design; if your stack is FSDP+vLLM, that friction matters.
Architecture¶
flowchart LR
subgraph Train["Train pool (Megatron-LM)"]
T["Actor update"]
end
subgraph Roll["Rollout pool (SGLang + Router)"]
G["Generate + reward / verifier"]
end
BUF["Data Buffer"]
G -->|"trajectories + rewards"| BUF
BUF -->|"training batches"| T
T ==>|"async weight sync"| G
RAY["Ray orchestration"] -.-> Train
RAY -.-> Roll
How to use it¶
slime is distributed as a Docker image (matched Megatron + SGLang). Pin a tagged release rather than latest:
docker pull slimerl/slime:<pinned> # e.g. a dated release tag; check the repo
docker run --rm --gpus all --ipc=host --shm-size=16g \
--ulimit memlock=-1 --ulimit stack=67108864 -it slimerl/slime:<pinned> /bin/bash
# inside the container:
cd /root/slime && git pull && pip install -e . --no-deps
# start Ray, then submit a run:
ray start --head --node-ip-address ${MASTER_ADDR} --num-gpus 8 --disable-usage-stats
ray job submit --address="http://127.0.0.1:8265" -- \
python3 train.py --actor-num-nodes 1 --actor-num-gpus-per-node 4 --rollout-num-gpus 4
The fastest start is an example script, e.g. bash scripts/run-glm4-9B.sh.
How to develop with it¶
Runs are shell scripts that group arguments into arrays: MODEL_ARGS (Megatron model, sourced from scripts/models/<model>.sh), CKPT_ARGS, ROLLOUT_ARGS, PERF_ARGS, GRPO_ARGS (algorithm), and SGLANG_ARGS. Megatron flags pass through verbatim; SGLang flags take a --sglang- prefix (so SGLang's --mem-fraction-static becomes --sglang-mem-fraction-static).
Custom generation / reward plug in by path, not by editing core:
python3 train.py \
--rollout-function-path my_pkg.rollout.generate \
--custom-generate-function-path my_pkg.gen.multi_turn \
--rollout-num-gpus 8
The custom function receives prompts from the Data Buffer and returns completions plus rewards/verifier outputs.
How to scale it¶
Decoupled async means rollout and train pools scale independently across nodes. Add worker nodes to Ray, then size each pool:
# head:
ray start --head --node-ip-address ${MASTER_ADDR} --num-gpus 8 --disable-usage-stats
# each worker:
ray start --address=${MASTER_ADDR}:6379 --num-gpus 8
# submit with separate train/rollout sizing across the cluster:
ray job submit --address="http://127.0.0.1:8265" -- \
python3 train.py --actor-num-nodes 4 --actor-num-gpus-per-node 8 --rollout-num-gpus 32
MoE training uses Megatron expert/tensor/pipeline parallelism set through MODEL_ARGS (tensor parallelism/pipeline parallelism); rollout scales by adding SGLang replicas behind the Router. Weight sync streams updated actor weights to rollout and overlaps with the next step.
Inference¶
slime does not serve external traffic; rollout uses SGLang internally to generate trajectories (multi-turn capable via the custom generate function). BF16 training with FP8 rollout is supported to cut generation cost. For standalone serving of a trained GLM checkpoint, use the serving stack; for disaggregated serving concepts that mirror slime's train/rollout split, see disaggregated inference.
Fine-tuning¶
slime is an RL post-training framework (GRPO/PPO-style updates on Megatron), used to RL-tune GLM-class models on top of an SFT/instruct base. For the RL methods see GRPO and the post-training overview; warm-start from SFT/LoRA (SFT and LoRA) before RL.
Optimised hardware¶
- Async weight sync (train → rollout) streams over the interconnect and overlaps the next training step; across pools it wants fast InfiniBand/RoCE with GDR (networking fabric/performance tuning).
- Megatron MoE training relies on NVLink/NVSwitch for intra-node expert/tensor parallel all-to-all; NCCL env (
NCCL_IB_HCA,NCCL_NET_GDR_LEVEL=SYS,NCCL_NVLS_ENABLE) and ACS-off are prerequisites. - FP8 rollout / Blackwell: FP8 generation on Blackwell (the Blackwell platform) lowers rollout cost; keep training in BF16 and verify reward parity.
Cookbook (common use cases)¶
1) A GLM-class RL run (single node):
2) Custom rollout generation (multi-turn / agentic):
python3 train.py \
--rollout-function-path my_pkg.rollout.generate \
--custom-generate-function-path my_pkg.agent.loop \
--sglang-mem-fraction-static 0.8 --rollout-num-gpus 8
3) MoE training (multi-node, separate pools): source a MoE MODEL_ARGS with expert/tensor parallel sizes, start Ray head+workers, then ray job submit ... -- python3 train.py --actor-num-nodes 4 --actor-num-gpus-per-node 8 --rollout-num-gpus 32.
Gotchas & failure modes¶
- SGLang flags need the
--sglang-prefix: an un-prefixed SGLang flag is silently ignored or rejected; mirror SGLang's own names underneath. - Pool sizing: an under-provisioned rollout pool starves the train pool (and vice versa); tune
--rollout-num-gpusagainst--actor-num-*and watch GPU idle. - Async staleness: too-stale rollouts hurt convergence; keep weight-sync cadence tight and monitor reward/entropy/KL for collapse (GRPO).
- Megatron checkpoint format: model args must match the converted Megatron checkpoint exactly, or load fails. Source the matching
scripts/models/<model>.sh.
References¶
- slime repo: https://github.com/THUDM/slime
- slime docs: https://thudm.github.io/slime/
- GLM-4.5 (ARC foundation models): https://arxiv.org/abs/2508.06471
- Anyscale — Open Source RL Libraries for LLMs: https://www.anyscale.com/blog/open-source-rl-libraries-for-llms
Related: RL libraries · OSS models · Post-training · GRPO · verl · SkyRL · Glossary