NeMo-RL (NVIDIA)¶
Scope: NVIDIA's open-source RL post-training library in the NeMo stack: Ray-orchestrated, with FSDP2/Megatron training and vLLM rollouts, built for scalable, async, multi-turn RL with clearly-defined component interfaces.
Reference templates on real APIs; pin versions and validate before production use.
What it is¶
NeMo-RL (NVIDIA-NeMo/RL, ~v0.6.0 mid-2026) is a scalable toolkit for model reinforcement and post-training, from a single GPU to thousands. It decomposes the RL loop into composable, well-typed components: a Policy (the trainer), a Generation backend (rollouts), an Environment (reward/interaction), and a Cluster of Ray worker groups that hosts them. It is the systems sibling of the broader library survey in RL libraries, specialised to the NVIDIA training/inference stack.
- Training backends: PyTorch-native DTensor/FSDP2 (TP, SP, PP, CP) for smaller models, and Megatron Core ("6D" TP/PP/CP/SP/EP/FSDP) for large models and MoE.
- Generation: vLLM rollouts (SGLang and Megatron inference also available in newer releases).
- Orchestration: Ray (Ray), where every component is a Ray actor/worker group and the controller shuttles weights and rollouts between them.
- Algorithms: GRPO/GSPO/DAPO, SFT (with LoRA), DPO, reward modelling, and on-policy distillation; multi-turn RL for tool use and games.
Why use it¶
- NVIDIA-stack integration: Megatron Core, Transformer Engine FP8, and vLLM are first-class; end-to-end FP8 training (Megatron FP8 + FP8 vLLM generation) is supported.
- Scale: the same code path runs single-GPU prototypes and thousand-GPU jobs; Megatron's parallelisms handle the largest dense and MoE models (tensor parallelism, pipeline parallelism).
- Agentic / multi-turn: explicit
Environmentinterfaces (math, code, reward-model, tool/game) make multi-turn rollouts and custom reward composition a supported extension point, not a fork. - Async: supports asynchronous rollouts and replay buffers, including a fully asynchronous GRPO, decoupling generation from the optimizer step.
When to use it (and when not)¶
- Use NeMo-RL when the target hardware is the NVIDIA stack (Hopper/Blackwell, the Blackwell platform), you need Megatron-scale parallelism for large/MoE models, or you want FP8 end-to-end and clean environment interfaces for agentic RL.
- Prefer verl (verl) for a tightly-coupled, throughput-first colocated design; slime (slime) for a minimal Megatron+SGLang async stack; SkyRL (SkyRL) for flexible colocated-or-disaggregated research; TRL (TRL) for single-node simplicity.
- Compare the full landscape in RL libraries.
Architecture¶
Every box below is a Ray actor or worker group. The controller owns the loop: it spawns the Policy and Generation groups, ships prompts into the rollout engine, collects rollouts plus rewards, and syncs updated weights back (synchronously, or asynchronously so generation never stalls the optimizer).
flowchart LR
subgraph Ctrl["Ray controller (cluster)"]
DRV["Driver / worker groups"]
end
subgraph Gen["Generation (rollout)"]
VLLM["vLLM engine"]
ENV["Environment (math / code / tool)"]
end
subgraph Train["Policy (trainer)"]
BK["DTensor FSDP2 or Megatron Core"]
end
DRV -.->|"spawn actors"| Gen
DRV -.->|"spawn actors"| Train
ENV -->|"prompts"| VLLM
VLLM -->|"rollouts + rewards"| Train
Train -.->|"weight sync (sync / async)"| VLLM
How to use it¶
NeMo-RL uses uv for environment management; examples are driven by a Python entrypoint plus a YAML config.
git clone https://github.com/NVIDIA-NeMo/RL.git nemo-rl && cd nemo-rl
uv venv && uv sync # pin to a release tag, e.g. git checkout v0.6.0
# GRPO on a math environment, DTensor/FSDP2 backend (default config)
uv run python examples/run_grpo.py # uses examples/configs/grpo_math_1B.yaml
# Megatron-backed GRPO (large models / MoE)
uv run python examples/run_grpo.py \
--config examples/configs/grpo_math_1B_megatron.yaml
Confirm exact example paths and config names against the repo tag you check out; the examples/ tree evolves between releases.
NeMo-RL is itself a fine-tuning / RL post-training tool. It implements the methods catalogued in fine-tuning and post-training: SFT (incl. LoRA, SFT and LoRA), DPO (DPO), and the critic-free GRPO family (GRPO) used to reproduce DeepScaleR-style recipes. Algorithm choice is a config switch with matching examples/run_dpo.py / examples/run_sft.py entrypoints.
GRPO, the default recipe, is critic-free: it drops the value network and normalizes each sample's reward within its sampled group of G completions. That group-relative advantage is the signal the Policy optimizes, and the estimator is small enough to validate directly (numpy-only, no torch or NeMo-RL needed):
# GRPO group-relative advantages: the critic-free estimator NeMo-RL optimizes.
# Each prompt is sampled G times; a sample's advantage is its reward normalized
# within its own group. numpy-only (no torch / NeMo-RL needed).
import numpy as np
def grpo_advantages(rewards: np.ndarray, group_size: int, eps: float = 1e-8) -> np.ndarray:
"""Group-relative advantage: (r - group_mean) / (group_std + eps)."""
assert rewards.ndim == 1 and rewards.size % group_size == 0
groups = rewards.reshape(-1, group_size)
adv = (groups - groups.mean(axis=1, keepdims=True)) / (groups.std(axis=1, keepdims=True) + eps)
return adv.reshape(-1)
def grpo_reference(rewards, group_size, eps=1e-8): # slow, explicit loops
out = []
for g in range(len(rewards) // group_size):
chunk = rewards[g * group_size:(g + 1) * group_size]
m = sum(chunk) / len(chunk)
s = (sum((x - m) ** 2 for x in chunk) / len(chunk)) ** 0.5
out.extend((x - m) / (s + eps) for x in chunk)
return np.array(out)
# Group 0 has one correct sample; group 1 is degenerate (all identical).
r = np.array([1.0, 0.0, 0.0, 0.0, 2.0, 2.0, 2.0, 2.0])
adv = grpo_advantages(r, group_size=4)
# 1) Mean-centering: each group's advantages sum to ~0.
assert np.allclose(adv.reshape(-1, 4).sum(axis=1), 0.0, atol=1e-6)
# 2) Equivalence: vectorized == slow reference.
assert np.allclose(adv, grpo_reference(r, 4), atol=1e-9)
# 3) Adversarial edge: zero-variance group yields finite zeros, not NaN/inf.
assert np.all(np.isfinite(adv)) and np.allclose(adv[4:], 0.0, atol=1e-6)
# 4) Sign: the only correct sample gets positive advantage, the wrong ones negative.
assert adv[0] > 0 and np.all(adv[1:4] < 0)
print("GRPO advantages OK:", np.round(adv, 3).tolist())
How to integrate with it¶
The structured environment plus reward interface is what makes NeMo-RL suited to multi-turn/agentic RL beyond single-response preference tuning. The extension surface is the typed component interface: a custom Environment implements the environment contract (nemo_rl.environments.interfaces) to return rewards and next-turn observations, while the Policy and Generation backends are selected by config, not code.
Reference template (needs NeMo-RL installed; not runnable here). The validated numpy block that follows exercises the same reward math:
# Reference template: sketch of a custom environment (verify the exact ABC on your tag).
from nemo_rl.environments.interfaces import EnvironmentInterface
class MyToolEnv(EnvironmentInterface):
def step(self, batch):
# parse model output, run the tool, score the turn
rewards = [score(s) for s in batch.responses]
return rewards, batch.next_observations # multi-turn continues
The core of step() is turning a batch of responses into a reward vector; here is that math as a runnable, asserted numpy stand-in:
# Reward scoring for a verifiable environment: what EnvironmentInterface.step
# computes each turn. numpy-only stand-in for the reference template above.
import numpy as np
def score_batch(responses, targets):
"""Exact-match verifiable reward in {0.0, 1.0}, one scalar per response."""
assert len(responses) == len(targets)
return np.array([1.0 if r.strip() == t.strip() else 0.0
for r, t in zip(responses, targets)], dtype=float)
rewards = score_batch(["42", " 42 ", "41", ""], ["42", "42", "42", "42"])
# Happy path: exact and whitespace-normalized matches score 1, mismatches 0.
assert rewards.tolist() == [1.0, 1.0, 0.0, 0.0]
# Adversarial: an all-wrong batch is all-zero and finite (GRPO then sees no signal).
zero = score_batch(["x", "y"], ["a", "b"])
assert np.all(zero == 0.0) and np.all(np.isfinite(zero))
# Edge: empty batch returns an empty vector instead of crashing.
assert score_batch([], []).shape == (0,)
print("reward scoring OK:", rewards.tolist())
Then point the config at the registered environment and run it:
# my_config.yaml: point the GRPO run at a custom environment
env:
type: my_tool_env # resolves to the registered MyToolEnv above
loss:
algorithm: grpo
generation:
backend: vllm
How to run it in production¶
Production concerns are the generation backend running inside the loop, numerical precision, and the fabric that carries weight sync. NeMo-RL is a training/post-training system, not a standalone serving stack, so plan the hand-off to a serving runtime separately.
Inference in the loop, and serving the checkpoint¶
Inference inside the loop is vLLM rollouts: generation is served by vLLM actors that the Policy syncs weights into each step (or every few steps, async). Newer NeMo-RL adds SGLang and Megatron inference, plus speculative decoding for rollout throughput. To serve the resulting checkpoint, export and run it under vLLM/SGLang/Dynamo per inference serving/serving open-weight models/disaggregated inference.
FP8 and optimised hardware¶
- FP8 end-to-end: Megatron Core FP8 training plus FP8 vLLM generation on Hopper/Blackwell Transformer Engine; validate convergence vs bf16 before committing (the Blackwell platform, performance tuning).
- NVLink/NVSwitch: keep a TP group inside one node's NVLink domain (intra-node weight-sync and tensor-parallel collectives ride NVLink); use Ray placement-group packing to enforce locality (Ray).
- InfiniBand/RoCE + GDR: inter-node Megatron PP/CP/EP and the trainer to rollout weight broadcast want GPUDirect RDMA. Set
NCCL_IB_HCA,NCCL_NET_GDR_LEVEL=SYS, optionallyNCCL_NVLS_ENABLE; confirm[GDRDMA]inNCCL_DEBUG=INFO; ACS off (networking fabric, performance tuning).
Monitoring¶
Log per-step reward, entropy, KL, and generation throughput; wire the metrics to observability. A healthy GRPO run keeps entropy from collapsing and KL bounded, so a sudden entropy drop or KL spike is the earliest signal that a run has gone off the rails.
How to maintain it¶
- Pin and reproduce: check out a release tag (e.g.
v0.6.0) and letuv venv && uv synclock the dependency set; theexamples/entrypoints and config keys drift across tags, so a pinned tag plus its README is what keeps a recipe reproducible. - Upgrade deliberately: on a version bump, re-read the release notes, re-run a small GRPO smoke config, and diff config keys before scaling back up.
- Checkpoint hygiene: keep resumable checkpoints so a preempted multi-node job restarts from the last step rather than from zero, and export the final checkpoint for the serving runtime.
- Re-baseline precision: whenever you change FP8/bf16 settings or the Transformer Engine version, re-baseline convergence, since FP8 numerics can shift reward curves (performance tuning).
How to scale it¶
Scaling is multi-node Ray plus the chosen training parallelism. Run on a Ray cluster (Ray): locally, via KubeRay on the GPU platform (Kubernetes, the Kubernetes platform), or on Slurm (Slurm).
- Multi-node: start a Ray head node and join workers; NeMo-RL places Policy/Generation worker groups across them. Config sets the cluster shape (GPUs per node, node count).
- Megatron parallelism: set TP/PP/CP/EP in the config to fit large/MoE models across nodes; this is the path beyond what FSDP2 alone covers (tensor parallelism, pipeline parallelism, DeepSpeed and ZeRO).
- Disaggregated/async generation: non-colocated vLLM lets the rollout pool and trainer pool scale independently; async GRPO overlaps them.
# config sketch: Megatron parallelism + cluster shape (validate keys on your tag)
policy:
megatron:
tensor_model_parallel_size: 8
pipeline_model_parallel_size: 2
context_parallel_size: 1
cluster:
num_nodes: 4
gpus_per_node: 8
Cookbook (common use cases)¶
1. GRPO on a math environment (FSDP2, single node):
2. Megatron-backed GRPO for a large/MoE policy (multi-node):
uv run python examples/run_grpo.py \
--config examples/configs/grpo_math_1B_megatron.yaml \
cluster.num_nodes=4 cluster.gpus_per_node=8
3. Multi-turn / agentic env via a custom EnvironmentInterface:
# my_config.yaml: point the GRPO run at a custom environment
env:
type: my_tool_env # resolves to the registered MyToolEnv above
loss:
algorithm: grpo
generation:
backend: vllm
Failure modes¶
- Backend mismatch to model size: FSDP2/DTensor for small/medium; reach for Megatron Core for large dense and MoE, or hit OOM / poor MFU (performance tuning).
- Zero-variance GRPO groups: if every sampled completion for a prompt is equally right (or equally wrong), the group's reward variance is zero, so group-relative advantages collapse to zero and that prompt contributes no gradient. Mix difficulty and check that reward is neither saturated nor floored (GRPO).
- FP8 surprises: FP8 rollouts/training can shift reward curves; baseline against bf16 and watch entropy/KL for collapse (fine-tuning and post-training, observability).
- Under-provisioned rollout pool: async GRPO still starves the trainer if generation can't keep up; size vLLM actors against step time (RL libraries).
- Weight-sync over TCP: if RDMA isn't exposed into Ray workers, the trainer to rollout broadcast falls back to TCP and stalls; validate
[GDRDMA](Ray). - Config/path drift:
examples/entrypoints and config keys change across tags; pin a release and read its README/docs.
References¶
- NeMo-RL repo: https://github.com/NVIDIA-NeMo/RL
- NeMo-RL docs: https://docs.nvidia.com/nemo/rl/latest/index.html
- NVIDIA blog, NeMo-RL reproducing DeepScaleR with GRPO: https://developer.nvidia.com/blog/reinforcement-learning-with-nvidia-nemo-rl-reproducing-a-deepscaler-recipe-using-grpo/
- NVIDIA blog, NeMo-RL Megatron-Core support for training throughput: https://developer.nvidia.com/blog/reinforcement-learning-with-nvidia-nemo-rl-megatron-core-support-for-optimized-training-throughput/
- Anyscale, Open Source RL Libraries for LLMs: https://www.anyscale.com/blog/open-source-rl-libraries-for-llms
- GRPO (DeepSeekMath): https://arxiv.org/abs/2402.03300 · DeepSeek-R1: https://arxiv.org/abs/2501.12948
Related: RL Libraries · Ray · Fine-tuning · GRPO · Distributed Training · Tensor Parallel · Pipeline Parallel · TRL · Glossary