Markdown

NeMo AutoModel: accelerated MoE fine-tuning on Transformers v5¶

Scope: fine-tuning Mixture-of-Experts models with NVIDIA NeMo AutoModel, the NeMo library that subclasses HuggingFace Transformers v5's AutoModelForCausalLM and accelerates it with Expert Parallelism, DeepEP fused dispatch, and TransformerEngine kernels behind the unchanged from_pretrained() API. This page covers what the library adds over Transformers v5, the measured speedups and memory savings, the expert-backend and weight-loading machinery it builds on, and how to run it from a single node to a 128-GPU full fine-tune. It sits beside the general fine-tuning and post-training overview and SFT and LoRA, shares the expert-sharding ideas of expert parallelism for inference, and complements NeMo-RL, the RL member of the same NeMo family.

All benchmark numbers on this page are from NVIDIA's June 2026 blog post and were not reproduced here; re-verify on current releases before capacity planning. The torch snippets are unexecuted reference templates (as of 2026-07, verify against the repo). The numpy example is executed and asserted, including zero-token experts and unbalanced routing.

flowchart TB
  FP["NeMoAutoModelForCausalLM.from_pretrained()"] --> SUB{"Hand-tuned implementation<br/>for this architecture?"}
  SUB -->|"yes: Qwen3, Nemotron, GPT-OSS, DeepSeek V3"| FAST["TransformerEngine attention + linear<br/>+ custom expert kernels"]
  SUB -->|"no"| VAN["Vanilla HF modeling code<br/>+ Liger kernel patching"]
  FAST --> EP["Expert Parallelism on a dedicated moe_mesh<br/>DTensor Shard(0) over the expert dim"]
  EP --> DEEP["DeepEP fused all-to-all dispatch/combine<br/>overlapped with grouped-GEMM expert compute"]
  VAN --> MESH["device_mesh: FSDP2 data parallel"]
  DEEP --> MESH
  MESH --> CKPT["save_pretrained(): standard HF safetensors"]
  CKPT --> SERVE["vLLM / SGLang load the checkpoint unchanged"]

What it is¶

NeMo AutoModel is an open library in the NVIDIA NeMo framework for training generative models at scale. Its entry point, NeMoAutoModelForCausalLM, subclasses Transformers v5's AutoModelForCausalLM, so existing HF training code keeps working after a one-line import change. For popular MoE architectures (Qwen3, NVIDIA Nemotron, GPT-OSS, DeepSeek V3) it ships hand-tuned implementations with TransformerEngine attention, fused linear layers, and custom expert kernels; for everything else it falls back to the vanilla HF modeling code while still applying optimizations such as Liger kernel patching.¹

The library is a direct consumer of the MoE foundations Transformers v5 introduced: expert backends (eager, batched_mm, grouped_mm), dynamic weight loading through WeightConverter and WeightRenaming (MoE checkpoints stored as fused 3D tensors, converted on the fly during from_pretrained()), and first-class distributed execution via PyTorch DeviceMesh.² On top of v5 it adds the three things v5 does not have: Expert Parallelism as its own mesh dimension, DeepEP fused all-to-all dispatch, and TransformerEngine kernels. The headline result is 3.4 to 3.7x higher fine-tuning throughput and 29 to 32% lower peak GPU memory than the best native Transformers v5 configuration on single-node 30B MoE models, with the same from_pretrained() API.¹

Why use it¶

Throughput. On 8x H100 80GB (sequence length 4,096, local batch 1), Qwen3-30B-A3B goes from 3,075 TPS/GPU on v5 (FA2 + grouped_mm) to 11,340 TPS/GPU with AutoModel at EP=8, a 3.69x speedup; Nemotron 3 Nano 30B A3B goes from 4,583 to 15,421 TPS/GPU, 3.36x.¹
Memory. The same runs drop peak memory from 68.2 to 48.1 GiB (Qwen3, -29%) and from 62.1 to 42.5 GiB (Nemotron Nano, -32%), because EP=8 shards expert weights so each GPU holds one eighth of them: roughly 55 GiB of Nemotron Nano expert weights become about 6.8 GiB per GPU.¹
Scale that v5 cannot reach. A full fine-tune of Nemotron 3 Ultra 550B A55B (every parameter updated, Adam state materialized) runs on 16 H100 nodes (128 GPUs) at EP=64: 815 TPS/GPU, about 293 TFLOP/s/GPU, 58.2 GiB peak. Transformers v5 runs out of memory at this scale, so there is no v5 column to compare against.¹
No ecosystem exit tax. The weight conversions are fully reversible: save_pretrained() emits standard HF-format safetensors that vLLM, SGLang, and any other HF-compatible tool load unchanged (see weight loading in inference engines).¹

When to use it (and when not)¶

Use it when fine-tuning MoE models on one or more GPUs and the architecture is in the hand-tuned coverage list; that is where the 3.4 to 3.7x and the EP memory savings apply.
Use it when the checkpoint must stay consumable by the HF ecosystem afterwards; the reversible conversion is the point of building on v5 rather than beside it.
Fallback coverage is thinner. Unsupported architectures take the vanilla-HF-plus-Liger path: still compatible, but without TE kernels, DeepEP, or the headline speedups.
Mind the routing caveat. The published single-node numbers use a balanced routing gate that distributes dummy tokens uniformly across experts, emulating the near-uniform steady state a trained load-balancing loss converges to; v4/v5 columns run their native routers on the same dummy tokens. A poorly balanced real workload will see stragglers the benchmark deliberately excludes (see MoE routing and load balancing).¹
Dense small models on a single GPU gain little: EP and DeepEP address expert sharding and expert communication, which dense models do not have.

Architecture¶

Transformers v5 stores experts as fused 3D parameter tensors rather than v4's ModuleList of per-expert MLPs, and offers three expert backends: eager (a for-loop over selected experts, for debugging and compatibility), batched_mm (duplicates expert parameters into one torch.bmm, fast for small inputs with torch.compile), and grouped_mm (orders tokens by their assigned expert and executes a single grouped GEMM via torch.nn.functional.grouped_mm, the training backend: memory-efficient, no parameter duplication).² v5 also ships its own Expert Parallelism path (GroupedGemmParallel loads only local experts; RouterParallel routes tokens and combines with an all_reduce), built on the tensor-parallel machinery, where EP shares the device budget with data parallelism (ep x dp = world_size). For the single-node 30B benchmarks NVIDIA found plain data parallelism (dp=8, ep=1) to be the fastest v5 configuration, so that is the v5 baseline reported.¹

NeMo AutoModel makes EP its own parallelism dimension: a dedicated moe_mesh alongside, rather than carved out of, the data-parallel mesh, using PyTorch DTensor with Shard(0) over the expert dimension. Because the expert mesh is orthogonal to data parallelism, the two compose on the same devices: on 8 GPUs it runs ep=8 and dp=8 together, every GPU training on its own data shard while holding one eighth of the experts. On top of EP, DeepEP replaces separate all-gather/reduce-scatter collectives with fused token dispatch and combine kernels that overlap communication with expert computation (the training-side counterpart of the overlap techniques in comms-compute overlap); in NVIDIA's large-scale MoE benchmarks, DeepEP plus grouped GEMM cut cost per iteration by 47% on the full DeepSeek V3 671B model versus all-gather with looped experts.³ TransformerEngine contributes fused attention, linear, and RMSNorm kernels across all layer types. The progression is: v4 eager for-loop, then v5 grouped_mm, then AutoModel with DeepEP + grouped GEMM + TE.¹

How to use it¶

Loading a model changes one import; scaling it to EP adds a distributed setup (reference template, unexecuted; as of 2026-07, verify on the repo):

# Reference template (needs nemo_automodel + torch with NCCL; run under torchrun).
import os
import torch
import torch.distributed as dist
from nemo_automodel import NeMoAutoModelForCausalLM
from nemo_automodel.recipes._dist_utils import create_distributed_setup_from_config

dist.init_process_group(backend="nccl")
torch.manual_seed(0)
torch.cuda.set_device(int(os.environ.get("LOCAL_RANK", 0)))

dist_setup = create_distributed_setup_from_config({"strategy": "fsdp2", "ep_size": 8})
model = NeMoAutoModelForCausalLM.from_pretrained(
    "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16",
    dtype=torch.bfloat16,
    distributed_setup=dist_setup,
)

That single call composes FSDP2 data parallelism with Expert Parallelism, TransformerEngine kernels, and DeepEP dispatch. Kernel choices are explicit through BackendConfig (reference template):

# Reference template: per-component backend selection.
from nemo_automodel.components.models.common.utils import BackendConfig

backend = BackendConfig(
    attn="te",            # TransformerEngine attention
    linear="te",          # TransformerEngine linear layers
    experts="torch_mm",   # grouped expert matmul
    dispatcher="deepep",  # DeepEP fused all-to-all
)

How to develop with it¶

The core mechanism worth understanding before touching backend knobs is grouped expert dispatch: sort token-expert pairs by expert id, run one GEMM per contiguous group, unsort, and gate-weight the combine. The following model of it is executed and asserted: grouped dispatch matches the eager per-expert loop to machine precision, an expert-parallel split over 4 simulated ranks recombines exactly, and the adversarial routing includes an expert that receives zero tokens (no GEMM issued) plus a heavily overloaded one:

# moe_dispatch.py, validated: grouped dispatch == eager per-expert loop, and an
# expert-parallel split of the same computation combines back to the reference.
import numpy as np


def eager_moe(x: np.ndarray, w: np.ndarray, expert_ids: np.ndarray,
              gates: np.ndarray) -> np.ndarray:
    """v4-style reference: loop tokens, loop selected experts."""
    n_tokens, top_k = expert_ids.shape
    out = np.zeros((n_tokens, w.shape[2]), dtype=x.dtype)
    for t in range(n_tokens):
        for k in range(top_k):
            out[t] += gates[t, k] * (x[t] @ w[expert_ids[t, k]])
    return out


def grouped_moe(x: np.ndarray, w: np.ndarray, expert_ids: np.ndarray,
                gates: np.ndarray) -> np.ndarray:
    """grouped_mm-style: sort token-expert pairs by expert, one matmul per
    contiguous group, unsort, weighted combine."""
    n_tokens, top_k = expert_ids.shape
    n_experts = w.shape[0]
    flat_e = expert_ids.reshape(-1)
    flat_t = np.repeat(np.arange(n_tokens), top_k)
    order = np.argsort(flat_e, kind="stable")
    xs = x[flat_t[order]]
    out_sorted = np.zeros((n_tokens * top_k, w.shape[2]), dtype=x.dtype)
    bounds = np.searchsorted(flat_e[order], np.arange(n_experts + 1))
    for e in range(n_experts):
        lo, hi = bounds[e], bounds[e + 1]
        if lo == hi:
            continue  # expert received zero tokens: no GEMM issued
        out_sorted[lo:hi] = xs[lo:hi] @ w[e]
    unsort = np.empty_like(order)
    unsort[order] = np.arange(order.size)
    contrib = out_sorted[unsort] * gates.reshape(-1)[:, None]
    out = np.zeros((n_tokens, w.shape[2]), dtype=x.dtype)
    np.add.at(out, flat_t, contrib)
    return out


def expert_parallel_moe(x: np.ndarray, w: np.ndarray, expert_ids: np.ndarray,
                        gates: np.ndarray, ep_size: int) -> np.ndarray:
    """EP simulation: shard experts Shard(0)-style across ep_size ranks; each
    rank computes only its local experts' contributions; combine by summation."""
    n_experts = w.shape[0]
    assert n_experts % ep_size == 0
    per_rank = n_experts // ep_size
    out = np.zeros((x.shape[0], w.shape[2]), dtype=x.dtype)
    for rank in range(ep_size):
        local = range(rank * per_rank, (rank + 1) * per_rank)
        mask = np.isin(expert_ids, list(local))
        masked_gates = np.where(mask, gates, 0.0)
        out += grouped_moe(x, w, expert_ids, masked_gates)
    return out


rng = np.random.default_rng(0)
n_experts, n_tokens, d_in, d_out, top_k = 8, 64, 16, 32, 2
x = rng.standard_normal((n_tokens, d_in))
w = rng.standard_normal((n_experts, d_in, d_out))

# Adversarial routing: expert 3 receives zero tokens, expert 0 is overloaded.
logits = rng.standard_normal((n_tokens, n_experts))
logits[:, 3] = -1e9
logits[:, 0] += 3.0
expert_ids = np.argsort(-logits, axis=1)[:, :top_k]
gates = rng.random((n_tokens, top_k))
gates /= gates.sum(axis=1, keepdims=True)

counts = np.bincount(expert_ids.reshape(-1), minlength=n_experts)
assert counts[3] == 0, "adversarial case not hit: expert 3 should be empty"
assert counts[0] == counts.max() and counts[0] > 2 * np.median(counts)

ref = eager_moe(x, w, expert_ids, gates)
out_grouped = grouped_moe(x, w, expert_ids, gates)
assert np.allclose(ref, out_grouped, atol=1e-12), np.abs(ref - out_grouped).max()

out_ep = expert_parallel_moe(x, w, expert_ids, gates, ep_size=4)
assert np.allclose(ref, out_ep, atol=1e-12), np.abs(ref - out_ep).max()

# EP footprint arithmetic from the blog: ~55 GiB of expert weights at ep=8.
assert abs(55 / 8 - 6.875) < 1e-9  # ~6.8 GiB per GPU

print("grouped == eager: max |diff| =", np.abs(ref - out_grouped).max())
print("EP(4 ranks) == eager: max |diff| =", np.abs(ref - out_ep).max())
print("expert token counts:", counts.tolist())
print("all assertions passed")

Output of the run: grouped == eager: max |diff| = 3.55e-15, EP(4 ranks) == eager: max |diff| = 3.55e-15, expert token counts: [61, 6, 11, 0, 15, 12, 11, 12], all assertions passed. The zero-token expert is exactly the case that deadlocked Transformers v4 under FSDP (see Failure modes); in grouped dispatch it is just an empty group that issues no GEMM and no collective.

When developing against the library, treat model coverage as data, not assumption: over 20 model types flow through v5's MODELS_REQUIRING_TENSOR_MERGING fused-tensor mechanism (Mixtral, Qwen2 MoE, Qwen3 MoE, DeepSeek V2/V3, OLMoE, and others), and AutoModel's hand-tuned list is separate and smaller.¹

How to maintain it¶

Pin and re-verify. The library, Transformers v5, TransformerEngine, and DeepEP move together; pin all four and re-run a short throughput benchmark on upgrade rather than trusting the June 2026 numbers to persist across releases.
Round-trip the checkpoint. After any version bump, save_pretrained() then reload both in AutoModel and in the serving engine (vLLM or SGLang); the reversibility of WeightConverter transforms is the contract that keeps the fleet HF-compatible.
Watch the fallback boundary. An architecture that silently drops from the hand-tuned path to vanilla-HF-plus-Liger keeps training but loses most of the speedup; log which path from_pretrained() selected and alert when a fleet model changes path.
Track expert balance in real fine-tunes. The benchmark's balanced gate is the best case; in production runs, per-expert token counts (as in the numpy example) are the early signal for EP stragglers.

Running it in production¶

The 550B recipe is the reference point for multi-node scale: 16x H100 nodes (128 GPUs), EP=64, local batch 2, sequence length 4,096, with Multi-Token Prediction, activation checkpointing, and fused linear cross-entropy enabled, on DeepEP dispatch + torch_mm experts + TransformerEngine kernels, landing at 815 TPS/GPU, roughly 293 TFLOP/s/GPU, and 58.2 GiB peak memory per GPU.¹ Points that matter operationally:

EP is what fits the model. At 550B full fine-tune the optimizer state alone rules out FSDP-only sharding on this hardware; the EP=64 expert sharding is the reason there is a run at all, not a tuning nicety.
Memory headroom is a feature. The 29 to 32% single-node savings convert directly into larger local batches or longer sequences at fixed hardware.
Deployment is the standard path. Checkpoints are HF safetensors; the vLLM deployment recipe applies unchanged, with no conversion step to own.
Throughput SLOs. TPS/GPU and TFLOP/s/GPU from the tables above are the calibration anchors for training-platform SLOs on comparable MoE fine-tunes; a large gap usually means the fallback path or a routing imbalance, not hardware fault.

Failure modes¶

Transformers v4 MoE + FSDP deadlock. v4 stores Qwen3 experts as a ModuleList of 128 individually FSDP-wrapped MLPs and iterates only experts that received tokens; with different data per rank, ranks skip different experts, collectives mismatch, and training hangs indefinitely. v5's fused 3D expert tensors remove the per-expert collectives; this is the reason the v4 Qwen3 column in the benchmark reads "deadlock".¹
OOM without EP at scale. FSDP-only approaches run out of memory where EP fits: v5 has no 550B number because it OOMs, and per-GPU expert footprint drops 8x at ep=8.
Benchmark-to-production gap on routing. The balanced gate emulates ideal load; a fine-tune whose router is far from uniform will see expert stragglers and lower effective TPS than the tables suggest.
Silent fallback path. Unsupported architectures still train, minus TE/DeepEP/EP gains; discovering this from a bill rather than a log is a process failure, not a library one.
Composition confusion. In v5's native EP, ep x dp = world_size (they share devices); in AutoModel, EP is orthogonal and composes with dp on the same GPUs. Sizing a job with the wrong mental model produces either idle GPUs or an over-subscribed mesh.

References¶

NVIDIA, Accelerating Transformers Fine-Tuning with NVIDIA NeMo AutoModel (June 2026): https://huggingface.co/blog/nvidia/accelerating-fine-tuning-nvidia-nemo-automodel
NeMo AutoModel repository: https://github.com/NVIDIA-NeMo/Automodel
HuggingFace, Transformers v5 release notes: https://huggingface.co/blog/transformers-v5
HuggingFace Transformers v5 documentation: https://huggingface.co/docs/transformers/v5.0.0/en/index
DeepSeek, DeepEP (fused expert-parallel all-to-all communication): https://github.com/deepseek-ai/DeepEP
NVIDIA TransformerEngine: https://github.com/NVIDIA/TransformerEngine

NVIDIA NeMo AutoModel blog (June 2026): one-import upgrade over Transformers v5; 3.4-3.7x TPS/GPU and 29-32% peak-memory reduction on 8x H100 (seq 4096, local batch 1): Qwen3-30B-A3B 3,075 to 11,340 TPS/GPU (3.69x), 68.2 to 48.1 GiB; Nemotron 3 Nano 30B A3B 4,583 to 15,421 TPS/GPU (3.36x), 62.1 to 42.5 GiB; Nemotron 3 Ultra 550B A55B full fine-tune on 128 H100s at EP=64: 815 TPS/GPU, ~293 TFLOP/s/GPU, 58.2 GiB peak, v5 OOMs; balanced routing gate in NeMo AutoModel columns; v4 Qwen3 FSDP deadlock from per-expert collectives; ~55 GiB expert weights to ~6.8 GiB per GPU at ep=8; reversible WeightConverter transforms and standard HF checkpoints. ↩↩↩↩↩↩↩↩↩↩↩↩
Transformers v5: fused 3D expert tensors, expert backends (eager, batched_mm, grouped_mm via torch.nn.functional.grouped_mm), dynamic weight loading (WeightConverter/WeightRenaming), DeviceMesh in from_pretrained(), and a native EP path (GroupedGemmParallel + RouterParallel, ep x dp = world_size). ↩↩
DeepEP fuses MoE token dispatch and combine into GPU kernels overlapping communication with expert compute; NVIDIA reports DeepEP + grouped GEMM cutting cost per iteration by 47% on DeepSeek V3 671B versus all-gather + looped experts. ↩