Markdown

Muon optimizer and distributed Muon (DMuon)¶

Scope: the Muon optimizer (matrix-orthogonalization via Newton-Schulz) and the systems problem of running it in sharded distributed training. This page covers what Muon computes and why it beats element-wise AdamW on compute efficiency, why a naive distributed Muon is slow (the granularity mismatch with ZeRO/FSDP/TP), and how DMuon closes the gap to near-AdamW overhead with owner-centric execution, Gram-space kernels, and measured load balancing. It sits beside the distributed-training mechanics in FSDP, DeepSpeed/ZeRO, and tensor parallelism, and shares the trainer-side communication concerns of delta weight sync.

Muon and DMuon are 2026-current; pin versions and validate convergence against a reference before production. The Python example is executed and asserted (numpy); the drop-in API snippet is a reference template.

flowchart TB
  G["Sharded gradient of matrix W (one shard per rank)"] --> MM{"Newton-Schulz needs the FULL matrix"}
  MM -->|"vanilla: gather-then-compute"| REDUN["Every rank materializes W and runs NS: redundant, >2x fwd+bwd"]
  MM -->|"DMuon: owner-centric"| OWNER["One owner rank per matrix reduces the gradient and runs NS once"]
  OWNER --> GRAM["Gram-space NS + symmetric kernels + autotune"]
  OWNER --> BAL["MILP load balance across owners"]
  GRAM --> PUB["Async publish updated W to consumers, overlapped with compute"]
  BAL --> PUB
  PUB --> NEAR["Near-AdamW per-step overhead (+2% avg)"]

What it is¶

Muon is a matrix-aware optimizer for the two-dimensional parameters of a model (the weight matrices of linear and attention projections). Where AdamW updates each element independently, Muon updates a whole matrix at once: for weight W with momentum-smoothed gradient M, the update is W <- W - eta * NS_k(M), where NS_k is k steps of a Newton-Schulz iteration that approximates the matrix sign function, equivalently the orthogonal factor U V^T of the SVD M = U S V^T. Each step is X <- a*X + b*(X X^T) X + c*(X X^T)^2 X, with coefficients chosen to drive the singular values toward one; k = 5 in low precision is typical, because only the singular subspace needs to be preserved, not the exact singular values.

The distributed problem is a granularity mismatch. Sharded training (ZeRO, FSDP, tensor parallelism) partitions each matrix across ranks and runs the optimizer per shard, assuming an element-wise update rule. Newton-Schulz breaks that contract: X X^T couples all rows of the matrix, so the full matrix must be reconstructed on one device before the iteration can run. A naive "gather-then-compute" distributed Muon therefore pays two costs at every step: materialization (collective communication to rebuild each full matrix) and replicated orthogonalization (every rank runs the same Newton-Schulz). In practice that makes vanilla distributed Muon cost more than 2x the forward and backward passes combined.

DMuon removes those costs while preserving Muon's exact update. It assigns each matrix to a single owner rank that reduces the gradient, runs Newton-Schulz once, and publishes the result; it runs the iteration in Gram space with symmetric, autotuned kernels; and it balances owners by measured cost. The result is an average per-step overhead within +2% of AdamW.

Why use it¶

Compute efficiency. Muon reports roughly 2x the token efficiency of AdamW at compute-optimal pretraining (Moonlight), and has been adopted at frontier scale (Kimi-K2, DeepSeek-V4).²³
The systems tax is what stopped it. Muon's FLOP advantage is real, but naive distributed Muon spends it back on communication and redundant Newton-Schulz, sometimes exceeding the cost of the forward and backward passes; DMuon recovers the advantage in wall-clock.¹
Drop-in, exact. DMuon is a mathematically equivalent reformulation, not a new optimizer, so it inherits Muon's convergence exactly and installs as a module over stock FSDP2/HSDP/TP with no framework changes.¹
Biggest win where the optimizer dominates. For short-context workloads such as vision-language-action (embodied) training, the forward/backward phase is small, so optimizer overhead is hardest to amortize and DMuon's gains are largest (up to 3.01x end-to-end over vanilla Muon).¹

When to use it (and when not)¶

Use Muon for the 2D weight matrices of large transformers when compute efficiency matters; keep 1D parameters (norms, biases, embeddings) on AdamW, which is what Muon implementations do.
Use DMuon when Muon runs in a sharded distributed setup (FSDP/ZeRO/TP) and the optimizer step is a real fraction of wall-clock, especially short-context (VLA) training.
Skip DMuon's distributed machinery on a single GPU: there is no cross-rank redundancy to remove, so the benefit shrinks to the kernel-level speedup only.
Do not expect an algorithmic change. DMuon does not alter Muon's update or convergence; if Muon itself is wrong for the workload, DMuon will not fix that.

Architecture¶

The vanilla path is gather-then-compute: every rank all-gathers the full gradient of every matrix and runs the same Newton-Schulz, so redundant optimizer compute scales with the data-parallel width. DMuon turns this into an owner-side execution problem and runs one logical step in four phases: (1) forward materialization, owners publish their matrices to consumers just before each layer runs; (2) backward gradient routing, gradients are reduced to each matrix's owner, producing the same averaged full-matrix gradient a synchronous reference would use; (3) owner-side Muon update, each owner runs Newton-Schulz for its matrices while non-matrix parameters follow the host stack's AdamW; (4) asynchronous publication, updated weights are pushed back overlapped with the next step's compute. A fine-grained owner layout (an XOR rule over the node-by-GPU mesh) spreads collectives across communication groups so broadcasts and reductions from different layers overlap without contention.

How to use it¶

In the paper's implementation, DMuon is three lines over a stock FSDP2 program (reference template; pin the version):

# Reference template (needs the dmuon package + PyTorch FSDP2). Preserves the optimizer protocol.
import dmuon
dmuon.dedicate_params(model, mesh)      # assign each matrix parameter to an owner rank
opt = dmuon.Muon(model)                 # matrices use Muon; 1D params fall back to AdamW

What that actually computes is the Newton-Schulz orthogonalization. This runnable model is executed and asserted: it shows NS_5 driving the gradient's singular values toward one (the orthogonalization Muon relies on), that the Gram-space recurrence DMuon uses is exactly equal to the standard iteration (not an approximation), and an adversarial check that the raw gradient is far from orthogonal:

# muon_ns.py — validated: Newton-Schulz orthogonalizes the gradient; the Gram form is exact. numpy only.
import numpy as np
a, b, c = 3.4445, -4.7750, 2.0315                    # standard Muon quintic coefficients

def ns(M, steps=5):                                  # X <- aX + b(XX^T)X + c(XX^T)^2 X
    X = M / (np.linalg.norm(M) + 1e-7)               # Frobenius-normalize so the spectral norm <= 1
    for _ in range(steps):
        G = X @ X.T
        X = a * X + b * (G @ X) + c * (G @ G @ X)
    return X

def ns_gram(M, steps=5):                             # DMuon: carry the recurrence in Gram space G = XX^T
    X = M / (np.linalg.norm(M) + 1e-7); G = X @ X.T
    for _ in range(steps):
        P = a * np.eye(G.shape[0]) + b * G + c * (G @ G)
        X, G = P @ X, P @ G @ P                       # G_{i+1} = P G P; O(m^3), no XX^T recompute
    return X

rng = np.random.default_rng(0)
M = rng.standard_normal((6, 8))                       # a weight-matrix gradient (m=6 < n=8)
spread = lambda X: float(np.abs(np.linalg.svd(X, compute_uv=False) - 1).max())   # distance of sing. values from 1
assert spread(M / np.linalg.norm(M)) > 0.9           # adversarial: the raw gradient is far from orthogonal
assert spread(ns(M)) < 0.35                          # NS_5 drives singular values toward 1 (orthogonalizes)
assert np.allclose(ns(M), ns_gram(M), atol=1e-8)     # DMuon's Gram reformulation is EXACT, not approximate
print(f"singular-value spread: raw={spread(M/np.linalg.norm(M)):.2f} -> NS_5={spread(ns(M)):.2f}; gram==standard")

How to develop with it¶

Three design levers, in order of contribution to DMuon's optimizer-step speedup:¹

Gram-space, symmetric kernels (48%). Carrying the iteration as G_{i+1} = P G P with P = aI + bG + cG^2 keeps the work in the m x m Gram space, reducing the dominant cost from O(m^2 n) to O(m^3) when m < n; a SYRK-style kernel computes only the symmetric lower triangle, nearly halving the arithmetic. The Gram formulation and the Polar-Express coefficient set are adopted as defaults.⁴⁵
Owner scheduling and load balancing (32%). One owner per matrix removes the D-times redundant Newton-Schulz; owner assignment is a measured-cost MILP that minimizes the makespan across ranks (profiled once at init, since parameter shapes are fixed, with a greedy fallback above a search-space threshold).
Autotuning and batching (16%). Small matrices underfill the GPU, so DMuon groups them by shape and advances them through one batched Gram-NS iteration, and autotunes tile/pipeline schedules per shape into a persistent cache. Precision detail: the iteration runs in fp16 (three more mantissa bits than bf16 at equal tensor-core cost), with the update cast to fp32 master weights.

How to maintain it¶

DMuon changes systems overhead, not the update rule, so treat convergence as a fixed reference: the owner receives the same averaged full-matrix gradient a synchronous Muon would, so a DMuon run should match a reference Muon run step for step (verify on a small model before scaling). The owner assignment is computed once from the fixed parameter shapes, so it needs re-solving only when the model architecture changes. Because DMuon reads DTensor placement metadata at hook time rather than caching it, it tolerates PyTorch DTensor internals shifting across versions; still, re-run the equivalence check after an engine or DMuon upgrade. Keep non-matrix parameters (norms, biases, 1D tensors) on the host stack's AdamW path, and keep the fp16-NS / fp32-master precision split, since the Newton-Schulz dynamic range was validated to fit fp16.

How to run it in production¶

DMuon composes with FSDP2/HSDP and tensor parallelism in the same order as AdamW, with no source changes to the host framework: TP-sharded matrices get a second level of ownership (a TP owner assembles the full gradient, runs Gram-NS, and scatters slices back), and non-owner ranks hold zero-size placeholders so parameter-walking code (gradient clipping, PEFT) still works. Measured on A800-80GB (NVLink intra-node, 200 Gb/s InfiniBand inter-node, bf16, up to 256 GPUs), DMuon stays within +2% of AdamW end-to-end on average and delivers 1.48x-3.01x end-to-end and 6.85x-163x optimizer-step speedups over vanilla gather-then-compute Muon across robotics (Wall-OSS-0.5, Pi0, Wall-WM) and LLM (Qwen2.5-7B) workloads.¹ The irreducible remaining overhead is the Newton-Schulz time of the single largest owner-side matrix, which must run at least once per step; size expectations accordingly and rely on the async publish and forward/backward overlap to hide the rest behind compute.

Failure modes¶

Applying Muon to 1D parameters. Muon is for matrices; norms, biases, and embeddings should stay on AdamW, or the orthogonalization is ill-defined.
Naive gather-then-compute at scale. Materializing every matrix on every rank and running redundant Newton-Schulz can cost more than the forward and backward passes; use owner-side execution.
Owner load imbalance. Assigning matrices without a cost model lets the largest matrices pile on a few ranks and stall the step on stragglers; balance by measured cost.
Precision too low for NS. Running Newton-Schulz below the validated fp16 range degrades the orthogonalization; keep fp16 for the iteration and fp32 for the master weights.
Assuming DMuon changes convergence. It does not; if a Muon run diverges, the fix is in the optimizer or learning rate, not in DMuon's systems layer.

References¶

X Square Robot Team, DMuon: Efficient Distributed Muon Training with Near-Adam Overhead (arXiv 2606.27153): https://arxiv.org/abs/2606.27153
Jordan et al., Muon: An optimizer for hidden layers in neural networks: https://kellerjordan.github.io/posts/muon/
Liu et al., Muon is Scalable for LLM Training (Moonlight): https://arxiv.org/abs/2502.16982
Moonshot AI, Kimi K2 (production Muon at trillion scale): https://arxiv.org/abs/2507.20534
Zhang et al., Gram Newton-Schulz (hardware-aware NS): https://github.com/Dao-AILab/gram-newton-schulz
Amsel et al., The Polar Express: Optimal matrix sign methods for Muon (arXiv 2505.16932): https://arxiv.org/abs/2505.16932
Shi et al., Distributed Shampoo (owner-compute/all-gather matrix optimizer, arXiv 2309.06497): https://arxiv.org/abs/2309.06497
Wang et al., Canzona: asynchronous load-balanced distributed matrix optimizers (arXiv 2602.06079): https://arxiv.org/abs/2602.06079
Zhao et al., PyTorch FSDP (arXiv 2304.11277): https://arxiv.org/abs/2304.11277

DMuon (arXiv 2606.27153), X Square Robot Team: vanilla distributed Muon (gather-then-compute) costs >2x forward+backward; DMuon assigns one owner per matrix, runs Newton-Schulz once in Gram space with symmetric autotuned kernels, and balances owners with a measured-cost MILP; drop-in over FSDP2/HSDP/TP with no framework changes; +2% avg vs AdamW end-to-end, 1.48-3.01x end-to-end and 6.85-163x optimizer-step over vanilla on Wall-OSS-0.5/Pi0/Wall-WM/Qwen2.5-7B (A800-80GB, bf16); speedup breakdown symmetric Gram kernel 48%, owner scheduling+load balancing 32%, autotuning+NS batching 16%; exact Muon semantics preserved. ↩↩↩↩↩
Jordan et al., Muon applies a Newton-Schulz polar factor to the momentum-aggregated gradient of each weight matrix, matching Shampoo/SOAP quality with simpler state. https://kellerjordan.github.io/posts/muon/ ↩
Liu et al., Moonlight (Muon is Scalable for LLM Training): a 16B MoE trained on 5.7T tokens at roughly 2x AdamW's token efficiency. https://arxiv.org/abs/2502.16982 ↩
Zhang et al., Gram Newton-Schulz recasts each NS step as a recurrence in the Gram matrix X X^T, one symmetric product plus a polynomial in the smaller Gram matrix. https://github.com/Dao-AILab/gram-newton-schulz ↩
Amsel et al., The Polar Express optimizes the per-step (a,b,c) coefficients for a fixed Newton-Schulz iteration count. https://arxiv.org/abs/2505.16932 ↩