Markdown

Mixture-of-experts: sparse scaling¶

Scope: why sparse MoE decouples total parameters from per-token FLOPs: the routing math, shared vs routed experts, and the memory-to-fit vs compute-to-run tradeoff that governs how you size and serve these models.

What it is¶

A Mixture-of-Experts (MoE) layer replaces the single feed-forward (MLP) block of a transformer with many parallel expert MLPs plus a small gating network. For each token, the gate scores the experts and routes the token to only the top-k of them. The token's activation is computed by those k experts (their outputs combined by the gate weights); the rest stay idle for that token. The attention sublayers remain dense and shared.

The consequence is a deliberate split between two parameter counts:

Total parameters: every expert's weights, summed. Sets the memory you must hold resident.
Active (activated) parameters per token: what the gate actually selects per token. Sets the FLOPs (and therefore latency and cost) of a forward pass.

Because the gate picks a fixed k regardless of how many experts exist, you can grow total parameters (and model capacity) while active parameters, and per-token FLOPs, stay roughly constant. Sparse MoE models contrast with dense LLMs (the classic GPT-series), where every parameter is active for every token.¹

DeepSeek-V3 is the canonical reference. It has 671B total parameters with 37B activated per token: each MoE layer holds 1 shared expert and 256 routed experts, and the gate activates 8 routed experts per token (so 9 experts run per token: the shared one plus the 8 routed).³ The book's Chapter 1 describes the same architecture and rounds total parameters to "~680 billion."² On disagreement, the DeepSeek-V3 technical report's 671B is the precise figure; the book's "~680B" is a rounding, not a contradiction.

Shared vs routed experts¶

Routed experts are the sparse pool. The gate selects a subset per token. This is where capacity scales: 256 routed experts in DeepSeek-V3, only 8 fired per token.
Shared expert(s) run for every token, unconditionally. They capture common knowledge that would otherwise be redundantly learned across many routed experts, letting the routed pool specialize. DeepSeek-V3 uses 1 shared expert.³

flowchart LR
  T["Token activation"] --> G["Gate / router"]
  T --> S["Shared expert<br/>(always active)"]
  G -->|"top-8 of 256"| R1["Routed expert i"]
  G -->|"top-8 of 256"| R2["Routed expert j"]
  R1 --> C["Weighted combine"]
  R2 --> C
  S --> C
  C --> O["Output to next layer"]

Why it matters¶

Two separate scaling walls, decoupled:

Compute to run. Dense scaling ties FLOPs/token to total parameters, so doubling capacity doubles inference cost. Sparse MoE breaks that link: a 100-expert model that fires 2 experts per token costs roughly what a small dense model costs per token, while holding the capacity of a much larger one.⁶ Fewer active parameters per token also means lower request latency than a dense model of equal total size.¹ Google's Switch Transformer (1.6T-parameter MoE, 2021) reached the accuracy of a dense baseline at a fraction of the compute and trained ~7x faster.⁴
Memory to fit. Sparsity does not save memory. Every expert must be resident even though most are idle for any given token, because at scale (across many concurrent tokens and users) effectively all experts are active simultaneously and contend for GPU resources.⁶ DeepSeek-V3's full 671B weights have to live in HBM regardless of the 37B active figure. This is the central tradeoff: sparsity buys you cheap compute, not cheap memory. You still need enough aggregate GPU memory to hold the entire parameter set, which is what forces expert parallelism across many GPUs (see Expert Parallelism for Inference).

This is why MoE is the practical path toward multi-trillion-parameter models. The book frames a dense 100T-parameter model as needing on the order of 1.2 x 10^29 FLOPs to train and ~182 TB just to load weights in 16-bit; sparsity reduces the effective compute, but the memory-to-fit wall remains and must be solved with parallelism.⁵

When it is needed (and when not)¶

Reach for sparse MoE when:

You need capacity beyond what a dense model can afford to run: high parameter count for quality, but a per-token compute and latency budget that a same-size dense model would blow.
You have the aggregate GPU memory and a high-bandwidth fabric (NVLink/NVSwitch intra-node, InfiniBand/RoCE inter-node) to host all experts and absorb the all-to-all routing traffic.⁶

Avoid or reconsider when:

The model already fits comfortably on your GPUs and per-token cost is acceptable; a dense model is simpler (no routing, no all-to-all, no load-balancing machinery).
Your fabric is weak. The all-to-all dispatch/combine at every MoE layer is communication-heavy and becomes the bottleneck if interconnect bandwidth is low.⁶
Memory is your binding constraint and you cannot scale out across enough GPUs; sparsity will not rescue you, because it saves compute, not footprint.

How: implement, integrate, maintain¶

You rarely implement MoE routing yourself; you configure a serving engine that shards experts across GPUs via expert parallelism (EP). The sizing rule is: total parameters dictate how many GPUs you need to fit the model; active parameters and routing dictate how you tune for latency and balance.

vLLM¶

Expert parallelism is enabled with --enable-expert-parallel. The EP group size is computed automatically as EP_SIZE = TP_SIZE x DP_SIZE; without the flag, MoE layers fall back to tensor parallelism over the same TP x DP group, like a dense model. The flag has no effect unless TP_SIZE x DP_SIZE > 1.¹⁰

# Single-node MoE serving with expert parallelism over 8 GPUs.
vllm serve deepseek-ai/DeepSeek-V3 \
  --tensor-parallel-size 1 \
  --data-parallel-size 8 \
  --enable-expert-parallel

To even out hot experts, vLLM exposes the Expert Parallel Load Balancer:

# Add EPLB to redistribute expert-to-rank mappings under skewed routing.
vllm serve deepseek-ai/DeepSeek-V3 \
  --data-parallel-size 8 \
  --enable-expert-parallel \
  --enable-eplb

For large-scale EP, select an all-to-all backend with --all2all-backend (e.g. deepep_low_latency, deepep_high_throughput).¹⁰

SGLang¶

SGLang sets EP size with --ep (commonly --tp and --ep equal, with dp=1, to use all GPUs as one EP group), enables the load balancer with --enable-eplb, and selects MoE communication/compute backends with --moe-a2a-backend and --moe-runner-backend:¹¹

# SGLang DeepSeek serving: 8-way TP, 8-way EP, EPLB on.
python -m sglang.launch_server \
  --model-path deepseek-ai/DeepSeek-V3 \
  --tp 8 --ep 8 \
  --enable-eplb \
  --moe-a2a-backend deepep

Routing, balance, and maintenance¶

Gating width. Top-2 gating is the common production default (better quality and more even load than top-1, at the cost of doubled expert outputs and communication). Top-1 is faster but risks lower quality and uneven load.⁷
Capacity factor. Serving systems cap tokens per expert per batch, typically at 1.2–1.5x average load; overflow tokens spill to a second-choice expert or a second pass. This bounds the straggler effect, where one overloaded ("hot") expert stalls the layer because all experts must finish before the layer proceeds.⁸
Hot-expert replication. Full-featured servers can replicate a hot expert onto multiple GPUs to split its load, at the cost of extra GPU memory.⁸ This is the inference-side complement to DeepSeek's training-time load balancing.
Training-time balance. DeepSeek-V3 uses an auxiliary-loss-free load-balancing strategy (a per-expert bias added to gate affinity scores, nudged down for overloaded and up for underloaded experts) to keep routing even without the quality hit of a heavy auxiliary loss.³ Google's GLaM established the earlier approach of load-balancing losses plus gating noise.⁷
Validate the win. Profile interconnect (NVLink/NIC) traffic and Tensor Core utilization before and after every change; an overloaded all-to-all or a silent imbalance erases the sparsity benefit.⁹ These are reference configurations grounded in vendor docs and the book; none of the throughput figures here were measured on local hardware.

For how experts are physically sharded and the all-to-all communication cost, see Expert Parallelism for Inference and MoE Routing & Load Balancing.

References¶

Chris Fregly, AI Systems Performance Engineering (O'Reilly). Ch. 1 (sparse MoE decouples FLOPs/token from total parameters; DeepSeek-V3 ~680B/37B active, 1 shared + 8 of 256 routed; Switch Transformer 1.6T, ~7x faster; 100T-parameter memory/compute wall). Ch. 15 (expert parallelism, all-to-all, capacity factor 1.2–1.5x, top-1 vs top-2 gating, hot-expert replication, hybrid parallelism, GLaM load-balancing).
DeepSeek-V3 Technical Report, arXiv:2412.19437 (671B total / 37B active, 1 shared + 256 routed experts, top-8 routing, auxiliary-loss-free load balancing) — https://arxiv.org/abs/2412.19437
deepseek-ai/DeepSeek-V3, Hugging Face model card — https://huggingface.co/deepseek-ai/DeepSeek-V3
vLLM, Expert Parallel Deployment (--enable-expert-parallel, EP_SIZE = TP_SIZE x DP_SIZE, --enable-eplb, --all2all-backend) — https://docs.vllm.ai/en/latest/serving/expert_parallel_deployment/
SGLang, Expert Parallelism (--ep, --enable-eplb, --moe-a2a-backend, --moe-runner-backend) — https://github.com/sgl-project/sglang/blob/main/docs/advanced_features/expert_parallelism.md

Fregly, Ch. 1: "Sparse models like MoEs activate only parts of the model for each input token... the FLOPS per token stays roughly constant even as total parameters grow," and "since these models use fewer active parameters during inference, request-response latencies are typically much lower." ↩↩
Fregly, Ch. 1: "~680 billion total parameters with about 37 billion active per input token... an MoE with 1 shared expert and 8 selected experts out of 256 per token, which yields about 9 active experts." ↩
DeepSeek-V3 Technical Report, arXiv:2412.19437: "671B total parameters with 37B activated for each token"; "1 shared expert and 256 routed experts"; "8 experts will be activated for each token"; "auxiliary-loss-free load balancing strategy" introducing a per-expert bias term on affinity scores. ↩↩↩
Fregly, Ch. 1: Google's Switch Transformer "1.6-trillion-parameter MoE model achieved the same accuracy as a dense model with only a fraction of the computation... trained 7x faster." ↩
Fregly, Ch. 1: a dense 100T-parameter model "would require on the order of 1.2 x 10^29 floating point operations" and "approximately 182 TB of GPU memory... if each parameter is stored in 16-bit." ↩
Fregly, Ch. 15: "expert parallelism allows the total model capacity to scale almost linearly with the number of GPUs... each GPU handles only a fraction of the tokens"; "at scale... all of the experts will likely be active simultaneously... contending for all GPU resources concurrently." ↩↩↩↩
Fregly, Ch. 15: "Modern MoE inference often uses top-2 gating"; top-1 "is faster, it can lead to a lower model quality and uneven load"; "Google's GLaM introduced load-balancing losses (and gating noise)." ↩↩
Fregly, Ch. 15: "capacity factor parameter, typically set at 1.2–1.5x the average token load"; the straggler effect where "an overloaded expert... will stall the inference pipeline"; "replicate hot experts onto multiple GPUs to split the load... at a cost of additional GPU memory." ↩↩
Fregly, Ch. 15: "always verify interconnect traffic, Tensor Core utilization, and other performance-enhancing mechanisms before and after making each change." ↩
vLLM Expert Parallel Deployment docs: "Enable EP by setting the --enable-expert-parallel flag"; "EP_SIZE = TP_SIZE x DP_SIZE"; --enable-eplb; --all2all-backend options including deepep_low_latency, deepep_high_throughput. ↩↩
SGLang Expert Parallelism docs: example launch uses --tp 8 --ep 8; --enable-eplb activates the EPLB; --moe-a2a-backend and --moe-runner-backend select communication and compute backends. ↩