MoE routing and expert load balancing¶
Scope: the gating/router (top-k selection), why uneven expert load wastes GPUs, and the techniques that smooth it: the auxiliary load-balancing loss, capacity factor and token dropping, and inference-time expert rebalancing via redundant-expert placement (EPLB). For how experts are physically sharded and the all-to-all cost, see expert parallelism for MoE inference; for the sparse-scaling rationale see Mixture-of-Experts: sparse scaling.
The three core algorithms this page teaches (top-k gating, capacity-factor token dropping, and the straggler / EPLB replica-placement math) are each expressed as a runnable numpy block that is executed and asserted with the system
python3(numpy 2.4.6), including edge and adversarial cases. Thevllm serveinvocation and the DeepSeekrebalance_expertscall require vLLM / torch (not installed here); they are labelled reference templates, and the core math each one relies on is validated in the numpy blocks that accompany them.
What it is¶
A Mixture-of-Experts (MoE) layer replaces a single feed-forward block with many parallel expert sub-networks plus a lightweight router (gating network). For each token, the router scores every expert, then selects the top-k (the active experts). Only those k experts run; the rest are skipped. The router's gating weights mix the selected experts' outputs back into the residual stream.2
Top-k is the lever: top-1 minimizes communication and compute but degrades quality and worsens load skew; top-2 is the common production compromise for quality, speed, and balanced load. Modern models push k higher with a shared expert always on: DeepSeek-R1 has 256 routed experts and selects the top 9 per token including 1 shared expert.2 DeepSeek-V3/R1 is roughly 671-685B parameters but activates only about 37B per token.28
The router is the source of all downside risk: because expert selection is data-dependent and learned, the realized token distribution across experts is almost never uniform.1
The gating computation itself is small and exact, so it is worth pinning down. The router takes per-expert logits, applies a softmax (DeepSeek-V3 uses a sigmoid affinity, but the top-k-then-renormalize shape is the same), keeps the k highest gates, and renormalizes those k gates to sum to 1 before mixing (vLLM's renormalize=True).4 This numpy block reproduces that select-experts path and is executed and asserted, with the top-1 boundary (weight collapses to 1.0), the dense limit (k equals the expert count recovers the full softmax), an exact-tie case, and equivalence to a slow scalar reference over 200 random shapes:
# Top-k gating (the router), validated (system python3, numpy). Run: python3 router.py
# Reproduces vLLM's select_experts: score every expert, take top-k, renormalize the
# selected gates to sum to 1, mix expert outputs into the residual by those weights.
import numpy as np
def softmax(x, axis=-1):
x = x - np.max(x, axis=axis, keepdims=True) # stabilize
e = np.exp(x)
return e / np.sum(e, axis=axis, keepdims=True)
def route_topk(logits, k, renormalize=True):
"""logits[token, expert] -> (topk_ids[token,k], topk_w[token,k]).
Gate is softmax over ALL experts; then keep the top-k and (optionally)
renormalize the kept gates to sum to 1 (vLLM `renormalize=True`)."""
n, e = logits.shape
assert 1 <= k <= e, (k, e)
gates = softmax(logits, axis=1) # full-softmax affinity
ids = np.argsort(-gates, axis=1, kind="stable")[:, :k] # top-k per token
w = np.take_along_axis(gates, ids, axis=1)
if renormalize:
w = w / np.sum(w, axis=1, keepdims=True)
return ids, w
def moe_forward(logits, expert_out, k, renormalize=True):
"""expert_out[expert, token, d] -> combined[token, d] = sum_k w * expert_out."""
ids, w = route_topk(logits, k, renormalize)
n, d = expert_out.shape[1], expert_out.shape[2]
out = np.zeros((n, d), np.float64)
for t in range(n):
for j in range(k):
out[t] += w[t, j] * expert_out[ids[t, j], t]
return out, ids, w
def route_topk_ref(logits, k, renormalize=True):
"""Slow scalar reference (different control flow) for equivalence checks."""
n, e = logits.shape
ids = np.zeros((n, k), np.int64)
w = np.zeros((n, k), np.float64)
for t in range(n):
g = softmax(logits[t:t + 1], axis=1)[0]
order = sorted(range(e), key=lambda i: (-g[i], i))[:k] # stable by index
s = sum(g[i] for i in order)
for j, i in enumerate(order):
ids[t, j] = i
w[t, j] = g[i] / s if renormalize else g[i]
return ids, w
rng = np.random.default_rng(0)
# happy path: top-2 of 8 experts, renormalized gates sum to 1 per token
logits = rng.normal(size=(32, 8))
ids, w = route_topk(logits, k=2)
assert ids.shape == (32, 2) and w.shape == (32, 2)
assert np.all(ids >= 0) and np.all(ids < 8) # valid expert ids
assert np.allclose(w.sum(axis=1), 1.0) # renormalized
assert np.all(np.diff(w, axis=1) <= 1e-12) # sorted high->low
for t in range(32): # top-k are the k largest gates
g = softmax(logits[t:t + 1], axis=1)[0]
assert set(ids[t]) == set(np.argsort(-g, kind="stable")[:2])
# invariant: NOT renormalizing -> kept gates sum to < 1 (mass leaked to dropped experts)
_, wr = route_topk(logits, k=2, renormalize=False)
assert np.all(wr.sum(axis=1) < 1.0 - 1e-9)
# edge: top-1 selects exactly argmax and its renormalized weight is 1.0
ids1, w1 = route_topk(logits, k=1)
assert np.array_equal(ids1[:, 0], np.argmax(softmax(logits, axis=1), axis=1))
assert np.allclose(w1, 1.0)
# edge: k == num_experts renormalizes back to the full softmax (dense limit)
idsE, wE = route_topk(logits, k=8)
full = softmax(logits, axis=1)
assert np.allclose(np.sort(wE, axis=1), np.sort(full, axis=1))
assert np.allclose(wE.sum(axis=1), 1.0)
# adversarial: fast path == slow scalar reference (ids and weights), incl. exact ties.
# All-equal logits -> every gate equal -> tie-break must pick the k lowest indices.
tie = np.zeros((3, 6))
it_f, wt_f = route_topk(tie, k=3)
it_r, wt_r = route_topk_ref(tie, k=3)
assert np.array_equal(it_f, it_r) and np.allclose(wt_f, wt_r)
assert np.array_equal(it_f[0], [0, 1, 2]) # deterministic tie order
assert np.allclose(wt_f, 1.0 / 3)
for _ in range(200):
lg = rng.normal(size=(rng.integers(1, 10), rng.integers(2, 12)))
kk = int(rng.integers(1, lg.shape[1] + 1))
a1, b1 = route_topk(lg, kk)
a2, b2 = route_topk_ref(lg, kk)
assert np.array_equal(a1, a2), (a1, a2)
assert np.allclose(b1, b2), (b1, b2)
# forward mixing: with renormalize, combined output is a convex combination of experts,
# so it lies within their per-coordinate min/max envelope (no extrapolation).
E, N, D = 8, 5, 4
xo = rng.normal(size=(E, N, D))
lg = rng.normal(size=(N, E))
out, oids, ow = moe_forward(lg, xo, k=3)
for t in range(N):
lo = xo[oids[t], t].min(axis=0)
hi = xo[oids[t], t].max(axis=0)
assert np.all(out[t] >= lo - 1e-9) and np.all(out[t] <= hi + 1e-9) # convex hull
print("router OK: top2 sums=1, top1 weight=1.0, dense-limit=full softmax, "
"fast==slow over 200 cases incl ties, mix is convex")
Running it prints router OK: top2 sums=1, top1 weight=1.0, dense-limit=full softmax, fast==slow over 200 cases incl ties, mix is convex and all asserts pass. The convex-hull check is the load-bearing property: with renormalized top-k gates the combine is a convex combination of the selected experts, which is why dropping a token (below) can only lose signal, never inject an out-of-range value.
Why use it¶
Under expert parallelism (EP) the experts of one MoE layer are sharded across GPUs, so token routing becomes an all-to-all shuffle: each token's activation is dispatched to the GPU(s) owning its chosen experts, computed, then combined back.1 The all-to-all is a synchronization barrier: the layer cannot proceed until the slowest expert finishes.
That makes load skew a throughput killer. When the router disproportionately favors a few experts they become hot, and the straggler effect sets in: every expert computation must complete before the layer advances, so an overloaded expert stalls the whole pipeline while other experts and their GPUs sit idle.1 On a 100-expert / 100-GPU deployment, one hot expert can pin the layer latency to that single GPU regardless of how lightly loaded the other 99 are.
Aggregate versus per-token is the trap. A single token activates only k experts, but across all concurrent users every expert is active and contending for GPU resources simultaneously.1 Balancing is therefore not optional at scale: it is what lets EP convert added GPUs into added capacity instead of added idle time. The ## How to scale it section makes the straggler cost concrete with a runnable model: one hot expert holding 50 units of load pins the layer latency at 50 even when 99 peer GPUs each hold 1.
When to use it (and when not)¶
Needed when:
- Serving a sparse MoE model under expert parallelism across multiple GPUs/nodes, where all-to-all and the straggler effect dominate. This is the recommended parallelism for sparse expert selection.2
- Production traffic is skewed (domain-specific prompts, repeated tool-call patterns) that biases the router toward a subset of experts.
- Decode-heavy serving at large EP size, where DeepSeek's EPLB global policy is intended.7
Not needed or lower priority when:
- The model is dense (no router), or the MoE fits on a single device with no expert parallelism, so there is no cross-GPU all-to-all to imbalance.
- The model was trained auxiliary-loss-free (DeepSeek-V3 bias-based balancing) and the workload is close to uniform; static placement may already be balanced and dynamic rebalancing adds churn.9
- The runtime/model pair lacks EPLB support. As of this writing vLLM's EPLB is auxiliary-loss-free-model oriented (DeepSeek-V3/R1, Qwen3 MoE); broader model coverage was still in progress.56 Verify support before enabling.
Architecture¶
The data path through one MoE layer is: tokens enter the router, the router scores every expert and keeps the top-k, an all-to-all dispatch sends each token to the GPU(s) owning its chosen experts under expert parallelism, the selected expert FFNs run (the slowest one is the straggler that gates the barrier), and a gating-weighted combine writes the result back into the residual. The three balancing mechanisms attach at three different points: the auxiliary loss (or bias) shapes the router at training time, the capacity factor caps and spills tokens at the all-to-all, and EPLB replicates hot experts at the expert stage.1
flowchart LR
T["Tokens"] --> R["Router / gating network<br/>(scores every expert)"]
R --> K["Top-k selection<br/>(active experts)"]
K --> A["All-to-all dispatch<br/>(expert parallelism across GPUs)"]
A --> E["Selected expert FFNs<br/>(slowest = straggler)"]
E --> C["Combine into residual<br/>(gating-weighted)"]
BL["Auxiliary loss / bias balancing<br/>(even tokens-per-expert)"] -.-> R
CAP["Capacity factor<br/>(cap, spill, or drop tokens)"] -.-> A
EPLB["EPLB: redundant hot-expert replicas"] -.-> E
The all-to-all is the load-bearing edge: it is a barrier, so the layer's latency is the max over per-GPU expert load, not the average. Every balancing mechanism on this diagram exists to pull that max down toward the average. The ## How to scale it model encodes exactly this max-equals-straggler relationship and the hard lower bound (total load divided by GPU count) that no placement can beat.
How to use it (top-k gating and the balancing levers)¶
The router is used by choosing k and, at training time, by choosing how to keep its output distribution even. The gating math is the ## What it is block above (top-k, renormalize, mix). Three complementary mechanisms then smooth the token-expert distribution; the book recommends combining inference-time spillover, training-time penalties, and hot-expert replicas.1 They are covered in turn below: the training-time loss/bias here, the capacity factor under ## How to run it in production, and redundant-expert placement under ## How to integrate it.
Training-time: auxiliary load-balancing loss (and gating noise)¶
The classic remedy adds an auxiliary loss to the LM objective that penalizes uneven expert assignment, pushing the router toward roughly equal tokens-per-expert. Google's GLaM introduced load-balancing losses plus gating noise for this purpose;1 Switch Transformer formalized it alongside a capacity factor.10 The cost: the auxiliary loss perturbs the primary LM gradient.
Newer models go auxiliary-loss-free. DeepSeek-V3 keeps a per-expert learnable bias added to routing affinity scores for top-k selection only (not to the gating mixing weight); a controller decreases an overloaded expert's bias and increases an underloaded one's, balancing load without distorting the LM gradient. A tiny complementary sequence-wise balance loss guards against single-sequence collapse.89 This is an inference-time-relevant property: EPLB in vLLM is described as essential precisely for these auxiliary-loss-free models, whose realized load can still be skewed at serving time.5
How to run it in production (capacity factor and token dropping)¶
Serving systems expose a capacity factor that caps tokens-per-expert-per-batch, typically 1.2 to 1.5 times the average load.1 Tokens beyond the cap on a hot expert are spilled: routed to a second-choice overflow expert or queued for a second pass.1 In the GShard/Switch lineage, over-capacity tokens are dropped: the expert is skipped and the token passes through unchanged via the residual connection.10
Capacity factor is a quality/throughput dial: too low drops or spills too many tokens (quality loss); too high reverts toward dense compute and erodes the sparsity win. Tune against held-out quality on representative traffic.
The capacity rule and the drop semantics are exact arithmetic. This numpy block encodes the Switch/GShard capacity formula (capacity_factor * num_tokens * k / num_experts) and the drop-to-residual behavior, and is executed and asserted, with conservation (kept plus dropped equals total), the infinite-capacity edge (nothing dropped), the zero-capacity edge (every token passes through as pure residual identity), an adversarial skewed load where only the hot expert drops, and monotonicity (raising the factor never drops more tokens):
# Capacity factor + token dropping/spill, validated (system python3, numpy).
# Run: python3 capacity.py
# Switch/GShard rule: expert_capacity = capacity_factor * (num_tokens * k / num_experts).
# Tokens beyond an expert's capacity within a batch overflow: DROP (skip the expert; the
# token passes through unchanged via the residual) or SPILL (route to a 2nd-choice expert).
import numpy as np
def expert_capacity(num_tokens, k, num_experts, capacity_factor):
"""Per-expert token cap for one batch. avg load per expert is tokens*k/experts;
the factor (typically 1.2-1.5) gives headroom above that average."""
assert capacity_factor >= 0 and num_experts > 0
avg = num_tokens * k / num_experts
return int(np.floor(capacity_factor * avg))
def apply_capacity_drop(assign, num_experts, cap):
"""assign[token] = chosen expert (top-1 for simplicity). Keep the first `cap`
tokens that arrive at each expert (stable order); DROP the rest.
Returns (kept_mask, per_expert_kept). Dropped tokens are skipped entirely."""
kept = np.zeros(assign.shape[0], dtype=bool)
seen = np.zeros(num_experts, dtype=np.int64)
for t, e in enumerate(assign):
if seen[e] < cap:
kept[t] = True
seen[e] += 1
return kept, seen
def moe_with_drop(tokens, assign, num_experts, cap, expert_fns):
"""Dropped tokens pass through UNCHANGED (residual); kept tokens get expert(token).
tokens[token, d]; expert_fns[e] maps a vector to a vector."""
kept, _ = apply_capacity_drop(assign, num_experts, cap)
out = tokens.copy() # residual passthrough is the default
for t in np.nonzero(kept)[0]:
out[t] = expert_fns[assign[t]](tokens[t])
return out, kept
rng = np.random.default_rng(1)
# happy path: capacity formula. 1024 tokens, top-2, 8 experts, cf=1.25
# avg = 1024*2/8 = 256; cap = floor(1.25*256) = 320
assert expert_capacity(1024, 2, 8, 1.25) == 320
assert expert_capacity(1024, 2, 8, 1.0) == 256 # cf=1 -> exactly the average
assert expert_capacity(1024, 1, 4, 1.5) == 384 # 1.5 * (1024/4)
# conservation: kept + dropped == total, for ANY assignment and cap
assign = rng.integers(0, 8, size=1000)
cap = expert_capacity(1000, 1, 8, 1.2) # floor(1.2*125)=150
kept, seen = apply_capacity_drop(assign, 8, cap)
assert kept.sum() + (~kept).sum() == 1000
assert seen.sum() == kept.sum() # per-expert kept totals match
assert np.all(seen <= cap) # nobody exceeds capacity
# edge: infinite capacity (cf huge) drops NOTHING
kept_all, _ = apply_capacity_drop(assign, 8, cap=10**9)
assert kept_all.all()
# edge: cf=0 -> cap 0 -> EVERY token dropped, all pass through residual unchanged
xo = rng.normal(size=(1000, 4))
double = [lambda v: v * 2.0] * 8
out0, k0 = moe_with_drop(xo, assign, 8, cap=0, expert_fns=double)
assert not k0.any() and np.array_equal(out0, xo) # identity (pure residual)
# adversarial: a SKEWED load drops only on the hot expert; cold experts lose nothing.
# Route 900 tokens to expert 0, 100 spread over 1..7. cap below the hot load forces drops.
skew = np.concatenate([np.zeros(900, int), rng.integers(1, 8, size=100)])
rng.shuffle(skew)
cap_s = 300
kept_s, seen_s = apply_capacity_drop(skew, 8, cap_s)
hot_dropped = 900 - min(900, cap_s)
assert (~kept_s).sum() == hot_dropped == 600 # exactly the hot overflow
assert seen_s[0] == cap_s # hot expert saturates at cap
for e in range(1, 8): # cold experts keep all of theirs
assert seen_s[e] == int(np.sum(skew == e))
# adversarial: dropped tokens are byte-for-byte the residual; kept tokens are transformed.
out_s, kept_flag = moe_with_drop(xo[:len(skew)], skew, 8, cap_s, double)
dropped_idx = np.nonzero(~kept_flag)[0]
kept_idx = np.nonzero(kept_flag)[0]
assert np.array_equal(out_s[dropped_idx], xo[dropped_idx]) # untouched
assert np.allclose(out_s[kept_idx], xo[kept_idx] * 2.0) # expert applied
# monotonicity: raising the capacity factor never drops MORE tokens
prev = 10**9
for cf in [0.0, 0.5, 1.0, 1.25, 1.5, 2.0, 4.0]:
c = expert_capacity(len(skew), 1, 8, cf)
kd = (~apply_capacity_drop(skew, 8, c)[0]).sum()
assert kd <= prev, (cf, kd, prev)
prev = kd
print(f"capacity OK: cap(1024,k2,E8,cf1.25)=320; skew drops={hot_dropped} on hot only; "
f"cf=0 -> pure residual identity; drops monotone non-increasing in cf")
Running it prints capacity OK: cap(1024,k2,E8,cf1.25)=320; skew drops=600 on hot only; cf=0 -> pure residual identity; drops monotone non-increasing in cf and all asserts pass. The monotonicity check is the operational guarantee behind the dial: raising the capacity factor trades HBM and dense-compute drift for strictly fewer dropped tokens, so it is safe to raise until quality plateaus and unsafe to lower past the point where drops start hurting held-out quality.
How to integrate it (redundant experts, DeepSeek EPLB, and vLLM)¶
Even with training-time balancing, serving load is skewed, so the runtime replicates hot experts onto multiple GPUs to split their load, at the cost of extra GPU memory.1 DeepSeek's EPLB (Expert Parallelism Load Balancer) systematizes this: it duplicates heavy-loaded experts and heuristically packs the replicas onto GPUs to even per-GPU load. It offers two policies: hierarchical (when node count divides expert-group count: distribute groups across nodes, replicate within node, then pack; suited to prefill) and global (replicate regardless of groups; suited to decode at large EP size).7
EPLB's core API computes a placement plan from a measured load matrix. This is a reference template (needs torch and the eplb package):
# Reference template (requires torch + deepseek-ai/EPLB; not run here). The core
# replicate-hot-then-pack math it embodies is validated in the numpy block below.
# deepseek-ai/EPLB: compute physical expert placement from observed load
import torch
from eplb import rebalance_experts
# weight[layer, expert] = observed token load per logical expert
weight = torch.tensor([[ 90, 132, 40, 61, 104, 165, 39, 4,
73, 56, 183, 86, 100, 110, 33, 8]])
num_replicas = 16 # total physical expert slots (>= logical experts)
num_groups = 4 # expert groups for group-limited routing
num_nodes = 2 # server nodes
num_gpus = 8 # total GPUs
phy2log, log2phy, logcnt = rebalance_experts(
weight, num_replicas, num_groups, num_nodes, num_gpus
)
# phy2log: physical-slot -> logical-expert id
# log2phy: logical-expert -> list of physical slots (replicas)
# logcnt: replica count per logical expert (hot experts get >1)
rebalance_experts(weight, num_replicas, num_groups, num_nodes, num_gpus) returns (phy2log, log2phy, logcnt); hot logical experts receive multiple physical replicas while the all-to-all dispatch spreads their tokens.7
The core math that call embodies (replicate the bottleneck expert, then balance-pack the physical slots across GPUs) is pure arithmetic. This numpy block validates it against the straggler model and is executed and asserted, with the fundamental EPLB win (replicating a hot expert once cuts the straggler 90 to 45), the hard lower bound (no placement beats total-load-over-GPUs), an adversarial case where naive over-replication of a single expert actually raises the optimal straggler (which is exactly why EPLB balances the packing, not just the replica counts), and a longest-processing-time packer cross-checked against a brute-force optimum:
# Straggler cost + EPLB redundant-expert placement, validated (system python3, numpy).
# Run: python3 eplb.py
# All-to-all is a barrier: the MoE layer's latency is set by the busiest GPU (straggler).
# EPLB duplicates hot experts and packs replicas onto GPUs to minimize the max per-GPU load.
# This numpy block validates the CORE math that deepseek-ai/EPLB's rebalance_experts embodies:
# replicate-the-hottest + balanced-pack, and the balancedness metric vLLM's log_balancedness reports.
import numpy as np
from itertools import product
def balancedness(per_gpu_load):
"""vLLM's metric: mean / max over per-GPU (or per-expert) load. 1.0 == perfectly
balanced; -> 0 as one unit dominates. The straggler wastes (max-mean)*num_gpus work."""
a = np.asarray(per_gpu_load, np.float64)
m = a.max()
return 1.0 if m == 0 else float(a.mean() / m)
def straggler_latency(per_gpu_load):
"""The barrier finishes when the busiest GPU finishes -> latency ~ max load."""
return float(np.max(per_gpu_load))
def split_replicas(logical_load, replicas):
"""Given logical expert loads and a replica count per expert (>=1), produce the
per-physical-slot load: a hot expert's tokens split evenly across its replicas."""
slots = []
for load, r in zip(logical_load, replicas):
assert r >= 1
slots.extend([load / r] * r)
return np.array(slots, np.float64)
def lpt_pack(slot_loads, num_gpus):
"""Longest-processing-time bin packing: sort slots desc, assign each to the currently
least-loaded GPU. A standard <=4/3-optimal makespan heuristic; here it evens per-GPU load."""
assert len(slot_loads) >= num_gpus
gpu = np.zeros(num_gpus, np.float64)
for s in sorted(slot_loads, reverse=True):
gpu[np.argmin(gpu)] += s
return gpu
def allocate_replicas(logical_load, num_replicas):
"""Greedy EPLB core: start 1 replica/expert, then repeatedly give the next spare
replica to the expert with the highest PER-REPLICA load (the current bottleneck).
This is the 'duplicate the heaviest expert' rule DeepSeek EPLB applies."""
n = len(logical_load)
assert num_replicas >= n, "need at least one physical slot per logical expert"
rep = np.ones(n, np.int64)
for _ in range(num_replicas - n):
per_rep = np.asarray(logical_load, np.float64) / rep
rep[int(np.argmax(per_rep))] += 1
return rep
def brute_min_makespan(slot_loads, num_gpus):
"""Exact minimum achievable straggler over all assignments (small instances only)."""
best = float("inf")
for combo in product(range(num_gpus), repeat=len(slot_loads)):
g = np.zeros(num_gpus)
for s, b in zip(slot_loads, combo):
g[b] += s
best = min(best, g.max())
return best
rng = np.random.default_rng(2)
# happy path: balancedness of a uniform load is exactly 1.0; a skewed one is < 1.
assert balancedness([10, 10, 10, 10]) == 1.0
assert balancedness([100, 1, 1, 1]) < 0.3
assert 0.0 < balancedness([5, 3, 4, 8]) < 1.0
# straggler: one hot expert pins latency regardless of the idle others (the book's claim:
# 1 hot expert on 1 GPU pins layer latency no matter how light the other 99 are).
load = np.array([1.0] * 99 + [50.0])
assert straggler_latency(load) == 50.0 # 99 near-idle GPUs do not help
assert balancedness(load) < 0.03 # ~ (99+50)/100 / 50
# EPLB core 1: replicating the hottest expert STRICTLY lowers the straggler.
logical = np.array([90.0, 10.0, 10.0, 10.0]) # expert 0 is hot
one = split_replicas(logical, [1, 1, 1, 1]) # 4 slots, 4 GPUs
base = lpt_pack(one, num_gpus=4)
rep = allocate_replicas(logical, num_replicas=5) # one spare replica
assert rep[0] == 2 and rep.sum() == 5 # spare went to the hot expert
after = lpt_pack(split_replicas(logical, rep), num_gpus=4)
assert straggler_latency(after) < straggler_latency(base) # 45 < 90
assert balancedness(after) > balancedness(base) # more even
assert np.isclose(straggler_latency(after), 45.0) # 90 split across 2 replicas
# EPLB core 2: conservation - total load is preserved by replication (only its split changes)
assert np.isclose(split_replicas(logical, rep).sum(), logical.sum())
assert np.isclose(lpt_pack(split_replicas(logical, rep), 4).sum(), logical.sum())
# hard lower bound: NO placement beats perfect balance, so the straggler can never drop
# below total_load / num_gpus, and never below the heaviest single physical slot.
for nr in range(4, 10):
r = allocate_replicas(logical, nr)
slots = split_replicas(logical, r)
lat = straggler_latency(lpt_pack(slots, num_gpus=4))
assert lat >= logical.sum() / 4 - 1e-9 # >= 30 (120/4), always
assert lat >= slots.max() - 1e-9 # a slot cannot be split further
# adversarial: naive "always replicate the single hottest expert" is NOT makespan-monotone.
# Going 6 -> 7 slots forces a 4th replica of expert 0 (90/4=22.5) and STRANDS the three cold
# experts (10 each) without a dedicated GPU, so even the OPTIMAL makespan RISES 30 -> 32.5.
# This is exactly why EPLB balances the PACKING across GPUs, not just the replica counts.
r6 = allocate_replicas(logical, 6); r7 = allocate_replicas(logical, 7)
opt6 = brute_min_makespan(split_replicas(logical, r6), 4)
opt7 = brute_min_makespan(split_replicas(logical, r7), 4)
assert np.isclose(opt6, 30.0) and np.isclose(opt7, 32.5)
assert opt7 > opt6 # more slots, WORSE straggler
# yet a balanced assignment of the SAME 6 slots holds the optimum at the 30 lower bound:
assert np.isclose(brute_min_makespan(split_replicas(logical, r6), 4), logical.sum() / 4)
# adversarial: LPT greedy packer matches the EXACT optimum makespan on small instances.
for _ in range(300):
m = int(rng.integers(4, 8))
g = int(rng.integers(2, 4))
slots = rng.integers(1, 12, size=m).astype(float)
greedy = lpt_pack(slots, g).max()
exact = brute_min_makespan(slots, g)
# LPT is a heuristic; assert it is optimal-or-better-bounded (<= 4/3 * OPT), and
# that it never beats the true optimum (a correctness floor).
assert exact <= greedy <= (4.0 / 3.0) * exact + 1e-9, (slots.tolist(), g, greedy, exact)
# adversarial: balancedness is scale-invariant and bounded in (0, 1].
for _ in range(500):
v = rng.uniform(0.1, 9.0, size=int(rng.integers(1, 20)))
b = balancedness(v)
assert 0.0 < b <= 1.0 + 1e-12
assert np.isclose(b, balancedness(v * 7.3)) # scale-free
assert (b == 1.0) == bool(np.allclose(v, v[0])) # ==1 iff uniform
print(f"eplb OK: 1 hot expert pins straggler=50 (bal<0.03); replicate hot 90->45 (bal up); "
f"lower bound total/GPUs=30 holds; naive over-replication 6->7 slots raises opt 30->32.5; "
f"LPT within 4/3 of brute optimum over 300 cases")
Running it prints eplb OK: 1 hot expert pins straggler=50 (bal<0.03); replicate hot 90->45 (bal up); lower bound total/GPUs=30 holds; naive over-replication 6->7 slots raises opt 30->32.5; LPT within 4/3 of brute optimum over 300 cases and all asserts pass. The 6-to-7-slot counterexample is the reason EPLB packs replicas across GPUs rather than merely counting them: pouring every spare slot into the single hottest expert can strand the cold experts and raise the straggler, so the placement objective is the per-GPU makespan, not the per-expert replica count.
vLLM EPLB¶
vLLM ships an EPLB that collects load statistics on every forward pass and periodically rebalances expert mappings across EP ranks.3 Enable and tune it via a reference template (needs vLLM + torch, and GPUs to run):
# Reference template (requires vLLM + torch + GPUs; not run here). The eplb-config fields
# and the balancedness metric it logs are the ones validated by the numpy eplb block above.
vllm serve deepseek-ai/DeepSeek-R1 \
--enable-expert-parallel \
--enable-eplb \
--eplb-config '{"num_redundant_experts": 32,
"window_size": 1000,
"step_interval": 3000,
"log_balancedness": true}'
Verified config fields and defaults:3
| Field | Default | Meaning |
|---|---|---|
num_redundant_experts |
0 | Additional global expert replicas per EP rank beyond equal distribution; hosts hot experts. vLLM recommends about 32 at large scale so the most popular experts are always available. |
window_size |
1000 | Engine steps of load history tracked for rebalancing decisions. |
step_interval |
3000 | Rebalance every N engine steps. |
log_balancedness |
false | Log avg-tokens-per-expert / max-tokens-per-expert (1.0 = perfectly balanced), the metric the numpy balancedness() above computes. |
use_async |
true | Non-blocking rebalance to reduce latency overhead. |
policy |
"default" | Expert-parallel load-balancing policy. |
communicator |
null (auto) | Backend for expert-weight transfers: torch_nccl, torch_gloo, pynccl, nixl, or null. |
The redundant experts cost HBM: the overhead is NUM_MOE_LAYERS * BYTES_PER_EXPERT * (NUM_TOTAL_EXPERTS + NUM_REDUNDANT_EXPERTS) / NUM_EP_RANKS, which for DeepSeek-V3 is about 2.4 GB for one redundant expert per EP rank.3 This is the concrete price of the straggler win the numpy block quantifies, so size num_redundant_experts against the heavy tail you actually observe, not the worst case.
Note: older docs and PR threads use flat flags (--num-redundant-experts, --eplb-window-size, --eplb-step-interval); current vLLM consolidates these under --eplb-config (JSON or --eplb-config.window_size 1000 dot-notation). Confirm the form against the version you deploy.35 The official vLLM doc is authoritative on flag spelling where it disagrees with secondary sources.
How to maintain it¶
- Track
log_balancedness(or the equivalent telemetry) per layer. A persistent drop below about 1.0 signals a hot expert and a stale placement plan; the numpybalancedness()function above is the same mean-over-max ratio. - Right-size
num_redundant_experts: replicas cost HBM (about 2.4 GB per redundant expert per rank for DeepSeek-V3), so add only enough to cover the heavy tail observed inwindow_sizehistory.3 - Match policy to phase: hierarchical for prefill pools, global for large-EP decode pools,7 consistent with disaggregated prefill/decode serving. See disaggregated inference.
- Re-measure after traffic shifts: rebalancing is driven by a measured load matrix, so a workload change (new domain, new tool) invalidates the old plan; the periodic
step_intervalrebuild handles steady drift, but validate end-to-end throughput after major shifts. - This guidance is grounded in the cited book chapters and official vLLM/DeepSeek docs; it is not hardware-tested here. Validate balancedness, TPOT, and quality on your own cluster and traffic before relying on any setting.
How to scale it¶
The scaling question is how much the straggler actually costs and how far replication can push it down. The all-to-all barrier makes the layer latency the max per-GPU load, so the win from balancing is the gap between that max and the average, multiplied by the GPU count in wasted work. The straggler numpy block above makes the limits concrete:
- The hard floor is
total_load / num_gpus: no placement, however clever, drives the straggler below perfect balance, and the block asserts this bound holds across replica counts. Adding GPUs (raising EP size) lowers that floor, which is why decode-heavy serving at large EP size is where EPLB's global policy pays off.7 - A single logical expert's per-replica load
hot / replicasis itself a floor: you cannot split one expert's tokens finer than its replica count, so the only way past a dominant hot expert is more replicas (more HBM) or a router that stops concentrating load. - Replicas are not free and not monotone if placed naively: the 6-to-7-slot counterexample shows that pouring spare slots into one expert can strand the others and raise the straggler. Scaling therefore means growing replicas and re-balancing the packing together, not just bumping
num_redundant_experts.
Because the plan is driven by a measured load matrix, the optimal replica count and placement are workload- and topology-specific and must be re-measured as EP size or traffic changes. See inference parallelism: TP, PP, EP, DP for serving and serving open-weight models.
Failure modes¶
- One hot expert pins the layer. Under the all-to-all barrier the straggler sets latency, so a single overloaded expert stalls the whole layer while peer GPUs idle; the numpy model shows 99 near-idle GPUs cannot rescue a layer whose hot expert holds 50 units.1 Track balancedness and replicate the hot expert.
- Over-dropping at low capacity factor. A capacity factor set too low drops or spills too many tokens on hot experts, and dropped tokens skip their expert entirely (pure residual), so quality falls; the capacity block shows drops rise as the factor falls.10 Tune against held-out quality.
- Dense-compute drift at high capacity factor. Setting the factor too high reverts toward dense compute and erodes the sparsity win, spending HBM and FLOPs for no throughput gain.1
- Stale placement after a traffic shift. EPLB's plan is built from a measured load matrix; a new domain or tool changes the distribution and invalidates the old plan until the next
step_intervalrebuild. Re-measure and validate throughput after major shifts.3 - Naive over-replication raises the straggler. Pouring every spare replica into the single hottest expert can strand the cold experts without a dedicated GPU and raise the optimal makespan (the asserted 6-to-7-slot case). Balance the packing across GPUs, do not just increase replica counts.7
- Redundant experts blow the HBM budget. Replicas cost about 2.4 GB per redundant expert per rank for DeepSeek-V3, competing with KV-cache space; over-provisioning
num_redundant_expertscan OOM or shrink the usable context.3 - Unsupported runtime/model pair. vLLM EPLB was auxiliary-loss-free-model oriented (DeepSeek-V3/R1, Qwen3 MoE) with broader coverage in progress; enabling it on an unsupported model may be a no-op or error. Verify support before relying on it.56
References¶
Related: Expert Parallelism for MoE Inference · Mixture-of-Experts: Sparse Scaling · Inference Parallelism: TP, PP, EP, DP for Serving · LLM Request Routing (Mixture-of-Models) · Serving Open-Weight Models · Disaggregated Inference · Glossary
-
Chris Fregly, AI Systems Performance Engineering (O'Reilly), Chapter 15, "Multinode Inference, Parallelism, Decoding, and Routing Optimizations": Expert Parallelism and the all-to-all barrier, top-k gating, the straggler effect, aggregate-versus-per-token expert activation, capacity factor (1.2 to 1.5 times average) with spill/overflow, hot-expert replication, and the GLaM load-balancing loss plus gating noise. ↩↩↩↩↩↩↩↩↩↩↩↩
-
Chris Fregly, AI Systems Performance Engineering (O'Reilly), Chapter 19, "Dynamic and Adaptive Inference Engine Optimizations": router / top-k experts, DeepSeek-R1 256 routed experts, top-9 (1 shared), about 37B active parameters, and expert parallelism as the recommended layout for sparse expert selection. ↩↩↩↩
-
vLLM, "Expert Parallel Deployment": EPLB collects load statistics every forward pass and periodically rebalances across EP ranks; the
--eplb-configfields (window_size,step_interval,log_balancedness,num_redundant_experts,use_async,policy,communicator) and defaults; balancedness defined as avg-tokens-per-expert over max-tokens-per-expert; the memory overhead formula and the roughly 2.4 GB per redundant expert per rank figure for DeepSeek-V3; and the recommendation to set about 32 redundant experts at large scale. https://docs.vllm.ai/en/latest/serving/expert_parallel_deployment/ ↩↩↩↩↩↩↩ -
vLLM,
FusedMoErouter / routing simulator (select_experts,renormalize=True, top-k over router logits). Source of the score-then-top-k-then-renormalize gating shape reproduced in the router numpy block. https://github.com/vllm-project/vllm/blob/main/tests/kernels/moe/test_routing_simulator.py ↩ -
vLLM, "[Feature] Expert Parallelism Load Balancer (EPLB)," PR #18343 (EPLB described as essential for auxiliary-loss-free models; flat-flag versus
--eplb-confighistory). https://github.com/vllm-project/vllm/pull/18343 ↩↩↩↩ -
vLLM, "[Feature]: Support EPLB for More MoE Models," Issue #20468 (model-coverage status). https://github.com/vllm-project/vllm/issues/20468 ↩↩
-
DeepSeek-AI, "EPLB: Expert Parallelism Load Balancer": duplicate heavy-loaded experts and heuristically pack replicas onto GPUs;
rebalance_experts(weight, num_replicas, num_groups, num_nodes, num_gpus) -> (phy2log, log2phy, logcnt); hierarchical policy (prefill) versus global policy (decode at large EP size). https://github.com/deepseek-ai/EPLB ↩↩↩↩↩↩ -
DeepSeek-AI, "DeepSeek-V3 Technical Report": auxiliary-loss-free bias balancing (per-expert learnable bias on the top-k affinity, not the mixing weight), sigmoid gating, complementary sequence-wise balance loss, and the roughly 671-685B-total / about-37B-active sparsity. https://arxiv.org/abs/2412.19437 ↩↩
-
L. Wang et al., "Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts" (bias-controller balancing without an auxiliary-loss gradient perturbation). https://arxiv.org/abs/2408.15664 ↩↩
-
W. Fedus et al., "Switch Transformers" (capacity factor and over-capacity token dropping via the residual) and N. Shazeer et al. / GShard (expert capacity and the gating load-balancing loss). https://arxiv.org/abs/2101.03961 ↩↩↩