Markdown

LLM request routing (Mixture-of-Models)¶

Scope: the decision layer that sits in front of a pool of heterogeneous model endpoints and picks which model (or whether to reason) serves each request. Covers the taxonomy (predictive vs cascading; decision signals; target axes; placement), the cost-quality argument, evaluation metrics, and how to operate a router. This is inter-model routing (request → endpoint); it is not the intra-model token → expert gating of MoE routing and load balancing. The concrete open-source implementation is documented in vLLM Semantic Router.

flowchart LR
  REQ["Incoming request<br/>(prompt + context)"] --> R["Router / dispatcher<br/>(classify, do not generate)"]
  SIG["Signals:<br/>semantic class, difficulty,<br/>cost, privacy, safety"] -.-> R
  R -->|"easy / cheap"| S["Small or non-reasoning model"]
  R -->|"hard / high-value"| L["Large or reasoning model"]
  R -->|"private / regulated"| P["Local / on-prem model"]
  R -->|"blocked"| G["Safety gate<br/>(jailbreak, PII)"]
  S --> OUT["Response"]
  L --> OUT
  P --> OUT

Disambiguation. Request routing (this page) chooses which model serves a request across a pool of separate endpoints, a Mixture-of-Models (MoM). MoE routing chooses which expert FFNs fire inside one model, a Mixture-of-Experts. Load balancing (QoS and admission control) chooses which replica of the same model serves a request. All three are "routing"; only this one changes the model identity.

What it is¶

A request router is a lightweight decision component that inspects each inference request and dispatches it to one of several backend models that differ in capability, cost, latency, modality, or privacy boundary. The router does not generate tokens (it classifies the request and forwards it), so its own latency must be a small fraction of the generation it gates.

The pattern is often branded Mixture-of-Models (MoM): instead of sending every request to one large model, a fleet exposes a mix (for example a 8B model, a 70B model, a reasoning model, and a local privacy-tier model), and the router allocates traffic across them. The economic premise is that request difficulty is highly skewed (most production traffic is easy), so paying frontier-model cost for every request is waste. As the vLLM Semantic Router project frames it: "What's the weather?" shouldn't cost as much as "Analyze this legal contract."

Routers exploit complementary strengths: two models with similar average benchmark scores frequently succeed on different subsets of queries, so a router with per-query model selection can beat any single model in the pool while spending less. The gap between a router's realized quality and a hypothetical perfect router (the oracle) is the headroom the literature optimizes.

Why it matters¶

Cost. Routing a large fraction of traffic to a cheaper model cuts token spend directly. RouteLLM (Ong et al., Berkeley/LMSYS) reports more than 2× cost reduction on benchmarks such as MT-Bench with no measurable quality drop, by learning from Chatbot Arena preference data when the weak model is good enough.
Latency and tokens. Gating reasoning (chain-of-thought) only to queries that need it is a high-leverage special case. When to Reason: Semantic Router for vLLM (Wang et al., NeurIPS 2025 ML for Systems) reports +10.2 points accuracy, −47.1% latency, and −48.5% token consumption on MMLU-Pro versus always-on reasoning, by classifying which prompts benefit from extended reasoning.
Fleet economics. A router lets a cluster run a few expensive large-model GPUs behind many cheap small-model GPUs, sizing each tier to its share of traffic. This changes capacity planning (GPU capacity planning, consumption models) from "provision the peak of the biggest model" to "provision each tier to its routed load."
Governance. The same choke point enforces privacy tiers (keep regulated prompts on local models), safety (block jailbreaks and PII before they reach a model), and modality (send images to a vision model). One decision layer, several policies.

Classifications¶

By decision timing¶

Class	Mechanism	Trade-off
Predictive (pre-generation)	Classify the prompt before any generation, pick one model, run once.	Lowest overhead; wrong calls are not recoverable without a re-route.
Cascading (generate-then-escalate)	Run the cheap model first; a verifier/confidence check decides whether to escalate to a stronger model.	Recovers from bad cheap answers; pays the cheap model's latency on escalated requests, and needs a reliable confidence signal.
Speculative / hybrid	Cheap model drafts, strong model verifies or corrects.	Overlaps with token-level speculative decoding; best when draft and target share a tokenizer.

Predictive routing minimises latency; cascading maximises the fraction served cheaply at the cost of a second hop on hard queries. Many production systems combine them: predict first, cascade on low confidence.

By decision signal¶

Semantic classification. Embed the prompt with a small encoder (BERT-family) and classify it into a domain/category, then map category → model. Fast, interpretable, retrainable. Basis of the vLLM Semantic Router.
Preference-learned. Train a router on human preference labels to predict whether the strong model will beat the weak one. RouteLLM trains four router families on this signal: similarity-weighted ranking, matrix factorisation, a BERT classifier, and a causal-LLM classifier.
Reasoning-gating (difficulty). A binary/graded classifier decides reason vs answer directly on one model with a toggle, rather than choosing between two models (When to Reason).
Preference-aligned / policy. Map queries to human-defined domains or action types so operators steer routing by intent, not benchmark scores. Arch-Router (Tran et al.) does this with a 1.5B model and adds new backends without retraining.
Reinforcement-learned / sequential / agentic. Treat routing as a sequential decision process (route, observe, aggregate, re-route) rather than one-shot classification. Agent-as-a-Router (ACRouter) runs a Context-Action-Feedback loop with an orchestrator, a verifier, and a memory, accumulating execution-grounded experience during deployment; on its CodeRouterBench (about 10K coding tasks across 8 frontier LLMs) the agentic router beats static classifiers, evaluated by regret against the oracle. Heavier per request, and aimed at coding and agentic multi-hop use.

By target axis (what the router optimises)¶

Capability/quality · cost/token-spend · latency/TTFT · privacy (local vs frontier) · modality (text/vision/audio) · safety (jailbreak/PII gating). Real routers combine several axes into one policy.

By placement in the stack¶

Gateway / proxy (e.g. Envoy external processing): language-agnostic, sees every request, the common production choice.
Client/SDK-side: the application picks the model; lowest infra, but every client must embed the policy.
In-engine: the serving engine chooses internally; tightest integration, least portable.

When to use it¶

Use a request router when:

The fleet exposes several models of different cost/capability and traffic difficulty is skewed (most requests are easy).
Reasoning models are in the mix and always-on reasoning is burning tokens and latency on trivial prompts.
Privacy or compliance requires some prompts to stay on local/on-prem models.
Safety gating (jailbreak/PII) must apply uniformly in front of every model.
You are cost-constrained and can tolerate a small, measured quality trade at a chosen operating point.

Skip or defer it when:

A single model already meets quality and cost; a router adds a failure domain and latency for no gain.
Traffic is uniform and ultra-low-latency, where even a few milliseconds of router overhead matters and there is no cheaper model to route to.
The fleet is tiny (one or two GPUs); tier the models only once traffic justifies separate pools.

How to implement, integrate, and operate¶

Build the decision function¶

Choose the signal (above). Semantic classification is the fastest to stand up: embed → classify → map to endpoint.
Keep the router cheap. The encoder must run in single-digit milliseconds so routing overhead does not move TTFT. Production routers run BERT-class encoders in Rust (Candle) or ONNX/OpenVINO rather than a Python model server. A router that adds 100 ms to save 200 ms of generation is a poor trade.
Set confidence thresholds and a fallback. Below a confidence floor, fall back to the strong model (fail safe on quality) or cascade. Never fail open to an unbounded model on a privacy tier.
Map categories to endpoints in config, not code, so the policy is auditable and changeable without a redeploy.

Place it in the serving stack¶

The dominant pattern is a gateway filter: the router intercepts OpenAI-compatible requests at an Envoy/API-gateway layer, classifies, then selects the backend cluster. This composes with existing serving concerns (continuous batching, KV-cache management, disaggregated inference) because the router chooses the endpoint, and each endpoint keeps its own batching and cache. See vLLM Semantic Router for the concrete Envoy ExtProc implementation and serving open-weight models for sizing the backend pool.

Evaluate the router¶

Routing quality is a curve, not a point: you trade quality for cost by moving the threshold. Standard metrics (from RouteLLM):

Cost-quality frontier: plot realized quality against fraction routed to the strong model; a good router dominates the random-routing baseline.
Call-Performance Threshold (CPT(x)): the fraction of strong-model calls needed to reach x% of the strong model's quality; lower is better.
Average Performance Gap Recovered (APGR): how much of the weak→strong quality gap the router recovers across operating points; closer to the oracle (1.0) is better.
Oracle gap: distance from a perfect per-query router, the remaining headroom.

The headroom itself comes from complementary strengths: routing only pays when the models succeed on different queries and a signal can tell them apart. This runnable block makes that precise, computing the oracle bound and APGR, and asserting the adversarial case where two equally-strong-but-non-complementary models leave a router no gain:

# routing_headroom.py — validated: a signal-driven router beats any single model only when the
# models are complementary; the oracle bounds the gain, APGR measures what is recovered. numpy only.
import numpy as np
rng = np.random.default_rng(0); N = 4000
domain = rng.integers(0, 2, N)                        # a routable signal (e.g. semantic class)
yA = rng.random(N) < np.where(domain == 0, 0.9, 0.5)  # model A strong on domain 0
yB = rng.random(N) < np.where(domain == 1, 0.9, 0.5)  # model B strong on domain 1 (complementary)

single_best = max(yA.mean(), yB.mean())               # best single model overall
oracle      = (yA | yB).mean()                         # perfect per-query router = upper bound
routed      = np.where(domain == 0, yA, yB).mean()     # route each domain to its stronger model
apgr        = (routed - single_best) / (oracle - single_best)   # avg performance-gap recovered
assert routed > single_best + 0.1 and 0.7 < apgr <= 1.0         # complementary strengths -> real, recovered gain

# adversarial: two equally-strong, NON-complementary models -> the signal buys no routing gain
yA2 = rng.random(N) < 0.7; yB2 = rng.random(N) < 0.7
gain = np.where(domain == 0, yA2, yB2).mean() - max(yA2.mean(), yB2.mean())
assert abs(gain) < 0.03                                # no complementarity -> routing does not help
print(f"single={single_best:.3f} routed={routed:.3f} oracle={oracle:.3f} APGR={apgr:.2f}; non-complementary gain={gain:+.3f}")

Pick an operating point against an SLO (inference-serving SLOs), not a single benchmark number, and re-measure when models in the pool change.

Operate it¶

Guard against drift. Traffic distribution and model quality both move; a router trained on last quarter's mix decays. Log routed decisions and periodically re-evaluate the frontier.
Close the loop. User feedback (thumbs, edits, escalations) is a cheap relabeling signal; feed it back into router training.
Gate safety at the router. Jailbreak and PII classifiers belong in front of every backend, not per-model (prompt-injection defense, security and multi-tenancy).
Cache semantically. An embedding-similarity response cache in front of the router returns near-duplicate answers without any generation, a large win on skewed production traffic; validate similarity thresholds to avoid stale or wrong hits.

Don't-miss checklist¶

Router latency budgeted and measured: encoder runs in Rust/Candle or ONNX, single-digit ms, off the Python hot path.
Category → model mapping lives in config, versioned and auditable, not hard-coded.
A confidence floor with a fail-safe fallback (to the strong model or a cascade), never fail-open on privacy tiers.
Router evaluated on a cost-quality frontier (APGR/CPT), tied to an SLO, with a chosen operating point, not a single accuracy number.
Safety gating (jailbreak/PII) applied at the router, in front of every backend.
Re-evaluation scheduled for when the model pool or traffic mix changes.

Failure modes¶

Router overhead eats the savings. A Python-hosted encoder adds tens of ms and negates the latency win. Measure end-to-end TTFT with and without the router.
Mis-route to a weak model on a hard query. Silent quality regression with no error. Mitigate with confidence thresholds, cascading, and frontier evaluation, not average accuracy.
Fail-open on privacy tier. A low-confidence fallback that sends a regulated prompt to a frontier API is a compliance breach; fallback must respect the privacy axis.
Drift. Static thresholds decay as traffic and models change; a router that looked optimal at launch quietly degrades.
Cache poisoning / stale hits. A too-loose semantic cache returns a near-duplicate's answer for a materially different prompt.

Open questions & validation¶

Benchmark the actual router encoder latency on your gateway hardware; confirm it is a small fraction of TTFT before shipping.
Re-derive the cost-quality frontier on your traffic, not the paper's benchmark; CPT/APGR are workload-specific.
Validate that privacy-tier routing is closed under low-confidence fallback (no path leaks a regulated prompt to a frontier endpoint).
Confirm the semantic-cache similarity threshold against a labelled set of near-duplicate vs distinct prompts.

References¶

RouteLLM: Learning to Route LLMs with Preference Data (Ong et al.): https://arxiv.org/abs/2406.18665
When to Reason: Semantic Router for vLLM (Wang et al., NeurIPS 2025 ML for Systems): https://arxiv.org/abs/2510.08731
Arch-Router: Aligning LLM Routing with Human Preferences (Tran et al.): https://arxiv.org/abs/2506.16655
vLLM Semantic Router (project): https://github.com/vllm-project/semantic-router
llm-semantic-router models (Hugging Face): https://huggingface.co/llm-semantic-router
Intelligent inference request routing (Red Hat Emerging Technologies): https://next.redhat.com/2025/11/11/intelligent-inference-request-routing-for-large-language-models/
Agent-as-a-Router: Agentic Model Routing for Coding Tasks (ACRouter, Zhou et al.): https://arxiv.org/abs/2606.22902