vLLM semantic router¶
Scope: the open-source vLLM Semantic Router (github.com/vllm-project/semantic-router), an Envoy External Processor that classifies each OpenAI-compatible request with small BERT-family encoders and dispatches it across a Mixture-of-Models, adding reasoning-gating, jailbreak/PII gating, and semantic caching. This is the concrete implementation of the pattern in LLM request routing; read that first for the taxonomy and evaluation. HOW to deploy, configure, and operate it is below.
flowchart LR
C["Client<br/>(OpenAI API)"] --> E["Envoy proxy"]
E <-->|"gRPC ExtProc"| X["Semantic Router<br/>(Go ExtProc server)"]
X --> CL["Classifiers (Rust / Candle):<br/>category, intent,<br/>jailbreak, PII, reason-gate"]
X --> SC["Semantic cache<br/>(embedding similarity)"]
X -->|"selected backend + policy"| E
E -->|"math / code"| M1["Reasoning model (vLLM)"]
E -->|"general / easy"| M2["Small model (vLLM)"]
E -->|"private"| M3["Local model (vLLM)"]
What it is¶
The vLLM Semantic Router is a request dispatcher, not an inference engine: it decides which backend generates tokens and whether that backend should reason, then hands the request back to Envoy for proxying. It runs as an Envoy External Processor (ExtProc) (a gRPC service Envoy consults for each request), implemented in Go for the ExtProc/orchestration path and Rust for the model path, using the Hugging Face Candle library to run BERT-family classifiers with low overhead (ONNX and OpenVINO backends are also present). It is a vLLM project subproject; as of v0.3.0 (June 2026) it publishes its trained classifiers on the Hugging Face llm-semantic-router org and receives AMD-sponsored ROCm support.
The router bundles several small classifiers, each a fine-tuned encoder:
- Category classifier: maps a prompt to a domain (MMLU-Pro-style categories: mathematics, physics, computer science, biology, chemistry, and more) and then to the best-scoring model for that domain.
- Reasoning gate ("when to reason"): decides whether a prompt benefits from extended chain-of-thought, toggling reasoning on the chosen model instead of always paying for it.
- Intent classifier: categorises requests for tool/action selection (retrieval, transformation, calculation, scheduling).
- Jailbreak detector is a binary classifier that blocks attempts to circumvent safety measures.
- PII detector: token classification that flags names, emails, phone numbers, SSNs, and similar entities for privacy-policy enforcement.
- Semantic cache is an embedding-similarity cache that returns a stored answer for near-duplicate prompts, skipping generation entirely.
On the llm-semantic-router org these ship as compact (~0.3B) mmBERT-based checkpoints (e.g. intent, jailbreak, fact-check, and modality-router classifiers) plus matching embedding models, including domain-specialised finance and medical variants.
Why use it¶
- Token economics. Routing easy traffic to small models and gating reasoning cut token spend and latency where they are wasted. The reasoning-gate is backed by When to Reason: Semantic Router for vLLM (Wang et al., NeurIPS 2025 ML for Systems), which reports +10.2 pts accuracy, −47.1% latency, −48.5% tokens on MMLU-Pro versus always-on reasoning.
- Safety at the choke point. Jailbreak and PII classifiers run in front of every backend, so the policy is uniform and auditable rather than reimplemented per model (prompt-injection defense, security and multi-tenancy).
- Full-mesh across boundaries. One layer coordinates local, private, and frontier models across cost, privacy, and capability boundaries, the Mixture-of-Models fleet pattern.
- Low overhead. Running the encoders in Rust/Candle keeps classification fast enough that routing does not meaningfully move TTFT, unlike a Python-hosted classifier on the hot path.
When to use it¶
Reach for it when you already run vLLM (or an OpenAI-compatible pool) with several models and want a production, config-driven router with built-in safety and caching, integrated at the Envoy/gateway layer. It fits naturally with the vLLM production stack and serving open-weight models.
Defer it when a single model meets your quality/cost target, when you have no Envoy/gateway layer and do not want one, or when traffic is uniform and latency-critical with no cheaper backend to route to. See the when not to guidance in LLM request routing.
How to deploy and integrate¶
Request path¶
Client → Envoy proxy → (gRPC) Semantic Router ExtProc → classify + policy → back to Envoy → selected backend model. The router intercepts each request, runs the classification pipeline, optionally serves from the semantic cache, applies safety policy (block/redact), selects a backend cluster, and may inject a domain-specific system prompt. Envoy then proxies to the chosen vLLM endpoint. Because the router only selects the endpoint, each backend keeps its own continuous batching and KV-cache management.
Kubernetes deployment¶
The project ships a Helm chart and integrates with Envoy Gateway. A typical install layers:
- A backend pool, e.g. the vLLM production stack Helm chart serving one or more models with replicas.
- The Semantic Router chart:
oci://ghcr.io/vllm-project/charts/semantic-router(pin the chart version; do not track latest). - Envoy Gateway plus the Envoy AI Gateway CRDs/controller for route management, wired to the ExtProc.
- Backend and route configuration applied as manifests.
The production-stack integration is a preview; confirm the current chart version, CRDs, and steps against the official integration guide before deploying.
Configuration (reference template, illustrative, unexecuted)¶
Routing policy is data, not code: categories map to candidate models with per-category scores, plus a reasoning toggle, safety toggles, and a semantic-cache section. The exact schema and key names live in the official config reference and move between releases. Treat the sketch below as conceptual shape only, not a validated file, and copy the real keys from the docs for your pinned version.
# Conceptual shape of the router config — verify exact keys against the
# official docs for your pinned chart version. Not an executable file.
default_model: general-8b # fallback backend
categories:
- name: math
use_reasoning: true # gate chain-of-thought on this class
model_scores: # candidate backends, best first
- { model: reasoning-model, score: 0.95 }
- { model: general-8b, score: 0.60 }
- name: general
use_reasoning: false
model_scores:
- { model: general-8b, score: 0.90 }
prompt_guard: { enabled: true } # jailbreak detection
pii: { enabled: true } # PII detection / redaction policy
semantic_cache: { enabled: true, similarity_threshold: 0.92 }
Map every model: to a real Envoy backend cluster / vLLM endpoint; set similarity_threshold conservatively and validate it (a loose cache returns near-duplicate answers for distinct prompts).
How to scale and operate¶
- Overhead budget. Confirm the classifier latency on your gateway hardware and keep it a small fraction of TTFT; the Rust/Candle path is chosen precisely to stay there. Measure end-to-end with and without the router.
- Replicas and cache. Run the ExtProc with enough replicas to cover request QPS; the semantic cache is a large win on skewed traffic but must be sized and its threshold validated.
- LoRA classifiers. The project offers LoRA variants of the classifiers (reported ~99.8% fewer trainable parameters, ~98% smaller) for cheaper retraining and footprint. Useful when you fine-tune categories on your own traffic.
- Retrain on drift. Re-fit the category/reasoning classifiers when your traffic mix or backend pool changes; evaluate on the cost-quality frontier from LLM request routing, not a single accuracy number.
- Observe the decisions. Log routed category, chosen model, cache hits, and safety blocks so the policy is auditable and the frontier re-derivable (observability and monitoring).
Safety and privacy gating¶
The jailbreak and PII classifiers make the router a policy enforcement point, not just a load optimiser: block or redact before a prompt reaches any backend, and pin regulated prompts to local models. This is the uniform-gating argument from prompt-injection defense: one gate in front of the mesh beats N per-model gates. Fail safe: a low-confidence route must not fall open to a frontier endpoint on a privacy-tier request.
Failure modes¶
- Preview churn. Chart versions, CRDs, and config keys move between releases; an install pinned to a stale guide breaks. Track the official docs and pin versions.
- Router overhead on the hot path. If a classifier is mis-deployed to a slow (e.g. Python) backend, it adds latency instead of saving it. Verify the Rust/Candle path and measure TTFT.
- Mis-route to a weak model. A low-quality category classifier silently sends hard prompts to a small model. Gate with confidence thresholds and evaluate the frontier (LLM request routing).
- Loose semantic cache. Too-low a similarity threshold serves a near-duplicate's answer for a materially different prompt.
- Fail-open privacy. Fallback that ignores the privacy axis leaks regulated prompts to frontier APIs.
References¶
- vLLM Semantic Router (GitHub): https://github.com/vllm-project/semantic-router
- Project site and docs: https://vllm-semantic-router.com/
- llm-semantic-router models and datasets (Hugging Face): https://huggingface.co/llm-semantic-router
- Training overview (official docs): https://vllm-semantic-router.com/docs/training/training-overview/
- Production-stack integration (vLLM docs): https://docs.vllm.ai/projects/production-stack/en/latest/use_cases/semantic-router-integration.html
- When to Reason: Semantic Router for vLLM (Wang et al.): https://arxiv.org/abs/2510.08731
- Intelligent inference request routing (Red Hat Emerging Technologies): https://next.redhat.com/2025/11/11/intelligent-inference-request-routing-for-large-language-models/
Related: LLM request routing · MoE routing and load balancing · Serving open-weight models · Inference serving · QoS and admission control · Prompt-injection defense · Security and multi-tenancy · Observability and monitoring · Glossary