Skip to content

Commit 44b3c2e

Browse files
committed
update
Signed-off-by: bitliu <[email protected]>
1 parent d46785e commit 44b3c2e

File tree

1 file changed

+64
-60
lines changed

1 file changed

+64
-60
lines changed

_posts/2025-09-01-semantic-router.md

Lines changed: 64 additions & 60 deletions
Original file line numberDiff line numberDiff line change
@@ -9,103 +9,107 @@ image: /assets/logos/vllm-logo-text-light.png
99

1010
## Industry Status: Inference ≠ More Is Better
1111

12-
Over the past year, **hybrid reasoning and automatic routing** have become central to discussions in the large-model ecosystem. The focus is shifting from raw parameter counts to **efficiency, selectivity**, and **per-token value**.
12+
Over the past year, hybrid reasoning and automatic routing have increasingly defined progress in large-model infrastructure—shifting the debate from raw scale to per-token efficiency, latency control, and targeted compute use.
1313

14-
Take **GPT-5** as an example. Its most notable breakthrough isn’t sheer size, but the introduction of **automatic routing and thinking quotas**:
14+
Take GPT-5 for example: its standout innovation lies not in sheer parameters, but in routing policies and quota-based reasoning:
1515

16-
* **Light queries → Lightweight models**: Simple prompts like “Why is the sky blue?” don’t need costly reasoning-heavy inference.
17-
* **Complex/High-value queries → Advanced models**: Legal analysis, financial simulations, or multi-step reasoning tasks are routed to models with Chain-of-Thought capabilities.
16+
- Light queries → lightweight paths: trivial prompts like “Why is the sky blue?” don’t trigger expensive reasoning.
17+
- Complex/high-value queries → reasoning-enabled models: multi-step tasks—like legal analysis or financial planning—are routed to Chain-of-Thought–enabled inference.
1818

19-
This shift reflects a new principle of **per-token unit economics**.
19+
This represents a broader principle of task-aware compute allocation, where every inference token must contribute meaningful value—not just be consumed.
2020

21-
Every token generated must deliver value, rather than being treated as sunk computational cost.
21+
Similar ideas are appearing in other systems:
2222

23-
Use cases:
23+
- Anthropic Claude 3.7/4: differentiates “fast thinking” and “slow thinking” pathways.
24+
- Google Gemini 2.5: offers explicit *thinking budgets*, allowing enterprises to cap reasoning depth.
25+
- Alibaba Qwen3: supports instruction-driven switching between reasoning and non-reasoning modes.
26+
- DeepSeek v3.1: merges conversational and reasoning flows within a dual-mode single model.
2427

25-
* Free-tier users are served by lightweight models, keeping costs sustainable.
26-
* Queries with clear commercial intent — e.g., booking flights or finding legal services — are escalated to high-compute models or directly integrated agent services. In these cases, providers can monetize not only inference but also downstream transactions, turning free usage into a revenue engine.
28+
The trend is clear: future inference systems will be defined by selectivity and intelligence, not just model size.
2729

28-
In these cases, companies like OpenAI can participate in the value chain by taking a commission on completed transactions — transforming free usage from a cost center into a monetizable entry point.
30+
## Recent Research: vLLM Semantic Router
2931

30-
Other leaders are adopting similar strategies:
32+
Responding to this shift, the vLLM Semantic Router offers an open-source, intent-aware routing layer for the highly efficient vLLM inference engine.
3133

32-
* **Anthropic Claude 3.7/4**: blends “fast thinking” and “slow thinking,” with user-controlled toggles.
33-
* **Google Gemini 2.5**: introduces a *thinking budget*, giving enterprises fine-grained control over inference costs.
34-
* **Alibaba Qwen3**: explores instruction-based switching between reasoning and non-reasoning.
35-
* **DeepSeek v3.1**: pioneers a dual-mode single-model design, merging conversational and reasoning flows.
34+
vLLM enables scalable LLM serving—but lacks semantic decision-making around reasoning. Developers face a trade-off:
3635

37-
In short: the industry is entering an era where **no token is wasted** — and **routing intelligence** defines the frontier of model innovation.
36+
- Enable reasoning always → accuracy increases, but so does cost.
37+
- Disable reasoning → cost drops, but accuracy suffers on complex tasks.
3838

39-
## Recent Research: vLLM Semantic Router
39+
The Semantic Router fills this gap by classifying queries semantically and routing them appropriately, giving accurate results where needed and efficiency where reasoning is unnecessary.
4040

41-
As the industry moves toward hybrid reasoning and intelligent routing, this project zeroes in on the **open-source inference engine vLLM**.
41+
![](/assets/figures/semantic-router/architecture.png)
4242

43-
vLLM has quickly become the **de facto standard** for serving large models at scale. Yet, it still lacks **semantic-level control** — the ability to decide when and how to apply reasoning based on the actual meaning of a query, not just its type. Without this capability, developers face an all-or-nothing trade-off:
43+
### Architecture Design
4444

45-
* Enable reasoning everywhere → higher accuracy, but wasted computation and inflated costs.
46-
* Disable reasoning entirely → lower cost, but accuracy drops sharply on reasoning-heavy tasks.
45+
The system comprises four pillars:
4746

48-
To overcome this gap, we introduce the **vLLM Semantic Router** — an intent-aware, fine-grained routing layer that brings **GPT-5-style “smart routing”** to the open-source ecosystem.
47+
1. Semantic Classification: Uses ModernBERT—currently a lightweight, standalone classifier integrated into the router—to determine routing paths.
48+
2. Smart Routing:
49+
- Simple queries → "fast path" inference.
50+
- Complex queries → "Chain-of-Thought" reasoning mode.
51+
3. High-Performance Engine: Written in Rust using Hugging Face Candle, it delivers high concurrency and zero-copy inference.
52+
4. Cloud-Native Integration: Works out-of-the-box with Kubernetes and Envoy via the `ext_proc` plugin.
4953

50-
By classifying queries at the semantic level and selectively enabling reasoning, the vLLM Semantic Router delivers **higher accuracy where it matters** and **significant cost savings where it doesn’t** — a step toward the principle that no token should be wasted.
54+
In trials, this design yielded:
5155

52-
![](/assets/figures/semantic-router/architecture.png)
56+
- \~10% higher accuracy
57+
- \~50% lower latency
58+
- \~50% fewer tokens
5359

54-
### Architecture Design
60+
In business and economics domains, gains exceeded 20% accuracy improvements.
5561

56-
The **vLLM Semantic Router** is built to combine fine-grained semantic awareness with production-grade performance. Its design includes four key components:
62+
## Challenges in Execution: Budgets and Tool Calling
5763

58-
1. **Semantic Classification**: A **ModernBERT** fine-tuned intent classifier determines whether each query requires advanced reasoning or can be handled by lightweight inference.
59-
2. **Smart Routing**:
64+
Two technical constraints are important to address:
6065

61-
* Simple queries → Fast inference mode. Minimal latency and cost for straightforward requests.
62-
* Complex queries → Chain-of-Thought for accurate reasoning. Ensures accuracy on tasks that demand multi-step reasoning.
63-
3. **High-Performance Engine**: Implemented in **Rust** using the **Hugging Face Candle** framework, the engine achieves high concurrency and **zero-copy efficiency**, making it well-suited for large-scale serving.
64-
4. **Cloud-Native Integration**: Seamlessly integrates with **Kubernetes** and **API Gateways** through the **Envoy** `ext_proc` plugin.
66+
- Reasoning Budget Costs
67+
Unlimited reasoning inflates cold-start latency and resource usage. Without dynamic control, simple queries may over-consume tokens while critical queries may not get deep reasoning when needed. SLOs like TTFT and p95 latency are necessary—with possible adaptation mid-inference.
68+
- Tool Calling Constraints
69+
Adding more tools (i.e. “tool catalog bloat”) or longer tool outputs can drastically reduce accuracy. The router must pre-filter tools and keep catalogs tight.
6570

66-
Experimental results show:
71+
## Project Background
6772

68-
* **Accuracy**: +10.2%
69-
* **Latency**: –47.1%
70-
* **Token Consumption**: –48.5%
73+
The Semantic Router evolved from contributions across the open-source community:
7174

72-
In knowledge-intensive areas such as business and economics, accuracy improvements can exceed **20%**.
75+
- Proposed in early 2025 by [Dr. Chen Huamin](https://www.linkedin.com/in/huaminchen) (Red Hat)
76+
- Further developed by [Xunzhuo Liu](https://www.linkedin.com/in/bitliu) (Tencent)
77+
- To be presented by [Dr. Wang Chen](https://www.linkedin.com/in/chenw615) (IBM Research) and Dr. Chen Huamin at [KubeCon North America 2025](https://kccncna2025.sched.com/event/27FaI/intelligent-llm-routing-a-new-paradigm-for-multi-model-ai-orchestration-in-kubernetes-chen-wang-ibm-research-huamin-chen-red-hat?iframe=no&w=100%&sidebar=yes&bg=no)
7378

74-
## Project Background
79+
Our goal: provide inference acceleration for open-source LLMs through:
7580

76-
The **vLLM Semantic Router** is not the isolated result of a single paper but a collaborative outcome of sustained community contributions:
81+
- Semantic-aware routing
82+
- Efficient model switching
83+
- Enterprise-friendly deployment (Kubernetes & Envoy)
7784

78-
* Originally proposed by **[Dr. Chen Huamin](https://www.linkedin.com/in/huaminchen)**, Distinguished Engineer at **Red Hat**, in early **2025** across multiple open-source communities.
79-
* Iterated and further developed by **[Xunzhuo Liu](https://www.linkedin.com/in/bitliu)** at **Tencent**, later contributed to the vLLM community.
80-
* **[Dr. Wang Chen](https://www.linkedin.com/in/chenw615)** from **IBM Research** and **Dr. Chen Huamin** will present the project at **[KubeCon North America 2025](https://kccncna2025.sched.com/event/27FaI/intelligent-llm-routing-a-new-paradigm-for-multi-model-ai-orchestration-in-kubernetes-chen-wang-ibm-research-huamin-chen-red-hat?iframe=no&w=100%&sidebar=yes&bg=no)**.
85+
Find the project on [GitHub](https://github.com/vllm-project/semantic-router). The current focus is on a [Work Group](https://vllm-semantic-router.com/community/work-groups) and planned [v0.1 Roadmap](https://vllm-semantic-router.com/roadmap/v0.1).
8186

82-
The mission is clear: to serve as an **inference accelerator** for open-source large models:
87+
## Integration & Future Work: Embeddings and Pluggability
8388

84-
* Preserve accuracy while minimizing unnecessary token usage.
85-
* Enable seamless switching between "fast" and "slow" thinking modes without fully enabling or disabling inference.
86-
* Deliver production-ready enterprise integration through native Kubernetes and Envoy support.
89+
Currently, ModernBERT runs internally within the router for classification. It is not yet served by vLLM. However, future work aims to make the classifier—and potentially other embedding models—pluggable, allowing integration with vLLM-hosted models or external embedding services.
8790

88-
The vLLM Semantic Router is therefore not just a research milestone but an **essential bridge for open-source AI infrastructure**, translating **academic innovation into industrial application**.
91+
This capability will enhance the semantic cache and enable smoother inference customization.
8992

90-
You can start exploring the project at [GitHub](https://github.com/vllm-project/semantic-router). We're currently working on the [v0.1 Roadmap](https://github.com/vllm-project/semantic-router/issues/14) and have established a [Work Group](https://github.com/vllm-project/semantic-router/issues/15). We welcome your thoughts and invite you to join us!
93+
## Roadmap: v0.1 Milestone Highlights
9194

92-
## Future Trends: Cost-Effective, Just-in-Time Inference
95+
The [v0.1 milestone](https://github.com/vllm-project/semantic-router/milestone/1) will expand the project’s technical capabilities:
9396

94-
The central industry question has shifted from *"Can we perform inference?"* to *"When and how should inference be performed?"*
97+
- Core: ExtProc-based modularity, semantic caching across backends, multi-factor routing logic
98+
- Benchmarking: CLI tools, performance testing suite, reasoning-mode evaluation
99+
- Networking: Deeper integration with Envoy, GIE, and llm-d gateways
100+
- Observability & UX: Admin dashboards, routing policy visualization, developer quickstarts, and policy cookbook
95101

96-
* **GPT-5**: exemplifies this shift with **automatic routing** and **thinking quotas** to align computation with commercial value, enabling monetization.
97-
* **vLLM Semantic Router**: extends this paradigm to the open-source **vLLM engine**, enabling **semantic aware**, **low-latency**, and **energy-efficient** inference routing.
102+
## Future Trends: Just-in-Time Inference
98103

99-
Looking ahead, competitive differentiation will hinge less on sheer **model scale** and more on:
104+
The field is maturing from *“Can we run inference?”* to *“How can inference be smarter?”*
100105

101-
* **Performing inference at the right moment with the lowest cost.**
102-
* **Switching seamlessly between fast and slow reasoning modes.**
103-
* **Preserving user experience without wasting compute.**
106+
- GPT-5 uses commercial value to guide reasoning depth.
107+
- vLLM Semantic Router delivers that capability to open source.
104108

105-
The next frontier is **intelligent, self-adjusting inference systems** — engines that autonomously decide when to think deeply and when to respond directly, without user toggles or rigid rules. This shift marks a new era where inference becomes not just powerful, but **context-aware, adaptive, and economically sustainable**.
109+
Looking ahead, systems that adapt their inference strategy on the fly, without manual toggles, will lead in efficiency, latency, and sustainability.
106110

107111
## One-Sentence Summary
108112

109-
* **GPT-5**: Business-driven routing → broad intelligence.
110-
* **vLLM Semantic Router**: Efficiency-driven routing → sustainable AI.
111-
* **Future edge**: Performing the right inference at the right time, with minimal computation.
113+
- GPT-5: enterprise routing for smarter inference
114+
- vLLM Semantic Router: technical-first routing for open-source LLMs
115+
- Edge future: context-aware, minimal-compute inference that works seamlessly

0 commit comments

Comments
 (0)