Skip to content

Commit fe34eec

Browse files
authored
docs: sync blog from official (#142)
Signed-off-by: bitliu <[email protected]>
1 parent 02382e1 commit fe34eec

File tree

1 file changed

+66
-60
lines changed

1 file changed

+66
-60
lines changed

website/blog/2025-09-06-welcome.md

Lines changed: 66 additions & 60 deletions
Original file line numberDiff line numberDiff line change
@@ -7,107 +7,113 @@ tags: [welcome, announcement, vllm, semantic-router]
77

88
![code](/img/code.png)
99

10+
Synced from official vLLM Blog: [vLLM Semantic Router: Next Phase in LLM inference](https://blog.vllm.ai/2025/09/11/semantic-router.html)
11+
1012
<!-- truncate -->
1113

1214
## Industry Status: Inference ≠ More Is Better
1315

14-
Over the past year, **hybrid reasoning and automatic routing** have become central to discussions in the large-model ecosystem. The focus is shifting from raw parameter counts to **efficiency, selectivity**, and **per-token value**.
16+
Over the past year, hybrid reasoning and automatic routing have increasingly defined progress in large-model infrastructure—shifting the debate from raw scale to per-token efficiency, latency control, and targeted compute use.
1517

16-
Take **GPT-5** as an example. Its most notable breakthrough isn’t sheer size, but the introduction of **automatic routing and thinking quotas**:
18+
Take GPT-5 for example: its standout innovation lies not in sheer parameters, but in routing policies and quota-based reasoning:
1719

18-
* **Light queries → Lightweight models**: Simple prompts like “Why is the sky blue?” don’t need costly reasoning-heavy inference.
19-
* **Complex/High-value queries → Advanced models**: Legal analysis, financial simulations, or multi-step reasoning tasks are routed to models with Chain-of-Thought capabilities.
20+
- Light queries → lightweight paths: trivial prompts like “Why is the sky blue?” don’t trigger expensive reasoning.
21+
- Complex/high-value queries → reasoning-enabled models: multi-step tasks—like legal analysis or financial planning—are routed to Chain-of-Thought–enabled inference.
2022

21-
This shift reflects a new principle of **per-token unit economics**.
23+
This represents a broader principle of task-aware compute allocation, where every inference token must contribute meaningful value—not just be consumed.
2224

23-
Every token generated must deliver value, rather than being treated as sunk computational cost.
25+
Similar ideas are appearing in other systems:
2426

25-
Use cases:
27+
- Anthropic Claude 3.7/4: differentiates “fast thinking” and “slow thinking” pathways.
28+
- Google Gemini 2.5: offers explicit *thinking budgets*, allowing enterprises to cap reasoning depth.
29+
- Alibaba Qwen3: supports instruction-driven switching between reasoning and non-reasoning modes.
30+
- DeepSeek v3.1: merges conversational and reasoning flows within a dual-mode single model.
2631

27-
* Free-tier users are served by lightweight models, keeping costs sustainable.
28-
* Queries with clear commercial intent — e.g., booking flights or finding legal services — are escalated to high-compute models or directly integrated agent services. In these cases, providers can monetize not only inference but also downstream transactions, turning free usage into a revenue engine.
32+
The trend is clear: future inference systems will be defined by selectivity and intelligence, not just model size.
2933

30-
In these cases, companies like OpenAI can participate in the value chain by taking a commission on completed transactions — transforming free usage from a cost center into a monetizable entry point.
34+
## Recent Research: vLLM Semantic Router
3135

32-
Other leaders are adopting similar strategies:
36+
Responding to this shift, the vLLM Semantic Router offers an open-source, intent-aware routing layer for the highly efficient vLLM inference engine.
3337

34-
* **Anthropic Claude 3.7/4**: blends “fast thinking” and “slow thinking,” with user-controlled toggles.
35-
* **Google Gemini 2.5**: introduces a *thinking budget*, giving enterprises fine-grained control over inference costs.
36-
* **Alibaba Qwen3**: explores instruction-based switching between reasoning and non-reasoning.
37-
* **DeepSeek v3.1**: pioneers a dual-mode single-model design, merging conversational and reasoning flows.
38+
vLLM enables scalable LLM serving—but lacks semantic decision-making around reasoning. Developers face a trade-off:
3839

39-
In short: the industry is entering an era where **no token is wasted** — and **routing intelligence** defines the frontier of model innovation.
40+
- Enable reasoning always → accuracy increases, but so does cost.
41+
- Disable reasoning → cost drops, but accuracy suffers on complex tasks.
4042

41-
## Recent Research: vLLM Semantic Router
43+
The Semantic Router fills this gap by classifying queries semantically and routing them appropriately, giving accurate results where needed and efficiency where reasoning is unnecessary.
4244

43-
As the industry moves toward hybrid reasoning and intelligent routing, this project zeroes in on the **open-source inference engine vLLM**.
45+
![architecture](/img/architecture.png)
4446

45-
vLLM has quickly become the **de facto standard** for serving large models at scale. Yet, it still lacks **semantic-level control** — the ability to decide when and how to apply reasoning based on the actual meaning of a query, not just its type. Without this capability, developers face an all-or-nothing trade-off:
47+
### Architecture Design
4648

47-
* Enable reasoning everywhere → higher accuracy, but wasted computation and inflated costs.
48-
* Disable reasoning entirely → lower cost, but accuracy drops sharply on reasoning-heavy tasks.
49+
The system comprises four pillars:
4950

50-
To overcome this gap, we introduce the **vLLM Semantic Router** — an intent-aware, fine-grained routing layer that brings **GPT-5-style “smart routing”** to the open-source ecosystem.
51+
1. Semantic Classification: Uses ModernBERT—currently a lightweight, standalone classifier integrated into the router—to determine routing paths.
52+
2. Smart Routing:
53+
- Simple queries → "fast path" inference.
54+
- Complex queries → "Chain-of-Thought" reasoning mode.
55+
3. High-Performance Engine: Written in Rust using Hugging Face Candle, it delivers high concurrency and zero-copy inference.
56+
4. Cloud-Native Integration: Works out-of-the-box with Kubernetes and Envoy via the `ext_proc` plugin.
5157

52-
By classifying queries at the semantic level and selectively enabling reasoning, the vLLM Semantic Router delivers **higher accuracy where it matters** and **significant cost savings where it doesn’t** — a step toward the principle that no token should be wasted.
58+
In trials, this design yielded:
5359

54-
![architecture](/img/architecture.png)
60+
- \~10% higher accuracy
61+
- \~50% lower latency
62+
- \~50% fewer tokens
5563

56-
### Architecture Design
64+
In business and economics domains, gains exceeded 20% accuracy improvements.
5765

58-
The **vLLM Semantic Router** is built to combine fine-grained semantic awareness with production-grade performance. Its design includes four key components:
66+
## Challenges in Execution: Budgets and Tool Calling
5967

60-
1. **Semantic Classification**: A **ModernBERT** fine-tuned intent classifier determines whether each query requires advanced reasoning or can be handled by lightweight inference.
61-
2. **Smart Routing**:
68+
Two technical constraints are important to address:
6269

63-
* Simple queries → Fast inference mode. Minimal latency and cost for straightforward requests.
64-
* Complex queries → Chain-of-Thought for accurate reasoning. Ensures accuracy on tasks that demand multi-step reasoning.
65-
3. **High-Performance Engine**: Implemented in **Rust** using the **Hugging Face Candle** framework, the engine achieves high concurrency and **zero-copy efficiency**, making it well-suited for large-scale serving.
66-
4. **Cloud-Native Integration**: Seamlessly integrates with **Kubernetes** and **API Gateways** through the **Envoy** `ext_proc` plugin.
70+
- Reasoning Budget Costs
71+
Unlimited reasoning inflates cold-start latency and resource usage. Without dynamic control, simple queries may over-consume tokens while critical queries may not get deep reasoning when needed. SLOs like TTFT and p95 latency are necessary—with possible adaptation mid-inference.
72+
- Tool Calling Constraints
73+
Adding more tools (i.e. “tool catalog bloat”) or longer tool outputs can drastically reduce accuracy. The router must pre-filter tools and keep catalogs tight.
6774

68-
Experimental results show:
75+
## Project Background
6976

70-
* **Accuracy**: +10.2%
71-
* **Latency**: –47.1%
72-
* **Token Consumption**: –48.5%
77+
The Semantic Router evolved from contributions across the open-source community:
7378

74-
In knowledge-intensive areas such as business and economics, accuracy improvements can exceed **20%**.
79+
- Proposed in early 2025 by [Dr. Chen Huamin](https://www.linkedin.com/in/huaminchen) (Red Hat)
80+
- Further developed by [Xunzhuo Liu](https://www.linkedin.com/in/bitliu) (Tencent)
81+
- To be presented by [Dr. Wang Chen](https://www.linkedin.com/in/chenw615) (IBM Research) and Dr. Chen Huamin at [KubeCon North America 2025](https://kccncna2025.sched.com/event/27FaI/intelligent-llm-routing-a-new-paradigm-for-multi-model-ai-orchestration-in-kubernetes-chen-wang-ibm-research-huamin-chen-red-hat?iframe=no&w=100%&sidebar=yes&bg=no)
7582

76-
## Project Background
83+
Our goal: provide inference acceleration for open-source LLMs through:
7784

78-
The **vLLM Semantic Router** is not the isolated result of a single paper but a collaborative outcome of sustained community contributions:
85+
- Semantic-aware routing
86+
- Efficient model switching
87+
- Enterprise-friendly deployment (Kubernetes & Envoy)
7988

80-
* Originally proposed by **[Dr. Chen Huamin](https://www.linkedin.com/in/huaminchen)**, Distinguished Engineer at **Red Hat**, in early **2025** across multiple open-source communities.
81-
* Iterated and further developed by **[Xunzhuo Liu](https://www.linkedin.com/in/bitliu)** at **Tencent**, later contributed to the vLLM community.
82-
* **[Dr. Wang Chen](https://www.linkedin.com/in/chenw615)** from **IBM Research** and **Dr. Chen Huamin** will present the project at **[KubeCon North America 2025](https://kccncna2025.sched.com/event/27FaI/intelligent-llm-routing-a-new-paradigm-for-multi-model-ai-orchestration-in-kubernetes-chen-wang-ibm-research-huamin-chen-red-hat?iframe=no&w=100%&sidebar=yes&bg=no)**.
89+
Find the project on [GitHub](https://github.com/vllm-project/semantic-router). The current focus is on a [Work Group](https://vllm-semantic-router.com/community/work-groups) and planned [v0.1 Roadmap](https://vllm-semantic-router.com/roadmap/v0.1).
8390

84-
The mission is clear: to serve as an **inference accelerator** for open-source large models:
91+
## Integration & Future Work: Embeddings and Pluggability
8592

86-
* Preserve accuracy while minimizing unnecessary token usage.
87-
* Enable seamless switching between "fast" and "slow" thinking modes without fully enabling or disabling inference.
88-
* Deliver production-ready enterprise integration through native Kubernetes and Envoy support.
93+
Currently, ModernBERT runs internally within the router for classification. It is not yet served by vLLM. However, future work aims to make the classifier—and potentially other embedding models—pluggable, allowing integration with vLLM-hosted models or external embedding services.
8994

90-
The vLLM Semantic Router is therefore not just a research milestone but an **essential bridge for open-source AI infrastructure**, translating **academic innovation into industrial application**.
95+
This capability will enhance the semantic cache and enable smoother inference customization.
9196

92-
You can start exploring the project at [GitHub](https://github.com/vllm-project/semantic-router). We're currently working on the [v0.1 Roadmap](https://github.com/vllm-project/semantic-router/issues/14) and have established a [Work Group](https://github.com/vllm-project/semantic-router/issues/15). We welcome your thoughts and invite you to join us!
97+
## Roadmap: v0.1 Milestone Highlights
9398

94-
## Future Trends: Cost-Effective, Just-in-Time Inference
99+
The [v0.1 milestone](https://github.com/vllm-project/semantic-router/milestone/1) will expand the project’s technical capabilities:
95100

96-
The central industry question has shifted from *"Can we perform inference?"* to *"When and how should inference be performed?"*
101+
- Core: ExtProc-based modularity, semantic caching across backends, multi-factor routing logic
102+
- Benchmarking: CLI tools, performance testing suite, reasoning-mode evaluation
103+
- Networking: Deeper integration with Envoy, GIE, and llm-d gateways
104+
- Observability & UX: Admin dashboards, routing policy visualization, developer quickstarts, and policy cookbook
97105

98-
* **GPT-5**: exemplifies this shift with **automatic routing** and **thinking quotas** to align computation with commercial value, enabling monetization.
99-
* **vLLM Semantic Router**: extends this paradigm to the open-source **vLLM engine**, enabling **semantic aware**, **low-latency**, and **energy-efficient** inference routing.
106+
## Future Trends: Just-in-Time Inference
100107

101-
Looking ahead, competitive differentiation will hinge less on sheer **model scale** and more on:
108+
The field is maturing from *“Can we run inference?”* to *“How can inference be smarter?”*
102109

103-
* **Performing inference at the right moment with the lowest cost.**
104-
* **Switching seamlessly between fast and slow reasoning modes.**
105-
* **Preserving user experience without wasting compute.**
110+
- GPT-5 uses commercial value to guide reasoning depth.
111+
- vLLM Semantic Router delivers that capability to open source.
106112

107-
The next frontier is **intelligent, self-adjusting inference systems** — engines that autonomously decide when to think deeply and when to respond directly, without user toggles or rigid rules. This shift marks a new era where inference becomes not just powerful, but **context-aware, adaptive, and economically sustainable**.
113+
Looking ahead, systems that adapt their inference strategy on the fly, without manual toggles, will lead in efficiency, latency, and sustainability.
108114

109115
## One-Sentence Summary
110116

111-
* **GPT-5**: Business-driven routing → broad intelligence.
112-
* **vLLM Semantic Router**: Efficiency-driven routing → sustainable AI.
113-
* **Future edge**: Performing the right inference at the right time, with minimal computation.
117+
- GPT-5: enterprise routing for smarter inference
118+
- vLLM Semantic Router: technical-first routing for open-source LLMs
119+
- Edge future: context-aware, minimal-compute inference that works seamlessly

0 commit comments

Comments
 (0)