Skip to content

Commit 947693a

Browse files
authored
project: add blog section (#70)
1 parent 3497d07 commit 947693a

File tree

4 files changed

+154
-2
lines changed

4 files changed

+154
-2
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@
77
[![License](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](LICENSE)
88
[![Crates.io](https://img.shields.io/crates/v/candle-semantic-router.svg)](https://crates.io/crates/candle-semantic-router)
99

10-
**📚 [Complete Documentation](https://vllm-semantic-router.com) | 🚀 [Quick Start](https://vllm-semantic-router.com/docs/getting-started/installation) | 🏗️ [Architecture](https://vllm-semantic-router.com/docs/architecture/system-architecture/) | 📖 [API Reference](https://vllm-semantic-router.com/docs/api/router/)**
10+
**📚 [Complete Documentation](https://vllm-semantic-router.com) | 🚀 [Quick Start](https://vllm-semantic-router.com/docs/getting-started/installation) | 📣 [Blog](https://vllm-semantic-router.com/blog/) | 📖 [API Reference](https://vllm-semantic-router.com/docs/api/router/)**
1111

1212
![](./website/static/img/code.png)
1313

website/blog/2025-09-06-welcome.md

Lines changed: 113 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,113 @@
1+
---
2+
slug: welcome
3+
title: "vLLM Semantic Router: Next Phase in LLM inference"
4+
authors: [rootfs, wangchen615, yuezhu1, Xunzhuo]
5+
tags: [welcome, announcement, vllm, semantic-router]
6+
---
7+
8+
![](/img/code.png)
9+
10+
<!-- truncate -->
11+
12+
## Industry Status: Inference ≠ More Is Better
13+
14+
Over the past year, **hybrid reasoning and automatic routing** have become central to discussions in the large-model ecosystem. The focus is shifting from raw parameter counts to **efficiency, selectivity**, and **per-token value**.
15+
16+
Take **GPT-5** as an example. Its most notable breakthrough isn’t sheer size, but the introduction of **automatic routing and thinking quotas**:
17+
18+
* **Light queries → Lightweight models**: Simple prompts like “Why is the sky blue?” don’t need costly reasoning-heavy inference.
19+
* **Complex/High-value queries → Advanced models**: Legal analysis, financial simulations, or multi-step reasoning tasks are routed to models with Chain-of-Thought capabilities.
20+
21+
This shift reflects a new principle of **per-token unit economics**.
22+
23+
Every token generated must deliver value, rather than being treated as sunk computational cost.
24+
25+
Use cases:
26+
27+
* Free-tier users are served by lightweight models, keeping costs sustainable.
28+
* Queries with clear commercial intent — e.g., booking flights or finding legal services — are escalated to high-compute models or directly integrated agent services. In these cases, providers can monetize not only inference but also downstream transactions, turning free usage into a revenue engine.
29+
30+
In these cases, companies like OpenAI can participate in the value chain by taking a commission on completed transactions — transforming free usage from a cost center into a monetizable entry point.
31+
32+
Other leaders are adopting similar strategies:
33+
34+
* **Anthropic Claude 3.7/4**: blends “fast thinking” and “slow thinking,” with user-controlled toggles.
35+
* **Google Gemini 2.5**: introduces a *thinking budget*, giving enterprises fine-grained control over inference costs.
36+
* **Alibaba Qwen3**: explores instruction-based switching between reasoning and non-reasoning.
37+
* **DeepSeek v3.1**: pioneers a dual-mode single-model design, merging conversational and reasoning flows.
38+
39+
In short: the industry is entering an era where **no token is wasted** — and **routing intelligence** defines the frontier of model innovation.
40+
41+
## Recent Research: vLLM Semantic Router
42+
43+
As the industry moves toward hybrid reasoning and intelligent routing, this project zeroes in on the **open-source inference engine vLLM**.
44+
45+
vLLM has quickly become the **de facto standard** for serving large models at scale. Yet, it still lacks **semantic-level control** — the ability to decide when and how to apply reasoning based on the actual meaning of a query, not just its type. Without this capability, developers face an all-or-nothing trade-off:
46+
47+
* Enable reasoning everywhere → higher accuracy, but wasted computation and inflated costs.
48+
* Disable reasoning entirely → lower cost, but accuracy drops sharply on reasoning-heavy tasks.
49+
50+
To overcome this gap, we introduce the **vLLM Semantic Router** — an intent-aware, fine-grained routing layer that brings **GPT-5-style “smart routing”** to the open-source ecosystem.
51+
52+
By classifying queries at the semantic level and selectively enabling reasoning, the vLLM Semantic Router delivers **higher accuracy where it matters** and **significant cost savings where it doesn’t** — a step toward the principle that no token should be wasted.
53+
54+
![](/img/architecture.png)
55+
56+
### Architecture Design
57+
58+
The **vLLM Semantic Router** is built to combine fine-grained semantic awareness with production-grade performance. Its design includes four key components:
59+
60+
1. **Semantic Classification**: A **ModernBERT** fine-tuned intent classifier determines whether each query requires advanced reasoning or can be handled by lightweight inference.
61+
2. **Smart Routing**:
62+
63+
* Simple queries → Fast inference mode. Minimal latency and cost for straightforward requests.
64+
* Complex queries → Chain-of-Thought for accurate reasoning. Ensures accuracy on tasks that demand multi-step reasoning.
65+
3. **High-Performance Engine**: Implemented in **Rust** using the **Hugging Face Candle** framework, the engine achieves high concurrency and **zero-copy efficiency**, making it well-suited for large-scale serving.
66+
4. **Cloud-Native Integration**: Seamlessly integrates with **Kubernetes** and **API Gateways** through the **Envoy** `ext_proc` plugin.
67+
68+
Experimental results show:
69+
70+
* **Accuracy**: +10.2%
71+
* **Latency**: –47.1%
72+
* **Token Consumption**: –48.5%
73+
74+
In knowledge-intensive areas such as business and economics, accuracy improvements can exceed **20%**.
75+
76+
## Project Background
77+
78+
The **vLLM Semantic Router** is not the isolated result of a single paper but a collaborative outcome of sustained community contributions:
79+
80+
* Originally proposed by **[Dr. Chen Huamin](https://www.linkedin.com/in/huaminchen)**, Distinguished Engineer at **Red Hat**, in early **2025** across multiple open-source communities.
81+
* Iterated and further developed by **[Xunzhuo Liu](https://www.linkedin.com/in/bitliu)** at **Tencent**, later contributed to the vLLM community.
82+
* **[Dr. Wang Chen](https://www.linkedin.com/in/chenw615)** from **IBM Research** and **Dr. Chen Huamin** will present the project at **[KubeCon North America 2025](https://kccncna2025.sched.com/event/27FaI/intelligent-llm-routing-a-new-paradigm-for-multi-model-ai-orchestration-in-kubernetes-chen-wang-ibm-research-huamin-chen-red-hat?iframe=no&w=100%&sidebar=yes&bg=no)**.
83+
84+
The mission is clear: to serve as an **inference accelerator** for open-source large models:
85+
86+
* Preserve accuracy while minimizing unnecessary token usage.
87+
* Enable seamless switching between "fast" and "slow" thinking modes without fully enabling or disabling inference.
88+
* Deliver production-ready enterprise integration through native Kubernetes and Envoy support.
89+
90+
The vLLM Semantic Router is therefore not just a research milestone but an **essential bridge for open-source AI infrastructure**, translating **academic innovation into industrial application**.
91+
92+
You can start exploring the project at [GitHub](https://github.com/vllm-project/semantic-router). We're currently working on the [v0.1 Roadmap](https://github.com/vllm-project/semantic-router/issues/14) and have established a [Work Group](https://github.com/vllm-project/semantic-router/issues/15). We welcome your thoughts and invite you to join us!
93+
94+
## Future Trends: Cost-Effective, Just-in-Time Inference
95+
96+
The central industry question has shifted from *"Can we perform inference?"* to *"When and how should inference be performed?"*
97+
98+
* **GPT-5**: exemplifies this shift with **automatic routing** and **thinking quotas** to align computation with commercial value, enabling monetization.
99+
* **vLLM Semantic Router**: extends this paradigm to the open-source **vLLM engine**, enabling **semantic aware**, **low-latency**, and **energy-efficient** inference routing.
100+
101+
Looking ahead, competitive differentiation will hinge less on sheer **model scale** and more on:
102+
103+
* **Performing inference at the right moment with the lowest cost.**
104+
* **Switching seamlessly between fast and slow reasoning modes.**
105+
* **Preserving user experience without wasting compute.**
106+
107+
The next frontier is **intelligent, self-adjusting inference systems** — engines that autonomously decide when to think deeply and when to respond directly, without user toggles or rigid rules. This shift marks a new era where inference becomes not just powerful, but **context-aware, adaptive, and economically sustainable**.
108+
109+
## One-Sentence Summary
110+
111+
* **GPT-5**: Business-driven routing → broad intelligence.
112+
* **vLLM Semantic Router**: Efficiency-driven routing → sustainable AI.
113+
* **Future edge**: Performing the right inference at the right time, with minimal computation.

website/blog/authors.yml

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
rootfs:
2+
name: Huamin Chen
3+
title: Distinguished Engineer @ Red Hat
4+
url: https://github.com/rootfs
5+
image_url: /img/team/huamin.png
6+
7+
wangchen615:
8+
name: Chen Wang
9+
title: Senior Staff Research Scientist @ IBM
10+
url: https://github.com/wangchen615
11+
image_url: /img/team/chen.png
12+
13+
yuezhu1:
14+
name: Yue Zhu
15+
title: Staff Research Scientist @ IBM
16+
url: https://github.com/yuezhu1
17+
image_url: /img/team/yue.png
18+
19+
Xunzhuo:
20+
name: Xunzhuo Liu
21+
title: Software Engineer @ Tencent
22+
url: https://github.com/Xunzhuo
23+
image_url: /img/team/xunzhuo.png

website/docusaurus.config.js

Lines changed: 17 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,18 @@ const config = {
5050
editUrl:
5151
'https://github.com/vllm-project/semantic-router/tree/main/docs/',
5252
},
53-
blog: false, // Disable blog
53+
blog: {
54+
showReadingTime: true,
55+
postsPerPage: 10,
56+
blogTitle: 'vLLM Semantic Router Blog',
57+
blogDescription: 'Latest updates, insights, and technical articles about vLLM Semantic Router',
58+
blogSidebarTitle: 'Recent Posts',
59+
blogSidebarCount: 10,
60+
// Please change this to your repo.
61+
// Remove this to remove the "edit this page" links.
62+
editUrl:
63+
'https://github.com/vllm-project/semantic-router/tree/main/website/blog/',
64+
},
5465
theme: {
5566
customCss: require.resolve('./src/css/custom.css'),
5667
},
@@ -77,6 +88,11 @@ const config = {
7788
position: 'left',
7889
label: 'Documentation',
7990
},
91+
{
92+
to: '/blog',
93+
label: 'Blog',
94+
position: 'left',
95+
},
8096
{
8197
type: 'dropdown',
8298
label: 'Community',

0 commit comments

Comments
 (0)