Skip to content

Commit 1e18c1f

Browse files
committed
docs: update
Signed-off-by: bitliu <[email protected]>
1 parent e2c27df commit 1e18c1f

File tree

1 file changed

+4
-4
lines changed

1 file changed

+4
-4
lines changed

_posts/2025-09-01-semantic-router.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ image: /assets/logos/vllm-logo-text-light.png
99

1010
## **Industry Status: Inference ≠ The More, The Better**
1111

12-
Over the past year, **hybrid inference / automatic routing** has become one of the hottest topics in the large model industry.
12+
Over the past year, **Hybrid inference / automatic routing** has become one of the hottest topics in the large model industry.
1313

1414
Take **GPT-5** as an example. Its real breakthrough isn't in the number of parameters, but in the **"automatic routing + thinking quota"**:
1515

@@ -39,7 +39,7 @@ In summary: The industry is entering a new era where **"not a single token shoul
3939

4040
## **Recent Research: vLLM Semantic Router**
4141

42-
Amid the industry's push for "hybrid inference," we focus on the **open-source inference engine vLLM**.
42+
Amid the industry's push for "Hybrid inference," we focus on the **open-source inference engine vLLM**.
4343

4444
vLLM has become the de facto standard for deploying large models in the industry. However, it lacks "semantic-level fine control." Developers either enable full inference (wasting computation) or disable inference entirely (losing accuracy).
4545

@@ -54,7 +54,7 @@ Thus, we propose the **vLLM Semantic Router**, bringing GPT-5's "smart routing"
5454
2. **Smart Routing**:
5555

5656
* Simple queries → Directly call the non-inference mode for fast responses.
57-
57+
5858
* Complex inference queries → Enable Chain-of-Thought to ensure accuracy.
5959

6060
3. **Rust High-Performance Engine**: Using the HuggingFace Candle framework to achieve high concurrency and zero-copy efficient inference.
@@ -105,7 +105,7 @@ The future competitive focus will no longer be about "whose model is the largest
105105

106106
Thus, the next frontier will be: **Intelligent self-adjusting inference mechanisms**. No need for explicit user switches or hardcoding; instead, the model/system can autonomously decide when to "think deeply" or provide a quick answer.
107107

108-
# **Summary in One Sentence**
108+
## **Summary in One Sentence**
109109

110110
* **GPT-5**: Uses routing for business, driving widespread intelligence.
111111
* **vLLM Semantic Router**: Uses semantic routing for efficiency, driving green AI.

0 commit comments

Comments
 (0)