You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+13-18Lines changed: 13 additions & 18 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -16,6 +16,7 @@
16
16
17
17
*Latest News* 🔥
18
18
19
+
-[2025/10/26] We reached 2000 stars on GitHub! 🔥
19
20
-[2025/10/21] We announced the [2025 Q4 Roadmap: Journey to Iris](https://vllm-semantic-router.com/blog/q4-roadmap-iris) 📅.
20
21
-[2025/10/16] We established the [vLLM Semantic Router Youtube Channel](https://www.youtube.com/@vLLMSemanticRouter) ✨.
21
22
-[2025/10/15] We announced the [vLLM Semantic Router Dashboard](https://www.youtube.com/watch?v=E2IirN8PsFw) 🚀.
@@ -25,13 +26,6 @@
25
26
-[2025/09/15] We reached 1000 stars on GitHub! 🔥
26
27
-[2025/09/01] We released the project officially: [vLLM Semantic Router: Next Phase in LLM inference](https://blog.vllm.ai/2025/09/11/semantic-router.html) 🚀.
27
28
28
-
<!-- <details>
29
-
<summary>Previous News 🔥</summary>
30
-
31
-
-
32
-
33
-
</details> -->
34
-
35
29
---
36
30
37
31
## Innovations ✨
@@ -44,30 +38,36 @@
44
38
45
39
An **Mixture-of-Models** (MoM) router that intelligently directs OpenAI API requests to the most suitable models from a defined pool based on **Semantic Understanding** of the request's intent (Complexity, Task, Tools).
46
40
47
-
This is achieved using BERT classification. Conceptually similar to Mixture-of-Experts (MoE) which lives *within* a model, this system selects the best *entire model* for the nature of the task.
41
+

42
+
43
+
Conceptually similar to Mixture-of-Experts (MoE) which lives *within* a model, this system selects the best *entire model* for the nature of the task.
48
44
49
45
As such, the overall inference accuracy is improved by using a pool of models that are better suited for different types of tasks:
- Golang (with Rust FFI based on the [candle](https://github.com/huggingface/candle) rust ML framework)
60
52
- Python
61
53
Benchmarking will be conducted to determine the best implementation.
62
54
55
+
#### Request Flow
56
+
57
+

58
+
63
59
#### Auto-Selection of Tools
64
60
65
61
Select the tools to use based on the prompt, avoiding the use of tools that are not relevant to the prompt so as to reduce the number of prompt tokens and improve tool selection accuracy by the LLM.
66
62
67
-
#### Category-Specific System Prompts
63
+
#### Domain Aware System Prompts
68
64
69
65
Automatically inject specialized system prompts based on query classification, ensuring optimal model behavior for different domains (math, coding, business, etc.) without manual prompt engineering.
70
66
67
+
#### Domain Aware Similarity Caching ⚡️
68
+
69
+
Cache the semantic representation of the prompt so as to reduce the number of prompt tokens and improve the overall inference latency.
70
+
71
71
### Enterprise Security 🔒
72
72
73
73
#### PII detection
@@ -78,10 +78,6 @@ Detect PII in the prompt, avoiding sending PII to the LLM so as to protect the p
78
78
79
79
Detect if the prompt is a jailbreak prompt, avoiding sending jailbreak prompts to the LLM so as to prevent the LLM from misbehaving. Can be configured globally or at the category level for fine-grained security control.
80
80
81
-
### Similarity Caching ⚡️
82
-
83
-
Cache the semantic representation of the prompt so as to reduce the number of prompt tokens and improve the overall inference latency.
84
-
85
81
### Distributed Tracing 🔍
86
82
87
83
Comprehensive observability with OpenTelemetry distributed tracing provides fine-grained visibility into the request processing pipeline.
@@ -128,7 +124,6 @@ The documentation includes:
128
124
-**[Model Training](https://vllm-semantic-router.com/docs/training/training-overview/)** - How classification models work
129
125
-**[API Reference](https://vllm-semantic-router.com/docs/api/router/)** - Complete API documentation
0 commit comments