Skip to content

Commit aab22ff

Browse files
authored
docs: update architecture and add req flow (#562)
Signed-off-by: bitliu <[email protected]>
1 parent 2176373 commit aab22ff

File tree

4 files changed

+13
-18
lines changed

4 files changed

+13
-18
lines changed

README.md

Lines changed: 13 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@
1616

1717
*Latest News* 🔥
1818

19+
- [2025/10/26] We reached 2000 stars on GitHub! 🔥
1920
- [2025/10/21] We announced the [2025 Q4 Roadmap: Journey to Iris](https://vllm-semantic-router.com/blog/q4-roadmap-iris) 📅.
2021
- [2025/10/16] We established the [vLLM Semantic Router Youtube Channel](https://www.youtube.com/@vLLMSemanticRouter) ✨.
2122
- [2025/10/15] We announced the [vLLM Semantic Router Dashboard](https://www.youtube.com/watch?v=E2IirN8PsFw) 🚀.
@@ -25,13 +26,6 @@
2526
- [2025/09/15] We reached 1000 stars on GitHub! 🔥
2627
- [2025/09/01] We released the project officially: [vLLM Semantic Router: Next Phase in LLM inference](https://blog.vllm.ai/2025/09/11/semantic-router.html) 🚀.
2728

28-
<!-- <details>
29-
<summary>Previous News 🔥</summary>
30-
31-
-
32-
33-
</details> -->
34-
3529
---
3630

3731
## Innovations ✨
@@ -44,30 +38,36 @@
4438

4539
An **Mixture-of-Models** (MoM) router that intelligently directs OpenAI API requests to the most suitable models from a defined pool based on **Semantic Understanding** of the request's intent (Complexity, Task, Tools).
4640

47-
This is achieved using BERT classification. Conceptually similar to Mixture-of-Experts (MoE) which lives *within* a model, this system selects the best *entire model* for the nature of the task.
41+
![](./website/static/img/mom-overview.png)
42+
43+
Conceptually similar to Mixture-of-Experts (MoE) which lives *within* a model, this system selects the best *entire model* for the nature of the task.
4844

4945
As such, the overall inference accuracy is improved by using a pool of models that are better suited for different types of tasks:
5046

5147
![Model Accuracy](./website/static/img/category_accuracies.png)
5248

53-
The screenshot below shows the LLM Router dashboard in Grafana.
54-
55-
![LLM Router Dashboard](./website/static/img/grafana_screenshot.png)
56-
5749
The router is implemented in two ways:
5850

5951
- Golang (with Rust FFI based on the [candle](https://github.com/huggingface/candle) rust ML framework)
6052
- Python
6153
Benchmarking will be conducted to determine the best implementation.
6254

55+
#### Request Flow
56+
57+
![architecture](./website/static/img/flow.png)
58+
6359
#### Auto-Selection of Tools
6460

6561
Select the tools to use based on the prompt, avoiding the use of tools that are not relevant to the prompt so as to reduce the number of prompt tokens and improve tool selection accuracy by the LLM.
6662

67-
#### Category-Specific System Prompts
63+
#### Domain Aware System Prompts
6864

6965
Automatically inject specialized system prompts based on query classification, ensuring optimal model behavior for different domains (math, coding, business, etc.) without manual prompt engineering.
7066

67+
#### Domain Aware Similarity Caching ⚡️
68+
69+
Cache the semantic representation of the prompt so as to reduce the number of prompt tokens and improve the overall inference latency.
70+
7171
### Enterprise Security 🔒
7272

7373
#### PII detection
@@ -78,10 +78,6 @@ Detect PII in the prompt, avoiding sending PII to the LLM so as to protect the p
7878

7979
Detect if the prompt is a jailbreak prompt, avoiding sending jailbreak prompts to the LLM so as to prevent the LLM from misbehaving. Can be configured globally or at the category level for fine-grained security control.
8080

81-
### Similarity Caching ⚡️
82-
83-
Cache the semantic representation of the prompt so as to reduce the number of prompt tokens and improve the overall inference latency.
84-
8581
### Distributed Tracing 🔍
8682

8783
Comprehensive observability with OpenTelemetry distributed tracing provides fine-grained visibility into the request processing pipeline.
@@ -128,7 +124,6 @@ The documentation includes:
128124
- **[Model Training](https://vllm-semantic-router.com/docs/training/training-overview/)** - How classification models work
129125
- **[API Reference](https://vllm-semantic-router.com/docs/api/router/)** - Complete API documentation
130126
- **[Dashboard](https://vllm-semantic-router.com/docs/overview/dashboard)** - vLLM Semantic Router Dashboard
131-
- **[Distributed Tracing](https://vllm-semantic-router.com/docs/tutorials/observability/distributed-tracing/)** - Observability and debugging guide
132127

133128
## Community 👋
134129

199 KB
Loading

website/static/img/flow.png

657 KB
Loading
741 KB
Loading

0 commit comments

Comments
 (0)