Skip to content

Commit a1b5826

Browse files
author
yinmin
committed
fix document
1 parent e77c27d commit a1b5826

File tree

3 files changed

+10
-8
lines changed

3 files changed

+10
-8
lines changed

website/blog/2025-10-30-milvus.md

Lines changed: 3 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,14 @@
11
---
22
slug: semantic-router-milvus
33
title: "vLLM Semantic Router + Milvus: How Semantic Routing and Caching Build Scalable AI Systems the Smart Way"
4-
authors: [min yin]
4+
authors: [minyin]
55
tags: [semantic-router, milvus, caching, scalability, ai-systems]
66
---
77

88
# vLLM Semantic Router + Milvus: How Semantic Routing and Caching Build Scalable AI Systems the Smart Way
99

1010
Most AI apps rely on a single model for every request. But that approach quickly runs into limits. Large models are powerful yet expensive, even when they're used for simple queries. Smaller models are cheaper and faster but can't handle complex reasoning. When traffic surges—say your AI app suddenly goes viral with ten million users overnight—the inefficiency of this one-model-for-all setup becomes painfully apparent. Latency spikes, GPU bills explode, and the model that ran fine yesterday starts gasping for air.
1111

12-
1312
<!-- truncate -->
1413

1514
And my friend, you, the engineer behind this app, have to fix it—fast.
@@ -69,10 +68,8 @@ In developer tools or IDE assistants, many queries overlap—syntax help, API lo
6968

7069
Enterprise queries tend to repeat over time—policy lookups, compliance references, product FAQs. With Milvus as the semantic cache layer, frequently asked questions and their answers can be stored and retrieved efficiently. This minimizes redundant computation while keeping responses consistent across departments and regions.
7170

72-
7371
Under the hood, the Semantic Router + Milvus pipeline is implemented in Go and Rust for high performance and low latency. Integrated at the gateway layer, it continuously monitors key metrics—like hit rates, routing latency, and model performance—to fine-tune routing strategies in real time.
7472

75-
7673
## How to Quickly Test the Semantic Caching in the Semantic Router
7774

7875
Before deploying semantic caching at scale, it's useful to validate how it behaves in a controlled setup. In this section, we'll walk through a quick local test that shows how the Semantic Router uses Milvus as its semantic cache. You'll see how similar queries hit the cache instantly while new or distinct ones trigger model generation—proving the caching logic in action.
@@ -99,6 +96,7 @@ Start the Milvus service.
9996
docker-compose up -d
10097
10198
```
99+
102100
![docker-compose](/img/docker-compose.png)
103101

104102
```
@@ -107,9 +105,6 @@ docker-compose ps -a
107105
108106
```
109107

110-
111-
112-
113108
### 2. Clone the project
114109

115110
```bash
@@ -275,4 +270,4 @@ In short, you get smarter scaling—less brute force, more brains.
275270

276271
---
277272

278-
If you'd like to explore this further, join the conversation in our Milvus Discord or open an issue on GitHub. You can also book a 20-minute Milvus Office Hours session for one-on-one guidance, insights, and technical deep dives from the team behind Milvus.
273+
If you'd like to explore this further, join the conversation in our Milvus Discord or open an issue on GitHub. You can also book a 20-minute Milvus Office Hours session for one-on-one guidance, insights, and technical deep dives from the team behind Milvus.

website/blog/authors.yml

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,3 +21,10 @@ Xunzhuo:
2121
title: Software Engineer @ Tencent
2222
url: https://github.com/Xunzhuo
2323
image_url: /img/team/xunzhuo.png
24+
25+
Xunzhuo:
26+
name: Min Yin
27+
title: Milvus Ambassador
28+
url: https://github.com/yinmin2020
29+
image_url: /img/team/yinmin.jpg
30+

website/static/img/team/yinmin.jpg

32.9 KB
Loading

0 commit comments

Comments
 (0)