You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
# vLLM Semantic Router + Milvus: How Semantic Routing and Caching Build Scalable AI Systems the Smart Way
9
9
10
10
Most AI apps rely on a single model for every request. But that approach quickly runs into limits. Large models are powerful yet expensive, even when they're used for simple queries. Smaller models are cheaper and faster but can't handle complex reasoning. When traffic surges—say your AI app suddenly goes viral with ten million users overnight—the inefficiency of this one-model-for-all setup becomes painfully apparent. Latency spikes, GPU bills explode, and the model that ran fine yesterday starts gasping for air.
11
11
12
-
13
12
<!-- truncate -->
14
13
15
14
And my friend, you, the engineer behind this app, have to fix it—fast.
@@ -69,10 +68,8 @@ In developer tools or IDE assistants, many queries overlap—syntax help, API lo
69
68
70
69
Enterprise queries tend to repeat over time—policy lookups, compliance references, product FAQs. With Milvus as the semantic cache layer, frequently asked questions and their answers can be stored and retrieved efficiently. This minimizes redundant computation while keeping responses consistent across departments and regions.
71
70
72
-
73
71
Under the hood, the Semantic Router + Milvus pipeline is implemented in Go and Rust for high performance and low latency. Integrated at the gateway layer, it continuously monitors key metrics—like hit rates, routing latency, and model performance—to fine-tune routing strategies in real time.
74
72
75
-
76
73
## How to Quickly Test the Semantic Caching in the Semantic Router
77
74
78
75
Before deploying semantic caching at scale, it's useful to validate how it behaves in a controlled setup. In this section, we'll walk through a quick local test that shows how the Semantic Router uses Milvus as its semantic cache. You'll see how similar queries hit the cache instantly while new or distinct ones trigger model generation—proving the caching logic in action.
@@ -99,6 +96,7 @@ Start the Milvus service.
99
96
docker-compose up -d
100
97
101
98
```
99
+
102
100

103
101
104
102
```
@@ -107,9 +105,6 @@ docker-compose ps -a
107
105
108
106
```
109
107
110
-
111
-
112
-
113
108
### 2. Clone the project
114
109
115
110
```bash
@@ -275,4 +270,4 @@ In short, you get smarter scaling—less brute force, more brains.
275
270
276
271
---
277
272
278
-
If you'd like to explore this further, join the conversation in our Milvus Discord or open an issue on GitHub. You can also book a 20-minute Milvus Office Hours session for one-on-one guidance, insights, and technical deep dives from the team behind Milvus.
273
+
If you'd like to explore this further, join the conversation in our Milvus Discord or open an issue on GitHub. You can also book a 20-minute Milvus Office Hours session for one-on-one guidance, insights, and technical deep dives from the team behind Milvus.
0 commit comments