Skip to content

Commit 21c945a

Browse files
authored
docs: add tutorials for semantic cache (#230)
Signed-off-by: bitliu <[email protected]>
1 parent 5e3716e commit 21c945a

File tree

2 files changed

+207
-0
lines changed

2 files changed

+207
-0
lines changed
Lines changed: 206 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,206 @@
1+
# Semantic Cache
2+
3+
Semantic Router provides intelligent caching that understands request similarity using semantic embeddings. Instead of exact string matching, it identifies semantically similar queries to serve cached responses, reducing latency and LLM inference costs.
4+
5+
## Architecture
6+
7+
```mermaid
8+
graph TB
9+
A[Client Request] --> B[Semantic Router]
10+
B --> C{Cache Enabled?}
11+
C -->|No| G[Route to LLM]
12+
C -->|Yes| D[Generate Embedding]
13+
D --> E{Similar Query in Cache?}
14+
E -->|Hit| F[Return Cached Response]
15+
E -->|Miss| G[Route to LLM]
16+
G --> H[LLM Response]
17+
H --> I[Store in Cache]
18+
H --> J[Return Response]
19+
I --> K[Update Metrics]
20+
F --> K
21+
22+
style F fill:#90EE90
23+
style I fill:#FFB6C1
24+
```
25+
26+
## Backend Options
27+
28+
### Memory Backend (Development)
29+
30+
- **Use case**: Development, testing, single-instance deployments
31+
- **Pros**: Fast startup, no external dependencies
32+
- **Cons**: Data lost on restart, limited to single instance
33+
34+
### Milvus Backend (Production/Persistent)
35+
36+
- **Use case**: Production, distributed deployments
37+
- **Pros**: Persistent storage, horizontally scalable, high availability
38+
- **Cons**: Requires Milvus cluster setup
39+
40+
## Configuration
41+
42+
### Memory Backend
43+
44+
```yaml
45+
semantic_cache:
46+
enabled: true
47+
backend_type: "memory"
48+
similarity_threshold: 0.8
49+
max_entries: 1000
50+
ttl_seconds: 3600
51+
```
52+
53+
### Milvus Backend
54+
55+
```yaml
56+
semantic_cache:
57+
enabled: true
58+
backend_type: "milvus"
59+
backend_config_path: "config/cache/milvus.yaml"
60+
similarity_threshold: 0.8
61+
ttl_seconds: 3600
62+
```
63+
64+
## Testing Cache Functionality
65+
66+
### Test Memory Backend
67+
68+
Start the router with memory cache:
69+
70+
```bash
71+
# Run the router
72+
make run-router
73+
```
74+
75+
Test cache behavior:
76+
77+
```bash
78+
# Send identical requests to see cache hits
79+
curl -X POST http://localhost:8080/v1/chat/completions \
80+
-H "Content-Type: application/json" \
81+
-d '{
82+
"model": "auto",
83+
"messages": [{"role": "user", "content": "What is machine learning?"}]
84+
}'
85+
86+
# Send similar request (should hit cache due to semantic similarity)
87+
curl -X POST http://localhost:8080/v1/chat/completions \
88+
-H "Content-Type: application/json" \
89+
-d '{
90+
"model": "auto",
91+
"messages": [{"role": "user", "content": "Explain machine learning"}]
92+
}'
93+
```
94+
95+
### Test Milvus Backend
96+
97+
Start Milvus container:
98+
99+
```bash
100+
make start-milvus
101+
```
102+
103+
Update configuration to use Milvus:
104+
105+
```bash
106+
# Edit config/config.yaml
107+
sed -i 's/backend_type: "memory"/backend_type: "milvus"/' config/config.yaml
108+
sed -i 's/# backend_config_path:/backend_config_path:/' config/config.yaml
109+
```
110+
111+
Run with Milvus support:
112+
113+
```bash
114+
# Run the router
115+
make run-router
116+
```
117+
118+
Stop Milvus when done:
119+
120+
```bash
121+
make stop-milvus
122+
```
123+
124+
## Monitoring Cache Performance
125+
126+
### Available Metrics
127+
128+
The router exposes Prometheus metrics for cache monitoring:
129+
130+
| Metric | Type | Description |
131+
|--------|------|-------------|
132+
| `llm_cache_hits_total` | Counter | Total cache hits |
133+
| `llm_cache_misses_total` | Counter | Total cache misses |
134+
| `llm_cache_operations_total` | Counter | Cache operations by backend, operation, and status |
135+
| `llm_cache_operation_duration_seconds` | Histogram | Duration of cache operations |
136+
| `llm_cache_entries_total` | Gauge | Current number of cache entries |
137+
138+
### Cache Metrics Dashboard
139+
140+
Access metrics via:
141+
142+
- **Metrics endpoint**: `http://localhost:9190/metrics`
143+
- **Built-in stats**: Available via cache backend `GetStats()` method
144+
145+
Example Prometheus queries:
146+
147+
```promql
148+
# Cache hit rate
149+
rate(llm_cache_hits_total[5m]) / (rate(llm_cache_hits_total[5m]) + rate(llm_cache_misses_total[5m]))
150+
151+
# Average cache operation duration
152+
rate(llm_cache_operation_duration_seconds_sum[5m]) / rate(llm_cache_operation_duration_seconds_count[5m])
153+
154+
# Cache operations by backend
155+
sum by (backend) (rate(llm_cache_operations_total[5m]))
156+
```
157+
158+
### Cache Performance Analysis
159+
160+
Monitor these key indicators:
161+
162+
1. **Hit Ratio**: Higher ratios indicate better cache effectiveness
163+
2. **Operation Latency**: Cache lookups should be significantly faster than LLM calls
164+
3. **Entry Count**: Monitor cache size for memory management
165+
4. **Backend Performance**: Compare memory vs Milvus operation times
166+
167+
## Configuration Best Practices
168+
169+
### Development Environment
170+
171+
```yaml
172+
semantic_cache:
173+
enabled: true
174+
backend_type: "memory"
175+
similarity_threshold: 0.85 # Higher threshold for more precise matching
176+
max_entries: 500 # Smaller cache for testing
177+
```
178+
179+
### Production Environment
180+
181+
```yaml
182+
semantic_cache:
183+
enabled: true
184+
backend_type: "milvus"
185+
backend_config_path: "config/cache/milvus.yaml"
186+
similarity_threshold: 0.8 # Balanced threshold
187+
```
188+
189+
### Milvus Production Configuration
190+
191+
```yaml
192+
# config/cache/milvus.yaml
193+
connection:
194+
host: "milvus-cluster.prod.example.com" # Replace with your Milvus cluster endpoint
195+
port: 443
196+
auth:
197+
enabled: true
198+
username: "semantic-router" # Replace with your Milvus username
199+
password: "${MILVUS_PASSWORD}" # Replace with your Milvus password
200+
tls:
201+
enabled: true
202+
203+
development:
204+
drop_collection_on_startup: false # Preserve data
205+
auto_create_collection: false # Pre-create collections
206+
```

website/sidebars.js

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -49,6 +49,7 @@ const sidebars = {
4949
items: [
5050
'getting-started/installation',
5151
'getting-started/docker-quickstart',
52+
'getting-started/semantic-cache',
5253
'getting-started/reasoning',
5354
'getting-started/configuration',
5455
'getting-started/observability',

0 commit comments

Comments
 (0)