Skip to content

Commit f193a19

Browse files
authored
[Doc] Reorganize intelligent routing tutorials into focused guides (vllm-project#636)
* [Doc] Reorganize intelligent routing tutorials into focused guides - Remove overview.md and reasoning.md - Create 4 new focused routing guides: - domain-routing.md: Fine-tuned classification (efficient, specialized) - embedding-routing.md: Semantic similarity routing (scalable, fast) - keyword-routing.md: Keyword-based routing (transparent, compliant) - mcp-routing.md: External service routing (extensible, private) - Update lora-routing.md to clarify it combines other routing methods - Update sidebar navigation to reflect new structure - Add comprehensive use cases and problem-solution context to each guide - Align all guides with consistent structure and friendly tone Signed-off-by: bitliu <[email protected]> * [Doc] Fix MDX compilation and markdown lint errors - Escape < characters in numeric comparisons (&lt;1%, &lt;1ms, etc.) - Add blank lines around fenced code blocks - Remove multiple consecutive blank lines at end of files - Fix list formatting around code blocks Signed-off-by: bitliu <[email protected]> --------- Signed-off-by: bitliu <[email protected]>
1 parent 2f70a77 commit f193a19

File tree

8 files changed

+957
-313
lines changed

8 files changed

+957
-313
lines changed
Lines changed: 228 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,228 @@
1+
# Domain Based Routing
2+
3+
This guide shows you how to use fine-tuned classification models for intelligent routing based on academic and professional domains. Domain routing uses specialized models (ModernBERT, Qwen3-Embedding, EmbeddingGemma) with LoRA adapters to classify queries into categories like math, physics, law, business, and more.
4+
5+
## Key Advantages
6+
7+
- **Efficient**: Fine-tuned models with LoRA adapters provide fast inference (5-20ms) with high accuracy
8+
- **Specialized**: Multiple model options (ModernBERT for English, Qwen3 for multilingual/long-context, Gemma for small footprint)
9+
- **Multi-task**: LoRA enables running multiple classification tasks (domain + PII + jailbreak) with shared base model
10+
- **Cost-effective**: Lower latency than LLM-based classification, no API costs
11+
12+
## What Problem Does It Solve?
13+
14+
Generic classification approaches struggle with domain-specific terminology and nuanced differences between academic/professional fields. Domain routing provides:
15+
16+
- **Accurate domain detection**: Fine-tuned models distinguish between math, physics, chemistry, law, business, etc.
17+
- **Multi-task efficiency**: LoRA adapters enable simultaneous domain classification, PII detection, and jailbreak detection with one base model pass
18+
- **Long-context support**: Qwen3-Embedding handles up to 32K tokens (vs ModernBERT's 8K limit)
19+
- **Multilingual routing**: Qwen3 trained on 100+ languages, ModernBERT optimized for English
20+
- **Resource optimization**: Expensive reasoning only enabled for domains that benefit (math, physics, chemistry)
21+
22+
## When to Use
23+
24+
- **Educational platforms** with diverse subject areas (STEM, humanities, social sciences)
25+
- **Professional services** requiring domain expertise (legal, medical, financial)
26+
- **Enterprise knowledge bases** spanning multiple departments
27+
- **Research assistance** tools needing academic domain awareness
28+
- **Multi-domain products** where classification accuracy is critical
29+
30+
## Configuration
31+
32+
Configure the domain classifier in your `config.yaml`:
33+
34+
```yaml
35+
classifier:
36+
category_model:
37+
model_id: "models/category_classifier_modernbert-base_model"
38+
use_modernbert: true
39+
threshold: 0.6
40+
use_cpu: true
41+
category_mapping_path: "models/category_classifier_modernbert-base_model/category_mapping.json"
42+
43+
pii_model:
44+
model_id: "models/pii_classifier_modernbert-base_presidio_token_model"
45+
use_modernbert: true
46+
threshold: 0.7
47+
use_cpu: true
48+
pii_mapping_path: "models/pii_classifier_modernbert-base_presidio_token_model/pii_type_mapping.json"
49+
50+
categories:
51+
- name: math
52+
system_prompt: "You are a mathematics expert. Provide step-by-step solutions."
53+
model_scores:
54+
- model: qwen3
55+
score: 1.0
56+
use_reasoning: true
57+
58+
- name: physics
59+
system_prompt: "You are a physics expert with deep understanding of physical laws."
60+
model_scores:
61+
- model: qwen3
62+
score: 0.7
63+
use_reasoning: true
64+
65+
- name: computer science
66+
system_prompt: "You are a computer science expert with knowledge of algorithms and data structures."
67+
model_scores:
68+
- model: qwen3
69+
score: 0.6
70+
use_reasoning: false
71+
72+
- name: business
73+
system_prompt: "You are a senior business consultant and strategic advisor."
74+
model_scores:
75+
- model: qwen3
76+
score: 0.7
77+
use_reasoning: false
78+
79+
- name: health
80+
system_prompt: "You are a health and medical information expert."
81+
semantic_cache_enabled: true
82+
semantic_cache_similarity_threshold: 0.95
83+
model_scores:
84+
- model: qwen3
85+
score: 0.5
86+
use_reasoning: false
87+
88+
- name: law
89+
system_prompt: "You are a knowledgeable legal expert."
90+
model_scores:
91+
- model: qwen3
92+
score: 0.4
93+
use_reasoning: false
94+
95+
default_model: qwen3
96+
```
97+
98+
## Supported Domains
99+
100+
Academic: math, physics, chemistry, biology, computer science, engineering
101+
102+
Professional: business, law, economics, health, psychology
103+
104+
General: philosophy, history, other
105+
106+
## Features
107+
108+
- **PII Detection**: Automatically detects and handles sensitive information
109+
- **Semantic Caching**: Cache similar queries for faster responses
110+
- **Reasoning Control**: Enable/disable reasoning per domain
111+
- **Custom Thresholds**: Adjust cache sensitivity per category
112+
113+
## Example Requests
114+
115+
```bash
116+
# Math query (reasoning enabled)
117+
curl -X POST http://localhost:8801/v1/chat/completions \
118+
-H "Content-Type: application/json" \
119+
-d '{
120+
"model": "MoM",
121+
"messages": [{"role": "user", "content": "Solve: x^2 + 5x + 6 = 0"}]
122+
}'
123+
124+
# Business query (reasoning disabled)
125+
curl -X POST http://localhost:8801/v1/chat/completions \
126+
-H "Content-Type: application/json" \
127+
-d '{
128+
"model": "MoM",
129+
"messages": [{"role": "user", "content": "What is a SWOT analysis?"}]
130+
}'
131+
132+
# Health query (high cache threshold)
133+
curl -X POST http://localhost:8801/v1/chat/completions \
134+
-H "Content-Type: application/json" \
135+
-d '{
136+
"model": "MoM",
137+
"messages": [{"role": "user", "content": "What are symptoms of diabetes?"}]
138+
}'
139+
```
140+
141+
## Real-World Use Cases
142+
143+
### 1. Multi-Task Classification with LoRA (Efficient)
144+
**Problem**: Need domain classification + PII detection + jailbreak detection on every request
145+
**Solution**: LoRA adapters run all 3 tasks with one base model pass instead of 3 separate models
146+
**Impact**: 3x faster than running 3 full models, &lt;1% parameter overhead per task
147+
148+
### 2. Long Document Analysis (Specialized - Qwen3)
149+
**Problem**: Research papers and legal documents exceed 8K token limit of ModernBERT
150+
**Solution**: Qwen3-Embedding supports up to 32K tokens without truncation
151+
**Impact**: Accurate classification on full documents, no information loss from truncation
152+
153+
### 3. Multilingual Education Platform (Specialized - Qwen3)
154+
**Problem**: Students ask questions in 100+ languages, ModernBERT limited to English
155+
**Solution**: Qwen3-Embedding trained on 100+ languages handles multilingual routing
156+
**Impact**: Single model serves global users, consistent quality across languages
157+
158+
### 4. Edge Deployment (Specialized - Gemma)
159+
**Problem**: Mobile/IoT devices can't run large classification models
160+
**Solution**: EmbeddingGemma-300M with Matryoshka embeddings (128-768 dims)
161+
**Impact**: 5x smaller model, runs on edge devices with &lt;100MB memory
162+
163+
### 5. STEM Tutoring Platform (Efficient Reasoning Control)
164+
**Problem**: Math/physics need reasoning, but history/literature don't
165+
**Solution**: Domain classifier routes STEM → reasoning models, humanities → fast models
166+
**Impact**: 2x better STEM accuracy, 60% cost savings on non-STEM queries
167+
168+
## Domain-Specific Optimizations
169+
170+
### STEM Domains (Reasoning Enabled)
171+
172+
```yaml
173+
- name: math
174+
use_reasoning: true # Step-by-step solutions
175+
score: 1.0 # Highest priority
176+
- name: physics
177+
use_reasoning: true # Derivations and proofs
178+
score: 0.7
179+
- name: chemistry
180+
use_reasoning: true # Reaction mechanisms
181+
score: 0.6
182+
```
183+
184+
### Professional Domains (PII + Caching)
185+
186+
```yaml
187+
- name: health
188+
semantic_cache_enabled: true
189+
semantic_cache_similarity_threshold: 0.95 # Very strict
190+
pii_detection_enabled: true
191+
- name: law
192+
score: 0.4 # Conservative routing
193+
pii_detection_enabled: true
194+
```
195+
196+
### General Domains (Fast + Cached)
197+
198+
```yaml
199+
- name: business
200+
use_reasoning: false # Fast responses
201+
score: 0.7
202+
- name: other
203+
semantic_cache_similarity_threshold: 0.75 # Relaxed
204+
score: 0.7
205+
```
206+
207+
## Performance Characteristics
208+
209+
| Domain | Reasoning | Cache Threshold | Avg Latency | Use Case |
210+
|--------|-----------|-----------------|-------------|----------|
211+
| Math | ✅ | 0.85 | 2-5s | Step-by-step solutions |
212+
| Physics | ✅ | 0.85 | 2-5s | Derivations |
213+
| Chemistry | ✅ | 0.85 | 2-5s | Mechanisms |
214+
| Health | ❌ | 0.95 | 500ms | Safety-critical |
215+
| Law | ❌ | 0.85 | 500ms | Compliance |
216+
| Business | ❌ | 0.80 | 300ms | Fast insights |
217+
| Other | ❌ | 0.75 | 200ms | General queries |
218+
219+
## Cost Optimization Strategy
220+
221+
1. **Reasoning Budget**: Enable only for STEM (30% of queries) → 60% cost reduction
222+
2. **Caching Strategy**: High threshold for sensitive domains → 70% hit rate
223+
3. **Model Selection**: Lower scores for low-value domains → cheaper models
224+
4. **PII Detection**: Only for health/law → reduced processing overhead
225+
226+
## Reference
227+
228+
See [bert_classification.yaml](https://github.com/vllm-project/semantic-router/blob/main/config/intelligent-routing/in-tree/bert_classification.yaml) for complete configuration.

0 commit comments

Comments
 (0)