|
| 1 | +# LoRA Routing |
| 2 | + |
| 3 | +This guide shows how to enable intent-aware LoRA (Low-Rank Adaptation) routing in the Semantic Router: |
| 4 | + |
| 5 | +- Minimal configuration for LoRA routing |
| 6 | +- vLLM server setup with LoRA adapters |
| 7 | +- Example request/response showing automatic LoRA selection |
| 8 | +- Verification steps |
| 9 | + |
| 10 | +## Prerequisites |
| 11 | + |
| 12 | +- A running vLLM server with LoRA support enabled |
| 13 | +- LoRA adapter files (fine-tuned for specific domains) |
| 14 | +- Envoy + the router (see [Start the router](../../getting-started/quickstart.md) section) |
| 15 | + |
| 16 | +## 1. Start vLLM with LoRA Adapters |
| 17 | + |
| 18 | +First, start your vLLM server with LoRA support enabled: |
| 19 | + |
| 20 | +```bash |
| 21 | +vllm serve meta-llama/Llama-2-7b-hf \ |
| 22 | + --enable-lora \ |
| 23 | + --lora-modules \ |
| 24 | + technical-lora=/path/to/technical-adapter \ |
| 25 | + medical-lora=/path/to/medical-adapter \ |
| 26 | + legal-lora=/path/to/legal-adapter \ |
| 27 | + --host 0.0.0.0 \ |
| 28 | + --port 8000 |
| 29 | +``` |
| 30 | + |
| 31 | +**Key flags**: |
| 32 | + |
| 33 | +- `--enable-lora`: Enables LoRA adapter support |
| 34 | +- `--lora-modules`: Registers LoRA adapters with their names and paths |
| 35 | +- Format: `adapter-name=/path/to/adapter` |
| 36 | + |
| 37 | +## 2. Minimal Configuration |
| 38 | + |
| 39 | +Put this in `config/config.yaml` (or merge into your existing config): |
| 40 | + |
| 41 | +```yaml |
| 42 | +# Category classifier (required for intent detection) |
| 43 | +classifier: |
| 44 | + category_model: |
| 45 | + model_id: "models/category_classifier_modernbert-base_model" |
| 46 | + use_modernbert: true |
| 47 | + threshold: 0.6 |
| 48 | + use_cpu: true |
| 49 | + category_mapping_path: "models/category_classifier_modernbert-base_model/category_mapping.json" |
| 50 | + |
| 51 | +# vLLM endpoint hosting your base model + LoRA adapters |
| 52 | +vllm_endpoints: |
| 53 | + - name: "vllm-primary" |
| 54 | + address: "127.0.0.1" |
| 55 | + port: 8000 |
| 56 | + weight: 1 |
| 57 | + |
| 58 | +# Define base model and available LoRA adapters |
| 59 | +model_config: |
| 60 | + "llama2-7b": |
| 61 | + reasoning_family: "llama2" |
| 62 | + preferred_endpoints: ["vllm-primary"] |
| 63 | + # IMPORTANT: Define all available LoRA adapters here |
| 64 | + loras: |
| 65 | + - name: "technical-lora" |
| 66 | + description: "Optimized for programming and technical questions" |
| 67 | + - name: "medical-lora" |
| 68 | + description: "Specialized for medical and healthcare domain" |
| 69 | + - name: "legal-lora" |
| 70 | + description: "Fine-tuned for legal questions" |
| 71 | + |
| 72 | +# Default model for fallback |
| 73 | +default_model: "llama2-7b" |
| 74 | + |
| 75 | +# Categories with LoRA routing |
| 76 | +categories: |
| 77 | + - name: "technical" |
| 78 | + description: "Programming, software engineering, and technical questions" |
| 79 | + system_prompt: "You are an expert software engineer." |
| 80 | + model_scores: |
| 81 | + - model: "llama2-7b" # Base model name |
| 82 | + lora_name: "technical-lora" # LoRA adapter to use |
| 83 | + score: 1.0 |
| 84 | + use_reasoning: true |
| 85 | + reasoning_effort: "medium" |
| 86 | + |
| 87 | + - name: "medical" |
| 88 | + description: "Medical and healthcare questions" |
| 89 | + system_prompt: "You are a medical expert." |
| 90 | + model_scores: |
| 91 | + - model: "llama2-7b" |
| 92 | + lora_name: "medical-lora" # Different LoRA for medical |
| 93 | + score: 1.0 |
| 94 | + use_reasoning: true |
| 95 | + reasoning_effort: "high" |
| 96 | + |
| 97 | + - name: "legal" |
| 98 | + description: "Legal questions and law-related topics" |
| 99 | + system_prompt: "You are a legal expert." |
| 100 | + model_scores: |
| 101 | + - model: "llama2-7b" |
| 102 | + lora_name: "legal-lora" # Different LoRA for legal |
| 103 | + score: 1.0 |
| 104 | + use_reasoning: true |
| 105 | + reasoning_effort: "high" |
| 106 | + |
| 107 | + - name: "general" |
| 108 | + description: "General questions" |
| 109 | + system_prompt: "You are a helpful assistant." |
| 110 | + model_scores: |
| 111 | + - model: "llama2-7b" # No lora_name = uses base model |
| 112 | + score: 0.8 |
| 113 | + use_reasoning: false |
| 114 | +``` |
| 115 | +
|
| 116 | +## 3. How It Works |
| 117 | +
|
| 118 | +```mermaid |
| 119 | +graph TB |
| 120 | + A[User Query] --> B[Semantic Router] |
| 121 | + B --> C[Category Classifier] |
| 122 | + |
| 123 | + C --> D{Classified Category} |
| 124 | + D -->|Technical| E[technical-lora] |
| 125 | + D -->|Medical| F[medical-lora] |
| 126 | + D -->|Legal| G[legal-lora] |
| 127 | + D -->|General| H[llama2-7b base] |
| 128 | + |
| 129 | + E --> I[vLLM Server] |
| 130 | + F --> I |
| 131 | + G --> I |
| 132 | + H --> I |
| 133 | + |
| 134 | + I --> J[Response] |
| 135 | +``` |
| 136 | + |
| 137 | +**Flow**: |
| 138 | + |
| 139 | +1. User sends a query to the router |
| 140 | +2. Category classifier detects the intent (e.g., "technical") |
| 141 | +3. Router looks up the best `ModelScore` for that category |
| 142 | +4. If `lora_name` is specified, it becomes the final model name |
| 143 | +5. Request is sent to vLLM with `model="technical-lora"` |
| 144 | +6. vLLM routes to the appropriate LoRA adapter |
| 145 | +7. Response is returned to the user |
| 146 | + |
| 147 | +### Test Domain Aware LoRA Routing |
| 148 | + |
| 149 | +Send test queries and verify they're classified correctly: |
| 150 | + |
| 151 | +```bash |
| 152 | +# Technical query |
| 153 | +curl -X POST http://localhost:8080/v1/chat/completions \ |
| 154 | + -H "Content-Type: application/json" \ |
| 155 | + -d '{"model": "MoM", "messages": [{"role": "user", "content": "Explain async/await in JavaScript"}]}' |
| 156 | + |
| 157 | +# Medical query |
| 158 | +curl -X POST http://localhost:8080/v1/chat/completions \ |
| 159 | + -H "Content-Type: application/json" \ |
| 160 | + -d '{"model": "MoM", "messages": [{"role": "user", "content": "What causes high blood pressure?"}]}' |
| 161 | +``` |
| 162 | + |
| 163 | +Check the router logs to confirm the correct LoRA adapter is selected for each query. |
| 164 | + |
| 165 | + |
| 166 | +## Benefits |
| 167 | + |
| 168 | +- **Domain Expertise**: Each LoRA adapter is fine-tuned for specific domains |
| 169 | +- **Cost Efficiency**: Share base model weights across adapters (lower memory usage) |
| 170 | +- **Easy A/B Testing**: Compare adapter versions by adjusting scores |
| 171 | +- **Flexible Deployment**: Add/remove adapters without restarting the router |
| 172 | +- **Automatic Selection**: Users don't need to know which adapter to use |
| 173 | + |
| 174 | +## Next Steps |
| 175 | + |
| 176 | +- See [complete LoRA routing example](https://github.com/vllm-project/semantic-router/blob/main/config/intelligent-routing/in-tree/lora_routing_example.yaml) |
| 177 | +- Learn about [category configuration](../../overview/categories/configuration.md#lora_name-optional) |
| 178 | +- Explore [reasoning routing](./reasoning.md) to combine with LoRA adapters |
0 commit comments