|
| 1 | +# LLM Router Extension - Design Document |
| 2 | + |
| 3 | +## Overview |
| 4 | + |
| 5 | +The LLM Router is an extension to RedisVL that provides intelligent, cost-optimized LLM model selection using semantic routing. Instead of routing queries to topics (like SemanticRouter), it routes queries to **model tiers** - selecting the cheapest LLM capable of handling each task. |
| 6 | + |
| 7 | +## Problem Statement |
| 8 | + |
| 9 | +### The LLM Cost Problem |
| 10 | +Modern applications often default to using the most capable (and expensive) LLM for all queries, even when simpler models would suffice: |
| 11 | +- "Hello, how are you?" -> Claude Opus 4.5 ($5/M tokens) |
| 12 | +- "Hello, how are you?" -> GPT-4.1 Nano ($0.10/M tokens) |
| 13 | + |
| 14 | +### Existing Solutions and Their Limitations |
| 15 | + |
| 16 | +**RouteLLM** (CMU/LMSys): |
| 17 | +- Binary classification only (strong vs weak model) |
| 18 | +- No support for >2 tiers |
| 19 | +- Requires training data or preference matrices |
| 20 | + |
| 21 | +**NVIDIA LLM Router Blueprint**: |
| 22 | +- Complexity classification approach (simple/moderate/complex) |
| 23 | +- Provides the taxonomy basis but no open-source Redis-native implementation |
| 24 | + |
| 25 | +**RouterArena / Bloom's Taxonomy Approach**: |
| 26 | +- Maps query complexity to Bloom's cognitive levels |
| 27 | +- Informs our tier design but lacks production routing infrastructure |
| 28 | + |
| 29 | +**OpenRouter Auto-Router**: |
| 30 | +- Black box routing decisions |
| 31 | +- Data flows through third-party servers |
| 32 | +- No transparency into why a model was selected |
| 33 | +- Can't self-host or customize |
| 34 | + |
| 35 | +**NotDiamond**: |
| 36 | +- Proprietary ML model for routing |
| 37 | +- Requires API calls for every routing decision |
| 38 | +- No local/offline capability |
| 39 | + |
| 40 | +**FrugalGPT**: |
| 41 | +- Sequential cascade approach (try cheap first, escalate) |
| 42 | +- Higher latency due to serial model calls |
| 43 | + |
| 44 | +## Solution: Semantic Model Tier Routing |
| 45 | + |
| 46 | +Repurpose RedisVL's battle-tested SemanticRouter for model selection: |
| 47 | + |
| 48 | +``` |
| 49 | +SemanticRouter -> LLMRouter |
| 50 | +----------------------------------------- |
| 51 | +Route -> ModelTier |
| 52 | +route.name -> tier.name (simple/standard/expert) |
| 53 | +route.references -> tier.references (task complexity examples) |
| 54 | +route.metadata -> tier.metadata (cost, capabilities) |
| 55 | +RouteMatch -> LLMRouteMatch (includes model string) |
| 56 | +``` |
| 57 | + |
| 58 | +### Architecture |
| 59 | + |
| 60 | +``` |
| 61 | ++---------------------------------------------------------------+ |
| 62 | +| LLMRouter | |
| 63 | ++---------------------------------------------------------------+ |
| 64 | +| +-------------+ +-------------+ +-------------+ | |
| 65 | +| | Simple | | Standard | | Expert | | |
| 66 | +| | Tier | | Tier | | Tier | | |
| 67 | +| +-------------+ +-------------+ +-------------+ | |
| 68 | +| | gpt-4.1-nano| | sonnet 4.5 | | opus 4.5 | | |
| 69 | +| | $0.10/M | | $3/M | | $5/M | | |
| 70 | +| | threshold: | | threshold: | | threshold: | | |
| 71 | +| | 0.5 | | 0.6 | | 0.7 | | |
| 72 | +| +-------------+ +-------------+ +-------------+ | |
| 73 | +| | | | | |
| 74 | +| +----------------+----------------+ | |
| 75 | +| v | |
| 76 | +| +------------------------+ | |
| 77 | +| | Redis Vector Index | | |
| 78 | +| | (reference phrases) | | |
| 79 | +| +------------------------+ | |
| 80 | ++---------------------------------------------------------------+ |
| 81 | + | |
| 82 | + v |
| 83 | + +-------------+ |
| 84 | + | Query | |
| 85 | + | "analyze | |
| 86 | + | this..." | |
| 87 | + +-------------+ |
| 88 | + | |
| 89 | + v |
| 90 | + +-------------+ |
| 91 | + | LiteLLM | |
| 92 | + | (optional) | |
| 93 | + +-------------+ |
| 94 | +``` |
| 95 | + |
| 96 | +## Key Design Decisions |
| 97 | + |
| 98 | +### 1. Model Tiers, Not Individual Models |
| 99 | + |
| 100 | +Routes map to **tiers** (simple, standard, expert) rather than specific models. This provides: |
| 101 | +- Abstraction from model churn (swap haiku -> gemini-flash without changing routes) |
| 102 | +- Clear mental model for users |
| 103 | +- Easy cost optimization within tiers |
| 104 | + |
| 105 | +### 2. Bloom's Taxonomy-Grounded Tiers |
| 106 | + |
| 107 | +The default pretrained config maps tiers to Bloom's Taxonomy cognitive levels: |
| 108 | +- **Simple** (Remember/Understand): Factual recall, greetings, format conversion |
| 109 | +- **Standard** (Apply/Analyze): Code explanation, summarization, moderate analysis |
| 110 | +- **Expert** (Evaluate/Create): Research, architecture, formal reasoning |
| 111 | + |
| 112 | +This is informed by RouterArena's finding that cognitive complexity correlates with model capability requirements. |
| 113 | + |
| 114 | +### 3. LiteLLM-Compatible Model Strings |
| 115 | + |
| 116 | +Tier model identifiers use LiteLLM format (`provider/model`): |
| 117 | +```python |
| 118 | +ModelTier( |
| 119 | + name="standard", |
| 120 | + model="anthropic/claude-sonnet-4-5", # Works directly with LiteLLM |
| 121 | + ... |
| 122 | +) |
| 123 | +``` |
| 124 | + |
| 125 | +### 4. Per-Tier Distance Thresholds |
| 126 | + |
| 127 | +Each tier has its own `distance_threshold`, allowing fine-grained control: |
| 128 | +```python |
| 129 | +simple_tier = ModelTier(..., distance_threshold=0.5) # Strict match |
| 130 | +expert_tier = ModelTier(..., distance_threshold=0.7) # Looser match |
| 131 | +``` |
| 132 | + |
| 133 | +### 5. Cost-Aware Routing |
| 134 | + |
| 135 | +When `cost_optimization=True`, the router adds a cost penalty to distances: |
| 136 | +```python |
| 137 | +adjusted_distance = distance + (cost_per_1k * cost_weight) |
| 138 | +``` |
| 139 | +This prefers cheaper tiers when semantic distances are close. |
| 140 | + |
| 141 | +### 6. Pretrained Configs with Embedded Vectors |
| 142 | + |
| 143 | +The built-in `default.json` provides a ready-to-use 3-tier configuration: |
| 144 | +```python |
| 145 | +# Instant setup - no embedding model needed at load time |
| 146 | +router = LLMRouter.from_pretrained("default", redis_client=client) |
| 147 | +``` |
| 148 | + |
| 149 | +The pretrained config includes pre-computed embeddings from |
| 150 | +`sentence-transformers/all-mpnet-base-v2`, with 18 reference phrases per tier |
| 151 | +covering the Bloom's Taxonomy spectrum. |
| 152 | + |
| 153 | +Custom configs can also be exported and shared: |
| 154 | +```python |
| 155 | +# Export (one-time, with embedding model) |
| 156 | +router.export_with_embeddings("my_router.json") |
| 157 | + |
| 158 | +# Import (no embedding needed) |
| 159 | +router = LLMRouter.from_pretrained("my_router.json", redis_client=client) |
| 160 | +``` |
| 161 | + |
| 162 | +### 7. Async Support |
| 163 | + |
| 164 | +`AsyncLLMRouter` provides the same functionality using async I/O. Since |
| 165 | +`__init__` cannot be async, it uses a `create()` classmethod factory: |
| 166 | + |
| 167 | +```python |
| 168 | +router = await AsyncLLMRouter.create( |
| 169 | + name="my-router", |
| 170 | + tiers=tiers, |
| 171 | + redis_client=async_client, |
| 172 | +) |
| 173 | +match = await router.route("hello") |
| 174 | +``` |
| 175 | + |
| 176 | +Key async method mapping: |
| 177 | + |
| 178 | +| Sync (`LLMRouter`) | Async (`AsyncLLMRouter`) | |
| 179 | +|---------------------|--------------------------| |
| 180 | +| `__init__()` | `await create()` | |
| 181 | +| `from_existing()` | `await from_existing()` | |
| 182 | +| `route()` | `await route()` | |
| 183 | +| `route_many()` | `await route_many()` | |
| 184 | +| `add_tier()` | `await add_tier()` | |
| 185 | +| `remove_tier()` | `await remove_tier()` | |
| 186 | +| `from_dict()` | `await from_dict()` | |
| 187 | +| `from_pretrained()` | `await from_pretrained()` | |
| 188 | +| `delete()` | `await delete()` | |
| 189 | + |
| 190 | +## Module Structure |
| 191 | + |
| 192 | +``` |
| 193 | +redisvl/extensions/llm_router/ |
| 194 | ++-- __init__.py # Public exports (LLMRouter, AsyncLLMRouter, schemas) |
| 195 | ++-- DESIGN.md # This document |
| 196 | ++-- schema.py # Pydantic models |
| 197 | +| +-- ModelTier # Tier definition |
| 198 | +| +-- LLMRouteMatch # Routing result |
| 199 | +| +-- RoutingConfig # Router configuration |
| 200 | +| +-- Pretrained* # Export/import schemas |
| 201 | ++-- router.py # LLMRouter + AsyncLLMRouter implementations |
| 202 | ++-- pretrained/ |
| 203 | + +-- __init__.py # Pretrained loader (get_pretrained_path) |
| 204 | + +-- default.json # Standard 3-tier config (simple/standard/expert) |
| 205 | +``` |
| 206 | + |
| 207 | +## API Examples |
| 208 | + |
| 209 | +### Basic Usage |
| 210 | + |
| 211 | +```python |
| 212 | +from redisvl.extensions.llm_router import LLMRouter, ModelTier |
| 213 | + |
| 214 | +tiers = [ |
| 215 | + ModelTier( |
| 216 | + name="simple", |
| 217 | + model="openai/gpt-4.1-nano", |
| 218 | + references=[ |
| 219 | + "hello", "hi there", "thanks", "goodbye", |
| 220 | + "what time is it?", "how are you?", |
| 221 | + ], |
| 222 | + metadata={"cost_per_1k_input": 0.0001}, |
| 223 | + distance_threshold=0.5, |
| 224 | + ), |
| 225 | + ModelTier( |
| 226 | + name="standard", |
| 227 | + model="anthropic/claude-sonnet-4-5", |
| 228 | + references=[ |
| 229 | + "analyze this code for bugs", |
| 230 | + "explain how neural networks learn", |
| 231 | + "compare and contrast these approaches", |
| 232 | + ], |
| 233 | + metadata={"cost_per_1k_input": 0.003}, |
| 234 | + distance_threshold=0.6, |
| 235 | + ), |
| 236 | + ModelTier( |
| 237 | + name="expert", |
| 238 | + model="anthropic/claude-opus-4-5", |
| 239 | + references=[ |
| 240 | + "prove this mathematical theorem", |
| 241 | + "architect a distributed system", |
| 242 | + "write a research paper analyzing", |
| 243 | + ], |
| 244 | + metadata={"cost_per_1k_input": 0.005}, |
| 245 | + distance_threshold=0.7, |
| 246 | + ), |
| 247 | +] |
| 248 | + |
| 249 | +router = LLMRouter( |
| 250 | + name="my-llm-router", |
| 251 | + tiers=tiers, |
| 252 | + redis_url="redis://localhost:6379", |
| 253 | +) |
| 254 | + |
| 255 | +# Route a query |
| 256 | +match = router.route("hello, how's it going?") |
| 257 | +print(match.tier) # "simple" |
| 258 | +print(match.model) # "openai/gpt-4.1-nano" |
| 259 | + |
| 260 | +# Use with LiteLLM (optional integration) |
| 261 | +from litellm import completion |
| 262 | +response = completion(model=match.model, messages=[{"role": "user", "content": query}]) |
| 263 | +``` |
| 264 | + |
| 265 | +### Cost-Optimized Routing |
| 266 | + |
| 267 | +```python |
| 268 | +router = LLMRouter( |
| 269 | + name="cost-aware-router", |
| 270 | + tiers=tiers, |
| 271 | + cost_optimization=True, # Prefer cheaper tiers when distances are close |
| 272 | + redis_url="redis://localhost:6379", |
| 273 | +) |
| 274 | +``` |
| 275 | + |
| 276 | +### Pretrained Router |
| 277 | + |
| 278 | +```python |
| 279 | +# Load without needing an embedding model for the references |
| 280 | +router = LLMRouter.from_pretrained( |
| 281 | + "default", # Built-in config, or path to JSON |
| 282 | + redis_client=client, |
| 283 | +) |
| 284 | +``` |
| 285 | + |
| 286 | +### Async Usage |
| 287 | + |
| 288 | +```python |
| 289 | +from redisvl.extensions.llm_router import AsyncLLMRouter |
| 290 | + |
| 291 | +router = await AsyncLLMRouter.create( |
| 292 | + name="my-async-router", |
| 293 | + tiers=tiers, |
| 294 | + redis_url="redis://localhost:6379", |
| 295 | +) |
| 296 | + |
| 297 | +match = await router.route("explain how garbage collection works") |
| 298 | +print(match.model) # "anthropic/claude-sonnet-4-5" |
| 299 | + |
| 300 | +# Or load from pretrained |
| 301 | +router = await AsyncLLMRouter.from_pretrained("default", redis_client=client) |
| 302 | + |
| 303 | +await router.delete() |
| 304 | +``` |
| 305 | + |
| 306 | +## Comparison with SemanticRouter |
| 307 | + |
| 308 | +| Feature | SemanticRouter | LLMRouter | |
| 309 | +|---------|---------------|-----------| |
| 310 | +| Purpose | Topic classification | Model selection | |
| 311 | +| Output | Route name | Model string + metadata | |
| 312 | +| Cost awareness | No | Yes | |
| 313 | +| Pretrained configs | No | Yes | |
| 314 | +| Per-route thresholds | Yes | Yes | |
| 315 | +| LiteLLM integration | No | Yes (model strings) | |
| 316 | +| Async support | No | Yes (`AsyncLLMRouter`) | |
| 317 | + |
| 318 | +## Testing |
| 319 | + |
| 320 | +```bash |
| 321 | +uv run pytest tests/unit/test_llm_router_schema.py -v |
| 322 | +uv run pytest tests/integration/test_llm_router.py -v |
| 323 | +uv run pytest tests/integration/test_async_llm_router.py -v |
| 324 | +``` |
| 325 | + |
| 326 | +## Future Enhancements |
| 327 | + |
| 328 | +### 1. `complete()` Method |
| 329 | +Direct LiteLLM integration for one-liner usage: |
| 330 | +```python |
| 331 | +response = router.complete("analyze this code", messages=[...]) |
| 332 | +``` |
| 333 | + |
| 334 | +### 2. Capability Filtering |
| 335 | +Filter tiers by capability before routing: |
| 336 | +```python |
| 337 | +match = router.route("generate an image", capabilities=["vision"]) |
| 338 | +``` |
| 339 | + |
| 340 | +### 3. Budget Constraints |
| 341 | +Enforce cost limits: |
| 342 | +```python |
| 343 | +router = LLMRouter(..., max_cost_per_1k=0.01) # Never select opus |
| 344 | +``` |
| 345 | + |
| 346 | +### 4. Fallback Chains |
| 347 | +Define fallback order when primary tier unavailable: |
| 348 | +```python |
| 349 | +tier = ModelTier(..., fallback=["standard", "simple"]) |
| 350 | +``` |
| 351 | + |
| 352 | +## References |
| 353 | + |
| 354 | +- [RedisVL SemanticRouter](https://docs.redisvl.com/en/latest/user_guide/semantic_router.html) |
| 355 | +- [LiteLLM Model List](https://docs.litellm.ai/docs/providers) |
| 356 | +- [RouteLLM](https://github.com/lm-sys/RouteLLM) - LMSys binary router framework |
| 357 | +- [NVIDIA LLM Router Blueprint](https://build.nvidia.com/blueprints/llm-router) - Complexity-based routing |
| 358 | +- [RouterArena / Bloom's Taxonomy](https://arxiv.org/abs/2412.06644) - Cognitive complexity for routing |
| 359 | +- [FrugalGPT](https://arxiv.org/abs/2305.05176) - Cost-efficient LLM strategies |
| 360 | +- [OpenRouter](https://openrouter.ai/) - Auto-routing concept |
| 361 | +- [NotDiamond](https://notdiamond.ai/) - ML-based model routing |
| 362 | +- [Unify.ai](https://unify.ai/) - Quality-cost tradeoff routing |
0 commit comments