- 
                Notifications
    
You must be signed in to change notification settings  - Fork 277
 
docs: add blog on VLLM semantic router + Milvus integration #561
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
          
     Open
      
      
            yinmin2020
  wants to merge
  6
  commits into
  vllm-project:main
  
    
      
        
          
  
    
      Choose a base branch
      
     
    
      
        
      
      
        
          
          
        
        
          
            
              
              
              
  
           
        
        
          
            
              
              
           
        
       
     
  
        
          
            
          
            
          
        
       
    
      
from
yinmin2020:add-file
  
      
      
   
  
    
  
  
  
 
  
      
    base: main
Could not load branches
            
              
  
    Branch not found: {{ refName }}
  
            
                
      Loading
              
            Could not load tags
            
            
              Nothing to show
            
              
  
            
                
      Loading
              
            Are you sure you want to change the base?
            Some commits from the old base branch may be removed from the timeline,
            and old review comments may become outdated.
          
          
      
        
          +279
        
        
          −0
        
        
          
        
      
    
  
  
     Open
                    Changes from 4 commits
      Commits
    
    
            Show all changes
          
          
            6 commits
          
        
        Select commit
          Hold shift + click to select a range
      
      
    File filter
Filter by extension
Conversations
          Failed to load comments.   
        
        
          
      Loading
        
  Jump to
        
          Jump to file
        
      
      
          Failed to load files.   
        
        
          
      Loading
        
  Diff view
Diff view
There are no files selected for viewing
  
    
      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      Learn more about bidirectional Unicode characters
    
  
  
    
              | Original file line number | Diff line number | Diff line change | 
|---|---|---|
| @@ -0,0 +1,273 @@ | ||
| --- | ||
| slug: semantic-router-milvus | ||
| title: "vLLM Semantic Router + Milvus: How Semantic Routing and Caching Build Scalable AI Systems the Smart Way" | ||
| authors: [minyin] | ||
| tags: [semantic-router, milvus, caching, scalability, ai-systems] | ||
| --- | ||
| 
     | 
||
| # vLLM Semantic Router + Milvus: How Semantic Routing and Caching Build Scalable AI Systems the Smart Way | ||
| 
     | 
||
| Most AI apps rely on a single model for every request. But that approach quickly runs into limits. Large models are powerful yet expensive, even when they're used for simple queries. Smaller models are cheaper and faster but can't handle complex reasoning. When traffic surges—say your AI app suddenly goes viral with ten million users overnight—the inefficiency of this one-model-for-all setup becomes painfully apparent. Latency spikes, GPU bills explode, and the model that ran fine yesterday starts gasping for air. | ||
| 
     | 
||
| <!-- truncate --> | ||
| 
     | 
||
| And my friend, you, the engineer behind this app, have to fix it—fast. | ||
| 
     | 
||
| Imagine deploying multiple models of varying sizes and having your system automatically select the best one for each request. Simple prompts go to smaller models; complex queries route to larger ones. That’s the idea behind the vLLM Semantic Router—a routing mechanism that directs requests based on meaning, not endpoints. It analyzes the semantic content, complexity, and intent of each input to select the most suitable language model, ensuring every query is handled by the model best equipped for it. | ||
| 
     | 
||
| To make this even more efficient, the Semantic Router pairs with Milvus, an open-source vector database that serves as a semantic cache layer. Before recomputing a response, it checks whether a semantically similar query has already been processed and instantly retrieves the cached result if found. The result: faster responses, lower costs, and a retrieval system that scales intelligently rather than wastefully. | ||
| 
     | 
||
| In this post, we’ll dive deeper into how the vLLM Semantic Router works, how Milvus powers its caching layer, and how this architecture can be applied in real-world AI applications. | ||
| 
     | 
||
| ## What is a Semantic Router? | ||
| 
     | 
||
| At its core, a Semantic Router is a system that decides which model should handle a given request based on its meaning, complexity, and intent. Instead of routing everything to one model, it distributes requests intelligently across multiple models to balance accuracy, latency, and cost. | ||
| 
     | 
||
| Architecturally, it's built on three key layers: Semantic Routing, Mixture of Models (MoM), and a Cache Layer. | ||
| 
     | 
||
| ### Semantic Routing Layer | ||
| 
     | 
||
| The semantic routing layer is the brain of the system. It analyzes each input—what it's asking, how complex it is, and what kind of reasoning it requires—to select the model best suited for the job. For example, a simple fact lookup might go to a lightweight model, while a multi-step reasoning query is routed to a larger one. This dynamic routing keeps the system responsive even as traffic and query diversity increase. | ||
| 
     | 
||
|  | ||
| 
     | 
||
| ### The Mixture of Models (MoM) Layer | ||
| 
     | 
||
| The second layer, the Mixture of Models (MoM), integrates multiple models of different sizes and capabilities into one unified system. It's inspired by the Mixture of Experts (MoE) architecture, but instead of picking "experts" inside a single large model, it operates across multiple independent models. This design reduces latency, lowers costs, and avoids being locked into any single model provider. | ||
| 
     | 
||
|  | ||
| 
     | 
||
| ### The Cache Layer: Where Milvus Makes the Difference | ||
| Finally, the cache layer—powered by Milvus Vector Database—acts as the system’s memory. Before running a new query, it checks whether a semantically similar request has been processed before. If so, it retrieves the cached result instantly, saving compute time and improving throughput. | ||
| 
     | 
||
| Traditional caching systems rely on in-memory key-value stores, matching requests by exact strings or templates. That works fine when queries are repetitive and predictable. But real users rarely type the same thing twice. Once the phrasing changes—even slightly—the cache fails to recognize it as the same intent. Over time, the cache hit rate drops, and performance gains vanish as language naturally drifts. | ||
| 
     | 
||
|  | ||
| 
     | 
||
| To fix this, we need caching that understands meaning, not just matching words. That's where semantic retrieval comes in. Instead of comparing strings, it compares embeddings—high-dimensional vector representations that capture semantic similarity. The challenge, though, is scale. Running a brute-force search across millions or billions of vectors on a single machine (with time complexity O(N·d)) is computationally prohibitive. Memory costs explode, horizontal scalability collapses, and the system struggles to handle sudden traffic spikes or long-tail queries. | ||
| 
     | 
||
|  | ||
| 
     | 
||
| As a distributed vector database purpose-built for large-scale semantic search, Milvus brings the horizontal scalability and fault tolerance this cache layer needs. It stores embeddings efficiently across nodes and performs Approximate Nearest Neighbor (ANN) searches with minimal latency, even at massive scale. With the right similarity thresholds and fallback strategies, Milvus ensures stable, predictable performance—turning the cache layer into a resilient semantic memory for your entire routing system. | ||
| 
     | 
||
| ## How Developers Are Using Semantic Router + Milvus in Production | ||
| 
     | 
||
| The combination of vLLM Semantic Router and Milvus shines in real-world production environments where speed, cost, and reusability all matter. | ||
| 
     | 
||
| Three common scenarios stand out: | ||
| 
     | 
||
| ### 1. Customer Service Q&A | ||
| 
     | 
||
| Customer-facing bots handle massive volumes of repetitive queries every day—password resets, account updates, and delivery statuses. This domain is both cost- and latency-sensitive, making it ideal for semantic routing. The router sends routine questions to smaller, faster models and escalates complex or ambiguous ones to larger models for deeper reasoning. Meanwhile, Milvus caches previous Q&A pairs, so when similar queries appear, the system can instantly reuse past answers instead of regenerating them. | ||
| 
     | 
||
| ### 2. Code Assistance | ||
| 
     | 
||
| In developer tools or IDE assistants, many queries overlap—syntax help, API lookups, small debugging hints. By analyzing the semantic structure of each prompt, the router dynamically selects an appropriate model size: lightweight for simple tasks, more capable for multi-step reasoning. Milvus boosts responsiveness further by caching similar coding problems and their solutions, turning prior user interactions into a reusable knowledge base. | ||
| 
     | 
||
| ### 3. Enterprise Knowledge Base | ||
| 
     | 
||
| Enterprise queries tend to repeat over time—policy lookups, compliance references, product FAQs. With Milvus as the semantic cache layer, frequently asked questions and their answers can be stored and retrieved efficiently. This minimizes redundant computation while keeping responses consistent across departments and regions. | ||
| 
     | 
||
| Under the hood, the Semantic Router + Milvus pipeline is implemented in Go and Rust for high performance and low latency. Integrated at the gateway layer, it continuously monitors key metrics—like hit rates, routing latency, and model performance—to fine-tune routing strategies in real time. | ||
| 
     | 
||
| ## How to Quickly Test the Semantic Caching in the Semantic Router | ||
| 
     | 
||
| Before deploying semantic caching at scale, it's useful to validate how it behaves in a controlled setup. In this section, we'll walk through a quick local test that shows how the Semantic Router uses Milvus as its semantic cache. You'll see how similar queries hit the cache instantly while new or distinct ones trigger model generation—proving the caching logic in action. | ||
| 
     | 
||
| ### Prerequisites | ||
| 
     | 
||
| - **Container Environment**: Docker + Docker Compose | ||
| - **Vector Database**: Milvus Service | ||
| - **LLM + Embedding**: Project downloaded locally | ||
| 
     | 
||
| ### 1. Deploy the Milvus Vector Database | ||
| Download the deployment files | ||
| 
     | 
||
| ``` | ||
| 
     | 
||
| wget https://github.com/Milvus-io/Milvus/releases/download/v2.5.12/Milvus-standalone-docker-compose.yml -O docker-compose.yml | ||
| 
     | 
||
| ``` | ||
| 
     | 
||
| Start the Milvus service. | ||
| 
     | 
||
| ``` | ||
| 
     | 
||
| docker-compose up -d | ||
| 
     | 
||
| ``` | ||
| 
     | 
||
|  | ||
| 
     | 
||
| ``` | ||
| 
     | 
||
| docker-compose ps -a | ||
| 
     | 
||
| ``` | ||
| 
     | 
||
| ### 2. Clone the project | ||
| 
     | 
||
| ```bash | ||
| git clone https://github.com/vllm-project/semantic-router.git | ||
| ``` | ||
| 
     | 
||
| ### 3. Download local models | ||
| 
     | 
||
| ```bash | ||
| cd semantic-router | ||
| make download-models | ||
| ``` | ||
| 
     | 
||
|  | ||
| 
     | 
||
| ### 4. Configuration Modifications | ||
| Note: Modify the semantic_cache type to milvus | ||
| 
     | 
||
| ```bash | ||
| vim config.yaml | ||
| ``` | ||
| 
     | 
||
| ```yaml | ||
| semantic_cache: | ||
| enabled: true | ||
| backend_type: "milvus" # Options: "memory" or "milvus" | ||
| backend_config_path: "config/cache/milvus.yaml" | ||
| similarity_threshold: 0.8 | ||
| max_entries: 1000 # Only applies to memory backend | ||
| ttl_seconds: 3600 | ||
| eviction_policy: "fifo" | ||
| ``` | ||
| 
     | 
||
| Modify the Milvus configuration | ||
| 
     | 
||
| Note: Fill in the Milvus service just deployed | ||
| 
     | 
||
| ```bash | ||
| vim milvus.yaml | ||
| ``` | ||
| 
     | 
||
| ```yaml | ||
| # Milvus connection settings | ||
| connection: | ||
| # Milvus server host (change for production deployment) | ||
| host: "192.168.7.xxx" # For production: use your Milvus cluster endpoint | ||
| # Milvus server port | ||
| port: 19530 # Standard Milvus port | ||
| # Database name (optional, defaults to "default") | ||
| database: "default" | ||
| # Connection timeout in seconds | ||
| timeout: 30 | ||
| # Authentication (enable for production) | ||
| auth: | ||
| enabled: false # Set to true for production | ||
| username: "" # Your Milvus username | ||
| password: "" # Your Milvus password | ||
| # TLS/SSL configuration (recommended for production) | ||
| tls: | ||
| enabled: false # Set to true for secure connections | ||
| cert_file: "" # Path to client certificate | ||
| key_file: "" # Path to client private key | ||
| ca_file: "" # Path to CA certificate | ||
| 
     | 
||
| # Collection settings | ||
| collection: | ||
| # Name of the collection to store cache entries | ||
| name: "semantic_cache" | ||
| # Description of the collection | ||
| description: "Semantic cache for LLM request-response pairs" | ||
| # Vector field configuration | ||
| vector_field: | ||
| # Name of the vector field | ||
| name: "embedding" | ||
| # Dimension of the embeddings (auto-detected from model at runtime) | ||
| dimension: 384 # This value is ignored - dimension is auto-detected from the embedding model | ||
| # Metric type for similarity calculation | ||
| metric_type: "IP" # Inner Product (cosine similarity for normalized vectors) | ||
| # Index configuration for the vector field | ||
| index: | ||
| # Index type (HNSW is recommended for most use cases) | ||
| type: "HNSW" | ||
| # Index parameters | ||
| params: | ||
| M: 16 # Number of bi-directional links for each node | ||
| efConstruction: 64 # Search scope during index construction | ||
| ``` | ||
| 
     | 
||
| ### 5. Start the project | ||
| 
     | 
||
| Note: It is recommended to modify some Dockerfile dependencies to domestic sources | ||
| 
     | 
||
| ```bash | ||
| docker compose --profile testing up --build | ||
| ``` | ||
| 
     | 
||
|  | ||
| 
     | 
||
| ### 6. Test Requests | ||
| Note: Two requests in total (no cache and cache hit) | ||
| 
     | 
||
| **First request:** | ||
| 
     | 
||
| ```bash | ||
| echo "=== 第一次请求(无缓存状态) ===" && \ | ||
| time curl -X POST http://localhost:8801/v1/chat/completions \ | ||
| -H "Content-Type: application/json" \ | ||
| -H "Authorization: Bearer test-token" \ | ||
| -d '{ | ||
| "model": "auto", | ||
| "messages": [ | ||
| {"role": "system", "content": "You are a helpful assistant."}, | ||
| {"role": "user", "content": "What are the main renewable energy sources?"} | ||
| ], | ||
| "temperature": 0.7 | ||
| }' | jq . | ||
| ``` | ||
| 
     | 
||
| Output: | ||
| 
     | 
||
| ``` | ||
| real 0m16.546s | ||
| user 0m0.116s | ||
| sys 0m0.033s | ||
| ``` | ||
| 
     | 
||
| **Second request:** | ||
| 
     | 
||
| ```bash | ||
| echo "=== 第二次请求(缓存状态) ===" && \ | ||
| time curl -X POST http://localhost:8801/v1/chat/completions \ | ||
| -H "Content-Type: application/json" \ | ||
| -H "Authorization: Bearer test-token" \ | ||
| -d '{ | ||
| "model": "auto", | ||
| "messages": [ | ||
| {"role": "system", "content": "You are a helpful assistant."}, | ||
| {"role": "user", "content": "What are the main renewable energy sources?"} | ||
| ], | ||
| "temperature": 0.7 | ||
| }' | jq . | ||
| ``` | ||
| 
     | 
||
| Output: | ||
| 
     | 
||
| ``` | ||
| real 0m2.393s | ||
| user 0m0.116s | ||
| sys 0m0.021s | ||
| ``` | ||
| 
     | 
||
| This test demonstrates Semantic Router's semantic caching in action. By leveraging Milvus as the vector database, it efficiently matches semantically similar queries, improving response times when users ask the same or similar questions. | ||
| 
     | 
||
|  | ||
| 
     | 
||
| ## Conclusion | ||
| 
     | 
||
| As AI workloads grow and cost optimization becomes essential, the combination of vLLM Semantic Router and Milvus provides a practical way to scale intelligently. By routing each query to the right model and caching semantically similar results with a distributed vector database, this setup cuts compute overhead while keeping responses fast and consistent across use cases. | ||
| 
     | 
||
| In short, you get smarter scaling—less brute force, more brains. | ||
| 
     | 
||
|  | ||
| 
     | 
||
| --- | ||
| 
     | 
||
| If you'd like to explore this further, join the conversation in our Milvus Discord or open an issue on GitHub. You can also book a 20-minute Milvus Office Hours session for one-on-one guidance, insights, and technical deep dives from the team behind Milvus. | 
  
    
      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      Learn more about bidirectional Unicode characters
    
  
  
    
              
      
      Loading
      
  Sorry, something went wrong. Reload?
      Sorry, we cannot display this file.
      Sorry, this file is invalid so it cannot be displayed.
      
    
      
      Loading
      
  Sorry, something went wrong. Reload?
      Sorry, we cannot display this file.
      Sorry, this file is invalid so it cannot be displayed.
      
    
      
      Loading
      
  Sorry, something went wrong. Reload?
      Sorry, we cannot display this file.
      Sorry, this file is invalid so it cannot be displayed.
      
    
      
      Loading
      
  Sorry, something went wrong. Reload?
      Sorry, we cannot display this file.
      Sorry, this file is invalid so it cannot be displayed.
      
    
      
      Loading
      
  Sorry, something went wrong. Reload?
      Sorry, we cannot display this file.
      Sorry, this file is invalid so it cannot be displayed.
      
    
      
      Loading
      
  Sorry, something went wrong. Reload?
      Sorry, we cannot display this file.
      Sorry, this file is invalid so it cannot be displayed.
      
    
      
      Loading
      
  Sorry, something went wrong. Reload?
      Sorry, we cannot display this file.
      Sorry, this file is invalid so it cannot be displayed.
      
    
      
      Loading
      
  Sorry, something went wrong. Reload?
      Sorry, we cannot display this file.
      Sorry, this file is invalid so it cannot be displayed.
      
    
      
      Oops, something went wrong.
        
    
  
  Add this suggestion to a batch that can be applied as a single commit.
  This suggestion is invalid because no changes were made to the code.
  Suggestions cannot be applied while the pull request is closed.
  Suggestions cannot be applied while viewing a subset of changes.
  Only one suggestion per line can be applied in a batch.
  Add this suggestion to a batch that can be applied as a single commit.
  Applying suggestions on deleted lines is not supported.
  You must change the existing code in this line in order to create a valid suggestion.
  Outdated suggestions cannot be applied.
  This suggestion has been applied or marked resolved.
  Suggestions cannot be applied from pending reviews.
  Suggestions cannot be applied on multi-line comments.
  Suggestions cannot be applied while the pull request is queued to merge.
  Suggestion cannot be applied right now. Please check back later.
  
    
  
    
Uh oh!
There was an error while loading. Please reload this page.