docs: add NVIDIA Dynamo integration proposal

Xunzhuo · Xunzhuo · commit 48a5a02106b7 · 2025-10-08T23:54:07.000+08:00
Signed-off-by: bitliu &lt;bitliu@tencent.com&gt;
diff --git a/website/docs/proposals/nvidia-dynamo-integration.md b/website/docs/proposals/nvidia-dynamo-integration.md
@@ -517,82 +517,96 @@ prompt_guard:
 
 ```mermaid
 graph TB
-    subgraph "Client Layer"
-        Client[LLM Application<br/>OpenAI SDK]
-    end
-    
-    subgraph "Semantic Intelligence Layer"
-        Gateway[Envoy Gateway<br/>:8080]
-        ExtProc[Semantic Router ExtProc<br/>:50051]
-        
-        subgraph "Semantic Components"
-            Classifier[BERT Classifier<br/>ModernBERT]
+    Client[LLM Application<br/>OpenAI SDK]
+
+    subgraph SIL["Semantic Intelligence Layer"]
+        direction TB
+        Gateway[Envoy Gateway :8080]
+        ExtProc[Semantic Router ExtProc :50051]
+
+        subgraph SC["Semantic Components"]
+            direction LR
+            Classifier[BERT Classifier]
             PIIDetector[PII Detector]
             JailbreakGuard[Jailbreak Guard]
-            SemanticCache[Semantic Cache<br/>Redis/In-Memory]
-            ToolSelector[Tool Selector]
         end
+
+        SemanticCache[Semantic Cache]
+        ToolSelector[Tool Selector]
     end
-    
-    subgraph "NVIDIA Dynamo Layer"
-        DynamoFrontend[Dynamo Frontend<br/>:8000]
-        DynamoRouter[KV Router<br/>KV-Aware Routing]
-        Planner[Planner<br/>Dynamic Scaling]
-        KVBM[KV Block Manager<br/>Global Registry]
+
+    subgraph DL["NVIDIA Dynamo Layer"]
+        direction TB
+        DynamoFrontend[Dynamo Frontend :8000]
+
+        subgraph DR["Routing & Management"]
+            direction LR
+            DynamoRouter[KV Router]
+            KVBM[KV Block Manager]
+        end
+
+        Planner[Planner - Dynamic Scaling]
     end
-    
-    subgraph "Execution Layer"
-        subgraph "Model Pool: deepseek-v31"
-            Worker1[Prefill Worker 1<br/>vLLM]
-            Worker2[Decode Worker 1<br/>vLLM]
+
+    subgraph EL["Execution Layer - Worker Pools"]
+        direction TB
+
+        subgraph MP1["deepseek-v31"]
+            direction LR
+            W1[Prefill Worker]
+            W2[Decode Worker]
         end
-        
-        subgraph "Model Pool: phi4"
-            Worker3[Prefill Worker 2<br/>vLLM]
-            Worker4[Decode Worker 2<br/>vLLM]
+
+        subgraph MP2["phi4"]
+            direction LR
+            W3[Prefill Worker]
+            W4[Decode Worker]
         end
-        
-        subgraph "Model Pool: qwen3"
-            Worker5[Worker 3<br/>SGLang]
+
+        subgraph MP3["qwen3"]
+            W5[Worker - SGLang]
         end
     end
-    
-    subgraph "Storage Layer"
-        Redis[(Redis<br/>Semantic Cache)]
-        SystemMem[(System Memory<br/>KV Cache Offload)]
-        NVMe[(NVMe<br/>Cold KV Cache)]
+
+    subgraph SL["Storage Layer"]
+        direction LR
+        Milvus[(Milvus<br/>Semantic Cache)]
+        SystemMem[(System Memory<br/>KV Offload)]
+        NVMe[(NVMe<br/>Cold Cache)]
     end
-    
-    Client -->|1. OpenAI API Request| Gateway
-    Gateway -->|2. ExtProc gRPC| ExtProc
+
+    Client -->|1. Request| Gateway
+    Gateway <-->|2. ExtProc| ExtProc
     ExtProc --> Classifier
     ExtProc --> PIIDetector
     ExtProc --> JailbreakGuard
     ExtProc --> SemanticCache
     ExtProc --> ToolSelector
-    
-    ExtProc -->|3. Enriched Request<br/>X-VSR-Model: deepseek-v31<br/>X-VSR-Category: math<br/>X-VSR-Reasoning: true| Gateway
-    Gateway -->|4. Forward to Dynamo| DynamoFrontend
-    
-    DynamoFrontend -->|5. Model-Filtered Routing| DynamoRouter
-    DynamoRouter <-->|KV Cache State| KVBM
-    DynamoRouter -->|6. Worker Selection| Worker1
-    DynamoRouter -->|6. Worker Selection| Worker2
-    
-    Planner -->|Dynamic Scaling| Worker1
-    Planner -->|Dynamic Scaling| Worker2
-    Planner -->|Dynamic Scaling| Worker3
-    Planner -->|Dynamic Scaling| Worker4
-    Planner -->|Dynamic Scaling| Worker5
-    
-    SemanticCache <--> Redis
+
+    Gateway -->|3. Enriched Request| DynamoFrontend
+    DynamoFrontend --> DynamoRouter
+    DynamoRouter <--> KVBM
+
+    DynamoRouter --> W1
+    DynamoRouter --> W2
+    DynamoRouter -.-> W3
+    DynamoRouter -.-> W4
+    DynamoRouter -.-> W5
+
+    Planner -.-> W1
+    Planner -.-> W2
+    Planner -.-> W3
+    Planner -.-> W4
+    Planner -.-> W5
+
+    SemanticCache <--> Milvus
     KVBM <--> SystemMem
     KVBM <--> NVMe
-    
-    Worker1 -->|7. Response| DynamoFrontend
-    DynamoFrontend -->|8. Response| Gateway
-    Gateway -->|9. Response| Client
-    
+
+    W1 -->|4. Response| DynamoFrontend
+    DynamoFrontend --> Gateway
+    Gateway --> Client
+
     style ExtProc fill:#e1f5ff
     style DynamoRouter fill:#c8e6c9
     style SemanticCache fill:#fff9c4
@@ -604,7 +618,7 @@ graph TB
 1. **Semantic Intelligence Layer (Semantic Router)**
    - Envoy Gateway with ExtProc for request interception
    - BERT-based classification and security filtering
-   - Semantic caching with Redis backend
+   - Semantic caching with Milvus backend
    - Request enrichment with routing metadata
 
 2. **Infrastructure Optimization Layer (Dynamo)**
@@ -619,7 +633,7 @@ graph TB
    - Backend-agnostic execution
 
 4. **Storage Layer**
-   - Redis for semantic cache
+   - Milvus for semantic cache
    - System memory for KV cache offload
    - NVMe for cold KV cache storage
 
@@ -644,7 +658,7 @@ graph TB
 │                                                                              │
 │ Step 3: Semantic Cache Lookup                                               │
 │   - Generate BERT embedding for query                                       │
-│   - Search Redis for similar queries (threshold: 0.85)                      │
+│   - Search Milvus for similar queries (threshold: 0.85)                     │
 │   - Action: Return cached response if HIT                                   │
 │   - Latency: Very low (cache hit), Low (cache miss)                         │
 │                                                                              │
@@ -708,7 +722,7 @@ graph TB
 │                                                                              │
 │ Step 11: Semantic Cache Update                                              │
 │   - Semantic Router receives response via ExtProc                           │
-│   - Store query embedding + response in Redis                               │
+│   - Store query embedding + response in Milvus                              │
 │   - TTL: 7200 seconds (configurable)                                        │
 │                                                                              │
 │ Step 12: Response to Client                                                 │
@@ -728,7 +742,7 @@ The integration leverages **two complementary caching layers**:
 - **Granularity:** Entire request-response pairs
 - **Matching:** Embedding similarity (cosine distance)
 - **Threshold:** 0.85 (configurable)
-- **Backend:** Redis or in-memory
+- **Backend:** Milvus (vector database)
 - **Benefit:** Avoids inference entirely for similar queries
 - **Example:** "What is 2+2?" ≈ "Calculate 2 plus 2" (similarity: 0.91)
 
@@ -1157,7 +1171,7 @@ Dynamo Frontend discovers workers through Kubernetes Headless Services, which pr
 
 2. **Performance Optimization:**
    - Parallel cache lookup and classification
-   - Redis connection pooling
+   - Milvus connection pooling
    - Cache warming strategies
 
 3. **Testing:**
@@ -1332,7 +1346,7 @@ prompt_guard:
 
 **Best Practices:**
 
-1. **Cache Encryption:** Encrypt Redis cache at rest and in transit
+1. **Cache Encryption:** Encrypt Milvus cache at rest and in transit
 2. **TTL Policies:** Automatic expiration of cached data (default: 2 hours)
 3. **Data Locality:** Deploy in compliance-approved regions
 4. **Audit Logging:** Comprehensive logs for compliance audits
@@ -1376,15 +1390,15 @@ prompt_guard:
   - Dynamo performs default routing
 - **Mitigation:** Deploy 3+ replicas with anti-affinity
 
-**Failure Scenario 2: Redis Cache Unavailable**
+**Failure Scenario 2: Milvus Cache Unavailable**
 
 - **Detection:** Connection errors, timeout
 - **Impact:** No semantic caching (cache misses)
 - **Recovery:**
   - Semantic Router continues with in-memory cache
   - All requests forwarded to Dynamo
   - Performance degradation but no outage
-- **Mitigation:** Redis Sentinel or Redis Cluster for HA
+- **Mitigation:** Milvus cluster deployment for HA
 
 **Failure Scenario 3: Dynamo Frontend Unavailable**