Skip to content

Commit 48a5a02

Browse files
committed
docs: add NVIDIA Dynamo integration proposal
Signed-off-by: bitliu <[email protected]>
1 parent 36fce5d commit 48a5a02

File tree

1 file changed

+82
-68
lines changed

1 file changed

+82
-68
lines changed

website/docs/proposals/nvidia-dynamo-integration.md

Lines changed: 82 additions & 68 deletions
Original file line numberDiff line numberDiff line change
@@ -517,82 +517,96 @@ prompt_guard:
517517

518518
```mermaid
519519
graph TB
520-
subgraph "Client Layer"
521-
Client[LLM Application<br/>OpenAI SDK]
522-
end
523-
524-
subgraph "Semantic Intelligence Layer"
525-
Gateway[Envoy Gateway<br/>:8080]
526-
ExtProc[Semantic Router ExtProc<br/>:50051]
527-
528-
subgraph "Semantic Components"
529-
Classifier[BERT Classifier<br/>ModernBERT]
520+
Client[LLM Application<br/>OpenAI SDK]
521+
522+
subgraph SIL["Semantic Intelligence Layer"]
523+
direction TB
524+
Gateway[Envoy Gateway :8080]
525+
ExtProc[Semantic Router ExtProc :50051]
526+
527+
subgraph SC["Semantic Components"]
528+
direction LR
529+
Classifier[BERT Classifier]
530530
PIIDetector[PII Detector]
531531
JailbreakGuard[Jailbreak Guard]
532-
SemanticCache[Semantic Cache<br/>Redis/In-Memory]
533-
ToolSelector[Tool Selector]
534532
end
533+
534+
SemanticCache[Semantic Cache]
535+
ToolSelector[Tool Selector]
535536
end
536-
537-
subgraph "NVIDIA Dynamo Layer"
538-
DynamoFrontend[Dynamo Frontend<br/>:8000]
539-
DynamoRouter[KV Router<br/>KV-Aware Routing]
540-
Planner[Planner<br/>Dynamic Scaling]
541-
KVBM[KV Block Manager<br/>Global Registry]
537+
538+
subgraph DL["NVIDIA Dynamo Layer"]
539+
direction TB
540+
DynamoFrontend[Dynamo Frontend :8000]
541+
542+
subgraph DR["Routing & Management"]
543+
direction LR
544+
DynamoRouter[KV Router]
545+
KVBM[KV Block Manager]
546+
end
547+
548+
Planner[Planner - Dynamic Scaling]
542549
end
543-
544-
subgraph "Execution Layer"
545-
subgraph "Model Pool: deepseek-v31"
546-
Worker1[Prefill Worker 1<br/>vLLM]
547-
Worker2[Decode Worker 1<br/>vLLM]
550+
551+
subgraph EL["Execution Layer - Worker Pools"]
552+
direction TB
553+
554+
subgraph MP1["deepseek-v31"]
555+
direction LR
556+
W1[Prefill Worker]
557+
W2[Decode Worker]
548558
end
549-
550-
subgraph "Model Pool: phi4"
551-
Worker3[Prefill Worker 2<br/>vLLM]
552-
Worker4[Decode Worker 2<br/>vLLM]
559+
560+
subgraph MP2["phi4"]
561+
direction LR
562+
W3[Prefill Worker]
563+
W4[Decode Worker]
553564
end
554-
555-
subgraph "Model Pool: qwen3"
556-
Worker5[Worker 3<br/>SGLang]
565+
566+
subgraph MP3["qwen3"]
567+
W5[Worker - SGLang]
557568
end
558569
end
559-
560-
subgraph "Storage Layer"
561-
Redis[(Redis<br/>Semantic Cache)]
562-
SystemMem[(System Memory<br/>KV Cache Offload)]
563-
NVMe[(NVMe<br/>Cold KV Cache)]
570+
571+
subgraph SL["Storage Layer"]
572+
direction LR
573+
Milvus[(Milvus<br/>Semantic Cache)]
574+
SystemMem[(System Memory<br/>KV Offload)]
575+
NVMe[(NVMe<br/>Cold Cache)]
564576
end
565-
566-
Client -->|1. OpenAI API Request| Gateway
567-
Gateway -->|2. ExtProc gRPC| ExtProc
577+
578+
Client -->|1. Request| Gateway
579+
Gateway <-->|2. ExtProc| ExtProc
568580
ExtProc --> Classifier
569581
ExtProc --> PIIDetector
570582
ExtProc --> JailbreakGuard
571583
ExtProc --> SemanticCache
572584
ExtProc --> ToolSelector
573-
574-
ExtProc -->|3. Enriched Request<br/>X-VSR-Model: deepseek-v31<br/>X-VSR-Category: math<br/>X-VSR-Reasoning: true| Gateway
575-
Gateway -->|4. Forward to Dynamo| DynamoFrontend
576-
577-
DynamoFrontend -->|5. Model-Filtered Routing| DynamoRouter
578-
DynamoRouter <-->|KV Cache State| KVBM
579-
DynamoRouter -->|6. Worker Selection| Worker1
580-
DynamoRouter -->|6. Worker Selection| Worker2
581-
582-
Planner -->|Dynamic Scaling| Worker1
583-
Planner -->|Dynamic Scaling| Worker2
584-
Planner -->|Dynamic Scaling| Worker3
585-
Planner -->|Dynamic Scaling| Worker4
586-
Planner -->|Dynamic Scaling| Worker5
587-
588-
SemanticCache <--> Redis
585+
586+
Gateway -->|3. Enriched Request| DynamoFrontend
587+
DynamoFrontend --> DynamoRouter
588+
DynamoRouter <--> KVBM
589+
590+
DynamoRouter --> W1
591+
DynamoRouter --> W2
592+
DynamoRouter -.-> W3
593+
DynamoRouter -.-> W4
594+
DynamoRouter -.-> W5
595+
596+
Planner -.-> W1
597+
Planner -.-> W2
598+
Planner -.-> W3
599+
Planner -.-> W4
600+
Planner -.-> W5
601+
602+
SemanticCache <--> Milvus
589603
KVBM <--> SystemMem
590604
KVBM <--> NVMe
591-
592-
Worker1 -->|7. Response| DynamoFrontend
593-
DynamoFrontend -->|8. Response| Gateway
594-
Gateway -->|9. Response| Client
595-
605+
606+
W1 -->|4. Response| DynamoFrontend
607+
DynamoFrontend --> Gateway
608+
Gateway --> Client
609+
596610
style ExtProc fill:#e1f5ff
597611
style DynamoRouter fill:#c8e6c9
598612
style SemanticCache fill:#fff9c4
@@ -604,7 +618,7 @@ graph TB
604618
1. **Semantic Intelligence Layer (Semantic Router)**
605619
- Envoy Gateway with ExtProc for request interception
606620
- BERT-based classification and security filtering
607-
- Semantic caching with Redis backend
621+
- Semantic caching with Milvus backend
608622
- Request enrichment with routing metadata
609623

610624
2. **Infrastructure Optimization Layer (Dynamo)**
@@ -619,7 +633,7 @@ graph TB
619633
- Backend-agnostic execution
620634

621635
4. **Storage Layer**
622-
- Redis for semantic cache
636+
- Milvus for semantic cache
623637
- System memory for KV cache offload
624638
- NVMe for cold KV cache storage
625639

@@ -644,7 +658,7 @@ graph TB
644658
│ │
645659
│ Step 3: Semantic Cache Lookup │
646660
│ - Generate BERT embedding for query │
647-
│ - Search Redis for similar queries (threshold: 0.85)
661+
│ - Search Milvus for similar queries (threshold: 0.85) │
648662
│ - Action: Return cached response if HIT │
649663
│ - Latency: Very low (cache hit), Low (cache miss) │
650664
│ │
@@ -708,7 +722,7 @@ graph TB
708722
│ │
709723
│ Step 11: Semantic Cache Update │
710724
│ - Semantic Router receives response via ExtProc │
711-
│ - Store query embedding + response in Redis
725+
│ - Store query embedding + response in Milvus
712726
│ - TTL: 7200 seconds (configurable) │
713727
│ │
714728
│ Step 12: Response to Client │
@@ -728,7 +742,7 @@ The integration leverages **two complementary caching layers**:
728742
- **Granularity:** Entire request-response pairs
729743
- **Matching:** Embedding similarity (cosine distance)
730744
- **Threshold:** 0.85 (configurable)
731-
- **Backend:** Redis or in-memory
745+
- **Backend:** Milvus (vector database)
732746
- **Benefit:** Avoids inference entirely for similar queries
733747
- **Example:** "What is 2+2?" ≈ "Calculate 2 plus 2" (similarity: 0.91)
734748
@@ -1157,7 +1171,7 @@ Dynamo Frontend discovers workers through Kubernetes Headless Services, which pr
11571171
11581172
2. **Performance Optimization:**
11591173
- Parallel cache lookup and classification
1160-
- Redis connection pooling
1174+
- Milvus connection pooling
11611175
- Cache warming strategies
11621176
11631177
3. **Testing:**
@@ -1332,7 +1346,7 @@ prompt_guard:
13321346

13331347
**Best Practices:**
13341348

1335-
1. **Cache Encryption:** Encrypt Redis cache at rest and in transit
1349+
1. **Cache Encryption:** Encrypt Milvus cache at rest and in transit
13361350
2. **TTL Policies:** Automatic expiration of cached data (default: 2 hours)
13371351
3. **Data Locality:** Deploy in compliance-approved regions
13381352
4. **Audit Logging:** Comprehensive logs for compliance audits
@@ -1376,15 +1390,15 @@ prompt_guard:
13761390
- Dynamo performs default routing
13771391
- **Mitigation:** Deploy 3+ replicas with anti-affinity
13781392

1379-
**Failure Scenario 2: Redis Cache Unavailable**
1393+
**Failure Scenario 2: Milvus Cache Unavailable**
13801394

13811395
- **Detection:** Connection errors, timeout
13821396
- **Impact:** No semantic caching (cache misses)
13831397
- **Recovery:**
13841398
- Semantic Router continues with in-memory cache
13851399
- All requests forwarded to Dynamo
13861400
- Performance degradation but no outage
1387-
- **Mitigation:** Redis Sentinel or Redis Cluster for HA
1401+
- **Mitigation:** Milvus cluster deployment for HA
13881402

13891403
**Failure Scenario 3: Dynamo Frontend Unavailable**
13901404

0 commit comments

Comments
 (0)