@@ -517,82 +517,96 @@ prompt_guard:
517
517
518
518
` ` ` mermaid
519
519
graph TB
520
- subgraph " Client Layer"
521
- Client[LLM Application<br/>OpenAI SDK]
522
- end
523
-
524
- subgraph "Semantic Intelligence Layer"
525
- Gateway[Envoy Gateway<br/>:8080 ]
526
- ExtProc[Semantic Router ExtProc<br/>:50051]
527
-
528
- subgraph "Semantic Components"
529
- Classifier[BERT Classifier<br/>ModernBERT ]
520
+ Client[LLM Application<br/>OpenAI SDK]
521
+
522
+ subgraph SIL["Semantic Intelligence Layer"]
523
+ direction TB
524
+ Gateway[Envoy Gateway :8080]
525
+ ExtProc[Semantic Router ExtProc :50051 ]
526
+
527
+ subgraph SC["Semantic Components"]
528
+ direction LR
529
+ Classifier[BERT Classifier]
530
530
PIIDetector[PII Detector]
531
531
JailbreakGuard[Jailbreak Guard]
532
- SemanticCache[Semantic Cache<br/>Redis/In-Memory]
533
- ToolSelector[Tool Selector]
534
532
end
533
+
534
+ SemanticCache[Semantic Cache]
535
+ ToolSelector[Tool Selector]
535
536
end
536
-
537
- subgraph "NVIDIA Dynamo Layer"
538
- DynamoFrontend[Dynamo Frontend<br/>:8000]
539
- DynamoRouter[KV Router<br/>KV-Aware Routing]
540
- Planner[Planner<br/>Dynamic Scaling]
541
- KVBM[KV Block Manager<br/>Global Registry]
537
+
538
+ subgraph DL["NVIDIA Dynamo Layer"]
539
+ direction TB
540
+ DynamoFrontend[Dynamo Frontend :8000]
541
+
542
+ subgraph DR["Routing & Management"]
543
+ direction LR
544
+ DynamoRouter[KV Router]
545
+ KVBM[KV Block Manager]
546
+ end
547
+
548
+ Planner[Planner - Dynamic Scaling]
542
549
end
543
-
544
- subgraph "Execution Layer"
545
- subgraph "Model Pool: deepseek-v31"
546
- Worker1[Prefill Worker 1<br/>vLLM]
547
- Worker2[Decode Worker 1<br/>vLLM]
550
+
551
+ subgraph EL["Execution Layer - Worker Pools"]
552
+ direction TB
553
+
554
+ subgraph MP1["deepseek-v31"]
555
+ direction LR
556
+ W1[Prefill Worker]
557
+ W2[Decode Worker]
548
558
end
549
-
550
- subgraph "Model Pool: phi4"
551
- Worker3[Prefill Worker 2<br/>vLLM]
552
- Worker4[Decode Worker 2<br/>vLLM]
559
+
560
+ subgraph MP2["phi4"]
561
+ direction LR
562
+ W3[Prefill Worker]
563
+ W4[Decode Worker]
553
564
end
554
-
555
- subgraph "Model Pool: qwen3"
556
- Worker5 [Worker 3<br/> SGLang]
565
+
566
+ subgraph MP3[" qwen3"]
567
+ W5 [Worker - SGLang]
557
568
end
558
569
end
559
-
560
- subgraph "Storage Layer"
561
- Redis[(Redis<br/>Semantic Cache)]
562
- SystemMem[(System Memory<br/>KV Cache Offload)]
563
- NVMe[(NVMe<br/>Cold KV Cache)]
570
+
571
+ subgraph SL["Storage Layer"]
572
+ direction LR
573
+ Milvus[(Milvus<br/>Semantic Cache)]
574
+ SystemMem[(System Memory<br/>KV Offload)]
575
+ NVMe[(NVMe<br/>Cold Cache)]
564
576
end
565
-
566
- Client -->|1. OpenAI API Request| Gateway
567
- Gateway -->|2. ExtProc gRPC | ExtProc
577
+
578
+ Client -->|1. Request| Gateway
579
+ Gateway < -->|2. ExtProc| ExtProc
568
580
ExtProc --> Classifier
569
581
ExtProc --> PIIDetector
570
582
ExtProc --> JailbreakGuard
571
583
ExtProc --> SemanticCache
572
584
ExtProc --> ToolSelector
573
-
574
- ExtProc -->|3. Enriched Request<br/>X-VSR-Model: deepseek-v31<br/>X-VSR-Category: math<br/>X-VSR-Reasoning: true| Gateway
575
- Gateway -->|4. Forward to Dynamo| DynamoFrontend
576
-
577
- DynamoFrontend -->|5. Model-Filtered Routing| DynamoRouter
578
- DynamoRouter <-->|KV Cache State| KVBM
579
- DynamoRouter -->|6. Worker Selection| Worker1
580
- DynamoRouter -->|6. Worker Selection| Worker2
581
-
582
- Planner -->|Dynamic Scaling| Worker1
583
- Planner -->|Dynamic Scaling| Worker2
584
- Planner -->|Dynamic Scaling| Worker3
585
- Planner -->|Dynamic Scaling| Worker4
586
- Planner -->|Dynamic Scaling| Worker5
587
-
588
- SemanticCache <--> Redis
585
+
586
+ Gateway -->|3. Enriched Request| DynamoFrontend
587
+ DynamoFrontend --> DynamoRouter
588
+ DynamoRouter <--> KVBM
589
+
590
+ DynamoRouter --> W1
591
+ DynamoRouter --> W2
592
+ DynamoRouter -.-> W3
593
+ DynamoRouter -.-> W4
594
+ DynamoRouter -.-> W5
595
+
596
+ Planner -.-> W1
597
+ Planner -.-> W2
598
+ Planner -.-> W3
599
+ Planner -.-> W4
600
+ Planner -.-> W5
601
+
602
+ SemanticCache <--> Milvus
589
603
KVBM <--> SystemMem
590
604
KVBM <--> NVMe
591
-
592
- Worker1 -->|7 . Response| DynamoFrontend
593
- DynamoFrontend -->|8. Response| Gateway
594
- Gateway -->|9. Response| Client
595
-
605
+
606
+ W1 -->|4 . Response| DynamoFrontend
607
+ DynamoFrontend --> Gateway
608
+ Gateway --> Client
609
+
596
610
style ExtProc fill:#e1f5ff
597
611
style DynamoRouter fill:#c8e6c9
598
612
style SemanticCache fill:#fff9c4
@@ -604,7 +618,7 @@ graph TB
604
618
1. **Semantic Intelligence Layer (Semantic Router)**
605
619
- Envoy Gateway with ExtProc for request interception
606
620
- BERT-based classification and security filtering
607
- - Semantic caching with Redis backend
621
+ - Semantic caching with Milvus backend
608
622
- Request enrichment with routing metadata
609
623
610
624
2. **Infrastructure Optimization Layer (Dynamo)**
@@ -619,7 +633,7 @@ graph TB
619
633
- Backend-agnostic execution
620
634
621
635
4. **Storage Layer**
622
- - Redis for semantic cache
636
+ - Milvus for semantic cache
623
637
- System memory for KV cache offload
624
638
- NVMe for cold KV cache storage
625
639
@@ -644,7 +658,7 @@ graph TB
644
658
│ │
645
659
│ Step 3: Semantic Cache Lookup │
646
660
│ - Generate BERT embedding for query │
647
- │ - Search Redis for similar queries (threshold: 0.85) │
661
+ │ - Search Milvus for similar queries (threshold: 0.85) │
648
662
│ - Action: Return cached response if HIT │
649
663
│ - Latency: Very low (cache hit), Low (cache miss) │
650
664
│ │
@@ -708,7 +722,7 @@ graph TB
708
722
│ │
709
723
│ Step 11: Semantic Cache Update │
710
724
│ - Semantic Router receives response via ExtProc │
711
- │ - Store query embedding + response in Redis │
725
+ │ - Store query embedding + response in Milvus │
712
726
│ - TTL: 7200 seconds (configurable) │
713
727
│ │
714
728
│ Step 12: Response to Client │
@@ -728,7 +742,7 @@ The integration leverages **two complementary caching layers**:
728
742
- **Granularity:** Entire request-response pairs
729
743
- **Matching:** Embedding similarity (cosine distance)
730
744
- **Threshold:** 0.85 (configurable)
731
- - **Backend:** Redis or in-memory
745
+ - **Backend:** Milvus (vector database)
732
746
- **Benefit:** Avoids inference entirely for similar queries
733
747
- **Example:** "What is 2+2?" ≈ "Calculate 2 plus 2" (similarity: 0.91)
734
748
@@ -1157,7 +1171,7 @@ Dynamo Frontend discovers workers through Kubernetes Headless Services, which pr
1157
1171
1158
1172
2. **Performance Optimization:**
1159
1173
- Parallel cache lookup and classification
1160
- - Redis connection pooling
1174
+ - Milvus connection pooling
1161
1175
- Cache warming strategies
1162
1176
1163
1177
3. **Testing:**
@@ -1332,7 +1346,7 @@ prompt_guard:
1332
1346
1333
1347
** Best Practices:**
1334
1348
1335
- 1 . ** Cache Encryption:** Encrypt Redis cache at rest and in transit
1349
+ 1 . ** Cache Encryption:** Encrypt Milvus cache at rest and in transit
1336
1350
2 . ** TTL Policies:** Automatic expiration of cached data (default: 2 hours)
1337
1351
3 . ** Data Locality:** Deploy in compliance-approved regions
1338
1352
4 . ** Audit Logging:** Comprehensive logs for compliance audits
@@ -1376,15 +1390,15 @@ prompt_guard:
1376
1390
- Dynamo performs default routing
1377
1391
- ** Mitigation:** Deploy 3+ replicas with anti-affinity
1378
1392
1379
- ** Failure Scenario 2: Redis Cache Unavailable**
1393
+ ** Failure Scenario 2: Milvus Cache Unavailable**
1380
1394
1381
1395
- ** Detection:** Connection errors, timeout
1382
1396
- ** Impact:** No semantic caching (cache misses)
1383
1397
- ** Recovery:**
1384
1398
- Semantic Router continues with in-memory cache
1385
1399
- All requests forwarded to Dynamo
1386
1400
- Performance degradation but no outage
1387
- - ** Mitigation:** Redis Sentinel or Redis Cluster for HA
1401
+ - ** Mitigation:** Milvus cluster deployment for HA
1388
1402
1389
1403
** Failure Scenario 3: Dynamo Frontend Unavailable**
1390
1404
0 commit comments