Skip to content

Commit 36fce5d

Browse files
committed
docs: generalize performance metrics to avoid specific numbers
- Replace specific latency values with descriptive terms (Low, Moderate, High) - Change cost reduction percentages to qualitative descriptions (Significant, Maximum) - Update cache hit rates to general terms (High, Low) - Modify GPU configurations to generic descriptions (Multi-GPU, Single/Multi-GPU) - Replace specific thresholds with relative terms in monitoring alerts - Maintain focus on intelligent routing and semantic cache benefits Rationale: Avoid committing to specific performance numbers while emphasizing the value of intelligent routing and semantic caching. Signed-off-by: bitliu <[email protected]>
1 parent 0f65556 commit 36fce5d

File tree

1 file changed

+47
-47
lines changed

1 file changed

+47
-47
lines changed

website/docs/proposals/nvidia-dynamo-integration.md

Lines changed: 47 additions & 47 deletions
Original file line numberDiff line numberDiff line change
@@ -11,12 +11,12 @@ The result is a production-grade LLM serving platform that optimizes for both **
1111

1212
**Key Benefits:**
1313

14-
- **3-5x cost reduction** through intelligent model selection combined with infrastructure optimization
15-
- **40-60% TTFT improvement** via semantic caching + KV cache management
14+
- **Significant cost reduction** through intelligent model selection combined with infrastructure optimization
15+
- **Substantial latency improvement** via semantic caching + KV cache management
1616
- **Enhanced LLM quality** with domain-aware system prompts that improve Chain-of-Thought reasoning, token efficiency, and MoE expert matching
17-
- **Adaptive routing latency** with fusion routing: 1-2ms (keyword) to 20-30ms (BERT) based on query complexity
17+
- **Adaptive routing latency** with fusion routing: fast path (keyword) to deep analysis (BERT) based on query complexity
1818
- **Multi-signal intelligence** combining BERT classification, keyword matching, and similarity search for robust routing decisions
19-
- **Enhanced security** with PII detection and jailbreak prevention before inference
19+
- **Enhanced content safety** with PII detection and jailbreak prevention before inference
2020
- **Unified observability** across semantic and infrastructure layers
2121

2222
---
@@ -117,7 +117,7 @@ Semantic Router implements a **multi-signal fusion routing** approach that combi
117117
**1. Keyword-Based Routing (Fast Path)**
118118

119119
- Deterministic routing for technology-specific terms (e.g., "kubernetes", "SQL", "React")
120-
- **Latency**: ~1-2ms (10-15x faster than BERT)
120+
- **Latency**: Minimal (significantly faster than BERT classification)
121121
- Boolean logic support (AND/OR operators)
122122
- Easy to update without model retraining
123123
- **Use case**: Exact term matching for known patterns
@@ -127,14 +127,14 @@ Semantic Router implements a **multi-signal fusion routing** approach that combi
127127
- Embedding similarity for semantic concept detection
128128
- Robust to paraphrasing ("step-by-step" ≈ "explain thoroughly")
129129
- Configurable similarity thresholds (default: 0.75)
130-
- **Latency**: ~5-10ms
130+
- **Latency**: Low (faster than full BERT classification)
131131
- **Use case**: Semantic concept matching beyond exact terms
132132

133133
**3. BERT Classification (Deep Understanding Path)**
134134

135135
- 14-category classification with ModernBERT
136136
- Highest accuracy for complex queries
137-
- **Latency**: ~20-30ms
137+
- **Latency**: Moderate (comprehensive analysis)
138138
- **Use case**: Comprehensive intent understanding
139139

140140
**Signal Fusion Layer:**
@@ -150,7 +150,7 @@ Semantic Router implements a **multi-signal fusion routing** approach that combi
150150

151151
**Benefits of Fusion Routing:**
152152

153-
- **Latency optimization**: Fast path for common patterns (1-2ms vs 20-30ms)
153+
- **Latency optimization**: Fast path for common patterns vs. deep analysis for complex queries
154154
- **Accuracy**: Deep understanding for complex queries
155155
- **Flexibility**: Easy to add new routing rules without retraining
156156
- **Robustness**: Multiple signals reduce misclassification risk
@@ -192,7 +192,7 @@ Enriched Request → [Worker Selection] → KV Cache Optimization → GPU Schedu
192192
| **Caching** | ✅ Semantic similarity (Milvus) | ✅ KV cache reuse | ✅✅ **Dual-layer caching** |
193193
| **Security** | ✅ PII + jailbreak | ❌ No security layer | ✅ Pre-inference filtering |
194194
| **Cost Optimization** | ✅ Model-level | ✅ Infrastructure-level | ✅✅ **End-to-end optimization** |
195-
| **Latency** | ~1-30ms (fusion routing) | ~5-10ms routing | **Parallel execution** |
195+
| **Latency** | Adaptive (fusion routing) | Low routing overhead | **Parallel execution** |
196196

197197
**Concrete Example:**
198198

@@ -229,14 +229,14 @@ Query: "Explain the proof of Fermat's Last Theorem step-by-step"
229229
│ - worker-1: 85 prefill + 25 active = 110 (BEST) │
230230
│ - worker-2: 97 prefill + 20 active = 117 │
231231
│ - worker-3: 100 prefill + 18 active = 118 │
232-
│ 4. Selection: worker-1 (40% prefill cost reduction)
232+
│ 4. Selection: worker-1 (significant prefill cost reduction) │
233233
└─────────────────────────────────────────────────────────────────┘
234234
235235
Result:
236236
- Right model (deepseek-v31 for math reasoning)
237237
- Right worker (worker-1 with relevant KV cache)
238238
- Right mode (reasoning enabled)
239-
- 40% faster TTFT vs. random worker selection
239+
- Significantly faster TTFT vs. random worker selection
240240
```
241241

242242
### 2.4 Why Integration Matters
@@ -499,9 +499,9 @@ prompt_guard:
499499

500500
- **Parallel Execution:** PII and Jailbreak detection run in parallel
501501
- **Early Exit:** Cache hits bypass all model inference
502-
- **Keyword Routing:** Fast path (~1-2ms) for deterministic patterns
502+
- **Keyword Routing:** Fast path for deterministic patterns
503503
- **CPU Optimization:** All models optimized for CPU inference to reduce cost
504-
- **LoRA Adapters:** Jailbreak model uses lightweight adapters (~10-20 MB) for faster loading
504+
- **LoRA Adapters:** Jailbreak model uses lightweight adapters for faster loading
505505

506506
---
507507

@@ -640,19 +640,19 @@ graph TB
640640
│ - PII Detection: Scan for PERSON, EMAIL, SSN, etc. │
641641
│ - Jailbreak Detection: Binary classification for prompt injection │
642642
│ - Action: BLOCK if security violation detected │
643-
│ - Latency: ~5-10ms
643+
│ - Latency: Low
644644
│ │
645645
│ Step 3: Semantic Cache Lookup │
646646
│ - Generate BERT embedding for query │
647647
│ - Search Redis for similar queries (threshold: 0.85) │
648648
│ - Action: Return cached response if HIT │
649-
│ - Latency: ~2-5ms (cache hit), ~10ms (cache miss) │
649+
│ - Latency: Very low (cache hit), Low (cache miss) │
650650
│ │
651651
│ Step 4: Intent Classification │
652652
│ - ModernBERT classification (10 categories) │
653653
│ - Entropy-based reasoning decision │
654654
│ - Category: math, code, reasoning, creative, etc. │
655-
│ - Latency: ~20-30ms
655+
│ - Latency: Moderate
656656
│ │
657657
│ Step 5: Model Selection │
658658
│ - Lookup category → model scores mapping │
@@ -671,7 +671,7 @@ graph TB
671671
│ * Inject reasoning parameters if applicable │
672672
│ * Add selected tools if tool selection enabled │
673673
│ │
674-
│ Total Latency: ~30-50ms (parallel execution)
674+
│ Total Latency: Low to Moderate (parallel execution) │
675675
└─────────────────────────────────────────────────────────────────────────────┘
676676
677677
┌─────────────────────────────────────────────────────────────────────────────┐
@@ -689,14 +689,14 @@ graph TB
689689
│ * potential_active_blocks = current_active + new_request_blocks │
690690
│ * logit = kv_overlap_weight × prefill + active │
691691
│ - Select worker with lowest cost │
692-
│ - Latency: ~5-10ms
692+
│ - Latency: Low
693693
│ │
694694
│ Step 9: Request Forwarding │
695695
│ - Forward to selected worker (prefill or decode) │
696696
│ - Worker processes request with vLLM/SGLang/TRT-LLM │
697697
│ - KVBM tracks new KV cache blocks │
698698
│ │
699-
│ Total Latency: ~10-20ms (routing overhead) │
699+
│ Total Latency: Low (routing overhead)
700700
└─────────────────────────────────────────────────────────────────────────────┘
701701
702702
┌─────────────────────────────────────────────────────────────────────────────┐
@@ -745,24 +745,24 @@ The integration leverages **two complementary caching layers**:
745745
```
746746
Scenario 1: Exact Semantic Match
747747
Query: "What is the capital of France?"
748-
Semantic Cache: HIT (similarity: 0.98 with "What's France's capital?")
748+
Semantic Cache: HIT (high similarity with "What's France's capital?")
749749
KV Cache: N/A (inference skipped)
750-
Latency: ~5ms (cache lookup only)
751-
Cost Reduction: 100% (no inference)
750+
Latency: Very low (cache lookup only)
751+
Cost Reduction: Maximum (no inference)
752752

753753
Scenario 2: Partial Semantic Match + KV Reuse
754754
Query: "Explain the proof of Fermat's Last Theorem in detail"
755755
Semantic Cache: MISS (novel query)
756-
KV Cache: HIT (40% overlap with "Explain Fermat's Last Theorem")
757-
Latency: ~200ms (vs. ~350ms without KV reuse)
758-
Cost Reduction: 40% (prefill cost)
756+
KV Cache: HIT (significant overlap with "Explain Fermat's Last Theorem")
757+
Latency: Reduced (vs. without KV reuse)
758+
Cost Reduction: Significant (prefill cost saved)
759759

760760
Scenario 3: Novel Query
761761
Query: "Design a distributed consensus algorithm for blockchain"
762762
Semantic Cache: MISS
763763
KV Cache: MISS
764-
Latency: ~500ms (full inference)
765-
Cost Reduction: 0% (but routed to best model)
764+
Latency: Standard (full inference)
765+
Cost Reduction: None (but routed to best model)
766766
```
767767
768768
### 4.5 Integration in Kubernetes
@@ -833,20 +833,20 @@ The integration follows a **layered service architecture** in Kubernetes, with c
833833
│ ├────────────────────────────────────────────────────────────┤ │
834834
│ │ │ │
835835
│ │ [Model Pool: deepseek-v31] │ │
836-
│ │ - StatefulSet: 4 replicas │ │
836+
│ │ - StatefulSet: Multiple replicas │ │
837837
│ │ - Service: vllm-deepseek-v31-svc │ │
838-
│ │ - GPU: 2x H100 per pod │ │
838+
│ │ - GPU: Multi-GPU per pod │ │
839839
│ │ - Features: prefix caching, fp8 KV cache │ │
840840
│ │ │ │
841841
│ │ [Model Pool: qwen3] │ │
842-
│ │ - StatefulSet: 3 replicas │ │
842+
│ │ - StatefulSet: Multiple replicas │ │
843843
│ │ - Service: vllm-qwen3-svc │ │
844-
│ │ - GPU: 2x H100 per pod │ │
844+
│ │ - GPU: Multi-GPU per pod │ │
845845
│ │ │ │
846846
│ │ [Model Pool: phi4] │ │
847-
│ │ - StatefulSet: 2 replicas │ │
847+
│ │ - StatefulSet: Multiple replicas │ │
848848
│ │ - Service: vllm-phi4-svc │ │
849-
│ │ - GPU: 1x H100 per pod │ │
849+
│ │ - GPU: Single/Multi-GPU per pod │ │
850850
│ │ │ │
851851
│ └────────────────────────────────────────────────────────────┘ │
852852
│ │
@@ -1167,9 +1167,9 @@ Dynamo Frontend discovers workers through Kubernetes Headless Services, which pr
11671167
11681168
**Success Criteria:**
11691169
1170-
- ✅ Semantic cache hit rate > 30% (production workloads)
1171-
- ✅ Cache hit latency < 10ms
1172-
- ✅ Combined cache hit rate (semantic + KV) > 60%
1170+
- ✅ High semantic cache hit rate (production workloads)
1171+
- ✅ Low cache hit latency
1172+
- ✅ High combined cache hit rate (semantic + KV)
11731173
11741174
#### Phase 3: Observability & Monitoring
11751175
@@ -1220,7 +1220,7 @@ Dynamo Frontend discovers workers through Kubernetes Headless Services, which pr
12201220
**Success Criteria:**
12211221
12221222
- ✅ Single distributed trace spans all layers (VSR → Dynamo → Worker)
1223-
- ✅ < 1% trace sampling overhead
1223+
- ✅ Minimal trace sampling overhead
12241224
- ✅ Real-time dashboards operational
12251225
- ✅ Trace context properly propagated across service boundaries
12261226
@@ -1251,8 +1251,8 @@ Dynamo Frontend discovers workers through Kubernetes Headless Services, which pr
12511251
12521252
**Success Criteria:**
12531253
1254-
- ✅ 99.9% availability
1255-
- ✅ < 50ms P99 latency (routing overhead)
1254+
- ✅ High availability
1255+
- ✅ Low P99 latency (routing overhead)
12561256
- ✅ 10K+ RPS sustained throughput
12571257
12581258
---
@@ -1348,14 +1348,14 @@ prompt_guard:
13481348

13491349
| Metric | Threshold | Alert Severity |
13501350
|--------|-----------|----------------|
1351-
| Semantic Router Latency (P99) | > 100ms | Warning |
1352-
| Dynamo Router Latency (P99) | > 50ms | Warning |
1353-
| Combined Latency (P99) | > 150ms | Critical |
1354-
| Semantic Cache Hit Rate | < 20% | Warning |
1355-
| KV Cache Hit Rate | < 30% | Warning |
1356-
| Security Block Rate | > 5% | Warning |
1357-
| Error Rate | > 1% | Critical |
1358-
| GPU Utilization | < 50% or > 95% | Warning |
1351+
| Semantic Router Latency (P99) | High | Warning |
1352+
| Dynamo Router Latency (P99) | High | Warning |
1353+
| Combined Latency (P99) | Very High | Critical |
1354+
| Semantic Cache Hit Rate | Low | Warning |
1355+
| KV Cache Hit Rate | Low | Warning |
1356+
| Security Block Rate | High | Warning |
1357+
| Error Rate | High | Critical |
1358+
| GPU Utilization | Too Low or Too High | Warning |
13591359

13601360
**Dashboards:**
13611361

0 commit comments

Comments
 (0)