docs: generalize performance metrics to avoid specific numbers

Xunzhuo · Xunzhuo · commit 36fce5d0e679 · 2025-10-08T23:49:04.000+08:00
- Replace specific latency values with descriptive terms (Low, Moderate, High)
- Change cost reduction percentages to qualitative descriptions (Significant, Maximum)
- Update cache hit rates to general terms (High, Low)
- Modify GPU configurations to generic descriptions (Multi-GPU, Single/Multi-GPU)
- Replace specific thresholds with relative terms in monitoring alerts
- Maintain focus on intelligent routing and semantic cache benefits

Rationale: Avoid committing to specific performance numbers while
emphasizing the value of intelligent routing and semantic caching.

Signed-off-by: bitliu &lt;bitliu@tencent.com&gt;
diff --git a/website/docs/proposals/nvidia-dynamo-integration.md b/website/docs/proposals/nvidia-dynamo-integration.md
@@ -11,12 +11,12 @@ The result is a production-grade LLM serving platform that optimizes for both **
 
 **Key Benefits:**
 
-- **3-5x cost reduction** through intelligent model selection combined with infrastructure optimization
-- **40-60% TTFT improvement** via semantic caching + KV cache management
+- **Significant cost reduction** through intelligent model selection combined with infrastructure optimization
+- **Substantial latency improvement** via semantic caching + KV cache management
 - **Enhanced LLM quality** with domain-aware system prompts that improve Chain-of-Thought reasoning, token efficiency, and MoE expert matching
-- **Adaptive routing latency** with fusion routing: 1-2ms (keyword) to 20-30ms (BERT) based on query complexity
+- **Adaptive routing latency** with fusion routing: fast path (keyword) to deep analysis (BERT) based on query complexity
 - **Multi-signal intelligence** combining BERT classification, keyword matching, and similarity search for robust routing decisions
-- **Enhanced security** with PII detection and jailbreak prevention before inference
+- **Enhanced content safety** with PII detection and jailbreak prevention before inference
 - **Unified observability** across semantic and infrastructure layers
 
 ---
@@ -117,7 +117,7 @@ Semantic Router implements a **multi-signal fusion routing** approach that combi
 **1. Keyword-Based Routing (Fast Path)**
 
 - Deterministic routing for technology-specific terms (e.g., "kubernetes", "SQL", "React")
-- **Latency**: ~1-2ms (10-15x faster than BERT)
+- **Latency**: Minimal (significantly faster than BERT classification)
 - Boolean logic support (AND/OR operators)
 - Easy to update without model retraining
 - **Use case**: Exact term matching for known patterns
@@ -127,14 +127,14 @@ Semantic Router implements a **multi-signal fusion routing** approach that combi
 - Embedding similarity for semantic concept detection
 - Robust to paraphrasing ("step-by-step" ≈ "explain thoroughly")
 - Configurable similarity thresholds (default: 0.75)
-- **Latency**: ~5-10ms
+- **Latency**: Low (faster than full BERT classification)
 - **Use case**: Semantic concept matching beyond exact terms
 
 **3. BERT Classification (Deep Understanding Path)**
 
 - 14-category classification with ModernBERT
 - Highest accuracy for complex queries
-- **Latency**: ~20-30ms
+- **Latency**: Moderate (comprehensive analysis)
 - **Use case**: Comprehensive intent understanding
 
 **Signal Fusion Layer:**
@@ -150,7 +150,7 @@ Semantic Router implements a **multi-signal fusion routing** approach that combi
 
 **Benefits of Fusion Routing:**
 
-- **Latency optimization**: Fast path for common patterns (1-2ms vs 20-30ms)
+- **Latency optimization**: Fast path for common patterns vs. deep analysis for complex queries
 - **Accuracy**: Deep understanding for complex queries
 - **Flexibility**: Easy to add new routing rules without retraining
 - **Robustness**: Multiple signals reduce misclassification risk
@@ -192,7 +192,7 @@ Enriched Request → [Worker Selection] → KV Cache Optimization → GPU Schedu
 | **Caching** | ✅ Semantic similarity (Milvus) | ✅ KV cache reuse | ✅✅ **Dual-layer caching** |
 | **Security** | ✅ PII + jailbreak | ❌ No security layer | ✅ Pre-inference filtering |
 | **Cost Optimization** | ✅ Model-level | ✅ Infrastructure-level | ✅✅ **End-to-end optimization** |
-| **Latency** | ~1-30ms (fusion routing) | ~5-10ms routing | **Parallel execution** |
+| **Latency** | Adaptive (fusion routing) | Low routing overhead | **Parallel execution** |
 
 **Concrete Example:**
 
@@ -229,14 +229,14 @@ Query: "Explain the proof of Fermat's Last Theorem step-by-step"
 │    - worker-1: 85 prefill + 25 active = 110 (BEST)             │
 │    - worker-2: 97 prefill + 20 active = 117                     │
 │    - worker-3: 100 prefill + 18 active = 118                    │
-│ 4. Selection: worker-1 (40% prefill cost reduction)             │
+│ 4. Selection: worker-1 (significant prefill cost reduction)     │
 └─────────────────────────────────────────────────────────────────┘
 
 Result: 
 - Right model (deepseek-v31 for math reasoning)
 - Right worker (worker-1 with relevant KV cache)
 - Right mode (reasoning enabled)
-- 40% faster TTFT vs. random worker selection
+- Significantly faster TTFT vs. random worker selection
 ```
 
 ### 2.4 Why Integration Matters
@@ -499,9 +499,9 @@ prompt_guard:
 
 - **Parallel Execution:** PII and Jailbreak detection run in parallel
 - **Early Exit:** Cache hits bypass all model inference
-- **Keyword Routing:** Fast path (~1-2ms) for deterministic patterns
+- **Keyword Routing:** Fast path for deterministic patterns
 - **CPU Optimization:** All models optimized for CPU inference to reduce cost
-- **LoRA Adapters:** Jailbreak model uses lightweight adapters (~10-20 MB) for faster loading
+- **LoRA Adapters:** Jailbreak model uses lightweight adapters for faster loading
 
 ---
 
@@ -640,19 +640,19 @@ graph TB
 │   - PII Detection: Scan for PERSON, EMAIL, SSN, etc.                        │
 │   - Jailbreak Detection: Binary classification for prompt injection         │
 │   - Action: BLOCK if security violation detected                            │
-│   - Latency: ~5-10ms                                                        │
+│   - Latency: Low                                                            │
 │                                                                              │
 │ Step 3: Semantic Cache Lookup                                               │
 │   - Generate BERT embedding for query                                       │
 │   - Search Redis for similar queries (threshold: 0.85)                      │
 │   - Action: Return cached response if HIT                                   │
-│   - Latency: ~2-5ms (cache hit), ~10ms (cache miss)                         │
+│   - Latency: Very low (cache hit), Low (cache miss)                         │
 │                                                                              │
 │ Step 4: Intent Classification                                               │
 │   - ModernBERT classification (10 categories)                               │
 │   - Entropy-based reasoning decision                                        │
 │   - Category: math, code, reasoning, creative, etc.                         │
-│   - Latency: ~20-30ms                                                       │
+│   - Latency: Moderate                                                       │
 │                                                                              │
 │ Step 5: Model Selection                                                     │
 │   - Lookup category → model scores mapping                                  │
@@ -671,7 +671,7 @@ graph TB
 │     * Inject reasoning parameters if applicable                             │
 │     * Add selected tools if tool selection enabled                          │
 │                                                                              │
-│ Total Latency: ~30-50ms (parallel execution)                                │
+│ Total Latency: Low to Moderate (parallel execution)                         │
 └─────────────────────────────────────────────────────────────────────────────┘
                                       ↓
 ┌─────────────────────────────────────────────────────────────────────────────┐
@@ -689,14 +689,14 @@ graph TB
 │     * potential_active_blocks = current_active + new_request_blocks         │
 │     * logit = kv_overlap_weight × prefill + active                          │
 │   - Select worker with lowest cost                                          │
-│   - Latency: ~5-10ms                                                        │
+│   - Latency: Low                                                            │
 │                                                                              │
 │ Step 9: Request Forwarding                                                  │
 │   - Forward to selected worker (prefill or decode)                          │
 │   - Worker processes request with vLLM/SGLang/TRT-LLM                       │
 │   - KVBM tracks new KV cache blocks                                         │
 │                                                                              │
-│ Total Latency: ~10-20ms (routing overhead)                                  │
+│ Total Latency: Low (routing overhead)                                       │
 └─────────────────────────────────────────────────────────────────────────────┘
                                       ↓
 ┌─────────────────────────────────────────────────────────────────────────────┐
@@ -745,24 +745,24 @@ The integration leverages **two complementary caching layers**:
 ```
 Scenario 1: Exact Semantic Match
   Query: "What is the capital of France?"
-  Semantic Cache: HIT (similarity: 0.98 with "What's France's capital?")
+  Semantic Cache: HIT (high similarity with "What's France's capital?")
   KV Cache: N/A (inference skipped)
-  Latency: ~5ms (cache lookup only)
-  Cost Reduction: 100% (no inference)
+  Latency: Very low (cache lookup only)
+  Cost Reduction: Maximum (no inference)
 
 Scenario 2: Partial Semantic Match + KV Reuse
   Query: "Explain the proof of Fermat's Last Theorem in detail"
   Semantic Cache: MISS (novel query)
-  KV Cache: HIT (40% overlap with "Explain Fermat's Last Theorem")
-  Latency: ~200ms (vs. ~350ms without KV reuse)
-  Cost Reduction: 40% (prefill cost)
+  KV Cache: HIT (significant overlap with "Explain Fermat's Last Theorem")
+  Latency: Reduced (vs. without KV reuse)
+  Cost Reduction: Significant (prefill cost saved)
 
 Scenario 3: Novel Query
   Query: "Design a distributed consensus algorithm for blockchain"
   Semantic Cache: MISS
   KV Cache: MISS
-  Latency: ~500ms (full inference)
-  Cost Reduction: 0% (but routed to best model)
+  Latency: Standard (full inference)
+  Cost Reduction: None (but routed to best model)
 ```
 
 ### 4.5 Integration in Kubernetes
@@ -833,20 +833,20 @@ The integration follows a **layered service architecture** in Kubernetes, with c
 │  ├────────────────────────────────────────────────────────────┤    │
 │  │                                                             │    │
 │  │  [Model Pool: deepseek-v31]                                │    │
-│  │   - StatefulSet: 4 replicas                                │    │
+│  │   - StatefulSet: Multiple replicas                        │    │
 │  │   - Service: vllm-deepseek-v31-svc                         │    │
-│  │   - GPU: 2x H100 per pod                                   │    │
+│  │   - GPU: Multi-GPU per pod                                 │    │
 │  │   - Features: prefix caching, fp8 KV cache                 │    │
 │  │                                                             │    │
 │  │  [Model Pool: qwen3]                                       │    │
-│  │   - StatefulSet: 3 replicas                                │    │
+│  │   - StatefulSet: Multiple replicas                        │    │
 │  │   - Service: vllm-qwen3-svc                                │    │
-│  │   - GPU: 2x H100 per pod                                   │    │
+│  │   - GPU: Multi-GPU per pod                                 │    │
 │  │                                                             │    │
 │  │  [Model Pool: phi4]                                        │    │
-│  │   - StatefulSet: 2 replicas                                │    │
+│  │   - StatefulSet: Multiple replicas                        │    │
 │  │   - Service: vllm-phi4-svc                                 │    │
-│  │   - GPU: 1x H100 per pod                                   │    │
+│  │   - GPU: Single/Multi-GPU per pod                          │    │
 │  │                                                             │    │
 │  └────────────────────────────────────────────────────────────┘    │
 │                                                                      │
@@ -1167,9 +1167,9 @@ Dynamo Frontend discovers workers through Kubernetes Headless Services, which pr
 
 **Success Criteria:**
 
-- ✅ Semantic cache hit rate > 30% (production workloads)
-- ✅ Cache hit latency < 10ms
-- ✅ Combined cache hit rate (semantic + KV) > 60%
+- ✅ High semantic cache hit rate (production workloads)
+- ✅ Low cache hit latency
+- ✅ High combined cache hit rate (semantic + KV)
 
 #### Phase 3: Observability & Monitoring
 
@@ -1220,7 +1220,7 @@ Dynamo Frontend discovers workers through Kubernetes Headless Services, which pr
 **Success Criteria:**
 
 - ✅ Single distributed trace spans all layers (VSR → Dynamo → Worker)
-- ✅ < 1% trace sampling overhead
+- ✅ Minimal trace sampling overhead
 - ✅ Real-time dashboards operational
 - ✅ Trace context properly propagated across service boundaries
 
@@ -1251,8 +1251,8 @@ Dynamo Frontend discovers workers through Kubernetes Headless Services, which pr
 
 **Success Criteria:**
 
-- ✅ 99.9% availability
-- ✅ < 50ms P99 latency (routing overhead)
+- ✅ High availability
+- ✅ Low P99 latency (routing overhead)
 - ✅ 10K+ RPS sustained throughput
 
 ---
@@ -1348,14 +1348,14 @@ prompt_guard:
 
 | Metric | Threshold | Alert Severity |
 |--------|-----------|----------------|
-| Semantic Router Latency (P99) | > 100ms | Warning |
-| Dynamo Router Latency (P99) | > 50ms | Warning |
-| Combined Latency (P99) | > 150ms | Critical |
-| Semantic Cache Hit Rate | < 20% | Warning |
-| KV Cache Hit Rate | < 30% | Warning |
-| Security Block Rate | > 5% | Warning |
-| Error Rate | > 1% | Critical |
-| GPU Utilization | < 50% or > 95% | Warning |
+| Semantic Router Latency (P99) | High | Warning |
+| Dynamo Router Latency (P99) | High | Warning |
+| Combined Latency (P99) | Very High | Critical |
+| Semantic Cache Hit Rate | Low | Warning |
+| KV Cache Hit Rate | Low | Warning |
+| Security Block Rate | High | Warning |
+| Error Rate | High | Critical |
+| GPU Utilization | Too Low or Too High | Warning |
 
 **Dashboards:**