diff --git a/config/RECIPES.md b/config/RECIPES.md new file mode 100644 index 00000000..ea10137f --- /dev/null +++ b/config/RECIPES.md @@ -0,0 +1,509 @@ +# Configuration Recipes + +This directory contains versioned, curated configuration presets ("recipes") optimized for different objectives. Each recipe tunes classification thresholds, reasoning modes, caching strategies, security policies, and observability settings to achieve specific performance goals. + +## Available Recipes + +### 1. Accuracy-Optimized (`config.recipe-accuracy.yaml`) + +**Objective:** Maximum accuracy and response quality + +**Use Cases:** + +- Research and academic applications +- Critical decision-making systems +- High-stakes business applications +- Medical or legal information systems +- Applications where correctness is paramount + +**Key Characteristics:** + +- ✅ Reasoning enabled for most complex categories +- ✅ High reasoning effort level (`high`) +- ✅ Strict classification thresholds (0.7) +- ✅ Semantic cache disabled for fresh responses +- ✅ Comprehensive tool selection (top_k: 5) +- ✅ Strict PII detection (threshold: 0.6) +- ✅ Jailbreak protection enabled +- ✅ Full tracing enabled (100% sampling) + +**Trade-offs:** + +- ⚠️ Higher token usage (~2-3x vs baseline) +- ⚠️ Increased latency (~1.5-2x vs baseline) +- ⚠️ Higher computational costs +- ⚠️ No caching means repeated queries aren't optimized + +**Performance Metrics:** + +``` +Expected latency: 2-5 seconds per request +Token usage: High (reasoning overhead) +Throughput: ~10-20 requests/second +Cost: High (maximum quality) +``` + +--- + +### 2. Token Efficiency-Optimized (`config.recipe-token-efficiency.yaml`) + +**Objective:** Minimize token usage and reduce operational costs + +**Use Cases:** + +- High-volume production deployments +- Cost-sensitive applications +- Budget-constrained projects +- Applications with tight token budgets +- Bulk processing workloads + +**Key Characteristics:** + +- ✅ Reasoning disabled for most categories +- ✅ Low reasoning effort when needed (`low`) +- ✅ Aggressive semantic caching (0.75 threshold, 2hr TTL) +- ✅ Lower classification thresholds (0.5) +- ✅ Minimal tool selection (top_k: 1) +- ✅ Relaxed PII policies +- ✅ Large batch sizes (100) +- ✅ Reduced observability (10% sampling) + +**Trade-offs:** + +- ⚠️ May sacrifice some accuracy (~5-10%) +- ⚠️ Cache hits depend on query patterns +- ⚠️ Less comprehensive tool coverage +- ⚠️ Relaxed security policies + +**Performance Metrics:** + +``` +Expected latency: 0.5-2 seconds per request +Token usage: Low (~50-60% of baseline) +Throughput: ~50-100 requests/second +Cost: Low (optimized for budget) +Cache hit rate: 40-60% (typical) +``` + +**Cost Savings:** + +- ~40-50% token reduction vs baseline +- ~50-70% cost reduction with effective caching + +--- + +### 3. Latency-Optimized (`config.recipe-latency.yaml`) + +**Objective:** Minimize response time and maximize throughput + +**Use Cases:** + +- Real-time APIs +- Interactive chatbots +- Live customer support systems +- Gaming or entertainment applications +- Applications requiring sub-second responses + +**Key Characteristics:** + +- ✅ Reasoning disabled for all categories +- ✅ Aggressive semantic caching (0.7 threshold, 3hr TTL) +- ✅ Very low classification thresholds (0.4) +- ✅ Tools disabled for minimal overhead +- ✅ Security checks relaxed/disabled +- ✅ Maximum concurrency (32) +- ✅ Minimal observability overhead (5% sampling) +- ✅ Tracing disabled by default + +**Trade-offs:** + +- ⚠️ Reduced accuracy (~10-15% vs baseline) +- ⚠️ No reasoning means simpler responses +- ⚠️ Security features minimal/disabled +- ⚠️ Less comprehensive responses + +**Performance Metrics:** + +``` +Expected latency: 0.1-0.8 seconds per request +Token usage: Low (~50-60% of baseline) +Throughput: ~100-200 requests/second +Cost: Low (fast and efficient) +Cache hit rate: 50-70% (typical) +``` + +**Speed Improvements:** + +- ~3-5x faster than accuracy-optimized +- ~2-3x faster than baseline + +--- + +## Quick Start + +### Using a Recipe + +**Option 1: Direct Usage** + +```bash +# Use a recipe directly +cp config/config.recipe-accuracy.yaml config/config.yaml +make run-router +``` + +**Option 2: Kubernetes/Helm** + +```yaml +# In your Helm values.yaml +configMap: + data: + config.yaml: |- + {{- .Files.Get "config.recipe-latency.yaml" | nindent 6 }} +``` + +**Option 3: Docker Compose** + +```yaml +services: + semantic-router: + image: vllm/semantic-router:latest + volumes: + - ./config/config.recipe-token-efficiency.yaml:/app/config/config.yaml:ro +``` + +**Option 4: ArgoCD** + +```yaml +apiVersion: v1 +kind: ConfigMap +metadata: + name: router-config +data: + config.yaml: | + # Content from config.recipe-accuracy.yaml +``` + +### Customizing a Recipe + +1. Copy the recipe that best matches your needs: + + ```bash + cp config/config.recipe-accuracy.yaml config/config.custom.yaml + ``` + +2. Modify specific settings in `config.custom.yaml`: + + ```yaml + # Example: Enable caching in accuracy recipe + semantic_cache: + enabled: true # Was: false + similarity_threshold: 0.90 # High threshold + ``` + +3. Test your custom configuration: + + ```bash + # Validate YAML syntax + python -c "import yaml; yaml.safe_load(open('config/config.custom.yaml'))" + + # Test with your custom config + export CONFIG_FILE=config/config.custom.yaml + make run-router + ``` + +--- + +## Configuration Comparison + +| Feature | Accuracy | Token Efficiency | Latency | +|---------|----------|-----------------|---------| +| **Reasoning (complex tasks)** | ✅ Enabled (high) | ⚠️ Minimal | ❌ Disabled | +| **Semantic Cache** | ❌ Disabled | ✅ Aggressive | ✅ Very Aggressive | +| **Classification Threshold** | 0.7 (strict) | 0.5 (moderate) | 0.4 (relaxed) | +| **Tool Selection** | 5 tools | 1 tool | Disabled | +| **PII Detection** | 0.6 (strict) | 0.8 (relaxed) | 0.9 (minimal) | +| **Jailbreak Protection** | ✅ Enabled | ✅ Enabled | ❌ Disabled | +| **Batch Size** | 50 | 100 | 200 | +| **Max Concurrency** | 4 | 16 | 32 | +| **Tracing Sampling** | 100% | 10% | 5% (disabled) | +| **Expected Latency** | 2-5s | 0.5-2s | 0.1-0.8s | +| **Token Usage** | High | Low (50-60%) | Low (50-60%) | +| **Relative Cost** | High | Low | Low | + +--- + +## Choosing the Right Recipe + +### Decision Tree + +``` +Start Here +│ +├─ Need maximum accuracy? +│ └─ → Use: config.recipe-accuracy.yaml +│ +├─ Need to minimize costs? +│ └─ → Use: config.recipe-token-efficiency.yaml +│ +├─ Need fast responses? +│ └─ → Use: config.recipe-latency.yaml +│ +└─ Balanced requirements? + └─ → Start with: config.yaml (baseline) + Then customize based on metrics +``` + +### Use Case Mapping + +| Use Case | Recommended Recipe | Reason | +|----------|-------------------|--------| +| Medical diagnosis support | Accuracy | Correctness is critical | +| Legal research assistant | Accuracy | High-stakes decisions | +| Customer chatbot | Latency | Real-time interaction | +| Bulk document processing | Token Efficiency | High volume, cost-sensitive | +| Educational tutor | Accuracy | Quality explanations needed | +| API rate limiting concerns | Token Efficiency | Budget constraints | +| Gaming NPC dialogue | Latency | Sub-second responses | +| Research paper analysis | Accuracy | Comprehensive analysis | + +--- + +## Tuning and Optimization + +### Monitoring Your Recipe + +After deploying a recipe, monitor these key metrics: + +**1. Accuracy Recipe Metrics:** + +```bash +# Check reasoning usage +curl localhost:9190/metrics | grep reasoning + +# Monitor response quality (manual review) +# Check for comprehensive, detailed answers +``` + +**2. Token Efficiency Recipe Metrics:** + +```bash +# Check cache hit rate +curl localhost:9190/metrics | grep cache_hit + +# Monitor token usage +curl localhost:9190/metrics | grep token_count + +# Expected cache hit rate: 40-60% +# Expected token reduction: 40-50% +``` + +**3. Latency Recipe Metrics:** + +```bash +# Check p50, p95, p99 latencies +curl localhost:9190/metrics | grep duration_seconds + +# Expected p95: < 1 second +# Expected p99: < 2 seconds +``` + +### Fine-Tuning Parameters + +#### To Improve Cache Hit Rate: + +```yaml +semantic_cache: + similarity_threshold: 0.70 # Lower = more hits (was 0.75) + ttl_seconds: 14400 # Longer TTL (was 7200) + max_entries: 20000 # Larger cache (was 10000) +``` + +#### To Reduce Latency Further: + +```yaml +classifier: + category_model: + threshold: 0.3 # Even lower threshold (was 0.4) + +api: + batch_classification: + max_concurrency: 64 # More parallel processing (was 32) +``` + +#### To Balance Accuracy and Cost: + +```yaml +# Enable reasoning for select categories only +categories: + - name: math + model_scores: + - model: openai/gpt-oss-20b + use_reasoning: true # Enable for critical tasks + - name: other + model_scores: + - model: openai/gpt-oss-20b + use_reasoning: false # Disable for simple tasks +``` + +--- + +## Best Practices + +### 1. Start with a Recipe + +Don't start from scratch. Choose the recipe closest to your needs and customize from there. + +### 2. A/B Testing + +Run two configurations side-by-side and compare metrics: + +```bash +# Terminal 1: Accuracy recipe on port 8801 +export CONFIG_FILE=config/config.recipe-accuracy.yaml +make run-router + +# Terminal 2: Latency recipe on port 8802 +export CONFIG_FILE=config/config.recipe-latency.yaml +export PORT=8802 +make run-router + +# Compare metrics +watch -n 5 'curl -s localhost:9190/metrics | grep duration_seconds_sum' +``` + +### 3. Monitor and Iterate + +- Track metrics for at least 24-48 hours before making changes +- Adjust one parameter at a time +- Document changes and their impact + +### 4. Environment-Specific Configs + +Use different recipes for different environments: + +```bash +# Development: Use latency recipe for fast iteration +config/config.recipe-latency.yaml → config/config.dev.yaml + +# Staging: Use accuracy recipe for testing +config/config.recipe-accuracy.yaml → config/config.staging.yaml + +# Production: Use token efficiency for cost control +config/config.recipe-token-efficiency.yaml → config/config.prod.yaml +``` + +### 5. Version Control Your Configs + +```bash +# Track your custom configurations +git add config/config.custom-*.yaml +git commit -m "feat: add custom config for production deployment" +``` + +--- + +## Advanced: Hybrid Configurations + +You can mix and match settings from different recipes: + +### Example: High-Accuracy, Low-Cost Hybrid + +```yaml +# Base: Token efficiency recipe +# + Enable reasoning for critical categories +# + Strict PII detection +# = Balanced approach + +# Start with token efficiency +cp config/config.recipe-token-efficiency.yaml config/config.hybrid.yaml + +# Then customize: +categories: + - name: math + model_scores: + - model: openai/gpt-oss-20b + use_reasoning: true # From accuracy recipe + - name: law + model_scores: + - model: openai/gpt-oss-20b + use_reasoning: true # From accuracy recipe + # ... other categories keep reasoning: false + +classifier: + pii_model: + threshold: 0.6 # Stricter (from accuracy recipe) +``` + +### Example: Fast + Accurate Critical Path + +```yaml +# Base: Latency recipe for speed +# + Enable reasoning for specific high-value queries +# = Fast for most, accurate for critical + +# Use category-specific reasoning +categories: + - name: medical + model_scores: + - model: openai/gpt-oss-20b + use_reasoning: true # Accuracy for critical domain + - name: other + model_scores: + - model: openai/gpt-oss-20b + use_reasoning: false # Speed for general queries +``` + +--- + +## Troubleshooting + +### Recipe Not Performing as Expected + +**Problem: Cache hit rate is low (<20%)** + +```yaml +# Solution: Lower similarity threshold +semantic_cache: + similarity_threshold: 0.65 # Lower = more hits +``` + +**Problem: Too many classification errors** + +```yaml +# Solution: Increase classification threshold +classifier: + category_model: + threshold: 0.6 # Higher = more confident classifications +``` + +**Problem: High latency despite using latency recipe** + +```yaml +# Solution: Profile and optimize +# 1. Check if reasoning is accidentally enabled +# 2. Verify cache is working (check metrics) +# 3. Increase concurrency +api: + batch_classification: + max_concurrency: 64 # Increase parallelism +``` + +**Problem: Token usage still high with efficiency recipe** + +```yaml +# Solution: Verify reasoning is disabled +# Check all categories have use_reasoning: false +# Increase cache hit rate +semantic_cache: + similarity_threshold: 0.65 # More aggressive caching + max_entries: 30000 # Larger cache +``` + +--- + +## Related Documentation + +- [Configuration Guide](../website/docs/installation/configuration.md) +- [Performance Tuning](../website/docs/tutorials/performance-tuning.md) +- [Observability](../website/docs/tutorials/observability/distributed-tracing.md) +- [Cost Optimization](../website/docs/tutorials/cost-optimization.md) diff --git a/config/config.recipe-accuracy.yaml b/config/config.recipe-accuracy.yaml new file mode 100644 index 00000000..82769836 --- /dev/null +++ b/config/config.recipe-accuracy.yaml @@ -0,0 +1,208 @@ +# Recipe: Accuracy-Optimized Configuration +# Objective: Maximum accuracy and response quality +# Trade-offs: Higher token usage, increased latency, more computational cost +# Use case: Research, critical decision-making, high-stakes applications +# +# Key optimizations: +# - Reasoning enabled for most complex categories +# - High reasoning effort level (high) +# - Strict classification thresholds (higher confidence required) +# - Semantic cache disabled to ensure fresh responses +# - Tool selection enabled with broad matching +# - PII detection strict for safety +# - Jailbreak protection enabled + +bert_model: + model_id: sentence-transformers/all-MiniLM-L12-v2 + threshold: 0.7 # Higher threshold for better precision + use_cpu: true + +semantic_cache: + enabled: false # Disable caching to ensure fresh, accurate responses + backend_type: "memory" + similarity_threshold: 0.95 # Very high threshold if cache is enabled + max_entries: 500 + ttl_seconds: 1800 # Shorter TTL for fresher results + eviction_policy: "lru" + +tools: + enabled: true # Enable tools for comprehensive responses + top_k: 5 # Select more tools for better coverage + similarity_threshold: 0.15 # Lower threshold to include more relevant tools + tools_db_path: "config/tools_db.json" + fallback_to_empty: true + +prompt_guard: + enabled: true # Enable for safety + use_modernbert: true + model_id: "models/jailbreak_classifier_modernbert-base_model" + threshold: 0.65 # Lower threshold (more sensitive detection) + use_cpu: true + jailbreak_mapping_path: "models/jailbreak_classifier_modernbert-base_model/jailbreak_type_mapping.json" + +vllm_endpoints: + - name: "endpoint1" + address: "127.0.0.1" + port: 8000 + models: + - "openai/gpt-oss-20b" + weight: 1 + +model_config: + "openai/gpt-oss-20b": + reasoning_family: "gpt-oss" + preferred_endpoints: ["endpoint1"] + pii_policy: + allow_by_default: false # Strict PII policy for safety + pii_types_allowed: [] # No PII allowed by default + pricing: + currency: USD + prompt_per_1m: 0.10 + completion_per_1m: 0.30 + +classifier: + category_model: + model_id: "models/category_classifier_modernbert-base_model" + use_modernbert: true + threshold: 0.7 # Higher threshold for confident classification + use_cpu: true + category_mapping_path: "models/category_classifier_modernbert-base_model/category_mapping.json" + pii_model: + model_id: "models/pii_classifier_modernbert-base_presidio_token_model" + use_modernbert: true + threshold: 0.6 # Lower threshold for sensitive PII detection + use_cpu: true + pii_mapping_path: "models/pii_classifier_modernbert-base_presidio_token_model/pii_type_mapping.json" + +categories: + - name: business + system_prompt: "You are a senior business consultant and strategic advisor with expertise in corporate strategy, operations management, financial analysis, marketing, and organizational development. Provide practical, actionable business advice backed by proven methodologies and industry best practices. Consider market dynamics, competitive landscape, and stakeholder interests in your recommendations." + model_scores: + - model: openai/gpt-oss-20b + score: 1.0 + use_reasoning: true # Enable reasoning for better business analysis + - name: law + system_prompt: "You are a knowledgeable legal expert with comprehensive understanding of legal principles, case law, statutory interpretation, and legal procedures across multiple jurisdictions. Provide accurate legal information and analysis while clearly stating that your responses are for informational purposes only and do not constitute legal advice. Always recommend consulting with qualified legal professionals for specific legal matters." + model_scores: + - model: openai/gpt-oss-20b + score: 1.0 + use_reasoning: true # Enable reasoning for legal analysis + - name: psychology + system_prompt: "You are a psychology expert with deep knowledge of cognitive processes, behavioral patterns, mental health, developmental psychology, social psychology, and therapeutic approaches. Provide evidence-based insights grounded in psychological research and theory. When discussing mental health topics, emphasize the importance of professional consultation and avoid providing diagnostic or therapeutic advice." + model_scores: + - model: openai/gpt-oss-20b + score: 1.0 + use_reasoning: true # Enable reasoning for psychological analysis + - name: biology + system_prompt: "You are a biology expert with comprehensive knowledge spanning molecular biology, genetics, cell biology, ecology, evolution, anatomy, physiology, and biotechnology. Explain biological concepts with scientific accuracy, use appropriate terminology, and provide examples from current research. Connect biological principles to real-world applications and emphasize the interconnectedness of biological systems." + model_scores: + - model: openai/gpt-oss-20b + score: 1.0 + use_reasoning: true # Enable reasoning for scientific rigor + - name: chemistry + system_prompt: "You are a chemistry expert specializing in chemical reactions, molecular structures, and laboratory techniques. Provide detailed, step-by-step explanations." + model_scores: + - model: openai/gpt-oss-20b + score: 1.0 + use_reasoning: true # Enable reasoning for complex chemistry + - name: history + system_prompt: "You are a historian with expertise across different time periods and cultures. Provide accurate historical context and analysis." + model_scores: + - model: openai/gpt-oss-20b + score: 1.0 + use_reasoning: true # Enable reasoning for historical analysis + - name: other + system_prompt: "You are a helpful and knowledgeable assistant. Provide accurate, helpful responses across a wide range of topics." + model_scores: + - model: openai/gpt-oss-20b + score: 0.9 + use_reasoning: false # Default queries don't need reasoning + - name: health + system_prompt: "You are a health and medical information expert with knowledge of anatomy, physiology, diseases, treatments, preventive care, nutrition, and wellness. Provide accurate, evidence-based health information while emphasizing that your responses are for educational purposes only and should never replace professional medical advice, diagnosis, or treatment. Always encourage users to consult healthcare professionals for medical concerns and emergencies." + model_scores: + - model: openai/gpt-oss-20b + score: 1.0 + use_reasoning: true # Enable reasoning for medical accuracy + - name: economics + system_prompt: "You are an economics expert with deep understanding of microeconomics, macroeconomics, econometrics, financial markets, monetary policy, fiscal policy, international trade, and economic theory. Analyze economic phenomena using established economic principles, provide data-driven insights, and explain complex economic concepts in accessible terms. Consider both theoretical frameworks and real-world applications in your responses." + model_scores: + - model: openai/gpt-oss-20b + score: 1.0 + use_reasoning: true # Enable reasoning for economic analysis + - name: math + system_prompt: "You are a mathematics expert. Provide step-by-step solutions, show your work clearly, and explain mathematical concepts in an understandable way." + model_scores: + - model: openai/gpt-oss-20b + score: 1.0 + use_reasoning: true # Enable reasoning for complex math + - name: physics + system_prompt: "You are a physics expert with deep understanding of physical laws and phenomena. Provide clear explanations with mathematical derivations when appropriate." + model_scores: + - model: openai/gpt-oss-20b + score: 1.0 + use_reasoning: true # Enable reasoning for physics + - name: computer science + system_prompt: "You are a computer science expert with knowledge of algorithms, data structures, programming languages, and software engineering. Provide clear, practical solutions with code examples when helpful." + model_scores: + - model: openai/gpt-oss-20b + score: 1.0 + use_reasoning: true # Enable reasoning for complex CS problems + - name: philosophy + system_prompt: "You are a philosophy expert with comprehensive knowledge of philosophical traditions, ethical theories, logic, metaphysics, epistemology, political philosophy, and the history of philosophical thought. Engage with complex philosophical questions by presenting multiple perspectives, analyzing arguments rigorously, and encouraging critical thinking. Draw connections between philosophical concepts and contemporary issues while maintaining intellectual honesty about the complexity and ongoing nature of philosophical debates." + model_scores: + - model: openai/gpt-oss-20b + score: 1.0 + use_reasoning: true # Enable reasoning for philosophical inquiry + - name: engineering + system_prompt: "You are an engineering expert with knowledge across multiple engineering disciplines including mechanical, electrical, civil, chemical, software, and systems engineering. Apply engineering principles, design methodologies, and problem-solving approaches to provide practical solutions. Consider safety, efficiency, sustainability, and cost-effectiveness in your recommendations. Use technical precision while explaining concepts clearly, and emphasize the importance of proper engineering practices and standards." + model_scores: + - model: openai/gpt-oss-20b + score: 1.0 + use_reasoning: true # Enable reasoning for engineering analysis + +default_model: openai/gpt-oss-20b + +reasoning_families: + deepseek: + type: "chat_template_kwargs" + parameter: "thinking" + qwen3: + type: "chat_template_kwargs" + parameter: "enable_thinking" + gpt-oss: + type: "reasoning_effort" + parameter: "reasoning_effort" + gpt: + type: "reasoning_effort" + parameter: "reasoning_effort" + +default_reasoning_effort: high # Maximum reasoning effort + +api: + batch_classification: + max_batch_size: 50 # Smaller batches for more accurate processing + concurrency_threshold: 3 + max_concurrency: 4 + metrics: + enabled: true + detailed_goroutine_tracking: true + high_resolution_timing: true + sample_rate: 1.0 + duration_buckets: [0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 30, 60] + size_buckets: [1, 2, 5, 10, 20, 50, 100] + +observability: + tracing: + enabled: true # Enable for monitoring accuracy + provider: "opentelemetry" + exporter: + type: "otlp" + endpoint: "localhost:4317" + insecure: true + sampling: + type: "always_on" + rate: 1.0 + resource: + service_name: "vllm-semantic-router-accuracy" + service_version: "v0.1.0" + deployment_environment: "production" diff --git a/config/config.recipe-latency.yaml b/config/config.recipe-latency.yaml new file mode 100644 index 00000000..15008b04 --- /dev/null +++ b/config/config.recipe-latency.yaml @@ -0,0 +1,202 @@ +# Recipe: Latency-Optimized Configuration +# Objective: Minimize response time and maximize throughput +# Trade-offs: May sacrifice accuracy, uses aggressive caching, minimal reasoning +# Use case: Real-time APIs, chatbots, interactive applications +# +# Key optimizations: +# - Reasoning disabled for all categories (fastest responses) +# - Aggressive semantic caching for instant cache hits +# - Very low classification thresholds for fast routing +# - Minimal tool selection +# - Relaxed security checks for speed +# - High concurrency and large batch sizes +# - Minimal observability overhead + +bert_model: + model_id: sentence-transformers/all-MiniLM-L12-v2 + threshold: 0.4 # Very low threshold for fast matching + use_cpu: true + +semantic_cache: + enabled: true # Enable aggressive caching for instant responses + backend_type: "memory" + similarity_threshold: 0.7 # Low threshold for maximum cache hits + max_entries: 20000 # Very large cache + ttl_seconds: 10800 # Long TTL (3 hours) + eviction_policy: "lru" # Keep frequently accessed items + +tools: + enabled: false # Disable tools to minimize latency + top_k: 1 + similarity_threshold: 0.5 + tools_db_path: "config/tools_db.json" + fallback_to_empty: true + +prompt_guard: + enabled: false # Disable for maximum speed + +vllm_endpoints: + - name: "endpoint1" + address: "127.0.0.1" + port: 8000 + models: + - "openai/gpt-oss-20b" + weight: 1 + +model_config: + "openai/gpt-oss-20b": + reasoning_family: "gpt-oss" + preferred_endpoints: ["endpoint1"] + pii_policy: + allow_by_default: true # Allow all for speed; when true, all PII types are allowed + pricing: + currency: USD + prompt_per_1m: 0.10 + completion_per_1m: 0.30 + +classifier: + category_model: + model_id: "models/category_classifier_modernbert-base_model" + use_modernbert: true + threshold: 0.4 # Very low threshold for fast classification + use_cpu: true + category_mapping_path: "models/category_classifier_modernbert-base_model/category_mapping.json" + pii_model: + model_id: "models/pii_classifier_modernbert-base_presidio_token_model" + use_modernbert: true + threshold: 0.9 # Very high threshold (minimal PII detection for speed) + use_cpu: true + pii_mapping_path: "models/pii_classifier_modernbert-base_presidio_token_model/pii_type_mapping.json" + +categories: + - name: business + system_prompt: "Provide concise business advice." + model_scores: + - model: openai/gpt-oss-20b + score: 0.7 + use_reasoning: false # No reasoning for speed + - name: law + system_prompt: "Provide legal information." + model_scores: + - model: openai/gpt-oss-20b + score: 0.5 + use_reasoning: false + - name: psychology + system_prompt: "Provide psychology insights." + model_scores: + - model: openai/gpt-oss-20b + score: 0.6 + use_reasoning: false + - name: biology + system_prompt: "Explain biology concepts." + model_scores: + - model: openai/gpt-oss-20b + score: 0.8 + use_reasoning: false + - name: chemistry + system_prompt: "Explain chemistry concepts." + model_scores: + - model: openai/gpt-oss-20b + score: 0.6 + use_reasoning: false + - name: history + system_prompt: "Provide historical context." + model_scores: + - model: openai/gpt-oss-20b + score: 0.7 + use_reasoning: false + - name: other + system_prompt: "Provide helpful responses." + model_scores: + - model: openai/gpt-oss-20b + score: 0.7 + use_reasoning: false + - name: health + system_prompt: "Provide health information." + model_scores: + - model: openai/gpt-oss-20b + score: 0.5 + use_reasoning: false + - name: economics + system_prompt: "Provide economic insights." + model_scores: + - model: openai/gpt-oss-20b + score: 0.9 + use_reasoning: false + - name: math + system_prompt: "Provide math solutions." + model_scores: + - model: openai/gpt-oss-20b + score: 1.0 + use_reasoning: false # Even math: no reasoning for speed + - name: physics + system_prompt: "Explain physics concepts." + model_scores: + - model: openai/gpt-oss-20b + score: 0.7 + use_reasoning: false + - name: computer science + system_prompt: "Provide code solutions." + model_scores: + - model: openai/gpt-oss-20b + score: 0.6 + use_reasoning: false + - name: philosophy + system_prompt: "Provide philosophical perspectives." + model_scores: + - model: openai/gpt-oss-20b + score: 0.5 + use_reasoning: false + - name: engineering + system_prompt: "Provide engineering solutions." + model_scores: + - model: openai/gpt-oss-20b + score: 0.7 + use_reasoning: false + +default_model: openai/gpt-oss-20b + +reasoning_families: + deepseek: + type: "chat_template_kwargs" + parameter: "thinking" + qwen3: + type: "chat_template_kwargs" + parameter: "enable_thinking" + gpt-oss: + type: "reasoning_effort" + parameter: "reasoning_effort" + gpt: + type: "reasoning_effort" + parameter: "reasoning_effort" + +default_reasoning_effort: low # Minimal effort if reasoning is ever used + +api: + batch_classification: + max_batch_size: 200 # Very large batches for throughput + concurrency_threshold: 5 + max_concurrency: 32 # Maximum concurrency for speed + metrics: + enabled: true + detailed_goroutine_tracking: false # Disable for performance + high_resolution_timing: false + sample_rate: 0.05 # Sample only 5% to minimize overhead + duration_buckets: [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1] + size_buckets: [1, 10, 50, 100, 200] + +observability: + tracing: + enabled: false # Disable tracing for maximum performance + provider: "opentelemetry" + exporter: + type: "stdout" + endpoint: "" + insecure: true + sampling: + type: "probabilistic" + rate: 0.01 # Sample only 1% if enabled + resource: + service_name: "vllm-semantic-router-latency" + service_version: "v0.1.0" + deployment_environment: "production" diff --git a/config/config.recipe-token-efficiency.yaml b/config/config.recipe-token-efficiency.yaml new file mode 100644 index 00000000..be3d8abc --- /dev/null +++ b/config/config.recipe-token-efficiency.yaml @@ -0,0 +1,207 @@ +# Recipe: Token Efficiency-Optimized Configuration +# Objective: Minimize token usage and reduce costs +# Trade-offs: May sacrifice some accuracy, uses aggressive caching +# Use case: High-volume production deployments, cost-sensitive applications +# +# Key optimizations: +# - Reasoning disabled for most categories (reduces token usage) +# - Low reasoning effort when reasoning is needed +# - Aggressive semantic caching (high similarity threshold, long TTL) +# - Lower classification thresholds for faster routing +# - Reduced tool selection (fewer tool tokens) +# - Relaxed PII policies (less token overhead) +# - Larger batch sizes for efficient processing + +bert_model: + model_id: sentence-transformers/all-MiniLM-L12-v2 + threshold: 0.5 # Lower threshold for faster matching + use_cpu: true + +semantic_cache: + enabled: true # Enable aggressive caching + backend_type: "memory" + similarity_threshold: 0.75 # Lower threshold for more cache hits + max_entries: 10000 # Large cache for better hit rate + ttl_seconds: 7200 # Long TTL (2 hours) + eviction_policy: "lru" # Keep most used entries + +tools: + enabled: true + top_k: 1 # Select fewer tools to reduce tokens + similarity_threshold: 0.3 # Higher threshold for stricter tool selection + tools_db_path: "config/tools_db.json" + fallback_to_empty: true + +prompt_guard: + enabled: true + use_modernbert: true + model_id: "models/jailbreak_classifier_modernbert-base_model" + threshold: 0.75 # Higher threshold (less sensitive, fewer rejections) + use_cpu: true + jailbreak_mapping_path: "models/jailbreak_classifier_modernbert-base_model/jailbreak_type_mapping.json" + +vllm_endpoints: + - name: "endpoint1" + address: "127.0.0.1" + port: 8000 + models: + - "openai/gpt-oss-20b" + weight: 1 + +model_config: + "openai/gpt-oss-20b": + reasoning_family: "gpt-oss" + preferred_endpoints: ["endpoint1"] + pii_policy: + allow_by_default: true # Relaxed PII policy for efficiency; when true, all PII types are allowed + pricing: + currency: USD + prompt_per_1m: 0.10 + completion_per_1m: 0.30 + +classifier: + category_model: + model_id: "models/category_classifier_modernbert-base_model" + use_modernbert: true + threshold: 0.5 # Lower threshold for faster classification + use_cpu: true + category_mapping_path: "models/category_classifier_modernbert-base_model/category_mapping.json" + pii_model: + model_id: "models/pii_classifier_modernbert-base_presidio_token_model" + use_modernbert: true + threshold: 0.8 # Higher threshold (less sensitive, allows more content) + use_cpu: true + pii_mapping_path: "models/pii_classifier_modernbert-base_presidio_token_model/pii_type_mapping.json" + +categories: + - name: business + system_prompt: "You are a business consultant. Provide concise, practical advice." + model_scores: + - model: openai/gpt-oss-20b + score: 0.7 + use_reasoning: false # Disable reasoning to save tokens + - name: law + system_prompt: "You are a legal information expert. Provide concise legal information." + model_scores: + - model: openai/gpt-oss-20b + score: 0.5 + use_reasoning: false + - name: psychology + system_prompt: "You are a psychology expert. Provide evidence-based insights." + model_scores: + - model: openai/gpt-oss-20b + score: 0.6 + use_reasoning: false + - name: biology + system_prompt: "You are a biology expert. Explain biological concepts clearly." + model_scores: + - model: openai/gpt-oss-20b + score: 0.8 + use_reasoning: false + - name: chemistry + system_prompt: "You are a chemistry expert. Provide clear explanations." + model_scores: + - model: openai/gpt-oss-20b + score: 0.7 + use_reasoning: false # Disable reasoning for token efficiency + - name: history + system_prompt: "You are a historian. Provide accurate historical context." + model_scores: + - model: openai/gpt-oss-20b + score: 0.7 + use_reasoning: false + - name: other + system_prompt: "You are a helpful assistant. Provide concise, accurate responses." + model_scores: + - model: openai/gpt-oss-20b + score: 0.7 + use_reasoning: false + - name: health + system_prompt: "You are a health information expert. Provide evidence-based health information." + model_scores: + - model: openai/gpt-oss-20b + score: 0.6 + use_reasoning: false + - name: economics + system_prompt: "You are an economics expert. Provide data-driven economic insights." + model_scores: + - model: openai/gpt-oss-20b + score: 0.9 + use_reasoning: false + - name: math + system_prompt: "You are a mathematics expert. Provide clear, step-by-step solutions." + model_scores: + - model: openai/gpt-oss-20b + score: 1.0 + use_reasoning: true # Only enable for math where reasoning is critical + - name: physics + system_prompt: "You are a physics expert. Explain physical concepts clearly." + model_scores: + - model: openai/gpt-oss-20b + score: 0.8 + use_reasoning: false # Disable to save tokens + - name: computer science + system_prompt: "You are a computer science expert. Provide practical code solutions." + model_scores: + - model: openai/gpt-oss-20b + score: 0.7 + use_reasoning: false + - name: philosophy + system_prompt: "You are a philosophy expert. Present clear philosophical perspectives." + model_scores: + - model: openai/gpt-oss-20b + score: 0.6 + use_reasoning: false + - name: engineering + system_prompt: "You are an engineering expert. Provide practical engineering solutions." + model_scores: + - model: openai/gpt-oss-20b + score: 0.8 + use_reasoning: false + +default_model: openai/gpt-oss-20b + +reasoning_families: + deepseek: + type: "chat_template_kwargs" + parameter: "thinking" + qwen3: + type: "chat_template_kwargs" + parameter: "enable_thinking" + gpt-oss: + type: "reasoning_effort" + parameter: "reasoning_effort" + gpt: + type: "reasoning_effort" + parameter: "reasoning_effort" + +default_reasoning_effort: low # Minimal reasoning effort to save tokens + +api: + batch_classification: + max_batch_size: 100 # Larger batches for efficiency + concurrency_threshold: 10 + max_concurrency: 16 # Higher concurrency for throughput + metrics: + enabled: true + detailed_goroutine_tracking: false # Disable for efficiency + high_resolution_timing: false + sample_rate: 0.1 # Sample 10% to reduce overhead + duration_buckets: [0.01, 0.05, 0.1, 0.5, 1, 5, 10] + size_buckets: [1, 10, 50, 100, 200] + +observability: + tracing: + enabled: true + provider: "opentelemetry" + exporter: + type: "otlp" + endpoint: "localhost:4317" + insecure: true + sampling: + type: "probabilistic" + rate: 0.1 # Sample 10% of traces to reduce overhead + resource: + service_name: "vllm-semantic-router-token-efficient" + service_version: "v0.1.0" + deployment_environment: "production" diff --git a/website/docs/installation/configuration.md b/website/docs/installation/configuration.md index 424f21de..8c2c26f5 100644 --- a/website/docs/installation/configuration.md +++ b/website/docs/installation/configuration.md @@ -133,6 +133,22 @@ model_config: preferred_endpoints: ["endpoint1"] ``` +## Configuration Recipes (presets) + +We provide curated, versioned presets you can use directly or as a starting point: + +- Accuracy optimized: https://github.com/vllm-project/semantic-router/blob/main/config/config.recipe-accuracy.yaml +- Token efficiency optimized: https://github.com/vllm-project/semantic-router/blob/main/config/config.recipe-token-efficiency.yaml +- Latency optimized: https://github.com/vllm-project/semantic-router/blob/main/config/config.recipe-latency.yaml +- Guide and usage: https://github.com/vllm-project/semantic-router/blob/main/config/RECIPES.md + +Quick usage: + +- Local: copy a recipe over config.yaml, then run + - cp config/config.recipe-accuracy.yaml config/config.yaml + - make run-router +- Helm/Argo: reference the recipe file contents in your config map (examples are in the guide above). + ## Key Configuration Sections ### Backend Endpoints