-
Notifications
You must be signed in to change notification settings - Fork 180
Description
Feature Request: Inter-Cluster and Hybrid Cloud Routing Support
Is your feature request related to a problem? Please describe.
The current semantic router is designed for intra-cluster routing within a single deployment, but production environments increasingly require routing across multiple clusters and cloud providers. The existing architecture lacks support for:
- Multi-cloud routing: Cannot route between on-premises deployments and external model providers (OpenAI, Claude, Grok, etc.)
- Cross-cluster routing: No ability to route among clusters with different model availability due to licensing, traffic limits, or GPU constraints
- Performance-based cluster selection: Cannot route to clusters based on latency, context window capacity, or other performance characteristics
- Hybrid cloud scenarios: No support for routing between private and public cloud deployments
- Fault tolerance: No failover mechanisms when primary clusters become unavailable
- Cost optimization: Cannot route based on cost differences between clusters or providers
This limitation prevents the semantic router from being used in enterprise environments where models are distributed across multiple clusters, regions, or cloud providers for reasons of compliance, performance, cost, or availability.
Describe the solution you'd like
I want a comprehensive inter-cluster routing system that extends the current semantic router to support routing across multiple clusters, cloud providers, and deployment environments:
Core Requirements
- Multi-cluster discovery: Automatic discovery and health monitoring of available clusters
- Provider abstraction: Unified interface for routing to different model providers (vLLM, OpenAI, Claude, etc.)
- Cross-cluster routing: Ability to route requests to the most appropriate cluster based on multiple criteria
- Fault tolerance: Automatic failover and circuit breaker patterns for cluster failures
- Performance optimization: Route based on latency, throughput, context window, and other performance metrics
- Cost-aware routing: Route based on cost differences between clusters and providers
- Compliance routing: Route based on data residency, security requirements, and regulatory constraints
Cluster Management
- Cluster registry: Central registry of available clusters with metadata (location, capabilities, costs, etc.)
- Health monitoring: Continuous health checks and performance monitoring for all clusters
- Dynamic discovery: Support for service discovery mechanisms (Consul, etcd, Kubernetes services)
- Cluster metadata: Rich metadata about each cluster (models available, performance characteristics, costs, compliance zones)
Routing Strategies
- Latency-based routing: Route to the cluster with lowest latency for the requesting region
- Load-based routing: Distribute load across clusters based on current capacity
- Cost-optimized routing: Route to the most cost-effective cluster for the request type
- Compliance-based routing: Route based on data residency and regulatory requirements
- Model-specific routing: Route to clusters that have specific models available
- Context-aware routing: Route to clusters with appropriate context window capacity
- Hybrid routing: Combine multiple strategies with configurable weights
Provider Support
- vLLM clusters: Native support for vLLM deployments across multiple clusters
- OpenAI API: Support for OpenAI's API with rate limiting and cost tracking
- Anthropic Claude: Integration with Claude API for specific use cases
- Grok API: Support for xAI's Grok API
- Custom providers: Plugin architecture for adding new model providers
- Provider-specific features: Leverage unique capabilities of each provider
Describe alternatives you've considered
-
API Gateway approach:
- Pros: Leverages existing API gateway solutions
- Cons: Limited semantic routing capabilities, no model-specific optimizations
-
Service mesh integration:
- Pros: Built-in load balancing and failover
- Cons: No semantic awareness, limited routing intelligence
-
Custom load balancer:
- Pros: Full control over routing logic
- Cons: Significant development effort, reinventing existing solutions
-
Provider-specific solutions:
- Pros: Optimized for each provider
- Cons: Fragmented approach, no unified interface
Chosen approach: Extend the existing semantic router with inter-cluster capabilities, providing a unified interface while leveraging the existing semantic classification and routing intelligence.
Additional context
Current Architecture Limitations
The existing semantic router operates within a single cluster and cannot:
- Route across multiple deployments
- Handle provider-specific authentication and rate limiting
- Optimize for cross-cluster performance characteristics
- Provide fault tolerance across cluster boundaries
- Support compliance requirements that span multiple regions
Inter-Cluster Routing Benefits
By adding inter-cluster support, the semantic router can:
- Scale horizontally: Distribute load across multiple clusters
- Improve availability: Provide fault tolerance through cluster redundancy
- Optimize costs: Route to the most cost-effective cluster for each request
- Meet compliance: Route based on data residency and regulatory requirements
- Leverage provider strengths: Use different providers for different use cases
Use Case Examples
1. On-Premises + Cloud Hybrid
clusters:
- name: "on-prem-gpu-cluster"
type: "vllm"
location: "us-west-2"
models: ["llama-2-70b", "codellama-34b"]
compliance: ["hipaa", "sox"]
cost_per_token: 0.001
- name: "openai-cloud"
type: "openai"
location: "us-west-2"
models: ["gpt-4", "gpt-3.5-turbo"]
compliance: ["soc2"]
cost_per_token: 0.002
2. Multi-Region Deployment
clusters:
- name: "us-east-cluster"
type: "vllm"
location: "us-east-1"
latency_zones: ["us-east", "us-central"]
models: ["llama-2-70b", "mistral-7b"]
- name: "eu-west-cluster"
type: "vllm"
location: "eu-west-1"
latency_zones: ["eu-west", "eu-central"]
models: ["llama-2-70b", "mistral-7b"]
compliance: ["gdpr"]
3. Model-Specific Routing
routing_rules:
- name: "code-generation-routing"
conditions:
- category: "code_generation"
confidence: 0.8
actions:
- route_to_cluster: "code-specialized-cluster"
models: ["codellama-34b", "gpt-4-code"]
- name: "general-purpose-routing"
conditions:
- category: "general"
actions:
- route_to_cluster: "general-purpose-cluster"
models: ["llama-2-70b", "gpt-3.5-turbo"]
Technical Requirements
- Performance: Inter-cluster routing decisions must complete within 100ms
- Reliability: 99.9% uptime with automatic failover
- Scalability: Support for 100+ clusters and 10,000+ requests/second
- Security: Secure authentication and encryption for cross-cluster communication
- Monitoring: Comprehensive metrics and alerting for cluster health and routing decisions
Implementation Priority
This feature should be prioritized as P1 because it's essential for enterprise adoption where models are distributed across multiple clusters, regions, and providers.
Example Inter-Cluster Configuration
apiVersion: vllm.ai/v1alpha1
kind: InterClusterRoute
metadata:
name: multi-cluster-routing
spec:
# Cluster discovery and management
cluster_discovery:
enabled: true
method: "kubernetes" # Options: "kubernetes", "consul", "static"
refresh_interval: "30s"
# Provider configurations
providers:
- name: "vllm-cluster-1"
type: "vllm"
endpoint: "https://cluster1.internal:8000"
authentication:
type: "bearer"
token: "secret-token"
models: ["llama-2-70b", "mistral-7b"]
capabilities:
max_context_length: 4096
max_tokens_per_second: 100
performance:
avg_latency_ms: 150
cost_per_1k_tokens: 0.001
- name: "openai-prod"
type: "openai"
endpoint: "https://api.openai.com/v1"
authentication:
type: "api_key"
key: "sk-..."
models: ["gpt-4", "gpt-3.5-turbo"]
capabilities:
max_context_length: 8192
max_tokens_per_second: 200
performance:
avg_latency_ms: 200
cost_per_1k_tokens: 0.002
# Routing strategies
routing_strategies:
- name: "latency-optimized"
priority: 100
conditions:
- type: "latency_requirement"
max_latency_ms: 200
actions:
- route_to_cluster: "lowest_latency"
- name: "cost-optimized"
priority: 90
conditions:
- type: "cost_sensitivity"
max_cost_per_1k_tokens: 0.0015
actions:
- route_to_cluster: "lowest_cost"
- name: "compliance-routing"
priority: 200
conditions:
- type: "data_residency"
required_region: "eu-west"
actions:
- route_to_cluster: "eu-west-cluster"
# Fault tolerance
fault_tolerance:
circuit_breaker:
failure_threshold: 5
timeout: "30s"
max_requests: 10
retry_policy:
max_retries: 3
backoff_multiplier: 2
max_backoff: "10s"
fallback_strategy: "next_best_cluster"
API Endpoints
// Cluster management
GET /api/v1/clusters // List all clusters
GET /api/v1/clusters/{id} // Get cluster details
POST /api/v1/clusters // Add new cluster
PUT /api/v1/clusters/{id} // Update cluster
DELETE /api/v1/clusters/{id} // Remove cluster
// Cluster health and monitoring
GET /api/v1/clusters/{id}/health // Get cluster health
GET /api/v1/clusters/{id}/metrics // Get cluster metrics
POST /api/v1/clusters/{id}/test // Test cluster connectivity
// Inter-cluster routing
POST /api/v1/route/inter-cluster // Route request across clusters
GET /api/v1/route/explain // Explain routing decision
GET /api/v1/route/options // Get available routing options
// Provider management
GET /api/v1/providers // List supported providers
POST /api/v1/providers/register // Register new provider type
Integration with Existing Features
This inter-cluster routing system should integrate seamlessly with the existing configurable routing rules system, allowing for complex routing logic that combines:
- Semantic classification (existing)
- Custom routing rules (existing)
- Inter-cluster routing (new)
- Performance optimization (new)
- Cost optimization (new)
- Compliance routing (new)
Labels: enhancement
, routing
, multi-cluster
, hybrid-cloud
, scalability
, fault-tolerance
Priority: P1
Milestone: v0.3
Dependencies: Configurable Routing Rules System (v0.2)