Skip to content

Feature Request: Multi-Cloud and Hybrid Cloud Routing Support #196

@wangchen615

Description

@wangchen615

Feature Request: Inter-Cluster and Hybrid Cloud Routing Support

Is your feature request related to a problem? Please describe.

The current semantic router is designed for intra-cluster routing within a single deployment, but production environments increasingly require routing across multiple clusters and cloud providers. The existing architecture lacks support for:

  • Multi-cloud routing: Cannot route between on-premises deployments and external model providers (OpenAI, Claude, Grok, etc.)
  • Cross-cluster routing: No ability to route among clusters with different model availability due to licensing, traffic limits, or GPU constraints
  • Performance-based cluster selection: Cannot route to clusters based on latency, context window capacity, or other performance characteristics
  • Hybrid cloud scenarios: No support for routing between private and public cloud deployments
  • Fault tolerance: No failover mechanisms when primary clusters become unavailable
  • Cost optimization: Cannot route based on cost differences between clusters or providers

This limitation prevents the semantic router from being used in enterprise environments where models are distributed across multiple clusters, regions, or cloud providers for reasons of compliance, performance, cost, or availability.

Describe the solution you'd like

I want a comprehensive inter-cluster routing system that extends the current semantic router to support routing across multiple clusters, cloud providers, and deployment environments:

Core Requirements

  • Multi-cluster discovery: Automatic discovery and health monitoring of available clusters
  • Provider abstraction: Unified interface for routing to different model providers (vLLM, OpenAI, Claude, etc.)
  • Cross-cluster routing: Ability to route requests to the most appropriate cluster based on multiple criteria
  • Fault tolerance: Automatic failover and circuit breaker patterns for cluster failures
  • Performance optimization: Route based on latency, throughput, context window, and other performance metrics
  • Cost-aware routing: Route based on cost differences between clusters and providers
  • Compliance routing: Route based on data residency, security requirements, and regulatory constraints

Cluster Management

  • Cluster registry: Central registry of available clusters with metadata (location, capabilities, costs, etc.)
  • Health monitoring: Continuous health checks and performance monitoring for all clusters
  • Dynamic discovery: Support for service discovery mechanisms (Consul, etcd, Kubernetes services)
  • Cluster metadata: Rich metadata about each cluster (models available, performance characteristics, costs, compliance zones)

Routing Strategies

  • Latency-based routing: Route to the cluster with lowest latency for the requesting region
  • Load-based routing: Distribute load across clusters based on current capacity
  • Cost-optimized routing: Route to the most cost-effective cluster for the request type
  • Compliance-based routing: Route based on data residency and regulatory requirements
  • Model-specific routing: Route to clusters that have specific models available
  • Context-aware routing: Route to clusters with appropriate context window capacity
  • Hybrid routing: Combine multiple strategies with configurable weights

Provider Support

  • vLLM clusters: Native support for vLLM deployments across multiple clusters
  • OpenAI API: Support for OpenAI's API with rate limiting and cost tracking
  • Anthropic Claude: Integration with Claude API for specific use cases
  • Grok API: Support for xAI's Grok API
  • Custom providers: Plugin architecture for adding new model providers
  • Provider-specific features: Leverage unique capabilities of each provider

Describe alternatives you've considered

  1. API Gateway approach:

    • Pros: Leverages existing API gateway solutions
    • Cons: Limited semantic routing capabilities, no model-specific optimizations
  2. Service mesh integration:

    • Pros: Built-in load balancing and failover
    • Cons: No semantic awareness, limited routing intelligence
  3. Custom load balancer:

    • Pros: Full control over routing logic
    • Cons: Significant development effort, reinventing existing solutions
  4. Provider-specific solutions:

    • Pros: Optimized for each provider
    • Cons: Fragmented approach, no unified interface

Chosen approach: Extend the existing semantic router with inter-cluster capabilities, providing a unified interface while leveraging the existing semantic classification and routing intelligence.

Additional context

Current Architecture Limitations

The existing semantic router operates within a single cluster and cannot:

  • Route across multiple deployments
  • Handle provider-specific authentication and rate limiting
  • Optimize for cross-cluster performance characteristics
  • Provide fault tolerance across cluster boundaries
  • Support compliance requirements that span multiple regions

Inter-Cluster Routing Benefits

By adding inter-cluster support, the semantic router can:

  • Scale horizontally: Distribute load across multiple clusters
  • Improve availability: Provide fault tolerance through cluster redundancy
  • Optimize costs: Route to the most cost-effective cluster for each request
  • Meet compliance: Route based on data residency and regulatory requirements
  • Leverage provider strengths: Use different providers for different use cases

Use Case Examples

1. On-Premises + Cloud Hybrid

clusters:
  - name: "on-prem-gpu-cluster"
    type: "vllm"
    location: "us-west-2"
    models: ["llama-2-70b", "codellama-34b"]
    compliance: ["hipaa", "sox"]
    cost_per_token: 0.001
    
  - name: "openai-cloud"
    type: "openai"
    location: "us-west-2"
    models: ["gpt-4", "gpt-3.5-turbo"]
    compliance: ["soc2"]
    cost_per_token: 0.002

2. Multi-Region Deployment

clusters:
  - name: "us-east-cluster"
    type: "vllm"
    location: "us-east-1"
    latency_zones: ["us-east", "us-central"]
    models: ["llama-2-70b", "mistral-7b"]
    
  - name: "eu-west-cluster"
    type: "vllm"
    location: "eu-west-1"
    latency_zones: ["eu-west", "eu-central"]
    models: ["llama-2-70b", "mistral-7b"]
    compliance: ["gdpr"]

3. Model-Specific Routing

routing_rules:
  - name: "code-generation-routing"
    conditions:
      - category: "code_generation"
        confidence: 0.8
    actions:
      - route_to_cluster: "code-specialized-cluster"
        models: ["codellama-34b", "gpt-4-code"]
        
  - name: "general-purpose-routing"
    conditions:
      - category: "general"
    actions:
      - route_to_cluster: "general-purpose-cluster"
        models: ["llama-2-70b", "gpt-3.5-turbo"]

Technical Requirements

  • Performance: Inter-cluster routing decisions must complete within 100ms
  • Reliability: 99.9% uptime with automatic failover
  • Scalability: Support for 100+ clusters and 10,000+ requests/second
  • Security: Secure authentication and encryption for cross-cluster communication
  • Monitoring: Comprehensive metrics and alerting for cluster health and routing decisions

Implementation Priority

This feature should be prioritized as P1 because it's essential for enterprise adoption where models are distributed across multiple clusters, regions, and providers.

Example Inter-Cluster Configuration

apiVersion: vllm.ai/v1alpha1
kind: InterClusterRoute
metadata:
  name: multi-cluster-routing
spec:
  # Cluster discovery and management
  cluster_discovery:
    enabled: true
    method: "kubernetes"  # Options: "kubernetes", "consul", "static"
    refresh_interval: "30s"
    
  # Provider configurations
  providers:
    - name: "vllm-cluster-1"
      type: "vllm"
      endpoint: "https://cluster1.internal:8000"
      authentication:
        type: "bearer"
        token: "secret-token"
      models: ["llama-2-70b", "mistral-7b"]
      capabilities:
        max_context_length: 4096
        max_tokens_per_second: 100
      performance:
        avg_latency_ms: 150
        cost_per_1k_tokens: 0.001
        
    - name: "openai-prod"
      type: "openai"
      endpoint: "https://api.openai.com/v1"
      authentication:
        type: "api_key"
        key: "sk-..."
      models: ["gpt-4", "gpt-3.5-turbo"]
      capabilities:
        max_context_length: 8192
        max_tokens_per_second: 200
      performance:
        avg_latency_ms: 200
        cost_per_1k_tokens: 0.002
        
  # Routing strategies
  routing_strategies:
    - name: "latency-optimized"
      priority: 100
      conditions:
        - type: "latency_requirement"
          max_latency_ms: 200
      actions:
        - route_to_cluster: "lowest_latency"
        
    - name: "cost-optimized"
      priority: 90
      conditions:
        - type: "cost_sensitivity"
          max_cost_per_1k_tokens: 0.0015
      actions:
        - route_to_cluster: "lowest_cost"
        
    - name: "compliance-routing"
      priority: 200
      conditions:
        - type: "data_residency"
          required_region: "eu-west"
      actions:
        - route_to_cluster: "eu-west-cluster"
        
  # Fault tolerance
  fault_tolerance:
    circuit_breaker:
      failure_threshold: 5
      timeout: "30s"
      max_requests: 10
    retry_policy:
      max_retries: 3
      backoff_multiplier: 2
      max_backoff: "10s"
    fallback_strategy: "next_best_cluster"

API Endpoints

// Cluster management
GET    /api/v1/clusters                    // List all clusters
GET    /api/v1/clusters/{id}               // Get cluster details
POST   /api/v1/clusters                    // Add new cluster
PUT    /api/v1/clusters/{id}               // Update cluster
DELETE /api/v1/clusters/{id}               // Remove cluster

// Cluster health and monitoring
GET    /api/v1/clusters/{id}/health        // Get cluster health
GET    /api/v1/clusters/{id}/metrics       // Get cluster metrics
POST   /api/v1/clusters/{id}/test          // Test cluster connectivity

// Inter-cluster routing
POST   /api/v1/route/inter-cluster         // Route request across clusters
GET    /api/v1/route/explain               // Explain routing decision
GET    /api/v1/route/options               // Get available routing options

// Provider management
GET    /api/v1/providers                   // List supported providers
POST   /api/v1/providers/register          // Register new provider type

Integration with Existing Features

This inter-cluster routing system should integrate seamlessly with the existing configurable routing rules system, allowing for complex routing logic that combines:

  • Semantic classification (existing)
  • Custom routing rules (existing)
  • Inter-cluster routing (new)
  • Performance optimization (new)
  • Cost optimization (new)
  • Compliance routing (new)

Labels: enhancement, routing, multi-cluster, hybrid-cloud, scalability, fault-tolerance
Priority: P1
Milestone: v0.3
Dependencies: Configurable Routing Rules System (v0.2)

Metadata

Metadata

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions