Skip to content

[gateway] RFC: Dedicated embeddings gateway plugin architecture for /v1/embeddings #6

@dittops

Description

@dittops

Summary

This RFC proposes implementing a dedicated gateway plugin for the /v1/embeddings endpoint (Option 4) as a long-term architectural solution, building upon the analysis in #4.

Context

As discussed in #4, there are multiple approaches to implementing embeddings support:

  1. Option 1: Extend current chat completions gateway (not suitable - different API patterns)
  2. Option 2: Route through metadata service (MVP approach - limited functionality)
  3. Option 3: Direct routing with extra headers (poor UX)
  4. Option 4: Dedicated embeddings gateway plugin (this proposal)

While Option 2 provides a quick MVP, embeddings workloads have fundamentally different characteristics from chat completions that justify a dedicated gateway plugin.

Motivation for Dedicated Gateway Plugin

Key Differences from Chat Completions

< /dev/null | Aspect | Chat Completions | Embeddings |
|--------|-----------------|------------|
| Input Pattern | Sequential conversation | Batch text processing |
| Output | Streaming text generation | Fixed-size vectors |
| Latency Profile | Variable (token generation) | Predictable |
| Memory Usage | Dynamic | Deterministic |
| Caching Strategy | Prefix-based | Input hash-based |
| Routing Needs | SLO, load, prefix cache | Batch-aware, memory-based |

Benefits of Dedicated Plugin

  1. Optimized Routing Algorithms

    • Batch-size aware routing
    • Memory-based pod selection
    • Simplified algorithms (no streaming/SLO complexity)
  2. Clean Architecture

    • Separate concerns between generation and embedding
    • Independent scaling policies
    • Cleaner codebase without conditional logic
  3. Performance Optimization

    • Embedding-specific request batching
    • Optimized memory allocation
    • Dedicated connection pooling
  4. Future Extensibility

    • Multi-modal embeddings support
    • Custom embedding strategies
    • Advanced caching mechanisms

Proposed Architecture

Component Structure

pkg/plugins/gateway/embeddings/
├── gateway.go          # Main server implementation
├── req_body.go         # Request handling
├── types.go            # Embeddings-specific types
├── router.go           # Routing algorithms
├── metrics.go          # Prometheus metrics
└── cache.go            # Embedding cache interface

Routing Algorithms

  • Least Memory: Route to pod with most available memory
  • Batch Affinity: Route similar batch sizes to same pods
  • Round Robin: Simple distribution for uniform workloads

Deployment Architecture

# Separate deployment for embeddings gateway
apiVersion: apps/v1
kind: Deployment
metadata:
  name: aibrix-embeddings-gateway
spec:
  replicas: 3  # Independent scaling
  template:
    spec:
      containers:
      - name: embeddings-gateway
        ports:
        - containerPort: 50053  # Different port

Implementation Plan

Phase 1: Core Gateway Structure

  • Create embeddings gateway base structure
  • Implement request/response handling
  • Add batch validation logic
  • Basic routing (round-robin)

Phase 2: Advanced Routing

  • Implement memory-based routing
  • Add batch affinity routing
  • Create routing benchmarks

Phase 3: Integration

  • Update Envoy configuration
  • Add Kubernetes manifests
  • Integrate with existing cache

Phase 4: Production Features

  • Prometheus metrics
  • Request batching optimization
  • Circuit breaker patterns
  • Comprehensive testing

Testing Requirements

  • Unit tests for all routing algorithms
  • Integration tests with multiple embedding models
  • Chaos testing for failure scenarios

Success Metrics

  • Independent scaling without affecting chat completions
  • Clean separation of concerns in codebase

References

Questions for Discussion

  1. Should we support embedding-specific caching in Phase 1?
  2. What batch size limits should we enforce?
  3. How should we handle multi-modal embeddings in the future?
  4. Should routing algorithms be configurable per model?

cc: @maintainers

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions