-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Summary
This RFC proposes implementing a dedicated gateway plugin for the /v1/embeddings endpoint (Option 4) as a long-term architectural solution, building upon the analysis in #4.
Context
As discussed in #4, there are multiple approaches to implementing embeddings support:
- Option 1: Extend current chat completions gateway (not suitable - different API patterns)
- Option 2: Route through metadata service (MVP approach - limited functionality)
- Option 3: Direct routing with extra headers (poor UX)
- Option 4: Dedicated embeddings gateway plugin (this proposal)
While Option 2 provides a quick MVP, embeddings workloads have fundamentally different characteristics from chat completions that justify a dedicated gateway plugin.
Motivation for Dedicated Gateway Plugin
Key Differences from Chat Completions
< /dev/null | Aspect | Chat Completions | Embeddings |
|--------|-----------------|------------|
| Input Pattern | Sequential conversation | Batch text processing |
| Output | Streaming text generation | Fixed-size vectors |
| Latency Profile | Variable (token generation) | Predictable |
| Memory Usage | Dynamic | Deterministic |
| Caching Strategy | Prefix-based | Input hash-based |
| Routing Needs | SLO, load, prefix cache | Batch-aware, memory-based |
Benefits of Dedicated Plugin
-
Optimized Routing Algorithms
- Batch-size aware routing
- Memory-based pod selection
- Simplified algorithms (no streaming/SLO complexity)
-
Clean Architecture
- Separate concerns between generation and embedding
- Independent scaling policies
- Cleaner codebase without conditional logic
-
Performance Optimization
- Embedding-specific request batching
- Optimized memory allocation
- Dedicated connection pooling
-
Future Extensibility
- Multi-modal embeddings support
- Custom embedding strategies
- Advanced caching mechanisms
Proposed Architecture
Component Structure
pkg/plugins/gateway/embeddings/
├── gateway.go # Main server implementation
├── req_body.go # Request handling
├── types.go # Embeddings-specific types
├── router.go # Routing algorithms
├── metrics.go # Prometheus metrics
└── cache.go # Embedding cache interface
Routing Algorithms
- Least Memory: Route to pod with most available memory
- Batch Affinity: Route similar batch sizes to same pods
- Round Robin: Simple distribution for uniform workloads
Deployment Architecture
# Separate deployment for embeddings gateway
apiVersion: apps/v1
kind: Deployment
metadata:
name: aibrix-embeddings-gateway
spec:
replicas: 3 # Independent scaling
template:
spec:
containers:
- name: embeddings-gateway
ports:
- containerPort: 50053 # Different portImplementation Plan
Phase 1: Core Gateway Structure
- Create embeddings gateway base structure
- Implement request/response handling
- Add batch validation logic
- Basic routing (round-robin)
Phase 2: Advanced Routing
- Implement memory-based routing
- Add batch affinity routing
- Create routing benchmarks
Phase 3: Integration
- Update Envoy configuration
- Add Kubernetes manifests
- Integrate with existing cache
Phase 4: Production Features
- Prometheus metrics
- Request batching optimization
- Circuit breaker patterns
- Comprehensive testing
Testing Requirements
- Unit tests for all routing algorithms
- Integration tests with multiple embedding models
- Chaos testing for failure scenarios
Success Metrics
- Independent scaling without affecting chat completions
- Clean separation of concerns in codebase
References
- Original embeddings issue: [API] Implement /v1/embeddings endpoint for OpenAI-compatible embeddings support #4
- OpenAI Embeddings API: https://platform.openai.com/docs/api-reference/embeddings
- Current gateway architecture:
docs/source/gateways.md
Questions for Discussion
- Should we support embedding-specific caching in Phase 1?
- What batch size limits should we enforce?
- How should we handle multi-modal embeddings in the future?
- Should routing algorithms be configurable per model?
cc: @maintainers