[gateway] RFC: Dedicated embeddings gateway plugin architecture for /v1/embeddings

## Summary

This RFC proposes implementing a dedicated gateway plugin for the `/v1/embeddings` endpoint (Option 4) as a long-term architectural solution, building upon the analysis in #4.

## Context

As discussed in #4, there are multiple approaches to implementing embeddings support:

1. **Option 1**: Extend current chat completions gateway (not suitable - different API patterns)
2. **Option 2**: Route through metadata service (MVP approach - limited functionality)
3. **Option 3**: Direct routing with extra headers (poor UX)
4. **Option 4**: **Dedicated embeddings gateway plugin** (this proposal)

While Option 2 provides a quick MVP, embeddings workloads have fundamentally different characteristics from chat completions that justify a dedicated gateway plugin.

## Motivation for Dedicated Gateway Plugin

### Key Differences from Chat Completions

 < /dev/null |  Aspect | Chat Completions | Embeddings |
|--------|-----------------|------------|
| **Input Pattern** | Sequential conversation | Batch text processing |
| **Output** | Streaming text generation | Fixed-size vectors |
| **Latency Profile** | Variable (token generation) | Predictable |
| **Memory Usage** | Dynamic | Deterministic |
| **Caching Strategy** | Prefix-based | Input hash-based |
| **Routing Needs** | SLO, load, prefix cache | Batch-aware, memory-based |

### Benefits of Dedicated Plugin

1. **Optimized Routing Algorithms**
   - Batch-size aware routing
   - Memory-based pod selection
   - Simplified algorithms (no streaming/SLO complexity)

2. **Clean Architecture**
   - Separate concerns between generation and embedding
   - Independent scaling policies
   - Cleaner codebase without conditional logic

3. **Performance Optimization**
   - Embedding-specific request batching
   - Optimized memory allocation
   - Dedicated connection pooling

4. **Future Extensibility**
   - Multi-modal embeddings support
   - Custom embedding strategies
   - Advanced caching mechanisms

## Proposed Architecture

### Component Structure
```
pkg/plugins/gateway/embeddings/
├── gateway.go          # Main server implementation
├── req_body.go         # Request handling
├── types.go            # Embeddings-specific types
├── router.go           # Routing algorithms
├── metrics.go          # Prometheus metrics
└── cache.go            # Embedding cache interface
```

### Routing Algorithms
- **Least Memory**: Route to pod with most available memory
- **Batch Affinity**: Route similar batch sizes to same pods
- **Round Robin**: Simple distribution for uniform workloads

### Deployment Architecture
```yaml
# Separate deployment for embeddings gateway
apiVersion: apps/v1
kind: Deployment
metadata:
  name: aibrix-embeddings-gateway
spec:
  replicas: 3  # Independent scaling
  template:
    spec:
      containers:
      - name: embeddings-gateway
        ports:
        - containerPort: 50053  # Different port
```

## Implementation Plan

### Phase 1: Core Gateway Structure
- [ ] Create embeddings gateway base structure
- [ ] Implement request/response handling
- [ ] Add batch validation logic
- [ ] Basic routing (round-robin)

### Phase 2: Advanced Routing
- [ ] Implement memory-based routing
- [ ] Add batch affinity routing
- [ ] Create routing benchmarks

### Phase 3: Integration
- [ ] Update Envoy configuration
- [ ] Add Kubernetes manifests
- [ ] Integrate with existing cache

### Phase 4: Production Features
- [ ] Prometheus metrics
- [ ] Request batching optimization
- [ ] Circuit breaker patterns
- [ ] Comprehensive testing


## Testing Requirements

- Unit tests for all routing algorithms
- Integration tests with multiple embedding models
- Chaos testing for failure scenarios

## Success Metrics

- Independent scaling without affecting chat completions
- Clean separation of concerns in codebase

## References

- Original embeddings issue: #4
- OpenAI Embeddings API: https://platform.openai.com/docs/api-reference/embeddings
- Current gateway architecture: `docs/source/gateways.md`

## Questions for Discussion

1. Should we support embedding-specific caching in Phase 1?
2. What batch size limits should we enforce?
3. How should we handle multi-modal embeddings in the future?
4. Should routing algorithms be configurable per model?

cc: @maintainers

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[gateway] RFC: Dedicated embeddings gateway plugin architecture for /v1/embeddings #6

Summary

Context

Motivation for Dedicated Gateway Plugin

Key Differences from Chat Completions

Benefits of Dedicated Plugin

Proposed Architecture

Component Structure

Routing Algorithms

Deployment Architecture

Implementation Plan

Phase 1: Core Gateway Structure

Phase 2: Advanced Routing

Phase 3: Integration

Phase 4: Production Features

Testing Requirements

Success Metrics

References

Questions for Discussion

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[gateway] RFC: Dedicated embeddings gateway plugin architecture for /v1/embeddings #6

Description

Summary

Context

Motivation for Dedicated Gateway Plugin

Key Differences from Chat Completions

Benefits of Dedicated Plugin

Proposed Architecture

Component Structure

Routing Algorithms

Deployment Architecture

Implementation Plan

Phase 1: Core Gateway Structure

Phase 2: Advanced Routing

Phase 3: Integration

Phase 4: Production Features

Testing Requirements

Success Metrics

References

Questions for Discussion

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions