|
| 1 | +# Ensemble Service Architecture |
| 2 | + |
| 3 | +## Overview |
| 4 | + |
| 5 | +The ensemble orchestration feature is implemented as an independent OpenAI-compatible API server that runs alongside the semantic router. This design provides clean separation of concerns and allows the ensemble service to scale independently. |
| 6 | + |
| 7 | +## Architecture Diagram |
| 8 | + |
| 9 | +``` |
| 10 | +┌─────────────┐ |
| 11 | +│ Client │ |
| 12 | +└──────┬──────┘ |
| 13 | + │ HTTP Request |
| 14 | + │ |
| 15 | + ▼ |
| 16 | +┌─────────────────────────────┐ |
| 17 | +│ Semantic Router (Port 8080) │ |
| 18 | +│ ┌─────────────────────┐ │ |
| 19 | +│ │ ExtProc (Port 50051)│ │ |
| 20 | +│ └─────────────────────┘ │ |
| 21 | +│ ┌─────────────────────┐ │ |
| 22 | +│ │ API Server │ │ |
| 23 | +│ └─────────────────────┘ │ |
| 24 | +└─────────────┬───────────────┘ |
| 25 | + │ |
| 26 | + │ (Optional: Route to Ensemble) |
| 27 | + │ |
| 28 | + ▼ |
| 29 | +┌──────────────────────────────┐ |
| 30 | +│ Ensemble Service (Port 8081) │ |
| 31 | +│ ┌──────────────────────┐ │ |
| 32 | +│ │ /v1/chat/completions │ │ |
| 33 | +│ │ /health │ │ |
| 34 | +│ └──────────────────────┘ │ |
| 35 | +└──────────┬───────────────────┘ |
| 36 | + │ |
| 37 | + │ Parallel Queries |
| 38 | + │ |
| 39 | + ┌──────┴──────┬──────────┬──────────┐ |
| 40 | + ▼ ▼ ▼ ▼ |
| 41 | +┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ |
| 42 | +│Model A │ │Model B │ │Model C │ │Model N │ |
| 43 | +│:8001 │ │:8002 │ │:8003 │ │:800N │ |
| 44 | +└────────┘ └────────┘ └────────┘ └────────┘ |
| 45 | + │ |
| 46 | + │ Responses |
| 47 | + │ |
| 48 | + ▼ |
| 49 | + Aggregation Engine |
| 50 | + (Voting, Weighted, etc.) |
| 51 | + │ |
| 52 | + ▼ |
| 53 | + Aggregated Response |
| 54 | +``` |
| 55 | + |
| 56 | +## Components |
| 57 | + |
| 58 | +### 1. Semantic Router (Existing) |
| 59 | + |
| 60 | +- **ExtProc Server** (Port 50051): Envoy external processor for request/response filtering |
| 61 | +- **API Server** (Port 8080): Classification and system prompt APIs |
| 62 | +- **Metrics Server** (Port 9190): Prometheus metrics |
| 63 | + |
| 64 | +### 2. Ensemble Service (New) |
| 65 | + |
| 66 | +- **Independent HTTP Server** (Port 8081, configurable) |
| 67 | +- **OpenAI-Compatible API**: `/v1/chat/completions` endpoint |
| 68 | +- **Health Check**: `/health` endpoint |
| 69 | +- **Started Automatically**: When `ensemble.enabled: true` in config |
| 70 | + |
| 71 | +### 3. Model Endpoints |
| 72 | + |
| 73 | +- **Multiple Backends**: Each with OpenAI-compatible API |
| 74 | +- **Configured in YAML**: Via `endpoint_mappings` |
| 75 | +- **Parallel Queries**: Executed concurrently with semaphore control |
| 76 | + |
| 77 | +## Request Flow |
| 78 | + |
| 79 | +### 1. Direct Ensemble Request |
| 80 | + |
| 81 | +Client directly queries the ensemble service: |
| 82 | + |
| 83 | +```bash |
| 84 | +curl -X POST http://localhost:8081/v1/chat/completions \ |
| 85 | + -H "x-ensemble-enable: true" \ |
| 86 | + -H "x-ensemble-models: model-a,model-b,model-c" \ |
| 87 | + -H "x-ensemble-strategy: voting" \ |
| 88 | + -d '{"messages":[...]}' |
| 89 | +``` |
| 90 | + |
| 91 | +**Flow:** |
| 92 | +1. Client → Ensemble Service (Port 8081) |
| 93 | +2. Ensemble Service → Model Endpoints (Parallel) |
| 94 | +3. Model Endpoints → Ensemble Service (Responses) |
| 95 | +4. Ensemble Service → Aggregation → Client (Final Response) |
| 96 | + |
| 97 | +### 2. Via Semantic Router (Future Enhancement) |
| 98 | + |
| 99 | +Semantic router could route to ensemble service based on headers/config: |
| 100 | + |
| 101 | +```bash |
| 102 | +curl -X POST http://localhost:8080/v1/chat/completions \ |
| 103 | + -H "x-ensemble-enable: true" \ |
| 104 | + -H "x-ensemble-models: model-a,model-b,model-c" \ |
| 105 | + -d '{"messages":[...]}' |
| 106 | +``` |
| 107 | + |
| 108 | +**Flow:** |
| 109 | +1. Client → Semantic Router (Port 8080) |
| 110 | +2. Router detects ensemble header → Routes to Ensemble Service |
| 111 | +3. Ensemble Service → Model Endpoints (Parallel) |
| 112 | +4. Ensemble Service → Router → Client |
| 113 | + |
| 114 | +## Key Design Decisions |
| 115 | + |
| 116 | +### Why Independent Service? |
| 117 | + |
| 118 | +1. **Clean Separation**: ExtProc is designed for single downstream endpoint |
| 119 | +2. **Scalability**: Ensemble service can be scaled independently |
| 120 | +3. **Flexibility**: Can be used standalone or with semantic router |
| 121 | +4. **Simplicity**: Each component has a single, clear responsibility |
| 122 | +5. **Maintainability**: Clear boundaries between components |
| 123 | + |
| 124 | +### Port Allocation |
| 125 | + |
| 126 | +| Service | Default Port | Configurable | Purpose | |
| 127 | +|---------|--------------|--------------|---------| |
| 128 | +| ExtProc | 50051 | `-port` | gRPC ExtProc server | |
| 129 | +| API Server | 8080 | `-api-port` | Classification APIs | |
| 130 | +| Ensemble | 8081 | `-ensemble-port` | Ensemble orchestration | |
| 131 | +| Metrics | 9190 | `-metrics-port` | Prometheus metrics | |
| 132 | + |
| 133 | +### Configuration |
| 134 | + |
| 135 | +Ensemble service reads configuration from the same `config.yaml`: |
| 136 | + |
| 137 | +```yaml |
| 138 | +ensemble: |
| 139 | + enabled: true # Start ensemble service |
| 140 | + default_strategy: "voting" |
| 141 | + default_min_responses: 2 |
| 142 | + timeout_seconds: 30 |
| 143 | + max_concurrent_requests: 10 |
| 144 | + endpoint_mappings: |
| 145 | + model-a: "http://localhost:8001/v1/chat/completions" |
| 146 | + model-b: "http://localhost:8002/v1/chat/completions" |
| 147 | + model-c: "http://localhost:8003/v1/chat/completions" |
| 148 | +``` |
| 149 | +
|
| 150 | +## Deployment Scenarios |
| 151 | +
|
| 152 | +### Scenario 1: Standalone Ensemble |
| 153 | +
|
| 154 | +Deploy only the ensemble service: |
| 155 | +
|
| 156 | +```bash |
| 157 | +./bin/router -config=config/ensemble-only.yaml |
| 158 | +``` |
| 159 | + |
| 160 | +Config with all other features disabled, only ensemble enabled. |
| 161 | + |
| 162 | +### Scenario 2: Integrated with Semantic Router |
| 163 | + |
| 164 | +Deploy all services together (default): |
| 165 | + |
| 166 | +```bash |
| 167 | +./bin/router -config=config/config.yaml |
| 168 | +``` |
| 169 | + |
| 170 | +All services start based on their enabled flags. |
| 171 | + |
| 172 | +### Scenario 3: Scaled Ensemble |
| 173 | + |
| 174 | +Run multiple ensemble service instances: |
| 175 | + |
| 176 | +```bash |
| 177 | +# Instance 1 |
| 178 | +./bin/router -config=config1.yaml -ensemble-port=8081 |
| 179 | + |
| 180 | +# Instance 2 |
| 181 | +./bin/router -config=config2.yaml -ensemble-port=8082 |
| 182 | +``` |
| 183 | + |
| 184 | +Load balancer distributes requests across instances. |
| 185 | + |
| 186 | +## API Specification |
| 187 | + |
| 188 | +### POST /v1/chat/completions |
| 189 | + |
| 190 | +OpenAI-compatible endpoint with ensemble extensions. |
| 191 | + |
| 192 | +#### Request Headers |
| 193 | + |
| 194 | +| Header | Required | Description | |
| 195 | +|--------|----------|-------------| |
| 196 | +| `x-ensemble-enable` | Yes | Must be "true" | |
| 197 | +| `x-ensemble-models` | Yes | Comma-separated model names | |
| 198 | +| `x-ensemble-strategy` | No | Aggregation strategy (default from config) | |
| 199 | +| `x-ensemble-min-responses` | No | Minimum responses required (default from config) | |
| 200 | +| `Authorization` | No | Forwarded to model endpoints | |
| 201 | + |
| 202 | +#### Request Body |
| 203 | + |
| 204 | +Standard OpenAI chat completion request: |
| 205 | + |
| 206 | +```json |
| 207 | +{ |
| 208 | + "model": "ensemble", |
| 209 | + "messages": [ |
| 210 | + {"role": "user", "content": "Your question"} |
| 211 | + ] |
| 212 | +} |
| 213 | +``` |
| 214 | + |
| 215 | +#### Response Headers |
| 216 | + |
| 217 | +| Header | Description | |
| 218 | +|--------|-------------| |
| 219 | +| `x-vsr-ensemble-used` | "true" if ensemble was used | |
| 220 | +| `x-vsr-ensemble-models-queried` | Number of models queried | |
| 221 | +| `x-vsr-ensemble-responses-received` | Number of successful responses | |
| 222 | + |
| 223 | +#### Response Body |
| 224 | + |
| 225 | +Standard OpenAI chat completion response with aggregated content. |
| 226 | + |
| 227 | +### GET /health |
| 228 | + |
| 229 | +Health check endpoint. |
| 230 | + |
| 231 | +#### Response |
| 232 | + |
| 233 | +```json |
| 234 | +{ |
| 235 | + "status": "healthy", |
| 236 | + "service": "ensemble" |
| 237 | +} |
| 238 | +``` |
| 239 | + |
| 240 | +## Aggregation Strategies |
| 241 | + |
| 242 | +### Voting |
| 243 | + |
| 244 | +Parses responses and selects most common answer: |
| 245 | + |
| 246 | +```yaml |
| 247 | +x-ensemble-strategy: voting |
| 248 | +``` |
| 249 | +
|
| 250 | +Best for: Classification, multiple choice questions |
| 251 | +
|
| 252 | +### Weighted |
| 253 | +
|
| 254 | +Selects response with highest confidence: |
| 255 | +
|
| 256 | +```yaml |
| 257 | +x-ensemble-strategy: weighted |
| 258 | +``` |
| 259 | +
|
| 260 | +Best for: Models with different reliability profiles |
| 261 | +
|
| 262 | +### First Success |
| 263 | +
|
| 264 | +Returns first valid response: |
| 265 | +
|
| 266 | +```yaml |
| 267 | +x-ensemble-strategy: first_success |
| 268 | +``` |
| 269 | +
|
| 270 | +Best for: Latency-sensitive applications |
| 271 | +
|
| 272 | +### Score Averaging |
| 273 | +
|
| 274 | +Balances confidence and latency: |
| 275 | +
|
| 276 | +```yaml |
| 277 | +x-ensemble-strategy: score_averaging |
| 278 | +``` |
| 279 | +
|
| 280 | +Best for: Balanced quality and speed |
| 281 | +
|
| 282 | +## Error Handling |
| 283 | +
|
| 284 | +### Insufficient Responses |
| 285 | +
|
| 286 | +If fewer than `min_responses` succeed: |
| 287 | + |
| 288 | +```json |
| 289 | +{ |
| 290 | + "error": "Ensemble orchestration failed: insufficient responses: got 1, required 2" |
| 291 | +} |
| 292 | +``` |
| 293 | + |
| 294 | +### Invalid Configuration |
| 295 | + |
| 296 | +If model not in endpoint_mappings: |
| 297 | + |
| 298 | +```json |
| 299 | +{ |
| 300 | + "error": "endpoint not found for model: model-x" |
| 301 | +} |
| 302 | +``` |
| 303 | + |
| 304 | +### Timeout |
| 305 | + |
| 306 | +If requests exceed timeout: |
| 307 | + |
| 308 | +```json |
| 309 | +{ |
| 310 | + "error": "HTTP request failed: context deadline exceeded" |
| 311 | +} |
| 312 | +``` |
| 313 | + |
| 314 | +## Monitoring |
| 315 | + |
| 316 | +### Logs |
| 317 | + |
| 318 | +Ensemble service logs: |
| 319 | +- Request details (models, strategy, min responses) |
| 320 | +- Execution results (queried, received, strategy used) |
| 321 | +- Errors and failures |
| 322 | + |
| 323 | +### Metrics |
| 324 | + |
| 325 | +Future enhancement: Prometheus metrics for: |
| 326 | +- Request count per strategy |
| 327 | +- Response latency per model |
| 328 | +- Success/failure rates |
| 329 | +- Aggregation time |
| 330 | + |
| 331 | +## Security Considerations |
| 332 | + |
| 333 | +1. **Authentication**: Headers forwarded to model endpoints |
| 334 | +2. **Network**: Use HTTPS in production |
| 335 | +3. **Rate Limiting**: Apply at load balancer level |
| 336 | +4. **Endpoint Validation**: Only configured endpoints are queried |
| 337 | +5. **Timeout Protection**: Prevents resource exhaustion |
| 338 | + |
| 339 | +## Future Enhancements |
| 340 | + |
| 341 | +1. **Semantic Router Integration**: Automatic routing to ensemble service |
| 342 | +2. **Streaming Support**: SSE for streaming responses |
| 343 | +3. **Advanced Reranking**: Separate model for ranking responses |
| 344 | +4. **Caching**: Cache ensemble results |
| 345 | +5. **Metrics**: Comprehensive Prometheus metrics |
| 346 | +6. **Circuit Breaker**: Automatic endpoint failure detection |
| 347 | +7. **Load Balancing**: Intelligent distribution across model endpoints |
0 commit comments