Skip to content

Commit 5c16944

Browse files
Copilotrootfs
andcommitted
Add ensemble service architecture documentation
Co-authored-by: rootfs <[email protected]>
1 parent 9acd73b commit 5c16944

File tree

1 file changed

+347
-0
lines changed

1 file changed

+347
-0
lines changed

config/ensemble/ARCHITECTURE.md

Lines changed: 347 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,347 @@
1+
# Ensemble Service Architecture
2+
3+
## Overview
4+
5+
The ensemble orchestration feature is implemented as an independent OpenAI-compatible API server that runs alongside the semantic router. This design provides clean separation of concerns and allows the ensemble service to scale independently.
6+
7+
## Architecture Diagram
8+
9+
```
10+
┌─────────────┐
11+
│ Client │
12+
└──────┬──────┘
13+
│ HTTP Request
14+
15+
16+
┌─────────────────────────────┐
17+
│ Semantic Router (Port 8080) │
18+
│ ┌─────────────────────┐ │
19+
│ │ ExtProc (Port 50051)│ │
20+
│ └─────────────────────┘ │
21+
│ ┌─────────────────────┐ │
22+
│ │ API Server │ │
23+
│ └─────────────────────┘ │
24+
└─────────────┬───────────────┘
25+
26+
│ (Optional: Route to Ensemble)
27+
28+
29+
┌──────────────────────────────┐
30+
│ Ensemble Service (Port 8081) │
31+
│ ┌──────────────────────┐ │
32+
│ │ /v1/chat/completions │ │
33+
│ │ /health │ │
34+
│ └──────────────────────┘ │
35+
└──────────┬───────────────────┘
36+
37+
│ Parallel Queries
38+
39+
┌──────┴──────┬──────────┬──────────┐
40+
▼ ▼ ▼ ▼
41+
┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐
42+
│Model A │ │Model B │ │Model C │ │Model N │
43+
│:8001 │ │:8002 │ │:8003 │ │:800N │
44+
└────────┘ └────────┘ └────────┘ └────────┘
45+
46+
│ Responses
47+
48+
49+
Aggregation Engine
50+
(Voting, Weighted, etc.)
51+
52+
53+
Aggregated Response
54+
```
55+
56+
## Components
57+
58+
### 1. Semantic Router (Existing)
59+
60+
- **ExtProc Server** (Port 50051): Envoy external processor for request/response filtering
61+
- **API Server** (Port 8080): Classification and system prompt APIs
62+
- **Metrics Server** (Port 9190): Prometheus metrics
63+
64+
### 2. Ensemble Service (New)
65+
66+
- **Independent HTTP Server** (Port 8081, configurable)
67+
- **OpenAI-Compatible API**: `/v1/chat/completions` endpoint
68+
- **Health Check**: `/health` endpoint
69+
- **Started Automatically**: When `ensemble.enabled: true` in config
70+
71+
### 3. Model Endpoints
72+
73+
- **Multiple Backends**: Each with OpenAI-compatible API
74+
- **Configured in YAML**: Via `endpoint_mappings`
75+
- **Parallel Queries**: Executed concurrently with semaphore control
76+
77+
## Request Flow
78+
79+
### 1. Direct Ensemble Request
80+
81+
Client directly queries the ensemble service:
82+
83+
```bash
84+
curl -X POST http://localhost:8081/v1/chat/completions \
85+
-H "x-ensemble-enable: true" \
86+
-H "x-ensemble-models: model-a,model-b,model-c" \
87+
-H "x-ensemble-strategy: voting" \
88+
-d '{"messages":[...]}'
89+
```
90+
91+
**Flow:**
92+
1. Client → Ensemble Service (Port 8081)
93+
2. Ensemble Service → Model Endpoints (Parallel)
94+
3. Model Endpoints → Ensemble Service (Responses)
95+
4. Ensemble Service → Aggregation → Client (Final Response)
96+
97+
### 2. Via Semantic Router (Future Enhancement)
98+
99+
Semantic router could route to ensemble service based on headers/config:
100+
101+
```bash
102+
curl -X POST http://localhost:8080/v1/chat/completions \
103+
-H "x-ensemble-enable: true" \
104+
-H "x-ensemble-models: model-a,model-b,model-c" \
105+
-d '{"messages":[...]}'
106+
```
107+
108+
**Flow:**
109+
1. Client → Semantic Router (Port 8080)
110+
2. Router detects ensemble header → Routes to Ensemble Service
111+
3. Ensemble Service → Model Endpoints (Parallel)
112+
4. Ensemble Service → Router → Client
113+
114+
## Key Design Decisions
115+
116+
### Why Independent Service?
117+
118+
1. **Clean Separation**: ExtProc is designed for single downstream endpoint
119+
2. **Scalability**: Ensemble service can be scaled independently
120+
3. **Flexibility**: Can be used standalone or with semantic router
121+
4. **Simplicity**: Each component has a single, clear responsibility
122+
5. **Maintainability**: Clear boundaries between components
123+
124+
### Port Allocation
125+
126+
| Service | Default Port | Configurable | Purpose |
127+
|---------|--------------|--------------|---------|
128+
| ExtProc | 50051 | `-port` | gRPC ExtProc server |
129+
| API Server | 8080 | `-api-port` | Classification APIs |
130+
| Ensemble | 8081 | `-ensemble-port` | Ensemble orchestration |
131+
| Metrics | 9190 | `-metrics-port` | Prometheus metrics |
132+
133+
### Configuration
134+
135+
Ensemble service reads configuration from the same `config.yaml`:
136+
137+
```yaml
138+
ensemble:
139+
enabled: true # Start ensemble service
140+
default_strategy: "voting"
141+
default_min_responses: 2
142+
timeout_seconds: 30
143+
max_concurrent_requests: 10
144+
endpoint_mappings:
145+
model-a: "http://localhost:8001/v1/chat/completions"
146+
model-b: "http://localhost:8002/v1/chat/completions"
147+
model-c: "http://localhost:8003/v1/chat/completions"
148+
```
149+
150+
## Deployment Scenarios
151+
152+
### Scenario 1: Standalone Ensemble
153+
154+
Deploy only the ensemble service:
155+
156+
```bash
157+
./bin/router -config=config/ensemble-only.yaml
158+
```
159+
160+
Config with all other features disabled, only ensemble enabled.
161+
162+
### Scenario 2: Integrated with Semantic Router
163+
164+
Deploy all services together (default):
165+
166+
```bash
167+
./bin/router -config=config/config.yaml
168+
```
169+
170+
All services start based on their enabled flags.
171+
172+
### Scenario 3: Scaled Ensemble
173+
174+
Run multiple ensemble service instances:
175+
176+
```bash
177+
# Instance 1
178+
./bin/router -config=config1.yaml -ensemble-port=8081
179+
180+
# Instance 2
181+
./bin/router -config=config2.yaml -ensemble-port=8082
182+
```
183+
184+
Load balancer distributes requests across instances.
185+
186+
## API Specification
187+
188+
### POST /v1/chat/completions
189+
190+
OpenAI-compatible endpoint with ensemble extensions.
191+
192+
#### Request Headers
193+
194+
| Header | Required | Description |
195+
|--------|----------|-------------|
196+
| `x-ensemble-enable` | Yes | Must be "true" |
197+
| `x-ensemble-models` | Yes | Comma-separated model names |
198+
| `x-ensemble-strategy` | No | Aggregation strategy (default from config) |
199+
| `x-ensemble-min-responses` | No | Minimum responses required (default from config) |
200+
| `Authorization` | No | Forwarded to model endpoints |
201+
202+
#### Request Body
203+
204+
Standard OpenAI chat completion request:
205+
206+
```json
207+
{
208+
"model": "ensemble",
209+
"messages": [
210+
{"role": "user", "content": "Your question"}
211+
]
212+
}
213+
```
214+
215+
#### Response Headers
216+
217+
| Header | Description |
218+
|--------|-------------|
219+
| `x-vsr-ensemble-used` | "true" if ensemble was used |
220+
| `x-vsr-ensemble-models-queried` | Number of models queried |
221+
| `x-vsr-ensemble-responses-received` | Number of successful responses |
222+
223+
#### Response Body
224+
225+
Standard OpenAI chat completion response with aggregated content.
226+
227+
### GET /health
228+
229+
Health check endpoint.
230+
231+
#### Response
232+
233+
```json
234+
{
235+
"status": "healthy",
236+
"service": "ensemble"
237+
}
238+
```
239+
240+
## Aggregation Strategies
241+
242+
### Voting
243+
244+
Parses responses and selects most common answer:
245+
246+
```yaml
247+
x-ensemble-strategy: voting
248+
```
249+
250+
Best for: Classification, multiple choice questions
251+
252+
### Weighted
253+
254+
Selects response with highest confidence:
255+
256+
```yaml
257+
x-ensemble-strategy: weighted
258+
```
259+
260+
Best for: Models with different reliability profiles
261+
262+
### First Success
263+
264+
Returns first valid response:
265+
266+
```yaml
267+
x-ensemble-strategy: first_success
268+
```
269+
270+
Best for: Latency-sensitive applications
271+
272+
### Score Averaging
273+
274+
Balances confidence and latency:
275+
276+
```yaml
277+
x-ensemble-strategy: score_averaging
278+
```
279+
280+
Best for: Balanced quality and speed
281+
282+
## Error Handling
283+
284+
### Insufficient Responses
285+
286+
If fewer than `min_responses` succeed:
287+
288+
```json
289+
{
290+
"error": "Ensemble orchestration failed: insufficient responses: got 1, required 2"
291+
}
292+
```
293+
294+
### Invalid Configuration
295+
296+
If model not in endpoint_mappings:
297+
298+
```json
299+
{
300+
"error": "endpoint not found for model: model-x"
301+
}
302+
```
303+
304+
### Timeout
305+
306+
If requests exceed timeout:
307+
308+
```json
309+
{
310+
"error": "HTTP request failed: context deadline exceeded"
311+
}
312+
```
313+
314+
## Monitoring
315+
316+
### Logs
317+
318+
Ensemble service logs:
319+
- Request details (models, strategy, min responses)
320+
- Execution results (queried, received, strategy used)
321+
- Errors and failures
322+
323+
### Metrics
324+
325+
Future enhancement: Prometheus metrics for:
326+
- Request count per strategy
327+
- Response latency per model
328+
- Success/failure rates
329+
- Aggregation time
330+
331+
## Security Considerations
332+
333+
1. **Authentication**: Headers forwarded to model endpoints
334+
2. **Network**: Use HTTPS in production
335+
3. **Rate Limiting**: Apply at load balancer level
336+
4. **Endpoint Validation**: Only configured endpoints are queried
337+
5. **Timeout Protection**: Prevents resource exhaustion
338+
339+
## Future Enhancements
340+
341+
1. **Semantic Router Integration**: Automatic routing to ensemble service
342+
2. **Streaming Support**: SSE for streaming responses
343+
3. **Advanced Reranking**: Separate model for ranking responses
344+
4. **Caching**: Cache ensemble results
345+
5. **Metrics**: Comprehensive Prometheus metrics
346+
6. **Circuit Breaker**: Automatic endpoint failure detection
347+
7. **Load Balancing**: Intelligent distribution across model endpoints

0 commit comments

Comments
 (0)