Skip to content

Commit 6c64997

Browse files
Copilotrootfs
andcommitted
Improve voting and score averaging implementations
Co-authored-by: rootfs <[email protected]>
1 parent 736bd4d commit 6c64997

File tree

4 files changed

+546
-24
lines changed

4 files changed

+546
-24
lines changed

config/ensemble/README.md

Lines changed: 190 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,190 @@
1+
# Ensemble Orchestration Configuration
2+
3+
This directory contains configuration examples for the ensemble orchestration feature, which enables parallel model inference with configurable aggregation strategies.
4+
5+
## Overview
6+
7+
The ensemble orchestration feature allows you to:
8+
- Query multiple LLM models in parallel
9+
- Combine their outputs using various aggregation strategies
10+
- Improve reliability, accuracy, and cost-performance trade-offs
11+
12+
## Configuration
13+
14+
### Basic Setup
15+
16+
Enable ensemble mode in your `config.yaml`:
17+
18+
```yaml
19+
ensemble:
20+
enabled: true
21+
default_strategy: "voting"
22+
default_min_responses: 2
23+
timeout_seconds: 30
24+
max_concurrent_requests: 10
25+
endpoint_mappings:
26+
model-a: "http://localhost:8001/v1/chat/completions"
27+
model-b: "http://localhost:8002/v1/chat/completions"
28+
model-c: "http://localhost:8003/v1/chat/completions"
29+
```
30+
31+
### Configuration Options
32+
33+
| Option | Type | Default | Description |
34+
|--------|------|---------|-------------|
35+
| `enabled` | boolean | `false` | Enable/disable ensemble orchestration |
36+
| `default_strategy` | string | `"voting"` | Default aggregation strategy |
37+
| `default_min_responses` | integer | `2` | Minimum successful responses required |
38+
| `timeout_seconds` | integer | `30` | Maximum time to wait for responses |
39+
| `max_concurrent_requests` | integer | `10` | Limit on parallel model queries |
40+
| `endpoint_mappings` | map | `{}` | Model name to OpenAI-compatible API endpoint mapping |
41+
42+
## Usage
43+
44+
### Request Headers
45+
46+
Control ensemble behavior using HTTP headers:
47+
48+
| Header | Description | Example |
49+
|--------|-------------|---------|
50+
| `x-ensemble-enable` | Enable ensemble mode | `true` |
51+
| `x-ensemble-models` | Comma-separated list of models | `model-a,model-b,model-c` |
52+
| `x-ensemble-strategy` | Aggregation strategy | `voting` |
53+
| `x-ensemble-min-responses` | Minimum responses required | `2` |
54+
55+
### Example Request
56+
57+
```bash
58+
curl -X POST http://localhost:8080/v1/chat/completions \
59+
-H "Content-Type: application/json" \
60+
-H "x-ensemble-enable: true" \
61+
-H "x-ensemble-models: model-a,model-b,model-c" \
62+
-H "x-ensemble-strategy: voting" \
63+
-H "x-ensemble-min-responses: 2" \
64+
-d '{
65+
"model": "ensemble",
66+
"messages": [
67+
{"role": "user", "content": "What is the capital of France?"}
68+
]
69+
}'
70+
```
71+
72+
### Response Headers
73+
74+
The response includes metadata about the ensemble process:
75+
76+
| Header | Description | Example |
77+
|--------|-------------|---------|
78+
| `x-vsr-ensemble-used` | Indicates ensemble was used | `true` |
79+
| `x-vsr-ensemble-models-queried` | Number of models queried | `3` |
80+
| `x-vsr-ensemble-responses-received` | Number of successful responses | `3` |
81+
82+
## Aggregation Strategies
83+
84+
### 1. Voting (Majority Consensus)
85+
**Best for:** Classification, multiple choice, yes/no questions
86+
87+
Selects the most common response among all models.
88+
89+
```bash
90+
-H "x-ensemble-strategy: voting"
91+
```
92+
93+
### 2. Weighted Consensus
94+
**Best for:** Combining models with different reliability profiles
95+
96+
Weights responses by confidence scores from each model.
97+
98+
```bash
99+
-H "x-ensemble-strategy: weighted"
100+
```
101+
102+
### 3. First Success
103+
**Best for:** Latency-sensitive applications
104+
105+
Returns the first valid response received, optimizing for speed.
106+
107+
```bash
108+
-H "x-ensemble-strategy: first_success"
109+
```
110+
111+
### 4. Score Averaging
112+
**Best for:** Numerical outputs, probability distributions
113+
114+
Averages numerical scores across all models.
115+
116+
```bash
117+
-H "x-ensemble-strategy: score_averaging"
118+
```
119+
120+
### 5. Reranking
121+
**Best for:** Generation tasks, open-ended responses
122+
123+
Collects multiple candidate responses and selects the best one (requires additional ranking logic).
124+
125+
```bash
126+
-H "x-ensemble-strategy: reranking"
127+
```
128+
129+
## Use Cases
130+
131+
### Critical Applications
132+
- Medical diagnosis assistance (consensus increases confidence)
133+
- Legal document analysis (high accuracy verification)
134+
- Financial advisory systems (reliability impacts business outcomes)
135+
136+
### Cost Optimization
137+
- Query multiple smaller models instead of one large expensive model
138+
- Start with fast/cheap models, escalate for uncertain cases
139+
- Adaptive routing based on query complexity
140+
141+
### Reliability & Accuracy
142+
- Voting mechanisms to reduce hallucinations
143+
- Consensus-based outputs for higher confidence
144+
- Graceful degradation with fallback chains
145+
146+
### Model Diversity
147+
- Combine different model architectures (GPT-style + Llama-style)
148+
- Ensemble different model sizes for balanced performance
149+
- Cross-validate responses from models with different training data
150+
151+
## Examples
152+
153+
See `ensemble-example.yaml` for a complete configuration example.
154+
155+
## Security Considerations
156+
157+
- Ensure all endpoint URLs are from trusted sources
158+
- Use TLS/HTTPS for production deployments
159+
- Set appropriate timeout values to prevent resource exhaustion
160+
- Monitor and log ensemble operations for debugging
161+
162+
## Performance Tips
163+
164+
1. **Optimize Concurrency**: Set `max_concurrent_requests` based on your infrastructure capacity
165+
2. **Tune Timeouts**: Balance between latency and completeness with `timeout_seconds`
166+
3. **Select Appropriate Strategy**: Choose the strategy that best matches your use case
167+
4. **Monitor Metrics**: Track response times and success rates per model
168+
169+
## Troubleshooting
170+
171+
### No responses received
172+
- Verify endpoint URLs are correct and reachable
173+
- Check network connectivity to model endpoints
174+
- Ensure models are running and accepting requests
175+
176+
### Insufficient responses error
177+
- Reduce `x-ensemble-min-responses` header value
178+
- Add more model endpoints to `endpoint_mappings`
179+
- Check model health and availability
180+
181+
### Slow responses
182+
- Reduce `timeout_seconds` for faster failures
183+
- Increase `max_concurrent_requests` for better parallelism
184+
- Use `first_success` strategy for latency optimization
185+
186+
## Related Documentation
187+
188+
- [Main Configuration Guide](../README.md)
189+
- [API Documentation](../../docs/api.md)
190+
- [Deployment Guide](../../docs/deployment.md)
Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
# Example Ensemble Configuration
2+
# This configuration demonstrates how to enable and use ensemble orchestration
3+
4+
# Enable ensemble mode
5+
ensemble:
6+
enabled: true # Set to true to enable ensemble orchestration
7+
8+
# Default aggregation strategy when not specified in request headers
9+
# Options: voting, weighted, first_success, score_averaging, reranking
10+
default_strategy: "voting"
11+
12+
# Minimum number of successful model responses required
13+
default_min_responses: 2
14+
15+
# Maximum time to wait for model responses (seconds)
16+
timeout_seconds: 30
17+
18+
# Maximum number of parallel model queries
19+
max_concurrent_requests: 10
20+
21+
# Map model names to their OpenAI-compatible API endpoints
22+
# Each endpoint should be the full URL to the chat completions endpoint
23+
endpoint_mappings:
24+
model-a: "http://localhost:8001/v1/chat/completions"
25+
model-b: "http://localhost:8002/v1/chat/completions"
26+
model-c: "http://localhost:8003/v1/chat/completions"
27+
28+
# Example Usage:
29+
#
30+
# To use ensemble mode, include the following headers in your request:
31+
#
32+
# x-ensemble-enable: true
33+
# x-ensemble-models: model-a,model-b,model-c
34+
# x-ensemble-strategy: voting
35+
# x-ensemble-min-responses: 2
36+
#
37+
# The response will include metadata headers:
38+
# x-vsr-ensemble-used: true
39+
# x-vsr-ensemble-models-queried: 3
40+
# x-vsr-ensemble-responses-received: 3

src/semantic-router/pkg/ensemble/factory.go

Lines changed: 114 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -292,37 +292,73 @@ func (f *Factory) aggregateResponses(responses []ModelResponse, strategy Strateg
292292
}
293293
}
294294

295-
// aggregateByVoting implements majority voting
295+
// aggregateByVoting implements majority voting by comparing message content
296296
func (f *Factory) aggregateByVoting(responses []ModelResponse, metadata *Metadata) ([]byte, Metadata, error) {
297-
// Count occurrences of each response
298-
// This is a simplified implementation - in production, you'd parse the actual content
299-
responseCounts := make(map[string]int)
300-
responseMap := make(map[string][]byte)
297+
// Parse responses and extract message content for voting
298+
type parsedResponse struct {
299+
content string
300+
rawBytes []byte
301+
}
302+
303+
contentCounts := make(map[string]int)
304+
contentToResponse := make(map[string]parsedResponse)
301305

302306
for _, resp := range responses {
303-
key := string(resp.Response)
304-
responseCounts[key]++
305-
responseMap[key] = resp.Response
307+
// Try to parse OpenAI-style response
308+
var openAIResp map[string]interface{}
309+
if err := json.Unmarshal(resp.Response, &openAIResp); err != nil {
310+
// If parsing fails, use first response as fallback
311+
logging.Warnf("Failed to parse response for voting: %v", err)
312+
continue
313+
}
314+
315+
// Extract content from choices array
316+
content := extractContentFromResponse(openAIResp)
317+
if content != "" {
318+
contentCounts[content]++
319+
contentToResponse[content] = parsedResponse{
320+
content: content,
321+
rawBytes: resp.Response,
322+
}
323+
}
306324
}
307325

308-
// Find the most common response
326+
// Find the most common content
309327
var maxCount int
310-
var selectedResponse []byte
311-
for key, count := range responseCounts {
328+
var selectedContent string
329+
for content, count := range contentCounts {
312330
if count > maxCount {
313331
maxCount = count
314-
selectedResponse = responseMap[key]
332+
selectedContent = content
315333
}
316334
}
317335

318-
metadata.AggregationDetails["votes"] = responseCounts
336+
metadata.AggregationDetails["vote_counts"] = contentCounts
319337
metadata.AggregationDetails["max_votes"] = maxCount
320338

321-
if selectedResponse == nil {
322-
return responses[0].Response, *metadata, nil
339+
// Return the response with the most votes, or first response if no clear winner
340+
if selectedContent != "" {
341+
if selected, ok := contentToResponse[selectedContent]; ok {
342+
return selected.rawBytes, *metadata, nil
343+
}
323344
}
324345

325-
return selectedResponse, *metadata, nil
346+
return responses[0].Response, *metadata, nil
347+
}
348+
349+
// extractContentFromResponse extracts the message content from an OpenAI-style response
350+
func extractContentFromResponse(resp map[string]interface{}) string {
351+
// Navigate: response["choices"][0]["message"]["content"]
352+
if choices, ok := resp["choices"].([]interface{}); ok && len(choices) > 0 {
353+
if choice, ok := choices[0].(map[string]interface{}); ok {
354+
if message, ok := choice["message"].(map[string]interface{}); ok {
355+
if content, ok := message["content"].(string); ok {
356+
return content
357+
}
358+
}
359+
}
360+
}
361+
return ""
326362
}
327363

328364
// aggregateByWeighted implements confidence-weighted selection
@@ -352,13 +388,67 @@ func (f *Factory) aggregateByWeighted(responses []ModelResponse, metadata *Metad
352388
return selectedResponse, *metadata, nil
353389
}
354390

355-
// aggregateByScoreAveraging implements score averaging (simplified)
391+
// aggregateByScoreAveraging averages logprobs or confidence scores from multiple models
356392
func (f *Factory) aggregateByScoreAveraging(responses []ModelResponse, metadata *Metadata) ([]byte, Metadata, error) {
357-
// This is a simplified implementation
358-
// In production, you'd parse the responses and average numerical scores
359-
// For now, return the first response as a placeholder
360-
metadata.SelectedModel = responses[0].ModelName
361-
metadata.AggregationDetails["note"] = "score averaging not fully implemented"
362-
363-
return responses[0].Response, *metadata, nil
393+
// For score averaging, we select the response with the median confidence/latency balance
394+
// This is more practical than trying to merge responses
395+
396+
type scoredResponse struct {
397+
response ModelResponse
398+
score float64
399+
}
400+
401+
scored := make([]scoredResponse, 0, len(responses))
402+
403+
for _, resp := range responses {
404+
// Compute a composite score based on confidence and latency
405+
// Higher confidence is better, lower latency is better
406+
score := resp.Confidence
407+
if resp.Latency.Seconds() > 0 {
408+
// Normalize latency (penalize slow responses)
409+
latencyPenalty := 1.0 / (1.0 + resp.Latency.Seconds())
410+
score = score * latencyPenalty
411+
}
412+
413+
scored = append(scored, scoredResponse{
414+
response: resp,
415+
score: score,
416+
})
417+
}
418+
419+
// If no confidence scores available, fall back to selecting by fastest response
420+
allZeroConfidence := true
421+
for _, s := range scored {
422+
if s.score > 0 {
423+
allZeroConfidence = false
424+
break
425+
}
426+
}
427+
428+
if allZeroConfidence {
429+
// Select fastest response
430+
fastest := scored[0]
431+
for _, s := range scored[1:] {
432+
if s.response.Latency < fastest.response.Latency {
433+
fastest = s
434+
}
435+
}
436+
metadata.SelectedModel = fastest.response.ModelName
437+
metadata.AggregationDetails["selection_method"] = "fastest_response"
438+
return fastest.response.Response, *metadata, nil
439+
}
440+
441+
// Find highest scoring response
442+
best := scored[0]
443+
for _, s := range scored[1:] {
444+
if s.score > best.score {
445+
best = s
446+
}
447+
}
448+
449+
metadata.SelectedModel = best.response.ModelName
450+
metadata.AggregationDetails["best_score"] = best.score
451+
metadata.AggregationDetails["selection_method"] = "score_based"
452+
453+
return best.response.Response, *metadata, nil
364454
}

0 commit comments

Comments
 (0)