-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Labels
difficulty:beginnerBeginner difficulty - straightforward taskBeginner difficulty - straightforward taskpriority:mediumMedium priority - important but not blockingMedium priority - important but not blockingtestingTesting and quality assuranceTesting and quality assurance
Description
Testing Enhancement
Run benchmarks with each quality prior context to compare performance across different model quality assumptions.
Quality Contexts (from conduit.yaml)
| Context | Top Model | Prior Score |
|---|---|---|
code |
claude-sonnet-4.5 | 0.92 |
creative |
claude-opus-4.5 | 0.94 |
analysis |
gemini-3-pro-preview | 0.93 |
simple_qa |
gpt-5-nano | 0.90 |
general |
gemini-3-pro-preview | 0.90 |
Test Matrix
Run small-scale benchmarks with each context:
for context in code creative analysis simple_qa general; do
CONDUIT_QUALITY_CONTEXT=$context uv run conduit-bench run \
--dataset mmlu --max-queries 100 \
--algorithms always_best,always_cheapest,thompson,ucb1 \
--output results/context_${context}.json
doneExpected Outcomes
AlwaysBestshould select different models based on context priorsAlwaysCheapestshould be unaffected (cost-only decision)- Thompson/UCB1 should start exploring from different priors
- Quality context affects baseline performance differently for different datasets
Analysis Script
import json
from pathlib import Path
contexts = ['code', 'creative', 'analysis', 'simple_qa', 'general']
for ctx in contexts:
with open(f'results/context_{ctx}.json') as f:
data = json.load(f)
# Extract model selection distribution for AlwaysBest
# Compare quality scores across contextsValue
- Validates quality prior system works correctly
- Demonstrates context-specific routing behavior
- Provides baseline comparison data for documentation
- Identifies optimal context for each dataset type
Acceptance Criteria
- All 5 contexts tested with MMLU dataset
- Model selection distributions documented
- Quality/cost comparisons across contexts
- Recommendations for which context to use with which dataset
Estimated Effort
~2-3 hours (mostly runtime)
Dependencies
- Requires Add context selection for quality priors #51 for better UX (can use env var as workaround)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
difficulty:beginnerBeginner difficulty - straightforward taskBeginner difficulty - straightforward taskpriority:mediumMedium priority - important but not blockingMedium priority - important but not blockingtestingTesting and quality assuranceTesting and quality assurance