Test benchmarks with all 5 quality prior contexts

## Testing Enhancement

Run benchmarks with each quality prior context to compare performance across different model quality assumptions.

## Quality Contexts (from conduit.yaml)
| Context | Top Model | Prior Score |
|---------|-----------|-------------|
| `code` | claude-sonnet-4.5 | 0.92 |
| `creative` | claude-opus-4.5 | 0.94 |
| `analysis` | gemini-3-pro-preview | 0.93 |
| `simple_qa` | gpt-5-nano | 0.90 |
| `general` | gemini-3-pro-preview | 0.90 |

## Test Matrix
Run small-scale benchmarks with each context:
```bash
for context in code creative analysis simple_qa general; do
  CONDUIT_QUALITY_CONTEXT=$context uv run conduit-bench run \
    --dataset mmlu --max-queries 100 \
    --algorithms always_best,always_cheapest,thompson,ucb1 \
    --output results/context_${context}.json
done
```

## Expected Outcomes
- `AlwaysBest` should select different models based on context priors
- `AlwaysCheapest` should be unaffected (cost-only decision)
- Thompson/UCB1 should start exploring from different priors
- Quality context affects baseline performance differently for different datasets

## Analysis Script
```python
import json
from pathlib import Path

contexts = ['code', 'creative', 'analysis', 'simple_qa', 'general']
for ctx in contexts:
    with open(f'results/context_{ctx}.json') as f:
        data = json.load(f)
    # Extract model selection distribution for AlwaysBest
    # Compare quality scores across contexts
```

## Value
- Validates quality prior system works correctly
- Demonstrates context-specific routing behavior
- Provides baseline comparison data for documentation
- Identifies optimal context for each dataset type

## Acceptance Criteria
- [ ] All 5 contexts tested with MMLU dataset
- [ ] Model selection distributions documented
- [ ] Quality/cost comparisons across contexts
- [ ] Recommendations for which context to use with which dataset

## Estimated Effort
~2-3 hours (mostly runtime)

## Dependencies
- Requires #51 for better UX (can use env var as workaround)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test benchmarks with all 5 quality prior contexts #52

Testing Enhancement

Quality Contexts (from conduit.yaml)

Test Matrix

Expected Outcomes

Analysis Script

Value

Acceptance Criteria

Estimated Effort

Dependencies

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Context	Top Model	Prior Score
`code`	claude-sonnet-4.5	0.92
`creative`	claude-opus-4.5	0.94
`analysis`	gemini-3-pro-preview	0.93
`simple_qa`	gpt-5-nano	0.90
`general`	gemini-3-pro-preview	0.90

Test benchmarks with all 5 quality prior contexts #52

Description

Testing Enhancement

Quality Contexts (from conduit.yaml)

Test Matrix

Expected Outcomes

Analysis Script

Value

Acceptance Criteria

Estimated Effort

Dependencies

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions