Skip to content

Test benchmarks with all 5 quality prior contexts #52

@evanvolgas

Description

@evanvolgas

Testing Enhancement

Run benchmarks with each quality prior context to compare performance across different model quality assumptions.

Quality Contexts (from conduit.yaml)

Context Top Model Prior Score
code claude-sonnet-4.5 0.92
creative claude-opus-4.5 0.94
analysis gemini-3-pro-preview 0.93
simple_qa gpt-5-nano 0.90
general gemini-3-pro-preview 0.90

Test Matrix

Run small-scale benchmarks with each context:

for context in code creative analysis simple_qa general; do
  CONDUIT_QUALITY_CONTEXT=$context uv run conduit-bench run \
    --dataset mmlu --max-queries 100 \
    --algorithms always_best,always_cheapest,thompson,ucb1 \
    --output results/context_${context}.json
done

Expected Outcomes

  • AlwaysBest should select different models based on context priors
  • AlwaysCheapest should be unaffected (cost-only decision)
  • Thompson/UCB1 should start exploring from different priors
  • Quality context affects baseline performance differently for different datasets

Analysis Script

import json
from pathlib import Path

contexts = ['code', 'creative', 'analysis', 'simple_qa', 'general']
for ctx in contexts:
    with open(f'results/context_{ctx}.json') as f:
        data = json.load(f)
    # Extract model selection distribution for AlwaysBest
    # Compare quality scores across contexts

Value

  • Validates quality prior system works correctly
  • Demonstrates context-specific routing behavior
  • Provides baseline comparison data for documentation
  • Identifies optimal context for each dataset type

Acceptance Criteria

  • All 5 contexts tested with MMLU dataset
  • Model selection distributions documented
  • Quality/cost comparisons across contexts
  • Recommendations for which context to use with which dataset

Estimated Effort

~2-3 hours (mostly runtime)

Dependencies

Metadata

Metadata

Assignees

No one assigned

    Labels

    difficulty:beginnerBeginner difficulty - straightforward taskpriority:mediumMedium priority - important but not blockingtestingTesting and quality assurance

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions