HN Launch: Run multi-benchmark suite and publish results

## Summary
Execute the complete benchmark suite and prepare results for Hacker News launch.

## Benchmark Suite

| Dataset | Size | Evaluation | Est. Cost | Headline |
|---------|------|------------|-----------|----------|
| GSM8K | 1,319 | Exact match | $100-150 | Math reasoning |
| MMLU | 1,000 | Exact match | $80-100 | Knowledge |
| HumanEval | 164 | Code execution | $20-30 | Coding |
| **Total** | **2,483** | | **$200-300** | |

## Commands
```bash
# GSM8K Benchmark
uv run conduit-bench run \
  --dataset gsm8k \
  --algorithms hybrid,linucb,ucb1,thompson,epsilon,random \
  --evaluator exact_match \
  --output results/gsm8k.json

# MMLU Benchmark  
uv run conduit-bench run \
  --dataset mmlu \
  --mmlu-limit 1000 \
  --algorithms hybrid,linucb,ucb1,thompson,epsilon,random \
  --evaluator exact_match \
  --output results/mmlu.json

# HumanEval Benchmark
uv run conduit-bench run \
  --dataset humaneval \
  --algorithms hybrid,linucb,ucb1,thompson,epsilon,random \
  --evaluator code_execution \
  --output results/humaneval.json

# Generate combined report
uv run conduit-bench report \
  --results results/gsm8k.json results/mmlu.json results/humaneval.json \
  --output analysis/hn_launch_report.md
```

## Target Headlines
1. **GSM8K**: "HybridRouter achieves X% accuracy at Y% the cost of GPT-4"
2. **MMLU**: "Matches top-tier accuracy while cutting costs in half"
3. **HumanEval**: "X% pass rate with intelligent model selection"

## HN Post Structure
- Lead with cost savings + quality retention
- Show Pareto frontier chart (accuracy vs cost)
- Link to reproducible benchmark suite
- Acknowledge limitations upfront
- Invite community to run benchmarks

## Dependencies
- #41 (Pluggable evaluation framework)
- #42 (GSM8K benchmark)
- #43 (MMLU benchmark)
- #44 (HumanEval benchmark)

## Acceptance Criteria
- [ ] All three benchmarks complete successfully
- [ ] Results reproducible with documented commands
- [ ] Visualizations generated (Pareto frontier, learning curves)
- [ ] README updated with benchmark results
- [ ] HN post draft prepared


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HN Launch: Run multi-benchmark suite and publish results #45

Summary

Benchmark Suite

Commands

Target Headlines

HN Post Structure

Dependencies

Acceptance Criteria

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Dataset	Size	Evaluation	Est. Cost	Headline
GSM8K	1,319	Exact match	$100-150	Math reasoning
MMLU	1,000	Exact match	$80-100	Knowledge
HumanEval	164	Code execution	$20-30	Coding
Total	2,483		$200-300

HN Launch: Run multi-benchmark suite and publish results #45

Description

Summary

Benchmark Suite

Commands

Target Headlines

HN Post Structure

Dependencies

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions