-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Labels
difficulty:intermediateIntermediate difficulty - requires domain knowledgeIntermediate difficulty - requires domain knowledgedocumentationImprovements or additions to documentationImprovements or additions to documentationpriority:highHigh priority - blocking or criticalHigh priority - blocking or critical
Description
Summary
Execute the complete benchmark suite and prepare results for Hacker News launch.
Benchmark Suite
| Dataset | Size | Evaluation | Est. Cost | Headline |
|---|---|---|---|---|
| GSM8K | 1,319 | Exact match | $100-150 | Math reasoning |
| MMLU | 1,000 | Exact match | $80-100 | Knowledge |
| HumanEval | 164 | Code execution | $20-30 | Coding |
| Total | 2,483 | $200-300 |
Commands
# GSM8K Benchmark
uv run conduit-bench run \
--dataset gsm8k \
--algorithms hybrid,linucb,ucb1,thompson,epsilon,random \
--evaluator exact_match \
--output results/gsm8k.json
# MMLU Benchmark
uv run conduit-bench run \
--dataset mmlu \
--mmlu-limit 1000 \
--algorithms hybrid,linucb,ucb1,thompson,epsilon,random \
--evaluator exact_match \
--output results/mmlu.json
# HumanEval Benchmark
uv run conduit-bench run \
--dataset humaneval \
--algorithms hybrid,linucb,ucb1,thompson,epsilon,random \
--evaluator code_execution \
--output results/humaneval.json
# Generate combined report
uv run conduit-bench report \
--results results/gsm8k.json results/mmlu.json results/humaneval.json \
--output analysis/hn_launch_report.mdTarget Headlines
- GSM8K: "HybridRouter achieves X% accuracy at Y% the cost of GPT-4"
- MMLU: "Matches top-tier accuracy while cutting costs in half"
- HumanEval: "X% pass rate with intelligent model selection"
HN Post Structure
- Lead with cost savings + quality retention
- Show Pareto frontier chart (accuracy vs cost)
- Link to reproducible benchmark suite
- Acknowledge limitations upfront
- Invite community to run benchmarks
Dependencies
- Implement pluggable evaluation framework #41 (Pluggable evaluation framework)
- Add GSM8K benchmark with exact-match evaluation #42 (GSM8K benchmark)
- Add MMLU benchmark with exact-match evaluation #43 (MMLU benchmark)
- Add HumanEval benchmark with code execution #44 (HumanEval benchmark)
Acceptance Criteria
- All three benchmarks complete successfully
- Results reproducible with documented commands
- Visualizations generated (Pareto frontier, learning curves)
- README updated with benchmark results
- HN post draft prepared
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
difficulty:intermediateIntermediate difficulty - requires domain knowledgeIntermediate difficulty - requires domain knowledgedocumentationImprovements or additions to documentationImprovements or additions to documentationpriority:highHigh priority - blocking or criticalHigh priority - blocking or critical