Skip to content

HN Launch: Run multi-benchmark suite and publish results #45

@evanvolgas

Description

@evanvolgas

Summary

Execute the complete benchmark suite and prepare results for Hacker News launch.

Benchmark Suite

Dataset Size Evaluation Est. Cost Headline
GSM8K 1,319 Exact match $100-150 Math reasoning
MMLU 1,000 Exact match $80-100 Knowledge
HumanEval 164 Code execution $20-30 Coding
Total 2,483 $200-300

Commands

# GSM8K Benchmark
uv run conduit-bench run \
  --dataset gsm8k \
  --algorithms hybrid,linucb,ucb1,thompson,epsilon,random \
  --evaluator exact_match \
  --output results/gsm8k.json

# MMLU Benchmark  
uv run conduit-bench run \
  --dataset mmlu \
  --mmlu-limit 1000 \
  --algorithms hybrid,linucb,ucb1,thompson,epsilon,random \
  --evaluator exact_match \
  --output results/mmlu.json

# HumanEval Benchmark
uv run conduit-bench run \
  --dataset humaneval \
  --algorithms hybrid,linucb,ucb1,thompson,epsilon,random \
  --evaluator code_execution \
  --output results/humaneval.json

# Generate combined report
uv run conduit-bench report \
  --results results/gsm8k.json results/mmlu.json results/humaneval.json \
  --output analysis/hn_launch_report.md

Target Headlines

  1. GSM8K: "HybridRouter achieves X% accuracy at Y% the cost of GPT-4"
  2. MMLU: "Matches top-tier accuracy while cutting costs in half"
  3. HumanEval: "X% pass rate with intelligent model selection"

HN Post Structure

  • Lead with cost savings + quality retention
  • Show Pareto frontier chart (accuracy vs cost)
  • Link to reproducible benchmark suite
  • Acknowledge limitations upfront
  • Invite community to run benchmarks

Dependencies

Acceptance Criteria

  • All three benchmarks complete successfully
  • Results reproducible with documented commands
  • Visualizations generated (Pareto frontier, learning curves)
  • README updated with benchmark results
  • HN post draft prepared

Metadata

Metadata

Assignees

No one assigned

    Labels

    difficulty:intermediateIntermediate difficulty - requires domain knowledgedocumentationImprovements or additions to documentationpriority:highHigh priority - blocking or critical

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions