Skip to content

MadAppGang/magus-bench

Repository files navigation

Magus Bench

Benchmarks and an autonomous experiment platform for the Magus plugin ecosystem.

Experiment platform

The experiment loop proposes improvements, tests them in isolated git worktrees, and merges what works. Each experiment is a TypeScript plugin — add a new one by writing a single file.

bun loop/loop.ts --runs 5           # run 5 iterations
bun loop/loop.ts --runs 1 --dry-run # preview without API calls
touch loop/STOP                      # stop after current phase

See loop/README.md for the full guide including how to write new experiment plugins.

Experiments

Experiment Plugin Description
tech-writer-quality experiment.ts Improve documentation quality via prompt/rubric iteration. 4-way blind comparison, 7-model judge panel, Borda + Friedman statistics.
agent-routing experiment.ts Improve Claude Code skill/agent routing correctness. Promptfoo benchmark, 22 test cases across 11 routing categories.

Eval harnesses

Each eval harness runs independently and is also callable by the experiment loop.

Eval Entry point Description
tech-writer-eval run.sh 4-way blind doc quality comparison: human reference vs bare Claude vs Claude+anti-slop vs Gemini Flash, judged by 7-model panel
skill-routing-eval promptfooconfig.yaml Promptfoo benchmark testing Skill-tool vs Task-tool disambiguation, routing-table honoring, and spelling correctness

Prerequisites

  • Claude Code CLI (claude)
  • claudish CLI (npm install -g claudish)
  • Bun runtime (for TypeScript)
  • OpenRouter API key (for external model judges)

Running an eval directly

cd tech-writer-eval
./run.sh              # full run: generate → judge → analyze
./run.sh --dry-run    # preview without API calls

Results land in results/run-YYYYMMDD-HHMMSS/ with a markdown report.

License

MIT

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors