Magus Bench

Benchmarks and an autonomous experiment platform for the Magus plugin ecosystem.

Experiment platform

The experiment loop proposes improvements, tests them in isolated git worktrees, and merges what works. Each experiment is a TypeScript plugin — add a new one by writing a single file.

bun loop/loop.ts --runs 5           # run 5 iterations
bun loop/loop.ts --runs 1 --dry-run # preview without API calls
touch loop/STOP                      # stop after current phase

See loop/README.md for the full guide including how to write new experiment plugins.

Experiments

Experiment	Plugin	Description
tech-writer-quality	`experiment.ts`	Improve documentation quality via prompt/rubric iteration. 4-way blind comparison, 7-model judge panel, Borda + Friedman statistics.
agent-routing	`experiment.ts`	Improve Claude Code skill/agent routing correctness. Promptfoo benchmark, 22 test cases across 11 routing categories.

Eval harnesses

Each eval harness runs independently and is also callable by the experiment loop.

Eval	Entry point	Description
tech-writer-eval	`run.sh`	4-way blind doc quality comparison: human reference vs bare Claude vs Claude+anti-slop vs Gemini Flash, judged by 7-model panel
skill-routing-eval	`promptfooconfig.yaml`	Promptfoo benchmark testing Skill-tool vs Task-tool disambiguation, routing-table honoring, and spelling correctness

Prerequisites

Claude Code CLI (claude)
claudish CLI (npm install -g claudish)
Bun runtime (for TypeScript)
OpenRouter API key (for external model judges)

Running an eval directly

cd tech-writer-eval
./run.sh              # full run: generate → judge → analyze
./run.sh --dry-run    # preview without API calls

Results land in results/run-YYYYMMDD-HHMMSS/ with a markdown report.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
.claude		.claude
.github/workflows		.github/workflows
ai-docs/sessions		ai-docs/sessions
experiments		experiments
loop		loop
skill-routing-eval		skill-routing-eval
tech-writer-eval		tech-writer-eval
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md
TESTING.md		TESTING.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Magus Bench

Experiment platform

Experiments

Eval harnesses

Prerequisites

Running an eval directly

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Magus Bench

Experiment platform

Experiments

Eval harnesses

Prerequisites

Running an eval directly

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages