agent-bench

CLI benchmark tool for AI agents. Uses Vercel AI SDK 5 for model calls.

Super Quick Start

Install globally from this repo:

npm install -g .

Run immediately:

agent-bench run
agent-bench ui

No manual SQLite work is needed. The CLI auto-creates and auto-migrates the local DB.

Optional Gateway Setup

If you want live LLM judging through Vercel AI Gateway, create a .env from sample and set your key:

cp .env.sample .env

# PowerShell
$env:AI_GATEWAY_API_KEY="your_key"

Without a key, agent-bench run still works using local fallback judging so you can start immediately.

Main Commands

agent-bench run
agent-bench ui

Useful options:

agent-bench run --agent ./agents/coder-v1.md --benchmark core-engineering --task fix-react-bug --model openai/gpt-4.1-mini
agent-bench ui --port 4173

Deterministic behavior:

LLM_JUDGE_RESPONSE_CACHE=true (default) keeps repeated identical runs stable.
With AI Gateway key set, first run calls the LM and then reuses cached judge output for identical inputs.
Without AI Gateway key, local deterministic fallback judge is used.
Optional custom judge system prompt via LLM_JUDGE_SYSTEM_PROMPT.

Defaults

DB path: $HOME/.agent-bench/data.db
Artifact path: $HOME/.agent-bench/artifacts/
Judge model: openai/gpt-4.1-mini
Benchmarks folder: ./benchmarks

Concepts:

Benchmark: a suite that groups related tasks.
Task: one specific natural-language challenge with expected outcome inside a benchmark.

Benchmark suite structure:

benchmarks/
  <benchmark-key>/
    benchmark.md
    tasks/
      <task-key>.md

benchmark.md:

# <Benchmark Title>

Key: <benchmark-key>

## Description
<what this benchmark suite covers>

Task file format (tasks/<task-key>.md):

# <Task Title>

Key: <task-key>

## Task
<natural-language task description for this task>

## Expected Outcome
<clear expected result>

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
agents/sample		agents/sample
benchmarks/core-engineering		benchmarks/core-engineering
docs		docs
src		src
tests		tests
.env.sample		.env.sample
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
Creator.env_sample		Creator.env_sample
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
tailwind.config.js		tailwind.config.js
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

agent-bench

Super Quick Start

Optional Gateway Setup

Main Commands

Defaults

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

agent-bench

Super Quick Start

Optional Gateway Setup

Main Commands

Defaults

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages