Benchmarking harness for evaluating AI agents. Extracted from ironclaw.
| Suite | Description |
|---|---|
trajectory |
Multi-turn trajectory scenarios with per-turn assertions (supersedes spot) |
spot |
End-to-end spot checks: conversation, tool use, chaining, robustness |
custom |
Custom JSONL tasks with flexible scoring (exact, contains, regex, LLM) |
gaia |
GAIA benchmark (knowledge and reasoning) |
tau_bench |
Tau-bench (multi-turn tool-calling dialog) |
swe_bench |
SWE-bench Pro (real-world software engineering) |
# 1. Configure your LLM provider (pick one)
cp .env.example .env
# Edit .env with your API key (OPENAI_API_KEY, ANTHROPIC_API_KEY, or LLM_* vars)
# 2. List available suites
nearai-bench list
# 3. Run trajectory scenarios
nearai-bench run --suite trajectory --config suites/trajectory.toml
# Run with a specific model
nearai-bench run --suite trajectory --config suites/trajectory.toml --model gpt-4o
# View latest results
nearai-bench results latest
# Compare two runs
nearai-bench compare <baseline-uuid> <comparison-uuid>Copy .env.example to .env and set your provider credentials. The harness
supports any OpenAI-compatible API endpoint.
OpenAI (simplest):
OPENAI_API_KEY=sk-...Anthropic:
ANTHROPIC_API_KEY=sk-ant-...Any OpenAI-compatible provider (OpenRouter, Together, vLLM, Ollama, etc.):
LLM_BACKEND=openai_compatible
LLM_BASE_URL=https://openrouter.ai/api/v1
LLM_API_KEY=sk-or-...
LLM_MODEL=anthropic/claude-sonnet-4NEAR AI (requires ironclaw onboarding):
LLM_BACKEND=nearaibenchmarks/
datasets/ Versioned benchmark datasets
spot/v1/ 21 spot-check tasks
swe-bench-lite/v1/ SWE-bench Lite dataset (astropy subset)
suites/ Suite configuration files (TOML)
baselines/ Curated reference results by suite
results/ Run output, namespaced by harness
ironclaw/ Results from the ironclaw harness
src/ Harness source code
adapters/ Suite adapter implementations
Datasets live under datasets/{suite-name}/v{N}/tasks.jsonl. The versioning scheme lets
datasets evolve without invalidating older results that reference a prior version.
- Create
datasets/{name}/v1/tasks.jsonlin the appropriate JSONL format. - Create
suites/{name}.tomlpointingsuite_config.dataset_pathat the new file. - If the suite type doesn't exist, implement a new adapter in
src/adapters/.
Results are written to results/{harness}/{run-uuid}/ containing:
run.json: aggregate metrics (pass rate, cost, timing, model, harness)tasks.jsonl: per-task results with scores, traces, and responses
The harness field in run.json identifies which agent implementation produced the results,
allowing multiple harnesses to share the same results directory structure.
Suite configs are TOML files with this structure:
task_timeout = "120s"
parallelism = 1
[[matrix]]
label = "default"
# model = "openai/gpt-4o" # optional model override
[suite_config]
dataset_path = "datasets/spot/v1/tasks.jsonl"MIT OR Apache-2.0