Benchmarks

This document defines the reproducible evaluation entry points for Open Agent SDK.

Principles

Reproducible first: all benchmark runs should be executable from scripts in this repository.
Artifacts first: each run should emit machine-readable outputs (reports, logs, trajectories).
Comparable metrics: report at least success/pass, latency, and cost when available.

SWE-bench Lite

Entry docs:

benchmark/swebench/README.md

Primary scripts:

benchmark/swebench/scripts/run_oas_smoke_one.sh
benchmark/swebench/scripts/run_oas_smoke_batch.sh
benchmark/swebench/scripts/run_oas_overnight.sh
benchmark/swebench/scripts/summarize_reports.py

Quick run:

cd benchmark/swebench
SWEBENCH_SMOKE_COUNT=5 OAS_MAX_TURNS=12 ./scripts/run_oas_smoke_batch.sh
python ./scripts/summarize_reports.py --reports-dir ./outputs/reports --limit 20

Artifacts:

benchmark/swebench/outputs/predictions/
benchmark/swebench/outputs/reports/
benchmark/swebench/outputs/trajectories/
benchmark/swebench/outputs/logs/<instance_id>/open-agent-transcript/

Terminal-bench

Entry docs:

benchmark/terminalbench/README.md
benchmark/terminalbench/open_agent_sdk_harbor/README.md

Primary scripts:

benchmark/terminalbench/scripts/run-terminalbench-overnight.sh
benchmark/terminalbench/scripts/cleanup-terminalbench-images.sh

Quick run:

env -u https_proxy -u http_proxy -u all_proxy \
    -u HTTPS_PROXY -u HTTP_PROXY -u ALL_PROXY \
harbor run -d terminal-bench@2.0 \
  --env docker \
  --agent-import-path "harbor.agents.installed.open_agent_sdk:OpenAgentSDKAgent" \
  --model MiniMax-M2.5 \
  --n-concurrent 1

Artifacts:

jobs/<run>/result.json
jobs/<run>/<trial>/result.json
jobs/<run>/<trial>/agent/open-agent-transcript/

Reporting Template

When publishing benchmark updates, include:

Run date (YYYY-MM-DD)
Model and provider
Dataset split / task list
Pass rate or success rate
Median latency (if available)
Cost estimate (if available)
Link to raw artifacts in this repo

Capability Snapshot (Open Agent SDK)

Capability	Status	Evidence
Tool permission gating	Available	`canUseTool` + mode system in SDK/examples
Hooks lifecycle	Available	`docs/api-reference.md` hooks section
Subagent/task orchestration	Available	`docs/api-reference.md` task tools
SWE-bench harness	Available	`benchmark/swebench/`
Terminal-bench harness	Available	`benchmark/terminalbench/`
Session persistence/resume	Available	`examples/code-agent/`

Note: external framework comparisons should only be added with direct links to upstream docs and a dated snapshot.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmarks

Principles

SWE-bench Lite

Terminal-bench

Reporting Template

Capability Snapshot (Open Agent SDK)

FilesExpand file tree

BENCHMARKS.md

Latest commit

History

BENCHMARKS.md

File metadata and controls

Benchmarks

Principles

SWE-bench Lite

Terminal-bench

Reporting Template

Capability Snapshot (Open Agent SDK)