This document defines the reproducible evaluation entry points for Open Agent SDK.
- Reproducible first: all benchmark runs should be executable from scripts in this repository.
- Artifacts first: each run should emit machine-readable outputs (reports, logs, trajectories).
- Comparable metrics: report at least success/pass, latency, and cost when available.
Entry docs:
benchmark/swebench/README.md
Primary scripts:
benchmark/swebench/scripts/run_oas_smoke_one.shbenchmark/swebench/scripts/run_oas_smoke_batch.shbenchmark/swebench/scripts/run_oas_overnight.shbenchmark/swebench/scripts/summarize_reports.py
Quick run:
cd benchmark/swebench
SWEBENCH_SMOKE_COUNT=5 OAS_MAX_TURNS=12 ./scripts/run_oas_smoke_batch.sh
python ./scripts/summarize_reports.py --reports-dir ./outputs/reports --limit 20Artifacts:
benchmark/swebench/outputs/predictions/benchmark/swebench/outputs/reports/benchmark/swebench/outputs/trajectories/benchmark/swebench/outputs/logs/<instance_id>/open-agent-transcript/
Entry docs:
benchmark/terminalbench/README.mdbenchmark/terminalbench/open_agent_sdk_harbor/README.md
Primary scripts:
benchmark/terminalbench/scripts/run-terminalbench-overnight.shbenchmark/terminalbench/scripts/cleanup-terminalbench-images.sh
Quick run:
env -u https_proxy -u http_proxy -u all_proxy \
-u HTTPS_PROXY -u HTTP_PROXY -u ALL_PROXY \
harbor run -d terminal-bench@2.0 \
--env docker \
--agent-import-path "harbor.agents.installed.open_agent_sdk:OpenAgentSDKAgent" \
--model MiniMax-M2.5 \
--n-concurrent 1Artifacts:
jobs/<run>/result.jsonjobs/<run>/<trial>/result.jsonjobs/<run>/<trial>/agent/open-agent-transcript/
When publishing benchmark updates, include:
- Run date (YYYY-MM-DD)
- Model and provider
- Dataset split / task list
- Pass rate or success rate
- Median latency (if available)
- Cost estimate (if available)
- Link to raw artifacts in this repo
| Capability | Status | Evidence |
|---|---|---|
| Tool permission gating | Available | canUseTool + mode system in SDK/examples |
| Hooks lifecycle | Available | docs/api-reference.md hooks section |
| Subagent/task orchestration | Available | docs/api-reference.md task tools |
| SWE-bench harness | Available | benchmark/swebench/ |
| Terminal-bench harness | Available | benchmark/terminalbench/ |
| Session persistence/resume | Available | examples/code-agent/ |
Note: external framework comparisons should only be added with direct links to upstream docs and a dated snapshot.