This directory contains local Terminal-Bench experiments and Harbor adapter code for Open Agent SDK.
Run terminal-bench@2.0 reliably with Harbor, then inspect reproducible artifacts under jobs/.
benchmark/terminalbench/open_agent_sdk_harbor/: Harbor agent adapter and install scripts.benchmark/terminalbench/test-tasks/: local hello-world style task.benchmark/terminalbench/jobs/: historical local benchmark outputs.docs/workflows/terminal-bench-harbor-runbook.md: extended troubleshooting notes.
From repo root:
pip install harbor
ln -sf "$(pwd)/benchmark/terminalbench/open_agent_sdk_harbor/agent.py" \
"$(python -c 'import harbor; print(harbor.__path__[0])')/agents/installed/open_agent_sdk.py"
set -a
source .env
set +aRequired env for MiniMax Anthropic-compatible mode:
ANTHROPIC_API_KEYANTHROPIC_BASE_URL
env -u https_proxy -u http_proxy -u all_proxy \
-u HTTPS_PROXY -u HTTP_PROXY -u ALL_PROXY \
harbor run -d terminal-bench@2.0 \
--env docker \
--agent-import-path "harbor.agents.installed.open_agent_sdk:OpenAgentSDKAgent" \
--model MiniMax-M2.5 \
--ae ANTHROPIC_API_KEY="$ANTHROPIC_API_KEY" \
--ae ANTHROPIC_BASE_URL="$ANTHROPIC_BASE_URL" \
--ae OAS_HARBOR_SAVE_TRAJECTORY=1 \
--task-name "fix-git" \
--n-concurrent 1 \
--timeout-multiplier 3.0 \
--override-memory-mb 4096Notes:
--override-memory-mb 4096is for debugging stability, not leaderboard submission.- Keep proxy vars unset for Harbor run if host proxy is
127.0.0.1.
env -u https_proxy -u http_proxy -u all_proxy \
-u HTTPS_PROXY -u HTTP_PROXY -u ALL_PROXY \
harbor run -d terminal-bench@2.0 \
--env docker \
--agent-import-path "harbor.agents.installed.open_agent_sdk:OpenAgentSDKAgent" \
--model MiniMax-M2.5 \
--ae ANTHROPIC_API_KEY="$ANTHROPIC_API_KEY" \
--ae ANTHROPIC_BASE_URL="$ANTHROPIC_BASE_URL" \
--n-concurrent 4Use the provided runner to execute tasks sequentially while automatically cleaning old terminal-bench images between batches.
Scripts:
benchmark/terminalbench/scripts/run-terminalbench-overnight.shbenchmark/terminalbench/scripts/cleanup-terminalbench-images.sh
The overnight runner always sources main workspace .env (via git common dir), so it works correctly even when executed from a git worktree.
Example (sleep-safe smoke run):
chmod +x benchmark/terminalbench/scripts/*.sh
./benchmark/terminalbench/scripts/run-terminalbench-overnight.sh \
--tasks-file benchmark/terminalbench/task-lists/smoke-5.txt \
--batch-size 2 \
--keep-images 1 \
--task-repeats 1 \
--agent-timeout-multiplier 0.6Image cleanup only (manual):
# Preview
./benchmark/terminalbench/scripts/cleanup-terminalbench-images.sh --dry-run --keep 2
# Apply
./benchmark/terminalbench/scripts/cleanup-terminalbench-images.sh --keep 2latest="$(ls -1dt jobs/* | head -n 1)"
find "$latest" -maxdepth 5 -type f | \
grep -E 'result.json|return-code.txt|stdout.txt|stderr.txt|trial.log|open-agent-transcript'Key files:
jobs/<run>/result.jsonjobs/<run>/<trial>/result.jsonjobs/<run>/<trial>/agent/setup/stdout.txtjobs/<run>/<trial>/agent/command-0/return-code.txtjobs/<run>/<trial>/agent/command-0/stderr.txtjobs/<run>/<trial>/agent/open-agent-transcript/sessions-index.jsonjobs/<run>/<trial>/agent/open-agent-transcript/*.jsonl
return code 137: container OOM kill. Increase memory only for debugging.- Setup fails while installing CLI: adapter install script now has npm registry fallback (
npmjsthennpmmirror). - MiniMax region mismatch:
api.minimaxi.comandapi.minimax.ioare different endpoint domains.- A key valid on one region endpoint may fail on the other.
- Daytona + MiniMax currently observed
ECONNRESETfrom sandbox egress to MiniMax endpoints in this environment.
Daytona environment can start tasks, but MiniMax calls from sandbox showed repeated ECONNRESET during this debugging cycle. Use Docker for stable MiniMax runs until daytona network path is fixed.
benchmark/terminalbench/open_agent_sdk_harbor/README.mddocs/workflows/terminal-bench-harbor-runbook.mddocs/research/harbor-137-debugging-handoff.md