This document summarizes the reproducible way to run terminal-bench with Harbor in this repo, including known pitfalls seen during debugging on March 2, 2026.
- Docker runtime available (for example: Colima running)
- Harbor installed (
python >= 3.12) - Open Agent SDK Harbor agent symlinked into Harbor installed agents
Example:
pip install harbor
ln -sf "$(pwd)/benchmark/terminalbench/open_agent_sdk_harbor/agent.py" \
"$(python -c 'import harbor; print(harbor.__path__[0])')/agents/installed/open_agent_sdk.py"Use the repository .env as the single source of truth:
set -a
source .env
set +aRequired for MiniMax Anthropic-compatible endpoint:
ANTHROPIC_API_KEYANTHROPIC_BASE_URL(for examplehttps://api.minimaxi.com/anthropic/v1)
If these are empty, command-0 fails quickly with invalid URL/provider errors.
Quick sanity check:
echo "ANTHROPIC_API_KEY length=${#ANTHROPIC_API_KEY}"
echo "ANTHROPIC_BASE_URL=$ANTHROPIC_BASE_URL"Use proxy settings for host tooling only if needed, but run Harbor process with proxy vars removed to avoid container networking issues with local 127.0.0.1 proxies.
Recommended pattern:
env -u https_proxy -u http_proxy -u all_proxy \
-u HTTPS_PROXY -u HTTP_PROXY -u ALL_PROXY \
harbor run ...Use fix-git as a quick real-task smoke test:
env -u https_proxy -u http_proxy -u all_proxy \
-u HTTPS_PROXY -u HTTP_PROXY -u ALL_PROXY \
harbor run -d terminal-bench@2.0 \
--agent-import-path "harbor.agents.installed.open_agent_sdk:OpenAgentSDKAgent" \
--model MiniMax-M2.5 \
--ae ANTHROPIC_API_KEY="$ANTHROPIC_API_KEY" \
--ae ANTHROPIC_BASE_URL="$ANTHROPIC_BASE_URL" \
--timeout-multiplier 3.0 \
--task-name "fix-git" \
--override-memory-mb 4096 \
--no-deleteNotes:
--override-memory-mb 4096is for debugging stability and may not be leaderboard-valid.--no-deletekeeps containers for postmortem.
return code 137: container killed by OOM.TypeError: fetch() URL is invalidinagent/command-0/stdout.txt: missing or emptyANTHROPIC_BASE_URL.reward=0with no exception: run completed but task solution incorrect (not an infra failure).
After each run, inspect:
latest="$(ls -1dt jobs/* | head -n 1)"
find "$latest" -maxdepth 5 -type f | grep -E 'result.json|return-code.txt|stdout.txt|trial.log|open-agent-transcript'Key files:
jobs/<run>/<trial>/result.jsonjobs/<run>/<trial>/agent/command-0/return-code.txtjobs/<run>/<trial>/agent/command-0/stdout.txtjobs/<run>/<trial>/verifier/test-stdout.txtjobs/<run>/<trial>/agent/open-agent-transcript/trajectory.json(whenOAS_HARBOR_SAVE_TRAJECTORY=1)jobs/<run>/<trial>/agent/open-agent-transcript/*.jsonl(Open Agent session transcript)jobs/<run>/<trial>/agent/open-agent-transcript/sessions-index.json
With --no-delete, inspect cgroup metrics in the container:
docker exec <container> sh -lc 'cat /sys/fs/cgroup/memory.max; cat /sys/fs/cgroup/memory.peak; cat /sys/fs/cgroup/memory.events'OOM evidence:
memory.peakclose tomemory.maxmemory.eventshasoom_kill > 0
- For local debugging: use
4GBmemory override + proxy unsetting pattern above. - Ensure
.envis sourced before running Harbor. - Distinguish infra failures (
137, invalid URL) from task-quality failures (reward=0).