Skip to content

Latest commit

 

History

History
142 lines (103 loc) · 4.41 KB

File metadata and controls

142 lines (103 loc) · 4.41 KB

Terminal-Bench Run Guide

This directory contains local Terminal-Bench experiments and Harbor adapter code for Open Agent SDK.

Goal

Run terminal-bench@2.0 reliably with Harbor, then inspect reproducible artifacts under jobs/.

Directory Layout

  • benchmark/terminalbench/open_agent_sdk_harbor/: Harbor agent adapter and install scripts.
  • benchmark/terminalbench/test-tasks/: local hello-world style task.
  • benchmark/terminalbench/jobs/: historical local benchmark outputs.
  • docs/workflows/terminal-bench-harbor-runbook.md: extended troubleshooting notes.

Setup

From repo root:

pip install harbor

ln -sf "$(pwd)/benchmark/terminalbench/open_agent_sdk_harbor/agent.py" \
  "$(python -c 'import harbor; print(harbor.__path__[0])')/agents/installed/open_agent_sdk.py"

set -a
source .env
set +a

Required env for MiniMax Anthropic-compatible mode:

  • ANTHROPIC_API_KEY
  • ANTHROPIC_BASE_URL

Recommended Smoke Test (Docker)

env -u https_proxy -u http_proxy -u all_proxy \
    -u HTTPS_PROXY -u HTTP_PROXY -u ALL_PROXY \
harbor run -d terminal-bench@2.0 \
  --env docker \
  --agent-import-path "harbor.agents.installed.open_agent_sdk:OpenAgentSDKAgent" \
  --model MiniMax-M2.5 \
  --ae ANTHROPIC_API_KEY="$ANTHROPIC_API_KEY" \
  --ae ANTHROPIC_BASE_URL="$ANTHROPIC_BASE_URL" \
  --ae OAS_HARBOR_SAVE_TRAJECTORY=1 \
  --task-name "fix-git" \
  --n-concurrent 1 \
  --timeout-multiplier 3.0 \
  --override-memory-mb 4096

Notes:

  • --override-memory-mb 4096 is for debugging stability, not leaderboard submission.
  • Keep proxy vars unset for Harbor run if host proxy is 127.0.0.1.

Batch Run (Docker)

env -u https_proxy -u http_proxy -u all_proxy \
    -u HTTPS_PROXY -u HTTP_PROXY -u ALL_PROXY \
harbor run -d terminal-bench@2.0 \
  --env docker \
  --agent-import-path "harbor.agents.installed.open_agent_sdk:OpenAgentSDKAgent" \
  --model MiniMax-M2.5 \
  --ae ANTHROPIC_API_KEY="$ANTHROPIC_API_KEY" \
  --ae ANTHROPIC_BASE_URL="$ANTHROPIC_BASE_URL" \
  --n-concurrent 4

Overnight Automation (Low Disk)

Use the provided runner to execute tasks sequentially while automatically cleaning old terminal-bench images between batches.

Scripts:

  • benchmark/terminalbench/scripts/run-terminalbench-overnight.sh
  • benchmark/terminalbench/scripts/cleanup-terminalbench-images.sh

The overnight runner always sources main workspace .env (via git common dir), so it works correctly even when executed from a git worktree.

Example (sleep-safe smoke run):

chmod +x benchmark/terminalbench/scripts/*.sh

./benchmark/terminalbench/scripts/run-terminalbench-overnight.sh \
  --tasks-file benchmark/terminalbench/task-lists/smoke-5.txt \
  --batch-size 2 \
  --keep-images 1 \
  --task-repeats 1 \
  --agent-timeout-multiplier 0.6

Image cleanup only (manual):

# Preview
./benchmark/terminalbench/scripts/cleanup-terminalbench-images.sh --dry-run --keep 2

# Apply
./benchmark/terminalbench/scripts/cleanup-terminalbench-images.sh --keep 2

Where to Check Results

latest="$(ls -1dt jobs/* | head -n 1)"
find "$latest" -maxdepth 5 -type f | \
  grep -E 'result.json|return-code.txt|stdout.txt|stderr.txt|trial.log|open-agent-transcript'

Key files:

  • jobs/<run>/result.json
  • jobs/<run>/<trial>/result.json
  • jobs/<run>/<trial>/agent/setup/stdout.txt
  • jobs/<run>/<trial>/agent/command-0/return-code.txt
  • jobs/<run>/<trial>/agent/command-0/stderr.txt
  • jobs/<run>/<trial>/agent/open-agent-transcript/sessions-index.json
  • jobs/<run>/<trial>/agent/open-agent-transcript/*.jsonl

Known Pitfalls

  • return code 137: container OOM kill. Increase memory only for debugging.
  • Setup fails while installing CLI: adapter install script now has npm registry fallback (npmjs then npmmirror).
  • MiniMax region mismatch:
    • api.minimaxi.com and api.minimax.io are different endpoint domains.
    • A key valid on one region endpoint may fail on the other.
  • Daytona + MiniMax currently observed ECONNRESET from sandbox egress to MiniMax endpoints in this environment.

Daytona (Current Status)

Daytona environment can start tasks, but MiniMax calls from sandbox showed repeated ECONNRESET during this debugging cycle. Use Docker for stable MiniMax runs until daytona network path is fixed.

Related Docs

  • benchmark/terminalbench/open_agent_sdk_harbor/README.md
  • docs/workflows/terminal-bench-harbor-runbook.md
  • docs/research/harbor-137-debugging-handoff.md