Skip to content

feat(evals): add harbor infra for terminal-bench evals#215

Open
maahir30 wants to merge 5 commits intomainfrom
harbor-evals
Open

feat(evals): add harbor infra for terminal-bench evals#215
maahir30 wants to merge 5 commits intomainfrom
harbor-evals

Conversation

@maahir30
Copy link
Contributor

@maahir30 maahir30 commented Feb 10, 2026

Summary

  • Adds the ability to run deepagents-js against terminal-bench benchmarks via Harbor.

  • Harbor is a Python-only benchmark framework -- it can only load Python agent classes. Since our agent runs in Node.js, we need a thin Python wrapper that acts as a bridge between the two runtimes. This PR adds that bridge.

  • Also adds support for running benchmarks on LangSmith hosted sandboxes as a custom Harbor environment, alongside Docker and Daytona.

Architecture

Harbor requires agents to be Python classes extending BaseAgent. Our wrapper (DeepAgentsJSWrapper) satisfies that contract but delegates all actual agent work to Node.js:

  1. Harbor calls run(instruction, environment) on our Python wrapper
  2. Python spawns a Node.js subprocess running runner.ts, which creates a deepagents agent
  3. The two processes communicate via JSON-RPC over stdin/stdout:
    • When the JS agent needs to execute a shell command (e.g., ls -la), the Node process sends an exec_request to Python
    • Python calls Harbor's environment.exec() (which runs the command in the sandboxed container) and sends the result back as an exec_response
    • All file operations (read, write, edit, grep, glob) are handled inside Node by the existing BaseSandbox class, which builds shell commands and routes them through execute()
  4. When the agent finishes, Node sends a done message with the full message history, and Python saves the trajectory in Harbor's ATIF format
Harbor (Python)  →  wrapper.py (Python)  →  runner.ts (Node.js)  →  createDeepAgent (JS)
                         ↑                        ↓
                         ↑    exec_request/exec_response
                         ↑    (JSON-RPC over stdin/stdout)
                         ↑                        ↓
                    environment.exec()    ←   RpcSandbox.execute()
                    (runs in sandbox)

LangSmith Sandbox Support

Adds LangSmithEnvironment, a custom Harbor BaseEnvironment implementation backed by the langsmith.sandbox SDK. Harbor supports custom environments via --environment-import-path (or environment.import_path in YAML config), which dynamically imports any class that extends BaseEnvironment.

The environment automatically resolves the correct container image from the task's docker_image field in task.toml (the same pre-built images used by Daytona/E2B), so tasks run with the full environment they expect (gcc, weights, test harness, etc.).

Key files:

  • langsmith_environment.py -- implements all BaseEnvironment abstract methods (exec, start, stop, upload_file, download_file, etc.) using AsyncSandboxClient/AsyncSandbox
  • langsmith-env-config.yaml -- minimal YAML config for use with harbor run -c

How to run

# One-time setup
cd libs/harbor/python && uv sync
cd libs/harbor && pnpm build

# Run a single benchmark task locally (Docker)
cd libs/harbor && make bench-docker

# Run a specific task
make bench-docker TASK=gpt2-codegolf

# Run at scale on Daytona
make bench-daytona

# Run on LangSmith sandbox (requires LANGSMITH_API_KEY)
make bench-langsmith
make bench-langsmith TASK=gpt2-codegolf

@maahir30 maahir30 requested review from hntrl and vtrivedy February 10, 2026 20:49
@maahir30 maahir30 changed the title feat (evals): add harbor infra for terminal-bench evals feat(evals): add harbor infra for terminal-bench evals Feb 10, 2026
@changeset-bot
Copy link

changeset-bot bot commented Feb 11, 2026

⚠️ No Changeset found

Latest commit: 957c47e

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant