feat(evals): add harbor infra for terminal-bench evals by maahir30 · Pull Request #215 · langchain-ai/deepagentsjs

maahir30 · 2026-02-10T20:49:54Z

Summary

Adds the ability to run deepagents-js against terminal-bench benchmarks via Harbor.
Harbor is a Python-only benchmark framework -- it can only load Python agent classes. Since our agent runs in Node.js, we need a thin Python wrapper that acts as a bridge between the two runtimes. This PR adds that bridge.
Also adds support for running benchmarks on LangSmith hosted sandboxes as a custom Harbor environment, alongside Docker and Daytona.

Architecture

Harbor requires agents to be Python classes extending BaseAgent. Our wrapper (DeepAgentsJSWrapper) satisfies that contract but delegates all actual agent work to Node.js:

Harbor calls run(instruction, environment) on our Python wrapper
Python spawns a Node.js subprocess running runner.ts, which creates a deepagents agent
The two processes communicate via JSON-RPC over stdin/stdout:
- When the JS agent needs to execute a shell command (e.g., ls -la), the Node process sends an exec_request to Python
- Python calls Harbor's environment.exec() (which runs the command in the sandboxed container) and sends the result back as an exec_response
- All file operations (read, write, edit, grep, glob) are handled inside Node by the existing BaseSandbox class, which builds shell commands and routes them through execute()
When the agent finishes, Node sends a done message with the full message history, and Python saves the trajectory in Harbor's ATIF format

Harbor (Python)  →  wrapper.py (Python)  →  runner.ts (Node.js)  →  createDeepAgent (JS)
                         ↑                        ↓
                         ↑    exec_request/exec_response
                         ↑    (JSON-RPC over stdin/stdout)
                         ↑                        ↓
                    environment.exec()    ←   RpcSandbox.execute()
                    (runs in sandbox)

LangSmith Sandbox Support

Adds LangSmithEnvironment, a custom Harbor BaseEnvironment implementation backed by the langsmith.sandbox SDK. Harbor supports custom environments via --environment-import-path (or environment.import_path in YAML config), which dynamically imports any class that extends BaseEnvironment.

The environment automatically resolves the correct container image from the task's docker_image field in task.toml (the same pre-built images used by Daytona/E2B), so tasks run with the full environment they expect (gcc, weights, test harness, etc.).

Key files:

langsmith_environment.py -- implements all BaseEnvironment abstract methods (exec, start, stop, upload_file, download_file, etc.) using AsyncSandboxClient/AsyncSandbox
langsmith-env-config.yaml -- minimal YAML config for use with harbor run -c

How to run

# One-time setup
cd libs/harbor/python && uv sync
cd libs/harbor && pnpm build

# Run a single benchmark task locally (Docker)
cd libs/harbor && make bench-docker

# Run a specific task
make bench-docker TASK=gpt2-codegolf

# Run at scale on Daytona
make bench-daytona

# Run on LangSmith sandbox (requires LANGSMITH_API_KEY)
make bench-langsmith
make bench-langsmith TASK=gpt2-codegolf

changeset-bot · 2026-02-11T18:26:59Z

⚠️ No Changeset found

Latest commit: 957c47e

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

rpc harbor infra

5262de7

maahir30 requested review from hntrl and vtrivedy February 10, 2026 20:49

move python infra into to repo

891c239

maahir30 changed the title ~~feat (evals): add harbor infra for terminal-bench evals~~ feat(evals): add harbor infra for terminal-bench evals Feb 10, 2026

add LC sandboxes

f0bc121

maahir30 added 2 commits February 11, 2026 17:03

langsmith experiment integration

e285807

fix

957c47e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(evals): add harbor infra for terminal-bench evals#215

feat(evals): add harbor infra for terminal-bench evals#215
maahir30 wants to merge 5 commits intomainfrom
harbor-evals

maahir30 commented Feb 10, 2026 •

edited

Loading

Uh oh!

changeset-bot bot commented Feb 11, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

maahir30 commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Architecture

LangSmith Sandbox Support

How to run

Uh oh!

changeset-bot bot commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ No Changeset found

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

maahir30 commented Feb 10, 2026 •

edited

Loading

changeset-bot bot commented Feb 11, 2026 •

edited

Loading