Merge pull request #16 from Octane0411/feat/harbor-local-test

Octane0411 · web-flow · commit 321ff3f2503d · 2026-02-27T16:53:25.000+08:00
feat(benchmark): add Harbor local development testing support
diff --git a/.gitignore b/.gitignore
@@ -51,3 +51,6 @@ RELEASE_NOTES.md
 
 # Worktrees
 .worktrees/
+
+# Harbor benchmark results
+jobs/
diff --git a/benchmark/terminalbench/harbor/README.md b/benchmark/terminalbench/harbor/README.md
@@ -140,6 +140,129 @@ Agent 代码 (create_run_agent_commands)
 - alex@laude.org
 - mikeam@cs.stanford.edu
 
+## Local Development Testing
+
+Before publishing to npm, you can test Harbor integration with local source code.
+
+### Prerequisites
+
+1. **Colima** (Docker-compatible runtime)
+   ```bash
+   colima start
+   ```
+
+2. **Harbor** (Python >= 3.12)
+   ```bash
+   pip install harbor
+   ```
+
+3. **MiniMax API credentials**
+   ```bash
+   export ANTHROPIC_AUTH_TOKEN="sk-api-..."
+   export ANTHROPIC_BASE_URL="https://api.minimaxi.com/anthropic/v1"
+   ```
+
+### Quick Test
+
+Use the automated test script:
+
+```bash
+# Set MiniMax API credentials
+export ANTHROPIC_AUTH_TOKEN="sk-api-..."
+export ANTHROPIC_BASE_URL="https://api.minimaxi.com/anthropic/v1"
+
+# Run test
+./scripts/test-harbor-local.sh
+```
+
+This script will:
+1. Check all prerequisites (Colima, Harbor, API keys)
+2. Register the local development agent
+3. Run the hello-world test task
+4. Display results (reward: 1 = success, 0 = failure)
+
+### Manual Testing
+
+For more control, you can run tests manually:
+
+```bash
+# 1. Register local development agent
+ln -sf $(pwd)/benchmark/terminalbench/harbor/agent_local.py \
+  $(python -c "import harbor; print(harbor.__path__[0])")/agents/installed/open_agent_sdk_local.py
+
+# 2. Set MiniMax credentials
+export ANTHROPIC_AUTH_TOKEN="sk-api-..."
+export ANTHROPIC_BASE_URL="https://api.minimaxi.com/anthropic/v1"
+
+# 3. Run single task
+harbor jobs start \
+  --path benchmark/terminalbench/test-tasks/hello-world \
+  --agent-import-path "harbor.agents.installed.open_agent_sdk_local:OpenAgentSDKAgentLocal" \
+  --model MiniMax-M2.5
+
+# 4. Check results in logs
+# Look for "reward: 1" in verifier output
+```
+
+### How It Works
+
+**Local Development Flow:**
+```
+Host Machine
+  ↓ Set ANTHROPIC_AUTH_TOKEN + ANTHROPIC_BASE_URL
+Harbor Agent (agent_local.py)
+  ↓ Use install-open-agent-sdk-local.sh.j2
+Sandbox Container
+  ↓ git clone --depth 1 https://github.com/Octane0411/open-agent-sdk.git
+  ↓ bun install && bun run build
+  ↓ cd packages/cli && bun link
+  ↓ Run: oas -p "..." --model MiniMax-M2.5 --base-url https://api.minimaxi.com/anthropic/v1
+  ↓ Call MiniMax API via Anthropic-compatible endpoint
+  ↓ Generate output and save to files
+Verifier
+  ↓ Check output files and write reward (1 or 0) to /logs/verifier/reward.txt
+```
+
+**Key differences from production:**
+- Installs from GitHub repository (not npm)
+- Builds packages locally in container
+- Uses `bun link` for global CLI access
+- Requires code to be pushed to GitHub before testing
+
+### Troubleshooting
+
+**Issue: "Colima is not running"**
+```bash
+colima start
+```
+
+**Issue: "Harbor is not installed"**
+```bash
+pip install harbor
+```
+
+**Issue: "ANTHROPIC_AUTH_TOKEN is not set"**
+```bash
+export ANTHROPIC_AUTH_TOKEN="sk-api-..."
+export ANTHROPIC_BASE_URL="https://api.minimaxi.com/anthropic/v1"
+```
+
+**Issue: "git clone failed in container"**
+- Ensure your changes are pushed to GitHub
+- Check network connectivity in container
+- Try: `docker pull ubuntu:24.04` to test Docker networking
+
+**Issue: "bun install failed"**
+- Check container has internet access
+- Verify package.json is valid
+- Check Bun installation logs
+
+**Issue: "reward: 0" (test failed)**
+- Check container logs for errors
+- Verify API credentials are correct
+- Check if `greeting.txt` was created
+- Verify greeting has ≥10 words and contains greeting keywords
+
 ## 参考
 
 - [Terminal-bench](https://www.tbench.ai/leaderboard/terminal-bench/2.0)
diff --git a/benchmark/terminalbench/harbor/agent_local.py b/benchmark/terminalbench/harbor/agent_local.py
@@ -0,0 +1,144 @@
+"""
+open-agent-sdk Harbor Agent Adapter (Local Development)
+
+This is a standalone variant that installs from local source code instead of npm packages.
+Use this for testing before publishing to npm.
+
+Usage:
+    # Register agent
+    ln -sf $(pwd)/benchmark/terminalbench/harbor/agent_local.py \
+      $(python -c "import harbor; print(harbor.__path__[0])")/agents/installed/open_agent_sdk_local.py
+
+    # Run with MiniMax
+    export ANTHROPIC_AUTH_TOKEN=your_token
+    export ANTHROPIC_BASE_URL=https://api.minimaxi.com/anthropic/v1
+    harbor jobs start \
+      --path benchmark/terminalbench/test-tasks/hello-world \
+      --agent-import-path "harbor.agents.installed.open_agent_sdk_local:OpenAgentSDKAgentLocal" \
+      --model MiniMax-M2.5
+"""
+
+import os
+from pathlib import Path
+
+from harbor.agents.installed.base import BaseInstalledAgent, ExecInput
+from harbor.models.agent.context import AgentContext
+
+
+# CLI command (installed globally by install script)
+CLI_COMMAND = "oas"
+
+
+def is_minimax_model(model_name: str) -> bool:
+    """Check if the model is a MiniMax model."""
+    return model_name.lower().startswith("minimax")
+
+
+def get_required_env_vars(model_name: str) -> dict[str, str]:
+    """
+    Determine required environment variables based on model name.
+    Returns a dict of {env_var_name: env_var_value}.
+    """
+    env_vars = {}
+    model_lower = model_name.lower()
+
+    # MiniMax uses Anthropic compatible endpoint with custom auth
+    if is_minimax_model(model_name):
+        auth_token = os.environ.get("ANTHROPIC_AUTH_TOKEN")
+        base_url = os.environ.get("ANTHROPIC_BASE_URL")
+
+        if not auth_token:
+            raise ValueError(
+                "MiniMax model requires ANTHROPIC_AUTH_TOKEN environment variable. "
+                "This is used for Bearer token authentication."
+            )
+        if not base_url:
+            raise ValueError(
+                "MiniMax model requires ANTHROPIC_BASE_URL environment variable. "
+                "Example: https://api.minimaxi.com/anthropic/v1"
+            )
+
+        env_vars["ANTHROPIC_AUTH_TOKEN"] = auth_token
+        env_vars["ANTHROPIC_BASE_URL"] = base_url
+        return env_vars
+
+    # Standard providers
+    if model_lower.startswith("gemini") or model_lower.startswith("google"):
+        api_key = os.environ.get("GEMINI_API_KEY")
+        if not api_key:
+            raise ValueError("Gemini model requires GEMINI_API_KEY environment variable.")
+        env_vars["GEMINI_API_KEY"] = api_key
+    elif model_lower.startswith("claude"):
+        api_key = os.environ.get("ANTHROPIC_API_KEY")
+        if not api_key:
+            raise ValueError("Claude model requires ANTHROPIC_API_KEY environment variable.")
+        env_vars["ANTHROPIC_API_KEY"] = api_key
+    elif model_lower.startswith("gpt") or model_lower.startswith("openai"):
+        api_key = os.environ.get("OPENAI_API_KEY")
+        if not api_key:
+            raise ValueError("OpenAI model requires OPENAI_API_KEY environment variable.")
+        env_vars["OPENAI_API_KEY"] = api_key
+    else:
+        # Default to Gemini for unknown models
+        api_key = os.environ.get("GEMINI_API_KEY")
+        if not api_key:
+            raise ValueError(f"Unknown model '{model_name}'. Please set GEMINI_API_KEY for default Gemini provider.")
+        env_vars["GEMINI_API_KEY"] = api_key
+
+    return env_vars
+
+
+class OpenAgentSDKAgentLocal(BaseInstalledAgent):
+    """
+    Local development variant of OpenAgentSDKAgent.
+    Installs from GitHub repository instead of npm.
+    """
+
+    @staticmethod
+    def name() -> str:
+        return "open-agent-sdk-local"
+
+    def version(self) -> str | None:
+        return "local-dev"
+
+    @property
+    def _install_agent_template_path(self) -> Path:
+        """Override to use local installation script."""
+        return Path(__file__).parent / "install-open-agent-sdk-local.sh.j2"
+
+    def create_run_agent_commands(self, instruction: str) -> list[ExecInput]:
+        model = self.model_name or "gemini-2.0-flash"
+
+        # Get required environment variables
+        env_vars = get_required_env_vars(model)
+
+        # Escape instruction for shell
+        escaped = instruction.replace('"', '\\"').replace('$', '\\$')
+
+        # Build CLI command with env vars inline (for Daytona compatibility)
+        # Daytona uses shlex.quote on env values which breaks shell variable assignment
+        env_exports = " && ".join([f'export {k}="{v}"' for k, v in env_vars.items()])
+
+        # Build base command (always use /workspace as cwd for Harbor compatibility)
+        cmd_parts = [
+            'export PATH="$HOME/.bun/bin:$PATH"',
+            env_exports,
+            f'{CLI_COMMAND} -p "{escaped}" --model {model} --cwd /workspace --output-format json'
+        ]
+
+        # For MiniMax, add --provider and --base-url flags
+        if is_minimax_model(model) and "ANTHROPIC_BASE_URL" in env_vars:
+            base_url = env_vars["ANTHROPIC_BASE_URL"]
+            # Use anthropic provider with custom base URL
+            cmd_parts[2] = f'{CLI_COMMAND} --provider anthropic --base-url {base_url} -p "{escaped}" --model {model} --cwd /workspace --output-format json'
+
+        return [
+            ExecInput(
+                command=" && ".join(cmd_parts),
+                timeout_sec=600,
+            )
+        ]
+
+    def populate_context_post_run(self, context: AgentContext) -> None:
+        # Harbor reads stdout from create_run_agent_commands automatically
+        pass
diff --git a/benchmark/terminalbench/harbor/install-open-agent-sdk-local.sh.j2 b/benchmark/terminalbench/harbor/install-open-agent-sdk-local.sh.j2
@@ -0,0 +1,41 @@
+#!/bin/bash
+set -euo pipefail
+
+# Install dependencies if not available
+if command -v apk &> /dev/null; then
+    apk add --no-cache curl bash git
+elif command -v apt-get &> /dev/null; then
+    apt-get update
+    apt-get install -y curl git
+fi
+
+# Install Bun if not present
+if ! command -v bun &> /dev/null; then
+    curl -fsSL https://bun.sh/install | bash
+fi
+export PATH="$HOME/.bun/bin:$PATH"
+
+# Clone repository (shallow clone for faster installation)
+echo "Cloning open-agent-sdk from GitHub..."
+git clone --depth 1 --branch feat/harbor-local-test https://github.com/Octane0411/open-agent-sdk.git /tmp/open-agent-sdk
+
+# Build packages locally
+echo "Building packages..."
+cd /tmp/open-agent-sdk
+bun install
+
+# Build core package
+cd packages/core
+bun run build
+cd ../..
+
+# Link CLI globally
+cd packages/cli
+bun link
+cd ../..
+
+# Ensure bun bin is in PATH
+export PATH="$HOME/.bun/bin:$PATH"
+
+echo "Bun ready: $(bun --version)"
+echo "CLI ready: $(which oas)"
diff --git a/benchmark/terminalbench/test-tasks/hello-world/environment/Dockerfile b/benchmark/terminalbench/test-tasks/hello-world/environment/Dockerfile
@@ -0,0 +1,12 @@
+FROM ubuntu:24.04
+
+# Install basic utilities
+RUN apt-get update && apt-get install -y \
+    curl \
+    bash \
+    git \
+    unzip \
+    && rm -rf /var/lib/apt/lists/*
+
+# Set working directory
+WORKDIR /workspace
diff --git a/benchmark/terminalbench/test-tasks/hello-world/instruction.md b/benchmark/terminalbench/test-tasks/hello-world/instruction.md
@@ -0,0 +1 @@
+Create a file named greeting.txt that contains a friendly greeting message welcoming someone to Harbor framework testing. The greeting must be at least 10 words long.
diff --git a/benchmark/terminalbench/test-tasks/hello-world/solution/solve.sh b/benchmark/terminalbench/test-tasks/hello-world/solution/solve.sh
@@ -0,0 +1,7 @@
+#!/bin/bash
+set -e
+
+# Generate greeting using oas CLI
+oas chat "Generate a friendly greeting message (at least 10 words) welcoming someone to Harbor framework testing. Output only the greeting text, no markdown formatting." > greeting.txt
+
+echo "Greeting generated and saved to greeting.txt"
diff --git a/benchmark/terminalbench/test-tasks/hello-world/task.toml b/benchmark/terminalbench/test-tasks/hello-world/task.toml
@@ -0,0 +1,8 @@
+[metadata]
+name = "hello-world"
+description = "Simple test task to verify Harbor + open-agent-sdk integration"
+version = "1.0.0"
+
+[timeouts]
+agent = 180  # Allow time for LLM API call
+verifier = 180
diff --git a/benchmark/terminalbench/test-tasks/hello-world/tests/test.sh b/benchmark/terminalbench/test-tasks/hello-world/tests/test.sh
@@ -0,0 +1,36 @@
+#!/bin/bash
+
+# Verifier script for hello-world task
+# Checks if greeting.txt exists and contains valid content
+
+REWARD_FILE="/logs/verifier/reward.txt"
+mkdir -p "$(dirname "$REWARD_FILE")"
+
+# Check if greeting.txt exists
+if [ ! -f "greeting.txt" ]; then
+    echo "0" > "$REWARD_FILE"
+    echo "FAIL: greeting.txt not found"
+    exit 0
+fi
+
+# Read content
+CONTENT=$(cat greeting.txt)
+
+# Check if content has at least 10 words
+WORD_COUNT=$(echo "$CONTENT" | wc -w | tr -d ' ')
+if [ "$WORD_COUNT" -lt 10 ]; then
+    echo "0" > "$REWARD_FILE"
+    echo "FAIL: greeting.txt has only $WORD_COUNT words (need at least 10)"
+    exit 0
+fi
+
+# Check if content contains greeting-related keywords (case-insensitive)
+if echo "$CONTENT" | grep -iE "(welcome|greeting|hello|harbor)" > /dev/null; then
+    echo "1" > "$REWARD_FILE"
+    echo "PASS: greeting.txt contains valid greeting ($WORD_COUNT words)"
+    exit 0
+else
+    echo "0" > "$REWARD_FILE"
+    echo "FAIL: greeting.txt doesn't contain greeting-related keywords"
+    exit 0
+fi
diff --git a/scripts/test-harbor-local.sh b/scripts/test-harbor-local.sh

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1 @@`
	`1`	`+Create a file named greeting.txt that contains a friendly greeting message welcoming someone to Harbor framework testing. The greeting must be at least 10 words long.`