|
| 1 | +# Terminal-Bench 2.0 Integration with PraisonAI |
| 2 | + |
| 3 | +This directory contains examples for integrating PraisonAI Agents with **Terminal-Bench 2.0** via the **Harbor framework** for AI agent benchmarking. |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +[Terminal-Bench 2.0](https://tbench.ai) is a Stanford/Laude Institute benchmark that has become the gold standard for evaluating AI coding agents in real terminal environments. The [Harbor framework](https://harborframework.com) is the official evaluation harness that abstracts container lifecycle, parallelization, and cloud providers. |
| 8 | + |
| 9 | +## Integration Types |
| 10 | + |
| 11 | +### 1. External Agent (Proof-of-Concept) |
| 12 | +- **File**: `praisonai_external_agent.py` |
| 13 | +- **Purpose**: External agent that drives Harbor container environment from outside |
| 14 | +- **Usage**: Run with `--agent-import-path` flag |
| 15 | + |
| 16 | +### 2. Installed Agent (Production) |
| 17 | +- **File**: `praisonai_installed_agent.py` |
| 18 | +- **Purpose**: Production-ready agent installed inside Harbor containers |
| 19 | +- **Usage**: Integrates with Harbor's leaderboard system |
| 20 | + |
| 21 | +## Quick Start |
| 22 | + |
| 23 | +### Prerequisites |
| 24 | + |
| 25 | +```bash |
| 26 | +# Install Harbor framework |
| 27 | +pip install harbor |
| 28 | + |
| 29 | +# Install PraisonAI with shell tools |
| 30 | +pip install praisonaiagents[tools] |
| 31 | +``` |
| 32 | + |
| 33 | +### Running External Agent |
| 34 | + |
| 35 | +```bash |
| 36 | +# Test with oracle agent first |
| 37 | +harbor run -d terminal-bench/terminal-bench-2 -a oracle |
| 38 | + |
| 39 | +# Run PraisonAI external agent |
| 40 | +harbor run -d terminal-bench/terminal-bench-2 \ |
| 41 | + --agent-import-path examples.terminal_bench.praisonai_external_agent:PraisonAIExternalAgent \ |
| 42 | + --model openai/gpt-4o \ |
| 43 | + --ae OPENAI_API_KEY=$OPENAI_API_KEY \ |
| 44 | + -n 4 |
| 45 | +``` |
| 46 | + |
| 47 | +### Running on Cloud (Daytona/E2B/Modal) |
| 48 | + |
| 49 | +```bash |
| 50 | +harbor run -d terminal-bench/terminal-bench-2 \ |
| 51 | + --agent-import-path examples.terminal_bench.praisonai_external_agent:PraisonAIExternalAgent \ |
| 52 | + --model openai/gpt-4o \ |
| 53 | + --env daytona -n 32 \ |
| 54 | + --ae OPENAI_API_KEY=$OPENAI_API_KEY |
| 55 | +``` |
| 56 | + |
| 57 | +## Key Features |
| 58 | + |
| 59 | +- **Shell Execution Bridge**: Wraps Harbor's `BaseEnvironment.exec()` as PraisonAI tool |
| 60 | +- **Auto-Approval**: Bypasses `@require_approval` for container-isolated execution |
| 61 | +- **Token Tracking**: Populates Harbor's `AgentContext` with usage metrics |
| 62 | +- **API Key Forwarding**: Supports environment variable injection |
| 63 | +- **Multi-Agent Support**: Can run `AgentTeam` and `AgentFlow` workflows |
| 64 | + |
| 65 | +## Architecture |
| 66 | + |
| 67 | +```text |
| 68 | +┌─────────────────────────────────────────┐ |
| 69 | +│ Harbor Framework │ |
| 70 | +│ (Terminal-Bench 2.0 Evaluation) │ |
| 71 | +├─────────────────────────────────────────┤ |
| 72 | +│ BaseEnvironment.exec() ←→ bash_tool │ |
| 73 | +├─────────────────────────────────────────┤ |
| 74 | +│ PraisonAI Agent │ |
| 75 | +│ • Uses execute_command tool │ |
| 76 | +│ • Auto-approval for container safety │ |
| 77 | +│ • Token tracking and metrics │ |
| 78 | +└─────────────────────────────────────────┘ |
| 79 | +``` |
| 80 | + |
| 81 | +## Files |
| 82 | + |
| 83 | +- `README.md` - This documentation |
| 84 | +- `praisonai_external_agent.py` - External agent implementation |
| 85 | +- `praisonai_installed_agent.py` - Installed agent implementation |
| 86 | +- `job.yaml` - Example Harbor job configuration |
| 87 | +- `test_integration.py` - Integration tests |
| 88 | + |
| 89 | +## Terminal-Bench 2.0 Tasks |
| 90 | + |
| 91 | +The benchmark includes 89 carefully curated tasks covering: |
| 92 | +- Compiling code |
| 93 | +- Training models |
| 94 | +- Setting up servers |
| 95 | +- System administration |
| 96 | +- File manipulation |
| 97 | +- Package management |
| 98 | + |
| 99 | +Each task provides: |
| 100 | +- English instruction |
| 101 | +- Docker container environment |
| 102 | +- Test script (writes reward to `/logs/verifier/reward.txt`) |
| 103 | +- Reference oracle solution |
| 104 | + |
| 105 | +## Contributing |
| 106 | + |
| 107 | +1. Test changes with oracle agent first: `harbor run -d terminal-bench/terminal-bench-2 -a oracle` |
| 108 | +2. Run real agentic tests to ensure end-to-end functionality |
| 109 | +3. Follow PraisonAI's AGENTS.md architecture guidelines |
| 110 | +4. Add both unit tests and integration tests |
| 111 | + |
| 112 | +## Resources |
| 113 | + |
| 114 | +- [Terminal-Bench 2.0 Announcement](https://www.tbench.ai/news/announcement-2-0) |
| 115 | +- [Harbor Framework Docs](https://www.harborframework.com/docs) |
| 116 | +- [Terminal-Bench Leaderboard](https://tbench.ai/leaderboard) |
| 117 | +- [Harbor GitHub Repo](https://github.com/laude-institute/harbor) |
0 commit comments