Skip to content

Commit 0e9a67e

Browse files
Merge pull request #1352 from MervinPraison/claude/issue-1351-20260410-0632
feat: integrate PraisonAI with Terminal-Bench 2.0 via Harbor framework
2 parents 365f750 + e944403 commit 0e9a67e

File tree

8 files changed

+1570
-0
lines changed

8 files changed

+1570
-0
lines changed

examples/terminal_bench/README.md

Lines changed: 117 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,117 @@
1+
# Terminal-Bench 2.0 Integration with PraisonAI
2+
3+
This directory contains examples for integrating PraisonAI Agents with **Terminal-Bench 2.0** via the **Harbor framework** for AI agent benchmarking.
4+
5+
## Overview
6+
7+
[Terminal-Bench 2.0](https://tbench.ai) is a Stanford/Laude Institute benchmark that has become the gold standard for evaluating AI coding agents in real terminal environments. The [Harbor framework](https://harborframework.com) is the official evaluation harness that abstracts container lifecycle, parallelization, and cloud providers.
8+
9+
## Integration Types
10+
11+
### 1. External Agent (Proof-of-Concept)
12+
- **File**: `praisonai_external_agent.py`
13+
- **Purpose**: External agent that drives Harbor container environment from outside
14+
- **Usage**: Run with `--agent-import-path` flag
15+
16+
### 2. Installed Agent (Production)
17+
- **File**: `praisonai_installed_agent.py`
18+
- **Purpose**: Production-ready agent installed inside Harbor containers
19+
- **Usage**: Integrates with Harbor's leaderboard system
20+
21+
## Quick Start
22+
23+
### Prerequisites
24+
25+
```bash
26+
# Install Harbor framework
27+
pip install harbor
28+
29+
# Install PraisonAI with shell tools
30+
pip install praisonaiagents[tools]
31+
```
32+
33+
### Running External Agent
34+
35+
```bash
36+
# Test with oracle agent first
37+
harbor run -d terminal-bench/terminal-bench-2 -a oracle
38+
39+
# Run PraisonAI external agent
40+
harbor run -d terminal-bench/terminal-bench-2 \
41+
--agent-import-path examples.terminal_bench.praisonai_external_agent:PraisonAIExternalAgent \
42+
--model openai/gpt-4o \
43+
--ae OPENAI_API_KEY=$OPENAI_API_KEY \
44+
-n 4
45+
```
46+
47+
### Running on Cloud (Daytona/E2B/Modal)
48+
49+
```bash
50+
harbor run -d terminal-bench/terminal-bench-2 \
51+
--agent-import-path examples.terminal_bench.praisonai_external_agent:PraisonAIExternalAgent \
52+
--model openai/gpt-4o \
53+
--env daytona -n 32 \
54+
--ae OPENAI_API_KEY=$OPENAI_API_KEY
55+
```
56+
57+
## Key Features
58+
59+
- **Shell Execution Bridge**: Wraps Harbor's `BaseEnvironment.exec()` as PraisonAI tool
60+
- **Auto-Approval**: Bypasses `@require_approval` for container-isolated execution
61+
- **Token Tracking**: Populates Harbor's `AgentContext` with usage metrics
62+
- **API Key Forwarding**: Supports environment variable injection
63+
- **Multi-Agent Support**: Can run `AgentTeam` and `AgentFlow` workflows
64+
65+
## Architecture
66+
67+
```text
68+
┌─────────────────────────────────────────┐
69+
│ Harbor Framework │
70+
│ (Terminal-Bench 2.0 Evaluation) │
71+
├─────────────────────────────────────────┤
72+
│ BaseEnvironment.exec() ←→ bash_tool │
73+
├─────────────────────────────────────────┤
74+
│ PraisonAI Agent │
75+
│ • Uses execute_command tool │
76+
│ • Auto-approval for container safety │
77+
│ • Token tracking and metrics │
78+
└─────────────────────────────────────────┘
79+
```
80+
81+
## Files
82+
83+
- `README.md` - This documentation
84+
- `praisonai_external_agent.py` - External agent implementation
85+
- `praisonai_installed_agent.py` - Installed agent implementation
86+
- `job.yaml` - Example Harbor job configuration
87+
- `test_integration.py` - Integration tests
88+
89+
## Terminal-Bench 2.0 Tasks
90+
91+
The benchmark includes 89 carefully curated tasks covering:
92+
- Compiling code
93+
- Training models
94+
- Setting up servers
95+
- System administration
96+
- File manipulation
97+
- Package management
98+
99+
Each task provides:
100+
- English instruction
101+
- Docker container environment
102+
- Test script (writes reward to `/logs/verifier/reward.txt`)
103+
- Reference oracle solution
104+
105+
## Contributing
106+
107+
1. Test changes with oracle agent first: `harbor run -d terminal-bench/terminal-bench-2 -a oracle`
108+
2. Run real agentic tests to ensure end-to-end functionality
109+
3. Follow PraisonAI's AGENTS.md architecture guidelines
110+
4. Add both unit tests and integration tests
111+
112+
## Resources
113+
114+
- [Terminal-Bench 2.0 Announcement](https://www.tbench.ai/news/announcement-2-0)
115+
- [Harbor Framework Docs](https://www.harborframework.com/docs)
116+
- [Terminal-Bench Leaderboard](https://tbench.ai/leaderboard)
117+
- [Harbor GitHub Repo](https://github.com/laude-institute/harbor)

examples/terminal_bench/job.yaml

Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,65 @@
1+
# Harbor Job Configuration for PraisonAI on Terminal-Bench 2.0
2+
#
3+
# Usage:
4+
# harbor run -c examples/terminal_bench/job.yaml
5+
#
6+
# This configuration runs PraisonAI external agent on Terminal-Bench 2.0 tasks
7+
# with 8 concurrent trials using GPT-4o.
8+
9+
# Dataset: Terminal-Bench 2.0 (89 carefully curated terminal tasks)
10+
dataset: terminal-bench/terminal-bench-2
11+
12+
# Agent configuration
13+
agent:
14+
# Use import path for external agent
15+
import_path: examples.terminal_bench.praisonai_external_agent:PraisonAIExternalAgent
16+
17+
# Model configuration
18+
model_name: openai/gpt-4o
19+
20+
# Environment variables (API keys)
21+
env:
22+
OPENAI_API_KEY: "${OPENAI_API_KEY}"
23+
# Optional: Add other API keys if using different models
24+
# ANTHROPIC_API_KEY: "${ANTHROPIC_API_KEY}"
25+
# GEMINI_API_KEY: "${GEMINI_API_KEY}"
26+
27+
# Execution configuration
28+
n_concurrent: 8 # Run 8 trials in parallel
29+
n_attempts: 1 # Single attempt per task (benchmark standard)
30+
31+
# Optional: Environment configuration for cloud execution
32+
# environment:
33+
# provider: daytona # Options: local, daytona, e2b, modal, runloop, gke
34+
# n_concurrent: 32 # Higher concurrency on cloud
35+
36+
# Optional: Filter to specific tasks for testing
37+
# task_filter:
38+
# task_names: ["compile_simple_c", "install_python_package", "create_directory"]
39+
40+
# Optional: Advanced configuration
41+
# timeout_sec: 600 # 10 minute timeout per task
42+
# save_logs: true # Save execution logs
43+
# save_artifacts: true # Save container artifacts
44+
45+
---
46+
47+
# Alternative configuration using installed agent (requires Harbor integration)
48+
# Uncomment this section once PraisonAI is integrated into Harbor's codebase
49+
50+
# dataset: terminal-bench/terminal-bench-2
51+
#
52+
# agent:
53+
# name: praisonai
54+
# model_name: openai/gpt-4o
55+
# flags:
56+
# max_turns: 30
57+
# verbose: false
58+
# memory: false
59+
# auto_approval: true
60+
#
61+
# n_concurrent: 8
62+
# n_attempts: 1
63+
#
64+
# env:
65+
# OPENAI_API_KEY: "${OPENAI_API_KEY}"

0 commit comments

Comments
 (0)