|
1 | | -# A2A Agent Template |
| 1 | +# Tau2 Green Agent |
2 | 2 |
|
3 | | -A minimal template for building [A2A (Agent-to-Agent)](https://a2a-protocol.org/latest/) green agents compatible with the [AgentBeats](https://agentbeats.dev) platform. |
| 3 | +A green agent for the [Tau2 benchmark](https://github.com/sierra-research/tau2-bench) on the [AgentBeats](https://agentbeats.dev) platform. Evaluates purple agents on customer service tasks across multiple domains (airline, retail, telecom) using simulated users and real tool environments. |
| 4 | + |
| 5 | +## How it works |
| 6 | + |
| 7 | +The green agent runs tau2 evaluations via the [A2A protocol](https://a2a-protocol.org/latest/): |
| 8 | + |
| 9 | +1. Receives an evaluation request with a purple agent URL and config (domain, number of tasks, etc.) |
| 10 | +2. For each task, creates a simulated user and orchestrates a multi-turn conversation between the user, the purple agent, and the domain environment (tools, databases, policies) |
| 11 | +3. Evaluates whether the purple agent completed the task successfully |
| 12 | +4. Returns pass rate, per-task rewards, and timing metrics |
4 | 13 |
|
5 | 14 | ## Project Structure |
6 | 15 |
|
7 | 16 | ``` |
8 | 17 | src/ |
9 | | -├─ server.py # Server setup and agent card configuration |
| 18 | +├─ server.py # A2A server setup and agent card |
10 | 19 | ├─ executor.py # A2A request handling |
11 | | -├─ agent.py # Your agent implementation goes here |
| 20 | +├─ agent.py # Tau2 evaluation logic and RemoteA2AAgent wrapper |
12 | 21 | └─ messenger.py # A2A messaging utilities |
| 22 | +amber/ |
| 23 | +├─ amber-scenario.json5 # Amber scenario (green + purple + gateway) |
| 24 | +├─ amber-manifest-green.json5 # Green agent manifest |
| 25 | +├─ amber-manifest-purple.json5 # Purple agent manifest |
| 26 | +├─ sample.env # Environment variable template |
| 27 | +└─ README.md # Amber compile and run instructions |
13 | 28 | tests/ |
14 | | -└─ test_agent.py # Agent tests |
15 | | -Dockerfile # Docker configuration |
16 | | -pyproject.toml # Python dependencies |
17 | | -amber-manifest.json5 # Amber manifest |
18 | | -.github/ |
19 | | -└─ workflows/ |
20 | | - └─ test-and-publish.yml # CI workflow |
| 29 | +└─ test_agent.py # A2A conformance tests |
| 30 | +setup.sh # Downloads tau2-bench data for local development |
| 31 | +test_run.py # Example evaluation request script |
| 32 | +Dockerfile # Docker image (includes tau2-bench data) |
21 | 33 | ``` |
22 | 34 |
|
23 | | -## Getting Started |
24 | | - |
25 | | -1. **Create your repository** - Click "Use this template" to create your own repository from this template |
26 | | - |
27 | | -2. **Implement your agent** - Add your agent logic to [`src/agent.py`](src/agent.py) |
28 | | - |
29 | | -3. **Configure your agent card** - Fill in your agent's metadata (name, skills, description) in [`src/server.py`](src/server.py) |
30 | | - |
31 | | -4. **Fill out your [Amber](https://github.com/RDI-Foundation/amber) manifest** - Update [`amber-manifest.json5`](amber-manifest.json5) to use your agent in Amber scenarios |
32 | | - |
33 | | -5. **Write your tests** - Add custom tests for your agent in [`tests/test_agent.py`](tests/test_agent.py) |
34 | | - |
35 | | -For a concrete example of implementing a green agent using this template, see this [draft PR](https://github.com/RDI-Foundation/green-agent-template/pull/3). |
36 | | - |
37 | 35 | ## Running Locally |
38 | 36 |
|
39 | 37 | ```bash |
| 38 | +# Clone tau2-bench data |
| 39 | +bash setup.sh |
| 40 | +export TAU2_DATA_DIR=$PWD/tau2-bench/data |
| 41 | + |
40 | 42 | # Install dependencies |
41 | 43 | uv sync |
42 | 44 |
|
43 | | -# Run the server |
| 45 | +# Set API key for the UserSimulator LLM |
| 46 | +export OPENAI_API_KEY=sk-... |
| 47 | +# Or for Gemini: |
| 48 | +# export GEMINI_API_KEY=... |
| 49 | + |
| 50 | +# Start the green agent |
44 | 51 | uv run src/server.py |
45 | 52 | ``` |
46 | 53 |
|
| 54 | +The server starts on port 9009. You'll need a purple agent running separately (e.g. from [agent-template](https://github.com/RDI-Foundation/agent-template)) to send evaluation requests to. |
| 55 | + |
47 | 56 | ## Running with Docker |
48 | 57 |
|
49 | | -```bash |
50 | | -# Build the image |
51 | | -docker build -t my-agent . |
| 58 | +The Docker image bundles tau2-bench data, so no setup script is needed. |
52 | 59 |
|
53 | | -# Run the container |
54 | | -docker run -p 9009:9009 my-agent |
| 60 | +```bash |
| 61 | +docker build -t tau2-green . |
| 62 | +docker run -p 8081:8081 -e OPENAI_API_KEY=sk-... tau2-green |
55 | 63 | ``` |
56 | 64 |
|
57 | | -## Testing |
| 65 | +## Running with Amber |
58 | 66 |
|
59 | | -Run A2A conformance tests against your agent. |
| 67 | +See [amber/README.md](amber/README.md) for instructions on compiling and running the full scenario (green agent + purple agent + gateway) using the Amber CLI. |
60 | 68 |
|
61 | | -```bash |
62 | | -# Install test dependencies |
63 | | -uv sync --extra test |
| 69 | +## Configuration |
| 70 | + |
| 71 | +The following config parameters can be passed in the evaluation request (or via Amber's `assessment_config`): |
64 | 72 |
|
65 | | -# Start your agent (uv or docker; see above) |
| 73 | +| Parameter | Required | Default | Description | |
| 74 | +|-----------|----------|---------|-------------| |
| 75 | +| `domain` | yes | `airline` | `airline`, `retail`, `telecom`, or `mock` | |
| 76 | +| `num_tasks` | no | all | Limit number of tasks to run | |
| 77 | +| `task_ids` | no | all | Specific task IDs to run | |
| 78 | +| `max_steps` | no | `200` | Max orchestrator steps per task | |
| 79 | +| `user_llm` | no | `openai/gpt-4o-mini` | LLM for the UserSimulator (litellm format) | |
| 80 | +| `user_llm_args` | no | `{"temperature": 0.0}` | LLM arguments for the UserSimulator | |
66 | 81 |
|
67 | | -# Run tests against your running agent URL |
| 82 | +To run the full benchmark, submit one evaluation per domain. |
| 83 | + |
| 84 | +## Testing |
| 85 | + |
| 86 | +```bash |
| 87 | +uv sync --extra test |
68 | 88 | uv run pytest --agent-url http://localhost:9009 |
69 | 89 | ``` |
70 | 90 |
|
71 | 91 | ## Publishing |
72 | 92 |
|
73 | | -The repository includes a GitHub Actions workflow that automatically builds, tests, and publishes a Docker image of your agent to GitHub Container Registry. |
74 | | - |
75 | | -If your agent needs API keys or other secrets, add them in Settings → Secrets and variables → Actions → Repository secrets. They'll be available as environment variables during CI tests. |
| 93 | +The CI workflow builds, tests, and publishes to GitHub Container Registry on push to `main` or version tags: |
76 | 94 |
|
77 | | -- **Push to `main`** → publishes `latest` tag: |
78 | 95 | ``` |
79 | | -ghcr.io/<your-username>/<your-repo-name>:latest |
80 | | -``` |
81 | | - |
82 | | -- **Create a git tag** (e.g. `git tag v1.0.0 && git push origin v1.0.0`) → publishes version tags: |
| 96 | +ghcr.io/rdi-foundation/tau2-agentbeats:latest |
| 97 | +ghcr.io/rdi-foundation/tau2-agentbeats:1.0.0 |
83 | 98 | ``` |
84 | | -ghcr.io/<your-username>/<your-repo-name>:1.0.0 |
85 | | -ghcr.io/<your-username>/<your-repo-name>:1 |
86 | | -``` |
87 | | - |
88 | | -Once the workflow completes, find your Docker image in the Packages section (right sidebar of your repository). Configure the package visibility in package settings. |
89 | | - |
90 | | -> **Note:** Organization repositories may need package write permissions enabled manually (Settings → Actions → General). Version tags must follow [semantic versioning](https://semver.org/) (e.g., `v1.0.0`). |
|
0 commit comments