You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Dec 23, 2025. It is now read-only.
This scenario evaluates agents on the [tau2-bench](https://github.com/sierra-research/tau2-bench) benchmark, which tests customer service agents on realistic tasks.
4
+
5
+
## Setup
6
+
7
+
1.**Download the tau2-bench data** (required once):
8
+
9
+
```bash
10
+
./scenarios/tau2/setup.sh
11
+
```
12
+
13
+
2.**Set your API key** in `.env`:
14
+
15
+
```
16
+
OPENAI_API_KEY=your-key-here
17
+
```
18
+
19
+
## Running the Benchmark
20
+
21
+
```bash
22
+
TAU2_DATA_DIR=./scenarios/tau2/tau2-bench/data uv run agentbeats-run scenarios/tau2/scenario.toml
23
+
```
24
+
25
+
## Configuration
26
+
27
+
Edit `scenario.toml` to configure the benchmark:
28
+
29
+
```toml
30
+
[config]
31
+
domain = "airline"# airline, retail, telecom, or mock
32
+
num_tasks = 5# number of tasks to run
33
+
user_llm = "openai/gpt-4.1"# LLM for user simulator (optional, defaults to gpt-4.1)
34
+
```
35
+
36
+
The agent LLM defaults to `openai/gpt-4.1` and can be configured via the `--agent-llm` CLI argument in `tau2_agent.py`.
37
+
38
+
## Architecture
39
+
40
+
-**tau2_evaluator.py** (Green Agent): Runs the tau2 Orchestrator which coordinates the user simulator, environment, and agent
41
+
-**tau2_agent.py** (Purple Agent): The agent being tested - receives task descriptions and responds with tool calls or user responses
0 commit comments