Skip to content

Commit c25a253

Browse files
authored
New Evals CLI (#1061)
# why The Evals CLI has a lot of potential # what changed Added a simplified, cleaner version of the cli that's installable with `pnpm build:cli`. Then, the new CLI can be run with just: ```bash evals ``` Includes options such as: ``` evals <command> <target> [options] Commands run Execute evals or benchmarks list List available evals/benchmarks config Get/set default configuration help Show this help message Examples # Run all custom evals evals run all # Run specific category evals run act -e browserbase -t 5 # Run specific eval evals run login # Run benchmark evals run benchmark:onlineMind2Web -l 10 -f difficulty=easy # Configure defaults evals config set env browserbase evals config set trials 5 Options -e, --env Environment: local|browserbase -t, --trials Number of trials per eval -c, --concurrency Max parallel sessions -m, --model Model override -p, --provider Provider override --api Use Stagehand API Benchmark-specific: -l, --limit Max tasks to run -s, --sample Random sample before limit -f, --filter Benchmark filters (key=value) ``` Added a README.md within the evals directory for detailed descriptions. # test plan - [x] tested locally - [x] tested on ci *The new cli is backwards compatible and doesn't require (yet) CI updates
1 parent 6966201 commit c25a253

File tree

4 files changed

+1214
-139
lines changed

4 files changed

+1214
-139
lines changed

evals/README.md

Lines changed: 212 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,212 @@
1+
# Stagehand Evals CLI
2+
3+
A powerful command-line interface for running Stagehand evaluation suites and benchmarks.
4+
5+
## Installation
6+
7+
```bash
8+
# From the stagehand root directory
9+
pnpm install
10+
pnpm run build:cli
11+
```
12+
13+
## Usage
14+
15+
The evals CLI provides a clean, intuitive interface for running evaluations:
16+
17+
```bash
18+
pnpm evals <command> <target> [options]
19+
```
20+
21+
## Commands
22+
23+
### `run` - Execute evaluations
24+
25+
Run custom evals or external benchmarks.
26+
27+
```bash
28+
# Run all custom evals
29+
pnpm evals run all
30+
31+
# Run specific category
32+
pnpm evals run act
33+
pnpm evals run extract
34+
pnpm evals run observe
35+
36+
# Run specific eval by name
37+
pnpm evals run extract/extract_text
38+
39+
# Run external benchmarks
40+
pnpm evals run benchmark:gaia
41+
```
42+
43+
### `list` - View available evals
44+
45+
List all available evaluations and benchmarks.
46+
47+
```bash
48+
# List all categories and benchmarks
49+
pnpm evals list
50+
51+
# Show detailed task list
52+
pnpm evals list --detailed
53+
```
54+
55+
### `config` - Manage defaults
56+
57+
Configure default settings for all eval runs.
58+
59+
```bash
60+
# View current configuration
61+
pnpm evals config
62+
63+
# Set default values
64+
pnpm evals config set env browserbase
65+
pnpm evals config set trials 5
66+
pnpm evals config set concurrency 10
67+
68+
# Reset to defaults
69+
pnpm evals config reset
70+
pnpm evals config reset trials # Reset specific key
71+
```
72+
73+
### `help` - Show help
74+
75+
```bash
76+
pnpm evals help
77+
```
78+
79+
## Options
80+
81+
### Core Options
82+
83+
- `-e, --env` - Environment: `local` or `browserbase` (default: local)
84+
- `-t, --trials` - Number of trials per eval (default: 3)
85+
- `-c, --concurrency` - Max parallel sessions (default: 3)
86+
- `-m, --model` - Model override (e.g., gpt-4o, claude-3.5)
87+
- `-p, --provider` - Provider override (openai, anthropic, etc.)
88+
- `--api` - Use Stagehand API instead of SDK
89+
90+
### Benchmark-Specific Options
91+
92+
- `-l, --limit` - Max tasks to run (default: 25)
93+
- `-s, --sample` - Random sample before limit
94+
- `-f, --filter` - Benchmark-specific filters (key=value)
95+
96+
## Examples
97+
98+
### Running Custom Evals
99+
100+
```bash
101+
# Run with custom settings
102+
pnpm evals run act -e browserbase -t 5 -c 10
103+
104+
# Run with specific model
105+
pnpm evals run observe -m gpt-4o -p openai
106+
107+
# Run using API
108+
pnpm evals run extract --api
109+
```
110+
111+
### Running Benchmarks
112+
113+
```bash
114+
# WebBench with filters
115+
pnpm evals run b:webbench -l 10 -f difficulty=easy -f category=READ
116+
117+
# GAIA with sampling
118+
pnpm evals run b:gaia -s 100 -l 25 -f level=1
119+
120+
# WebVoyager with limit
121+
pnpm evals run b:webvoyager -l 50
122+
```
123+
124+
## Available Benchmarks
125+
126+
### OnlineMind2Web (`b:onlineMind2Web`)
127+
Real-world web interaction tasks for evaluating web agents.
128+
129+
### GAIA (`b:gaia`)
130+
General AI Assistant benchmark for complex reasoning.
131+
132+
**Filters:**
133+
- `level`: 1, 2, 3 (difficulty levels)
134+
135+
### WebVoyager (`b:webvoyager`)
136+
Web navigation and task completion benchmark.
137+
138+
### WebBench (`b:webbench`)
139+
Real-world web automation tasks across live websites.
140+
141+
**Filters:**
142+
- `difficulty`: easy, hard
143+
- `category`: READ, CREATE, UPDATE, DELETE, FILE_MANIPULATION
144+
- `use_hitl`: true/false
145+
146+
### OSWorld (`b:osworld`)
147+
Chrome browser automation tasks from the OSWorld benchmark.
148+
149+
**Filters:**
150+
- `source`: Mind2Web, test_task_1, etc.
151+
- `evaluation_type`: url_match, string_match, dom_state, custom
152+
153+
## Configuration
154+
155+
The CLI uses a configuration file at `evals/evals.config.json` which contains:
156+
157+
- **defaults**: Default values for CLI options
158+
- **benchmarks**: Metadata for external benchmarks
159+
- **tasks**: Registry of all evaluation tasks
160+
161+
You can modify defaults either through the `config` command or by editing the file directly.
162+
163+
## Environment Variables
164+
165+
While the CLI reduces the need for environment variables, some are still supported for CI/CD:
166+
167+
- `EVAL_ENV` - Override environment setting
168+
- `EVAL_TRIAL_COUNT` - Override trial count
169+
- `EVAL_MAX_CONCURRENCY` - Override concurrency
170+
- `EVAL_PROVIDER` - Override LLM provider
171+
- `USE_API` - Use Stagehand API
172+
173+
## Development
174+
175+
### Adding New Evals
176+
177+
1. Create your eval file in `evals/tasks/<category>/`
178+
2. Add it to `evals.config.json` under the `tasks` array
179+
3. Run with: `pnpm evals run <category>/<eval_name>`
180+
181+
## Troubleshooting
182+
183+
### Command not found
184+
185+
If `evals` command is not found, make sure you've:
186+
1. Run `pnpm install` from the project root
187+
2. Run `pnpm run build:cli` to compile the CLI
188+
189+
### Build errors
190+
191+
If you encounter build errors:
192+
```bash
193+
# Clean and rebuild
194+
rm -rf dist/evals
195+
pnpm run build:cli
196+
```
197+
198+
### Permission errors
199+
200+
If you get permission errors:
201+
```bash
202+
chmod +x dist/evals/cli.js
203+
```
204+
205+
## Contributing
206+
207+
When adding new features to the CLI:
208+
209+
1. Update the command in `evals/cli.ts`
210+
2. Add new options to the help text
211+
3. Update this README with examples
212+
4. Test with various flag combinations

0 commit comments

Comments
 (0)