|
| 1 | +# Stagehand Evals CLI |
| 2 | + |
| 3 | +A powerful command-line interface for running Stagehand evaluation suites and benchmarks. |
| 4 | + |
| 5 | +## Installation |
| 6 | + |
| 7 | +```bash |
| 8 | +# From the stagehand root directory |
| 9 | +pnpm install |
| 10 | +pnpm run build:cli |
| 11 | +``` |
| 12 | + |
| 13 | +## Usage |
| 14 | + |
| 15 | +The evals CLI provides a clean, intuitive interface for running evaluations: |
| 16 | + |
| 17 | +```bash |
| 18 | +pnpm evals <command> <target> [options] |
| 19 | +``` |
| 20 | + |
| 21 | +## Commands |
| 22 | + |
| 23 | +### `run` - Execute evaluations |
| 24 | + |
| 25 | +Run custom evals or external benchmarks. |
| 26 | + |
| 27 | +```bash |
| 28 | +# Run all custom evals |
| 29 | +pnpm evals run all |
| 30 | + |
| 31 | +# Run specific category |
| 32 | +pnpm evals run act |
| 33 | +pnpm evals run extract |
| 34 | +pnpm evals run observe |
| 35 | + |
| 36 | +# Run specific eval by name |
| 37 | +pnpm evals run extract/extract_text |
| 38 | + |
| 39 | +# Run external benchmarks |
| 40 | +pnpm evals run benchmark:gaia |
| 41 | +``` |
| 42 | + |
| 43 | +### `list` - View available evals |
| 44 | + |
| 45 | +List all available evaluations and benchmarks. |
| 46 | + |
| 47 | +```bash |
| 48 | +# List all categories and benchmarks |
| 49 | +pnpm evals list |
| 50 | + |
| 51 | +# Show detailed task list |
| 52 | +pnpm evals list --detailed |
| 53 | +``` |
| 54 | + |
| 55 | +### `config` - Manage defaults |
| 56 | + |
| 57 | +Configure default settings for all eval runs. |
| 58 | + |
| 59 | +```bash |
| 60 | +# View current configuration |
| 61 | +pnpm evals config |
| 62 | + |
| 63 | +# Set default values |
| 64 | +pnpm evals config set env browserbase |
| 65 | +pnpm evals config set trials 5 |
| 66 | +pnpm evals config set concurrency 10 |
| 67 | + |
| 68 | +# Reset to defaults |
| 69 | +pnpm evals config reset |
| 70 | +pnpm evals config reset trials # Reset specific key |
| 71 | +``` |
| 72 | + |
| 73 | +### `help` - Show help |
| 74 | + |
| 75 | +```bash |
| 76 | +pnpm evals help |
| 77 | +``` |
| 78 | + |
| 79 | +## Options |
| 80 | + |
| 81 | +### Core Options |
| 82 | + |
| 83 | +- `-e, --env` - Environment: `local` or `browserbase` (default: local) |
| 84 | +- `-t, --trials` - Number of trials per eval (default: 3) |
| 85 | +- `-c, --concurrency` - Max parallel sessions (default: 3) |
| 86 | +- `-m, --model` - Model override (e.g., gpt-4o, claude-3.5) |
| 87 | +- `-p, --provider` - Provider override (openai, anthropic, etc.) |
| 88 | +- `--api` - Use Stagehand API instead of SDK |
| 89 | + |
| 90 | +### Benchmark-Specific Options |
| 91 | + |
| 92 | +- `-l, --limit` - Max tasks to run (default: 25) |
| 93 | +- `-s, --sample` - Random sample before limit |
| 94 | +- `-f, --filter` - Benchmark-specific filters (key=value) |
| 95 | + |
| 96 | +## Examples |
| 97 | + |
| 98 | +### Running Custom Evals |
| 99 | + |
| 100 | +```bash |
| 101 | +# Run with custom settings |
| 102 | +pnpm evals run act -e browserbase -t 5 -c 10 |
| 103 | + |
| 104 | +# Run with specific model |
| 105 | +pnpm evals run observe -m gpt-4o -p openai |
| 106 | + |
| 107 | +# Run using API |
| 108 | +pnpm evals run extract --api |
| 109 | +``` |
| 110 | + |
| 111 | +### Running Benchmarks |
| 112 | + |
| 113 | +```bash |
| 114 | +# WebBench with filters |
| 115 | +pnpm evals run b:webbench -l 10 -f difficulty=easy -f category=READ |
| 116 | + |
| 117 | +# GAIA with sampling |
| 118 | +pnpm evals run b:gaia -s 100 -l 25 -f level=1 |
| 119 | + |
| 120 | +# WebVoyager with limit |
| 121 | +pnpm evals run b:webvoyager -l 50 |
| 122 | +``` |
| 123 | + |
| 124 | +## Available Benchmarks |
| 125 | + |
| 126 | +### OnlineMind2Web (`b:onlineMind2Web`) |
| 127 | +Real-world web interaction tasks for evaluating web agents. |
| 128 | + |
| 129 | +### GAIA (`b:gaia`) |
| 130 | +General AI Assistant benchmark for complex reasoning. |
| 131 | + |
| 132 | +**Filters:** |
| 133 | +- `level`: 1, 2, 3 (difficulty levels) |
| 134 | + |
| 135 | +### WebVoyager (`b:webvoyager`) |
| 136 | +Web navigation and task completion benchmark. |
| 137 | + |
| 138 | +### WebBench (`b:webbench`) |
| 139 | +Real-world web automation tasks across live websites. |
| 140 | + |
| 141 | +**Filters:** |
| 142 | +- `difficulty`: easy, hard |
| 143 | +- `category`: READ, CREATE, UPDATE, DELETE, FILE_MANIPULATION |
| 144 | +- `use_hitl`: true/false |
| 145 | + |
| 146 | +### OSWorld (`b:osworld`) |
| 147 | +Chrome browser automation tasks from the OSWorld benchmark. |
| 148 | + |
| 149 | +**Filters:** |
| 150 | +- `source`: Mind2Web, test_task_1, etc. |
| 151 | +- `evaluation_type`: url_match, string_match, dom_state, custom |
| 152 | + |
| 153 | +## Configuration |
| 154 | + |
| 155 | +The CLI uses a configuration file at `evals/evals.config.json` which contains: |
| 156 | + |
| 157 | +- **defaults**: Default values for CLI options |
| 158 | +- **benchmarks**: Metadata for external benchmarks |
| 159 | +- **tasks**: Registry of all evaluation tasks |
| 160 | + |
| 161 | +You can modify defaults either through the `config` command or by editing the file directly. |
| 162 | + |
| 163 | +## Environment Variables |
| 164 | + |
| 165 | +While the CLI reduces the need for environment variables, some are still supported for CI/CD: |
| 166 | + |
| 167 | +- `EVAL_ENV` - Override environment setting |
| 168 | +- `EVAL_TRIAL_COUNT` - Override trial count |
| 169 | +- `EVAL_MAX_CONCURRENCY` - Override concurrency |
| 170 | +- `EVAL_PROVIDER` - Override LLM provider |
| 171 | +- `USE_API` - Use Stagehand API |
| 172 | + |
| 173 | +## Development |
| 174 | + |
| 175 | +### Adding New Evals |
| 176 | + |
| 177 | +1. Create your eval file in `evals/tasks/<category>/` |
| 178 | +2. Add it to `evals.config.json` under the `tasks` array |
| 179 | +3. Run with: `pnpm evals run <category>/<eval_name>` |
| 180 | + |
| 181 | +## Troubleshooting |
| 182 | + |
| 183 | +### Command not found |
| 184 | + |
| 185 | +If `evals` command is not found, make sure you've: |
| 186 | +1. Run `pnpm install` from the project root |
| 187 | +2. Run `pnpm run build:cli` to compile the CLI |
| 188 | + |
| 189 | +### Build errors |
| 190 | + |
| 191 | +If you encounter build errors: |
| 192 | +```bash |
| 193 | +# Clean and rebuild |
| 194 | +rm -rf dist/evals |
| 195 | +pnpm run build:cli |
| 196 | +``` |
| 197 | + |
| 198 | +### Permission errors |
| 199 | + |
| 200 | +If you get permission errors: |
| 201 | +```bash |
| 202 | +chmod +x dist/evals/cli.js |
| 203 | +``` |
| 204 | + |
| 205 | +## Contributing |
| 206 | + |
| 207 | +When adding new features to the CLI: |
| 208 | + |
| 209 | +1. Update the command in `evals/cli.ts` |
| 210 | +2. Add new options to the help text |
| 211 | +3. Update this README with examples |
| 212 | +4. Test with various flag combinations |
0 commit comments