|
| 1 | +--- |
| 2 | +name: manage-evals |
| 3 | +description: This skill should be used when the user asks to "trigger an eval", "run evaluation", "run swebench", "run gaia", "run benchmark", "compare eval runs", "compare evaluation results", "check eval regression", "compare benchmark results", "what changed in the eval", "diff eval runs", or mentions triggering, comparing, or reporting on SWE-bench, GAIA, or other benchmark evaluation results. Provides workflow for triggering evaluations on different benchmarks, finding and comparing runs, and reporting performance differences. |
| 4 | +--- |
| 5 | + |
| 6 | +# Managing Evaluations |
| 7 | + |
| 8 | +## Overview |
| 9 | + |
| 10 | +OpenHands evaluations produce results stored on a CDN at `https://results.eval.all-hands.dev/`. Each run is identified by a path: `{benchmark}/{model_slug}/{github_run_id}/`. This skill enables triggering evaluation runs, comparing results between runs, and posting performance reports as GitHub PR comments. |
| 11 | + |
| 12 | +## Quick Start |
| 13 | + |
| 14 | +### Trigger an Evaluation |
| 15 | + |
| 16 | +```bash |
| 17 | +python .agents/skills/manage-evals/scripts/manage_evals.py trigger \ |
| 18 | + --sdk-ref <BRANCH_OR_TAG> --benchmark swebench --eval-limit 50 |
| 19 | +``` |
| 20 | + |
| 21 | +### Compare Runs |
| 22 | + |
| 23 | +```bash |
| 24 | +python .agents/skills/manage-evals/scripts/manage_evals.py compare \ |
| 25 | + "<benchmark>/<model_slug>/<run_id>/" \ |
| 26 | + --auto-baseline |
| 27 | +``` |
| 28 | + |
| 29 | +### Compare and Post to PR |
| 30 | + |
| 31 | +```bash |
| 32 | +python .agents/skills/manage-evals/scripts/manage_evals.py compare \ |
| 33 | + "<benchmark>/<model_slug>/<run_id>/" \ |
| 34 | + --auto-baseline \ |
| 35 | + --post-comment --pr <PR_NUMBER> --repo OpenHands/software-agent-sdk |
| 36 | +``` |
| 37 | + |
| 38 | +## Triggering Evaluations |
| 39 | + |
| 40 | +### Using the Script |
| 41 | + |
| 42 | +```bash |
| 43 | +# SWE-bench (default) on a PR branch |
| 44 | +python .agents/skills/manage-evals/scripts/manage_evals.py trigger \ |
| 45 | + --sdk-ref my-feature-branch --eval-limit 50 |
| 46 | + |
| 47 | +# GAIA benchmark |
| 48 | +python .agents/skills/manage-evals/scripts/manage_evals.py trigger \ |
| 49 | + --sdk-ref main --benchmark gaia --eval-limit 50 |
| 50 | + |
| 51 | +# With a specific model |
| 52 | +python .agents/skills/manage-evals/scripts/manage_evals.py trigger \ |
| 53 | + --sdk-ref v1.16.0 --benchmark swebench --model-ids gemini-3-flash --eval-limit 50 |
| 54 | + |
| 55 | +# Multiple benchmarks (run the command multiple times) |
| 56 | +for bench in swebench gaia; do |
| 57 | + python .agents/skills/manage-evals/scripts/manage_evals.py trigger \ |
| 58 | + --sdk-ref main --benchmark "$bench" --eval-limit 50 --reason "Multi-benchmark eval" |
| 59 | +done |
| 60 | +``` |
| 61 | + |
| 62 | +### Available Benchmarks |
| 63 | + |
| 64 | +| Benchmark | Description | |
| 65 | +|-----------|-------------| |
| 66 | +| `swebench` | SWE-bench (default) — software engineering tasks | |
| 67 | +| `gaia` | GAIA — general AI assistant tasks | |
| 68 | +| `swtbench` | SWT-bench — software testing tasks | |
| 69 | +| `commit0` | Commit0 — commit generation tasks | |
| 70 | +| `swebenchmultimodal` | SWE-bench Multimodal — tasks with images | |
| 71 | +| `terminalbench` | TerminalBench — terminal interaction tasks | |
| 72 | + |
| 73 | +### Trigger Options |
| 74 | + |
| 75 | +| Option | Default | Description | |
| 76 | +|--------|---------|-------------| |
| 77 | +| `--sdk-ref` | *(required)* | Branch, tag, or commit SHA to evaluate | |
| 78 | +| `--benchmark` | `swebench` | Benchmark to run | |
| 79 | +| `--eval-limit` | `50` | Number of instances to evaluate | |
| 80 | +| `--model-ids` | *(first in config)* | Comma-separated model IDs from `resolve_model_config.py` | |
| 81 | +| `--tool-preset` | `default` | Tool preset: `default`, `gemini`, `gpt5`, `planning` | |
| 82 | +| `--agent-type` | `default` | Agent type: `default`, `acp-claude`, `acp-codex` | |
| 83 | +| `--instance-ids` | | Specific instance IDs to evaluate (overrides eval-limit) | |
| 84 | +| `--reason` | | Human-readable reason (shown in notifications) | |
| 85 | +| `--benchmarks-branch` | `main` | Branch of the benchmarks repo | |
| 86 | +| `--eval-branch` | `main` | Branch of the evaluation repo | |
| 87 | + |
| 88 | +### Via PR Labels (Alternative) |
| 89 | + |
| 90 | +Adding a label to a PR also triggers evaluations: |
| 91 | +- `run-eval-1` — 1 instance (quick sanity check) |
| 92 | +- `run-eval-50` — 50 instances (standard comparison) |
| 93 | +- `run-eval-200` — 200 instances |
| 94 | +- `run-eval-500` — 500 instances (full benchmark) |
| 95 | + |
| 96 | +## Comparing Evaluation Runs |
| 97 | + |
| 98 | +### Step 1: Find the Current PR's Eval Run |
| 99 | + |
| 100 | +Eval runs are triggered by adding labels like `run-eval-50` to a PR. The `all-hands-bot` posts a comment with results when complete. |
| 101 | + |
| 102 | +**Option A — From bot comments on the PR:** |
| 103 | + |
| 104 | +```bash |
| 105 | +gh api repos/OpenHands/software-agent-sdk/issues/<PR_NUMBER>/comments \ |
| 106 | + --jq '.[] | select(.user.login == "all-hands-bot") | .body' \ |
| 107 | + | grep -o 'Evaluation:.*' | head -1 |
| 108 | +``` |
| 109 | + |
| 110 | +The evaluation name follows the format `{github_run_id}-{model_slug_short}` (e.g., `23775164157-claude-son`). Extract the `github_run_id` from this. |
| 111 | + |
| 112 | +**Option B — From the "Evaluation Triggered" bot comment:** |
| 113 | + |
| 114 | +```bash |
| 115 | +gh api repos/OpenHands/software-agent-sdk/issues/<PR_NUMBER>/comments \ |
| 116 | + --jq '.[] | select(.body | test("Evaluation Triggered")) | .body' |
| 117 | +``` |
| 118 | + |
| 119 | +This contains the SDK commit SHA. Cross-reference with daily metadata to find the run ID. |
| 120 | + |
| 121 | +**Option C — From daily metadata:** |
| 122 | + |
| 123 | +```bash |
| 124 | +curl -s "https://results.eval.all-hands.dev/metadata/$(date -u +%Y-%m-%d).txt" |
| 125 | +``` |
| 126 | + |
| 127 | +Each line is a run path. Match by benchmark and model to find the run. |
| 128 | + |
| 129 | +### Step 2: Identify the Run Path Components |
| 130 | + |
| 131 | +A run path has three components: |
| 132 | +- **benchmark**: `swebench`, `gaia`, `swtbench`, `commit0`, `swebenchmultimodal`, `terminalbench` |
| 133 | +- **model_slug**: Derived from model name with `/:@.` replaced by `-` (e.g., `litellm_proxy-claude-sonnet-4-5-20250929`) |
| 134 | +- **run_id**: The GitHub Actions workflow run ID from the `OpenHands/evaluation` repo |
| 135 | + |
| 136 | +### Step 3: Verify Results Exist |
| 137 | + |
| 138 | +```bash |
| 139 | +curl -sI "https://results.eval.all-hands.dev/<benchmark>/<model_slug>/<run_id>/output.report.json" | head -1 |
| 140 | +``` |
| 141 | + |
| 142 | +A `200` status confirms the run completed and results are available. |
| 143 | + |
| 144 | +### Step 4: Find a Baseline for Comparison |
| 145 | + |
| 146 | +**Automatic**: The comparison script's `--auto-baseline` flag scans metadata files backward up to 14 days to find the most recent completed run with the same benchmark and model. |
| 147 | + |
| 148 | +**Manual**: Inspect metadata files or other PR bot comments to identify a specific run: |
| 149 | + |
| 150 | +```bash |
| 151 | +# Check today's runs |
| 152 | +curl -s "https://results.eval.all-hands.dev/metadata/$(date -u +%Y-%m-%d).txt" | grep "swebench/litellm_proxy-claude" |
| 153 | + |
| 154 | +# Check yesterday's runs |
| 155 | +curl -s "https://results.eval.all-hands.dev/metadata/$(date -u -d yesterday +%Y-%m-%d).txt" | grep "swebench/litellm_proxy-claude" |
| 156 | +``` |
| 157 | + |
| 158 | +### Step 5: Run the Comparison |
| 159 | + |
| 160 | +```bash |
| 161 | +python .agents/skills/manage-evals/scripts/manage_evals.py compare \ |
| 162 | + "swebench/litellm_proxy-claude-sonnet-4-5-20250929/23775164157/" \ |
| 163 | + --baseline "swebench/litellm_proxy-claude-sonnet-4-5-20250929/23773892085/" |
| 164 | +``` |
| 165 | + |
| 166 | +Or with auto-baseline and PR comment posting: |
| 167 | + |
| 168 | +```bash |
| 169 | +python .agents/skills/manage-evals/scripts/manage_evals.py compare \ |
| 170 | + "swebench/litellm_proxy-claude-sonnet-4-5-20250929/23775164157/" \ |
| 171 | + --auto-baseline \ |
| 172 | + --post-comment --pr 2334 --repo OpenHands/software-agent-sdk |
| 173 | +``` |
| 174 | + |
| 175 | +## Available Data Per Run |
| 176 | + |
| 177 | +Each run stores files at `https://results.eval.all-hands.dev/{run_path}/`: |
| 178 | + |
| 179 | +| File | Description | |
| 180 | +|------|-------------| |
| 181 | +| `metadata/params.json` | Run parameters: SDK commit, PR number, model, eval_limit, triggered_by | |
| 182 | +| `output.report.json` | Aggregated results: resolved/submitted/total counts and instance IDs | |
| 183 | +| `cost_report.jsonl` | Per-instance cost data | |
| 184 | +| `results.tar.gz` | Full archive with all outputs | |
| 185 | + |
| 186 | +## Dashboard |
| 187 | + |
| 188 | +The eval monitor dashboard provides a visual view of runs: |
| 189 | + |
| 190 | +``` |
| 191 | +https://openhands-eval-monitor.vercel.app/?run={benchmark}/{model_slug}/{run_id}/ |
| 192 | +``` |
| 193 | + |
| 194 | +## Interpreting Results |
| 195 | + |
| 196 | +- **Success rate** = resolved / min(eval_limit, total_instances) |
| 197 | +- A 50-instance sample has natural variance of ±2-4 resolved instances between runs |
| 198 | +- Focus on **instance-level changes** (gained/lost) to understand regressions vs. noise |
| 199 | +- If the same set of instances is resolved, the difference is likely noise |
| 200 | + |
| 201 | +## Additional Resources |
| 202 | + |
| 203 | +### Reference Files |
| 204 | +- **`references/eval-infrastructure.md`** — Detailed documentation on the evaluation infrastructure, GCS paths, metadata format, and workflow triggers |
| 205 | + |
| 206 | +### Scripts |
| 207 | +- **`scripts/manage_evals.py`** — Standalone comparison script with auto-baseline detection and GitHub comment posting |
0 commit comments