Skip to content

Commit f9c568d

Browse files
authored
Merge branch 'main' into feature/multi-model-debate-pr-review
2 parents e1a3a4a + 32573fb commit f9c568d

172 files changed

Lines changed: 11673 additions & 1163 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.agents/skills/custom-codereview-guide.md

Lines changed: 26 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -45,7 +45,7 @@ Examples of straightforward and low-risk PRs you should approve (non-exhaustive)
4545
- **Documentation-only changes**: Docstring updates, clarifying notes, API documentation improvements
4646
- **Simple additions**: Adding entries to lists/dictionaries following existing patterns
4747
- **Test-only changes**: Adding or updating tests without changing production code
48-
- **Dependency updates**: Version bumps with passing CI
48+
- **Dependency updates**: Version bumps with passing CI, unless the updated package is newer than the repo's 7-day freshness guardrail described in the Security section below
4949

5050
### When NOT to APPROVE - Blocking Issues
5151

@@ -54,6 +54,7 @@ Examples of straightforward and low-risk PRs you should approve (non-exhaustive)
5454
- **Package version bumps in non-release PRs**: If any `pyproject.toml` file has changes to the `version` field (e.g., `version = "1.12.0"``version = "1.13.0"`), and the PR is NOT explicitly a release PR (title/description doesn't indicate it's a release), **DO NOT APPROVE**. Version numbers should only be changed in dedicated release PRs managed by maintainers.
5555
- Check: Look for changes to `version = "..."` in any `*/pyproject.toml` files
5656
- Exception: PRs with titles like "release: v1.x.x" or "chore: bump version to 1.x.x" from maintainers
57+
- **Too-new dependency uploads**: If a dependency bump pulls in a package uploaded within the repo's 7-day freshness window, **DO NOT APPROVE**. See the Security section below for the exact review instructions and the Dependabot / `tool.uv.exclude-newer` caveat.
5758

5859
Examples:
5960
- A PR adding a new model to `resolve_model_config.py` or `verified_models.py` with corresponding test updates
@@ -71,6 +72,30 @@ Use COMMENT when you have feedback or concerns:
7172

7273
If there are significant issues, leave detailed comments explaining the concerns—but let a human maintainer decide whether to block the PR.
7374

75+
## Security
76+
77+
### Dependency freshness / supply-chain guardrail
78+
79+
This repository intentionally uses a workspace-wide `uv` resolver guardrail:
80+
81+
- Root `pyproject.toml`: `[tool.uv] exclude-newer = "7 days"`
82+
83+
**Important:** Dependabot does **not** currently honor that `uv` guardrail when it opens `uv.lock` update PRs for this repo's workspace setup. A Dependabot PR can therefore bump to a version that was uploaded **less than 7 days ago**, even though a local `uv lock` would normally exclude it.
84+
85+
When reviewing dependency update PRs (`uv.lock`, `pyproject.toml`, `requirements*.txt`, etc.), explicitly check for **too-new package uploads**:
86+
87+
1. Check the package upload timestamp on the package index.
88+
2. For `uv.lock`, use the per-file `upload-time` metadata in the changed package entry.
89+
3. Treat `upload-time` as the upload time of that specific distribution file to the package index (for example, the wheel uploaded to PyPI) — not the Git tag time or GitHub release time.
90+
4. Compare that timestamp against the current date and the repo's 7-day freshness window.
91+
92+
If the updated package was uploaded **within the last 7 days**, treat it as a real security / supply-chain concern:
93+
94+
- Do **NOT** approve the PR.
95+
- Leave a **COMMENT** review that clearly calls out the package name, version, upload time, and that it is newer than the repo's 7-day guardrail.
96+
- Explain that this can happen because Dependabot currently ignores `tool.uv.exclude-newer` for this repo's workspace updates.
97+
- Ask a human maintainer to decide whether to wait until the package ages past the guardrail or to merge intentionally despite the freshness risk.
98+
7499
## Core Principles
75100

76101
1. **Simplicity First**: Question complexity. If something feels overcomplicated, ask "what's the use case?" and seek simpler alternatives. Features should solve real problems, not imaginary ones.
Lines changed: 207 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,207 @@
1+
---
2+
name: manage-evals
3+
description: This skill should be used when the user asks to "trigger an eval", "run evaluation", "run swebench", "run gaia", "run benchmark", "compare eval runs", "compare evaluation results", "check eval regression", "compare benchmark results", "what changed in the eval", "diff eval runs", or mentions triggering, comparing, or reporting on SWE-bench, GAIA, or other benchmark evaluation results. Provides workflow for triggering evaluations on different benchmarks, finding and comparing runs, and reporting performance differences.
4+
---
5+
6+
# Managing Evaluations
7+
8+
## Overview
9+
10+
OpenHands evaluations produce results stored on a CDN at `https://results.eval.all-hands.dev/`. Each run is identified by a path: `{benchmark}/{model_slug}/{github_run_id}/`. This skill enables triggering evaluation runs, comparing results between runs, and posting performance reports as GitHub PR comments.
11+
12+
## Quick Start
13+
14+
### Trigger an Evaluation
15+
16+
```bash
17+
python .agents/skills/manage-evals/scripts/manage_evals.py trigger \
18+
--sdk-ref <BRANCH_OR_TAG> --benchmark swebench --eval-limit 50
19+
```
20+
21+
### Compare Runs
22+
23+
```bash
24+
python .agents/skills/manage-evals/scripts/manage_evals.py compare \
25+
"<benchmark>/<model_slug>/<run_id>/" \
26+
--auto-baseline
27+
```
28+
29+
### Compare and Post to PR
30+
31+
```bash
32+
python .agents/skills/manage-evals/scripts/manage_evals.py compare \
33+
"<benchmark>/<model_slug>/<run_id>/" \
34+
--auto-baseline \
35+
--post-comment --pr <PR_NUMBER> --repo OpenHands/software-agent-sdk
36+
```
37+
38+
## Triggering Evaluations
39+
40+
### Using the Script
41+
42+
```bash
43+
# SWE-bench (default) on a PR branch
44+
python .agents/skills/manage-evals/scripts/manage_evals.py trigger \
45+
--sdk-ref my-feature-branch --eval-limit 50
46+
47+
# GAIA benchmark
48+
python .agents/skills/manage-evals/scripts/manage_evals.py trigger \
49+
--sdk-ref main --benchmark gaia --eval-limit 50
50+
51+
# With a specific model
52+
python .agents/skills/manage-evals/scripts/manage_evals.py trigger \
53+
--sdk-ref v1.16.0 --benchmark swebench --model-ids gemini-3-flash --eval-limit 50
54+
55+
# Multiple benchmarks (run the command multiple times)
56+
for bench in swebench gaia; do
57+
python .agents/skills/manage-evals/scripts/manage_evals.py trigger \
58+
--sdk-ref main --benchmark "$bench" --eval-limit 50 --reason "Multi-benchmark eval"
59+
done
60+
```
61+
62+
### Available Benchmarks
63+
64+
| Benchmark | Description |
65+
|-----------|-------------|
66+
| `swebench` | SWE-bench (default) — software engineering tasks |
67+
| `gaia` | GAIA — general AI assistant tasks |
68+
| `swtbench` | SWT-bench — software testing tasks |
69+
| `commit0` | Commit0 — commit generation tasks |
70+
| `swebenchmultimodal` | SWE-bench Multimodal — tasks with images |
71+
| `terminalbench` | TerminalBench — terminal interaction tasks |
72+
73+
### Trigger Options
74+
75+
| Option | Default | Description |
76+
|--------|---------|-------------|
77+
| `--sdk-ref` | *(required)* | Branch, tag, or commit SHA to evaluate |
78+
| `--benchmark` | `swebench` | Benchmark to run |
79+
| `--eval-limit` | `50` | Number of instances to evaluate |
80+
| `--model-ids` | *(first in config)* | Comma-separated model IDs from `resolve_model_config.py` |
81+
| `--tool-preset` | `default` | Tool preset: `default`, `gemini`, `gpt5`, `planning` |
82+
| `--agent-type` | `default` | Agent type: `default`, `acp-claude`, `acp-codex` |
83+
| `--instance-ids` | | Specific instance IDs to evaluate (overrides eval-limit) |
84+
| `--reason` | | Human-readable reason (shown in notifications) |
85+
| `--benchmarks-branch` | `main` | Branch of the benchmarks repo |
86+
| `--eval-branch` | `main` | Branch of the evaluation repo |
87+
88+
### Via PR Labels (Alternative)
89+
90+
Adding a label to a PR also triggers evaluations:
91+
- `run-eval-1` — 1 instance (quick sanity check)
92+
- `run-eval-50` — 50 instances (standard comparison)
93+
- `run-eval-200` — 200 instances
94+
- `run-eval-500` — 500 instances (full benchmark)
95+
96+
## Comparing Evaluation Runs
97+
98+
### Step 1: Find the Current PR's Eval Run
99+
100+
Eval runs are triggered by adding labels like `run-eval-50` to a PR. The `all-hands-bot` posts a comment with results when complete.
101+
102+
**Option A — From bot comments on the PR:**
103+
104+
```bash
105+
gh api repos/OpenHands/software-agent-sdk/issues/<PR_NUMBER>/comments \
106+
--jq '.[] | select(.user.login == "all-hands-bot") | .body' \
107+
| grep -o 'Evaluation:.*' | head -1
108+
```
109+
110+
The evaluation name follows the format `{github_run_id}-{model_slug_short}` (e.g., `23775164157-claude-son`). Extract the `github_run_id` from this.
111+
112+
**Option B — From the "Evaluation Triggered" bot comment:**
113+
114+
```bash
115+
gh api repos/OpenHands/software-agent-sdk/issues/<PR_NUMBER>/comments \
116+
--jq '.[] | select(.body | test("Evaluation Triggered")) | .body'
117+
```
118+
119+
This contains the SDK commit SHA. Cross-reference with daily metadata to find the run ID.
120+
121+
**Option C — From daily metadata:**
122+
123+
```bash
124+
curl -s "https://results.eval.all-hands.dev/metadata/$(date -u +%Y-%m-%d).txt"
125+
```
126+
127+
Each line is a run path. Match by benchmark and model to find the run.
128+
129+
### Step 2: Identify the Run Path Components
130+
131+
A run path has three components:
132+
- **benchmark**: `swebench`, `gaia`, `swtbench`, `commit0`, `swebenchmultimodal`, `terminalbench`
133+
- **model_slug**: Derived from model name with `/:@.` replaced by `-` (e.g., `litellm_proxy-claude-sonnet-4-5-20250929`)
134+
- **run_id**: The GitHub Actions workflow run ID from the `OpenHands/evaluation` repo
135+
136+
### Step 3: Verify Results Exist
137+
138+
```bash
139+
curl -sI "https://results.eval.all-hands.dev/<benchmark>/<model_slug>/<run_id>/output.report.json" | head -1
140+
```
141+
142+
A `200` status confirms the run completed and results are available.
143+
144+
### Step 4: Find a Baseline for Comparison
145+
146+
**Automatic**: The comparison script's `--auto-baseline` flag scans metadata files backward up to 14 days to find the most recent completed run with the same benchmark and model.
147+
148+
**Manual**: Inspect metadata files or other PR bot comments to identify a specific run:
149+
150+
```bash
151+
# Check today's runs
152+
curl -s "https://results.eval.all-hands.dev/metadata/$(date -u +%Y-%m-%d).txt" | grep "swebench/litellm_proxy-claude"
153+
154+
# Check yesterday's runs
155+
curl -s "https://results.eval.all-hands.dev/metadata/$(date -u -d yesterday +%Y-%m-%d).txt" | grep "swebench/litellm_proxy-claude"
156+
```
157+
158+
### Step 5: Run the Comparison
159+
160+
```bash
161+
python .agents/skills/manage-evals/scripts/manage_evals.py compare \
162+
"swebench/litellm_proxy-claude-sonnet-4-5-20250929/23775164157/" \
163+
--baseline "swebench/litellm_proxy-claude-sonnet-4-5-20250929/23773892085/"
164+
```
165+
166+
Or with auto-baseline and PR comment posting:
167+
168+
```bash
169+
python .agents/skills/manage-evals/scripts/manage_evals.py compare \
170+
"swebench/litellm_proxy-claude-sonnet-4-5-20250929/23775164157/" \
171+
--auto-baseline \
172+
--post-comment --pr 2334 --repo OpenHands/software-agent-sdk
173+
```
174+
175+
## Available Data Per Run
176+
177+
Each run stores files at `https://results.eval.all-hands.dev/{run_path}/`:
178+
179+
| File | Description |
180+
|------|-------------|
181+
| `metadata/params.json` | Run parameters: SDK commit, PR number, model, eval_limit, triggered_by |
182+
| `output.report.json` | Aggregated results: resolved/submitted/total counts and instance IDs |
183+
| `cost_report.jsonl` | Per-instance cost data |
184+
| `results.tar.gz` | Full archive with all outputs |
185+
186+
## Dashboard
187+
188+
The eval monitor dashboard provides a visual view of runs:
189+
190+
```
191+
https://openhands-eval-monitor.vercel.app/?run={benchmark}/{model_slug}/{run_id}/
192+
```
193+
194+
## Interpreting Results
195+
196+
- **Success rate** = resolved / min(eval_limit, total_instances)
197+
- A 50-instance sample has natural variance of ±2-4 resolved instances between runs
198+
- Focus on **instance-level changes** (gained/lost) to understand regressions vs. noise
199+
- If the same set of instances is resolved, the difference is likely noise
200+
201+
## Additional Resources
202+
203+
### Reference Files
204+
- **`references/eval-infrastructure.md`** — Detailed documentation on the evaluation infrastructure, GCS paths, metadata format, and workflow triggers
205+
206+
### Scripts
207+
- **`scripts/manage_evals.py`** — Standalone comparison script with auto-baseline detection and GitHub comment posting

0 commit comments

Comments
 (0)