Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .agents/skills/run-eval.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ curl -X POST \

**Key parameters:**
- `benchmark`: `swebench`, `swebenchmultimodal`, `gaia`, `swtbench`, `commit0`, `multiswebench`
- `eval_limit`: `1`, `50`, `100`, `200`, `500`
- `eval_limit`: Any positive integer (e.g., `1`, `10`, `50`, `200`)
- `model_ids`: See `.github/run-eval/resolve_model_config.py` for available models
- `benchmarks_branch`: Use feature branch from the benchmarks repo to test benchmark changes before merging

Expand Down
10 changes: 2 additions & 8 deletions .github/workflows/run-eval.yml
Original file line number Diff line number Diff line change
Expand Up @@ -32,16 +32,10 @@ on:
default: false
type: boolean
eval_limit:
description: Number of instances to run
description: Number of instances to run (any positive integer)
required: false
default: '1'
type: choice
options:
- '1'
- '100'
- '50'
- '200'
- '500'
type: string
model_ids:
description: Comma-separated model IDs to evaluate. Must be keys of MODELS in resolve_model_config.py. Defaults to first model in that
dict.
Expand Down
Loading