Skip to content

Commit c2f3263

Browse files
committed
Change eval_limit from choice to free-form string input (OpenHands#2261)
Cherry-pick from upstream e96bce3
1 parent c2034c3 commit c2f3263

File tree

2 files changed

+11
-9
lines changed

2 files changed

+11
-9
lines changed

.agents/skills/run-eval.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@ curl -X POST \
3232

3333
**Key parameters:**
3434
- `benchmark`: `swebench`, `swebenchmultimodal`, `gaia`, `swtbench`, `commit0`, `multiswebench`
35-
- `eval_limit`: `1`, `50`, `100`, `200`, `500`
35+
- `eval_limit`: Any positive integer (e.g., `1`, `10`, `50`, `200`)
3636
- `model_ids`: See `.github/run-eval/resolve_model_config.py` for available models
3737
- `benchmarks_branch`: Use feature branch from the benchmarks repo to test benchmark changes before merging
3838

.github/workflows/run-eval.yml

Lines changed: 10 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -31,16 +31,10 @@ on:
3131
default: false
3232
type: boolean
3333
eval_limit:
34-
description: Number of instances to run
34+
description: Number of instances to run (any positive integer)
3535
required: false
3636
default: '1'
37-
type: choice
38-
options:
39-
- '1'
40-
- '100'
41-
- '50'
42-
- '200'
43-
- '500'
37+
type: string
4438
model_ids:
4539
description: Comma-separated model IDs to evaluate. Must be keys of MODELS in resolve_model_config.py. Defaults to first model in that
4640
dict.
@@ -123,6 +117,14 @@ jobs:
123117
with:
124118
python-version: '3.13'
125119

120+
- name: Validate eval_limit
121+
if: github.event_name == 'workflow_dispatch'
122+
run: |
123+
if ! [[ "${{ github.event.inputs.eval_limit }}" =~ ^[1-9][0-9]*$ ]]; then
124+
echo "Error: eval_limit must be a positive integer, got: ${{ github.event.inputs.eval_limit }}"
125+
exit 1
126+
fi
127+
126128
- name: Validate SDK reference (semantic version check)
127129
if: github.event_name == 'workflow_dispatch'
128130
env:

0 commit comments

Comments
 (0)