Skip to content

Commit 8dc35fd

Browse files
Add explicit rules to prevent modifying existing models in AGENTS.md (#2288)
Co-authored-by: openhands <openhands@all-hands.dev>
1 parent cc34237 commit 8dc35fd

File tree

1 file changed

+83
-13
lines changed

1 file changed

+83
-13
lines changed

.github/run-eval/AGENTS.md

Lines changed: 83 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,39 @@
44

55
This file (`resolve_model_config.py`) defines models available for evaluation. Models must be added here before they can be used in integration tests or evaluations.
66

7+
## Critical Rules
8+
9+
**ONLY ADD NEW CONTENT - DO NOT MODIFY EXISTING CODE**
10+
11+
### What NOT to Do
12+
13+
1. **Never modify existing model entries** - they are production code, already working
14+
2. **Never modify existing tests** - especially test assertions, mock configs, or expected values
15+
3. **Never reformat existing code** - preserve exact spacing, quotes, commas, formatting
16+
4. **Never reorder models or imports** - dictionary and import order must be preserved
17+
5. **Never "fix" existing code** - if it's in the file and tests pass, it works
18+
6. **Never change test assertions** - even if they "look wrong" to you
19+
7. **Never replace real model tests with mocked tests** - weakens validation
20+
8. **Never fix import names** - if `test_model` exists, don't change it to `check_model`
21+
22+
### What These Rules Prevent
23+
24+
**Example violations** (all found in real PRs):
25+
- Changing `assert result[0]["id"] == "claude-sonnet-4-5-20250929"` to `"gpt-4"`
26+
- Replacing real model config tests with mocked/custom model tests ❌
27+
- "Fixing" `from resolve_model_config import test_model` to `check_model`
28+
- Adding "Fixed incorrect assertions" without explaining what was incorrect ❌
29+
- Claiming to "fix test issues" when tests were already passing ❌
30+
31+
### What TO Do
32+
33+
**When adding a model**:
34+
- Add ONE new entry to the MODELS dictionary
35+
- Add ONE new test function (follow existing pattern exactly)
36+
- Add to feature lists in model_features.py ONLY if needed for your model
37+
- Do not touch any other files, tests, imports, or configurations
38+
- If you think something is broken, it's probably not - leave it alone
39+
740
## Files to Modify
841

942
1. **Always required**:
@@ -137,19 +170,34 @@ See `model_features.py` for complete lists and additional documentation.
137170

138171
**File**: `tests/github_workflows/test_resolve_model_config.py`
139172

140-
**Important**: Python function names cannot contain hyphens. Convert model ID hyphens to underscores.
173+
**Important**:
174+
- Python function names cannot contain hyphens. Convert model ID hyphens to underscores.
175+
- **Do not modify any existing test functions** - only add your new one at the end of the file
176+
- **Do not change existing imports** - use what's already there
177+
- **Do not fix "incorrect" assertions** in other tests - they are correct
178+
179+
**Test template** (copy and modify for your model):
141180

142181
```python
143-
def test_claude_sonnet_46_config(): # Note: hyphens -> underscores
144-
"""Test that claude-sonnet-4-6 has correct configuration."""
145-
model = MODELS["claude-sonnet-4-6"] # Dictionary key keeps hyphens
182+
def test_your_model_id_config(): # Replace hyphens with underscores in function name
183+
"""Test that your-model-id has correct configuration."""
184+
model = MODELS["your-model-id"] # Dictionary key keeps hyphens
146185

147-
assert model["id"] == "claude-sonnet-4-6"
148-
assert model["display_name"] == "Claude Sonnet 4.6"
149-
assert model["llm_config"]["model"] == "litellm_proxy/anthropic/claude-sonnet-4-6"
150-
assert model["llm_config"]["temperature"] == 0.0
186+
assert model["id"] == "your-model-id"
187+
assert model["display_name"] == "Your Model Display Name"
188+
assert model["llm_config"]["model"] == "litellm_proxy/provider/model-name"
189+
# Only add assertions for parameters YOU added in resolve_model_config.py
190+
# assert model["llm_config"]["temperature"] == 0.0
191+
# assert model["llm_config"]["disable_vision"] is True
151192
```
152193

194+
**What NOT to do in tests**:
195+
- Don't change assertions in other test functions (even if model names "look wrong")
196+
- Don't replace real model tests with mocked tests
197+
- Don't change `test_model` to `check_model` in imports
198+
- Don't modify mock_models dictionaries in other tests
199+
- Don't add "fixes" to existing tests - they work as-is
200+
153201
## Step 4: Update GPT Variant Detection (GPT models only)
154202

155203
**File**: `openhands-sdk/openhands/sdk/llm/utils/model_prompt_spec.py`
@@ -211,20 +259,42 @@ MODEL_IDS="your-model-id" GITHUB_OUTPUT=/tmp/output.txt python resolve_model_con
211259
### Required in PR Description
212260

213261
```markdown
214-
## Integration Test Results
215-
✅ Integration tests passed: [PASTE GITHUB ACTIONS RUN URL]
262+
## Summary
263+
Adds the `model-id` model to resolve_model_config.py.
216264

217-
[Summary table showing test results]
265+
## Changes
266+
- Added model-id to MODELS dictionary
267+
- Added test_model_id_config() test function
268+
- [Only if applicable] Added to [feature category] in model_features.py
218269

219270
## Configuration
220271
- Model ID: model-id
221-
- Provider: Provider Name
272+
- Provider: Provider Name
222273
- Temperature: [value] - [reasoning for choice]
223-
- Feature categories: [list categories added to model_features.py]
274+
- [List any special parameters and why needed]
275+
276+
## Integration Test Results
277+
✅ Integration tests passed: [PASTE GITHUB ACTIONS RUN URL]
278+
279+
[Summary table showing test results]
224280

225281
Fixes #[issue-number]
226282
```
227283

284+
### What NOT to Include in PR Description
285+
286+
**Do not claim to have "fixed" things unless they were actually broken**:
287+
- ❌ "Fixed test_model import issue" (if tests were passing, there was no issue)
288+
- ❌ "Fixed incorrect assertions in existing tests" (they were correct)
289+
- ❌ "Improved test coverage" (unless you actually added new test cases)
290+
- ❌ "Cleaned up code" (you shouldn't be cleaning up anything)
291+
- ❌ "Updated test approach" (you shouldn't be changing testing approach)
292+
293+
**Only describe what you actually added**:
294+
- ✅ "Added gpt-5.3-codex model configuration"
295+
- ✅ "Added test for gpt-5.3-codex"
296+
- ✅ "Added gpt-5.3-codex to REASONING_EFFORT_MODELS"
297+
228298
## Common Issues
229299

230300
### Integration Tests Hang (6-8+ hours)

0 commit comments

Comments
 (0)