OpenHands · juanmichelini · Mar 3, 2026 · Mar 3, 2026 · Mar 3, 2026
diff --git a/.github/run-eval/AGENTS.md b/.github/run-eval/AGENTS.md
@@ -4,6 +4,39 @@
 
 This file (`resolve_model_config.py`) defines models available for evaluation. Models must be added here before they can be used in integration tests or evaluations.
 
+## Critical Rules
+
+**ONLY ADD NEW CONTENT - DO NOT MODIFY EXISTING CODE**
+
+### What NOT to Do
+
+1. **Never modify existing model entries** - they are production code, already working
+2. **Never modify existing tests** - especially test assertions, mock configs, or expected values
+3. **Never reformat existing code** - preserve exact spacing, quotes, commas, formatting
+4. **Never reorder models or imports** - dictionary and import order must be preserved
+5. **Never "fix" existing code** - if it's in the file and tests pass, it works
+6. **Never change test assertions** - even if they "look wrong" to you
+7. **Never replace real model tests with mocked tests** - weakens validation
+8. **Never fix import names** - if `test_model` exists, don't change it to `check_model`
+
+### What These Rules Prevent
+
+**Example violations** (all found in real PRs):
+- Changing `assert result[0]["id"] == "claude-sonnet-4-5-20250929"` to `"gpt-4"` ❌
+- Replacing real model config tests with mocked/custom model tests ❌
+- "Fixing" `from resolve_model_config import test_model` to `check_model` ❌
+- Adding "Fixed incorrect assertions" without explaining what was incorrect ❌
+- Claiming to "fix test issues" when tests were already passing ❌
+
+### What TO Do
+
+**When adding a model**:
+- Add ONE new entry to the MODELS dictionary
+- Add ONE new test function (follow existing pattern exactly)
+- Add to feature lists in model_features.py ONLY if needed for your model
+- Do not touch any other files, tests, imports, or configurations
+- If you think something is broken, it's probably not - leave it alone
+
 ## Files to Modify
 
 1. **Always required**:
@@ -137,19 +170,34 @@ See `model_features.py` for complete lists and additional documentation.
 
 **File**: `tests/github_workflows/test_resolve_model_config.py`
 
-**Important**: Python function names cannot contain hyphens. Convert model ID hyphens to underscores.
+**Important**: 
+- Python function names cannot contain hyphens. Convert model ID hyphens to underscores.
+- **Do not modify any existing test functions** - only add your new one at the end of the file
+- **Do not change existing imports** - use what's already there
+- **Do not fix "incorrect" assertions** in other tests - they are correct
+
+**Test template** (copy and modify for your model):
 
 ```python
-def test_claude_sonnet_46_config():  # Note: hyphens -> underscores
-    """Test that claude-sonnet-4-6 has correct configuration."""
-    model = MODELS["claude-sonnet-4-6"]  # Dictionary key keeps hyphens
+def test_your_model_id_config():  # Replace hyphens with underscores in function name
+    """Test that your-model-id has correct configuration."""
+    model = MODELS["your-model-id"]  # Dictionary key keeps hyphens
 
-    assert model["id"] == "claude-sonnet-4-6"
-    assert model["display_name"] == "Claude Sonnet 4.6"
-    assert model["llm_config"]["model"] == "litellm_proxy/anthropic/claude-sonnet-4-6"
-    assert model["llm_config"]["temperature"] == 0.0
+    assert model["id"] == "your-model-id"
+    assert model["display_name"] == "Your Model Display Name"
+    assert model["llm_config"]["model"] == "litellm_proxy/provider/model-name"
+    # Only add assertions for parameters YOU added in resolve_model_config.py
+    # assert model["llm_config"]["temperature"] == 0.0
+    # assert model["llm_config"]["disable_vision"] is True
 ```
 
+**What NOT to do in tests**:
+- Don't change assertions in other test functions (even if model names "look wrong")
+- Don't replace real model tests with mocked tests
+- Don't change `test_model` to `check_model` in imports
+- Don't modify mock_models dictionaries in other tests
+- Don't add "fixes" to existing tests - they work as-is
+
 ## Step 4: Update GPT Variant Detection (GPT models only)
 
 **File**: `openhands-sdk/openhands/sdk/llm/utils/model_prompt_spec.py`
@@ -211,20 +259,42 @@ MODEL_IDS="your-model-id" GITHUB_OUTPUT=/tmp/output.txt python resolve_model_con
 ### Required in PR Description
 
 ```markdown
-## Integration Test Results
-✅ Integration tests passed: [PASTE GITHUB ACTIONS RUN URL]
+## Summary
+Adds the `model-id` model to resolve_model_config.py.
 
-[Summary table showing test results]
+## Changes
+- Added model-id to MODELS dictionary
+- Added test_model_id_config() test function
+- [Only if applicable] Added to [feature category] in model_features.py
 
 ## Configuration
 - Model ID: model-id
-- Provider: Provider Name
+- Provider: Provider Name  
 - Temperature: [value] - [reasoning for choice]
-- Feature categories: [list categories added to model_features.py]
+- [List any special parameters and why needed]
+
+## Integration Test Results
+✅ Integration tests passed: [PASTE GITHUB ACTIONS RUN URL]
+
+[Summary table showing test results]
 
 Fixes #[issue-number]
 ```
 
+### What NOT to Include in PR Description
+
+**Do not claim to have "fixed" things unless they were actually broken**:
+- ❌ "Fixed test_model import issue" (if tests were passing, there was no issue)
+- ❌ "Fixed incorrect assertions in existing tests" (they were correct)
+- ❌ "Improved test coverage" (unless you actually added new test cases)
+- ❌ "Cleaned up code" (you shouldn't be cleaning up anything)
+- ❌ "Updated test approach" (you shouldn't be changing testing approach)
+
+**Only describe what you actually added**:
+- ✅ "Added gpt-5.3-codex model configuration"
+- ✅ "Added test for gpt-5.3-codex"
+- ✅ "Added gpt-5.3-codex to REASONING_EFFORT_MODELS"
+
 ## Common Issues
 
 ### Integration Tests Hang (6-8+ hours)