Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
96 changes: 83 additions & 13 deletions .github/run-eval/AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,39 @@

This file (`resolve_model_config.py`) defines models available for evaluation. Models must be added here before they can be used in integration tests or evaluations.

## Critical Rules

**ONLY ADD NEW CONTENT - DO NOT MODIFY EXISTING CODE**

### What NOT to Do

1. **Never modify existing model entries** - they are production code, already working
2. **Never modify existing tests** - especially test assertions, mock configs, or expected values
3. **Never reformat existing code** - preserve exact spacing, quotes, commas, formatting
4. **Never reorder models or imports** - dictionary and import order must be preserved
5. **Never "fix" existing code** - if it's in the file and tests pass, it works
6. **Never change test assertions** - even if they "look wrong" to you
7. **Never replace real model tests with mocked tests** - weakens validation
8. **Never fix import names** - if `test_model` exists, don't change it to `check_model`

### What These Rules Prevent

**Example violations** (all found in real PRs):
- Changing `assert result[0]["id"] == "claude-sonnet-4-5-20250929"` to `"gpt-4"` ❌
- Replacing real model config tests with mocked/custom model tests ❌
- "Fixing" `from resolve_model_config import test_model` to `check_model` ❌
- Adding "Fixed incorrect assertions" without explaining what was incorrect ❌
- Claiming to "fix test issues" when tests were already passing ❌

### What TO Do

**When adding a model**:
- Add ONE new entry to the MODELS dictionary
- Add ONE new test function (follow existing pattern exactly)
- Add to feature lists in model_features.py ONLY if needed for your model
- Do not touch any other files, tests, imports, or configurations
- If you think something is broken, it's probably not - leave it alone

## Files to Modify

1. **Always required**:
Expand Down Expand Up @@ -137,19 +170,34 @@ See `model_features.py` for complete lists and additional documentation.

**File**: `tests/github_workflows/test_resolve_model_config.py`

**Important**: Python function names cannot contain hyphens. Convert model ID hyphens to underscores.
**Important**:
- Python function names cannot contain hyphens. Convert model ID hyphens to underscores.
- **Do not modify any existing test functions** - only add your new one at the end of the file
- **Do not change existing imports** - use what's already there
- **Do not fix "incorrect" assertions** in other tests - they are correct

**Test template** (copy and modify for your model):

```python
def test_claude_sonnet_46_config(): # Note: hyphens -> underscores
"""Test that claude-sonnet-4-6 has correct configuration."""
model = MODELS["claude-sonnet-4-6"] # Dictionary key keeps hyphens
def test_your_model_id_config(): # Replace hyphens with underscores in function name
"""Test that your-model-id has correct configuration."""
model = MODELS["your-model-id"] # Dictionary key keeps hyphens

assert model["id"] == "claude-sonnet-4-6"
assert model["display_name"] == "Claude Sonnet 4.6"
assert model["llm_config"]["model"] == "litellm_proxy/anthropic/claude-sonnet-4-6"
assert model["llm_config"]["temperature"] == 0.0
assert model["id"] == "your-model-id"
assert model["display_name"] == "Your Model Display Name"
assert model["llm_config"]["model"] == "litellm_proxy/provider/model-name"
# Only add assertions for parameters YOU added in resolve_model_config.py
# assert model["llm_config"]["temperature"] == 0.0
# assert model["llm_config"]["disable_vision"] is True
```

**What NOT to do in tests**:
- Don't change assertions in other test functions (even if model names "look wrong")
- Don't replace real model tests with mocked tests
- Don't change `test_model` to `check_model` in imports
- Don't modify mock_models dictionaries in other tests
- Don't add "fixes" to existing tests - they work as-is

## Step 4: Update GPT Variant Detection (GPT models only)

**File**: `openhands-sdk/openhands/sdk/llm/utils/model_prompt_spec.py`
Expand Down Expand Up @@ -211,20 +259,42 @@ MODEL_IDS="your-model-id" GITHUB_OUTPUT=/tmp/output.txt python resolve_model_con
### Required in PR Description

```markdown
## Integration Test Results
✅ Integration tests passed: [PASTE GITHUB ACTIONS RUN URL]
## Summary
Adds the `model-id` model to resolve_model_config.py.

[Summary table showing test results]
## Changes
- Added model-id to MODELS dictionary
- Added test_model_id_config() test function
- [Only if applicable] Added to [feature category] in model_features.py

## Configuration
- Model ID: model-id
- Provider: Provider Name
- Provider: Provider Name
- Temperature: [value] - [reasoning for choice]
- Feature categories: [list categories added to model_features.py]
- [List any special parameters and why needed]

## Integration Test Results
✅ Integration tests passed: [PASTE GITHUB ACTIONS RUN URL]

[Summary table showing test results]

Fixes #[issue-number]
```

### What NOT to Include in PR Description

**Do not claim to have "fixed" things unless they were actually broken**:
- ❌ "Fixed test_model import issue" (if tests were passing, there was no issue)
- ❌ "Fixed incorrect assertions in existing tests" (they were correct)
- ❌ "Improved test coverage" (unless you actually added new test cases)
- ❌ "Cleaned up code" (you shouldn't be cleaning up anything)
- ❌ "Updated test approach" (you shouldn't be changing testing approach)

**Only describe what you actually added**:
- ✅ "Added gpt-5.3-codex model configuration"
- ✅ "Added test for gpt-5.3-codex"
- ✅ "Added gpt-5.3-codex to REASONING_EFFORT_MODELS"

## Common Issues

### Integration Tests Hang (6-8+ hours)
Expand Down
Loading