|
| 1 | +# Adding Models to resolve_model_config.py |
| 2 | + |
| 3 | +## Overview |
| 4 | + |
| 5 | +This file (`resolve_model_config.py`) defines models available for evaluation. Models must be added here before they can be used in integration tests or evaluations. |
| 6 | + |
| 7 | +## Critical Rules |
| 8 | + |
| 9 | +**ONLY ADD NEW CONTENT - DO NOT MODIFY EXISTING CODE** |
| 10 | + |
| 11 | +### What NOT to Do |
| 12 | + |
| 13 | +1. **Never modify existing model entries** - they are production code, already working |
| 14 | +2. **Never modify existing tests** - especially test assertions, mock configs, or expected values |
| 15 | +3. **Never reformat existing code** - preserve exact spacing, quotes, commas, formatting |
| 16 | +4. **Never reorder models or imports** - dictionary and import order must be preserved |
| 17 | +5. **Never "fix" existing code** - if it's in the file and tests pass, it works |
| 18 | +6. **Never change test assertions** - even if they "look wrong" to you |
| 19 | +7. **Never replace real model tests with mocked tests** - weakens validation |
| 20 | +8. **Never fix import names** - if `test_model` exists, don't change it to `check_model` |
| 21 | + |
| 22 | +### What These Rules Prevent |
| 23 | + |
| 24 | +**Example violations** (all found in real PRs): |
| 25 | +- Changing `assert result[0]["id"] == "claude-sonnet-4-5-20250929"` to `"gpt-4"` ❌ |
| 26 | +- Replacing real model config tests with mocked/custom model tests ❌ |
| 27 | +- "Fixing" `from resolve_model_config import test_model` to `check_model` ❌ |
| 28 | +- Adding "Fixed incorrect assertions" without explaining what was incorrect ❌ |
| 29 | +- Claiming to "fix test issues" when tests were already passing ❌ |
| 30 | + |
| 31 | +### What TO Do |
| 32 | + |
| 33 | +**When adding a model**: |
| 34 | +- Add ONE new entry to the MODELS dictionary |
| 35 | +- Add ONE new test function (follow existing pattern exactly) |
| 36 | +- Add to feature lists in model_features.py ONLY if needed for your model |
| 37 | +- Do not touch any other files, tests, imports, or configurations |
| 38 | +- If you think something is broken, it's probably not - leave it alone |
| 39 | + |
| 40 | +## Files to Modify |
| 41 | + |
| 42 | +1. **Always required**: |
| 43 | + - `.github/run-eval/resolve_model_config.py` - Add model configuration |
| 44 | + - `tests/github_workflows/test_resolve_model_config.py` - Add test |
| 45 | + |
| 46 | +2. **Usually required** (if model has special characteristics): |
| 47 | + - `openhands-sdk/openhands/sdk/llm/utils/model_features.py` - Add to feature categories |
| 48 | + |
| 49 | +3. **Sometimes required**: |
| 50 | + - `openhands-sdk/openhands/sdk/llm/utils/model_prompt_spec.py` - GPT models only (variant detection) |
| 51 | + - `openhands-sdk/openhands/sdk/llm/utils/verified_models.py` - Production-ready models |
| 52 | + |
| 53 | +## Step 1: Add to resolve_model_config.py |
| 54 | + |
| 55 | +Add entry to `MODELS` dictionary: |
| 56 | + |
| 57 | +```python |
| 58 | +"model-id": { |
| 59 | + "id": "model-id", # Must match dictionary key |
| 60 | + "display_name": "Human Readable Name", |
| 61 | + "llm_config": { |
| 62 | + "model": "litellm_proxy/provider/model-name", |
| 63 | + "temperature": 0.0, # See temperature guide below |
| 64 | + }, |
| 65 | +}, |
| 66 | +``` |
| 67 | + |
| 68 | +### Temperature Configuration |
| 69 | + |
| 70 | +| Value | When to Use | Provider Requirements | |
| 71 | +|-------|-------------|----------------------| |
| 72 | +| `0.0` | Standard deterministic models | Most providers | |
| 73 | +| `1.0` | Reasoning models | Kimi K2, MiniMax M2.5 | |
| 74 | +| `None` | Use provider default | When unsure | |
| 75 | + |
| 76 | +### Special Parameters |
| 77 | + |
| 78 | +Add only if needed: |
| 79 | + |
| 80 | +- **`disable_vision: True`** - Model doesn't support vision despite LiteLLM reporting it does (GLM-4.7, GLM-5) |
| 81 | +- **`reasoning_effort: "high"`** - For OpenAI reasoning models that support this parameter |
| 82 | +- **`max_tokens: <value>`** - To prevent hangs or control output length |
| 83 | +- **`top_p: <value>`** - Nucleus sampling (cannot be used with `temperature` for Claude models) |
| 84 | +- **`litellm_extra_body: {...}`** - Provider-specific parameters (e.g., `{"enable_thinking": True}`) |
| 85 | + |
| 86 | +### Critical Rules |
| 87 | + |
| 88 | +1. Model ID must match dictionary key |
| 89 | +2. Model path must start with `litellm_proxy/` |
| 90 | +3. **Claude models**: Cannot use both `temperature` and `top_p` - choose one or omit both |
| 91 | +4. Parameters like `disable_vision` must be in `SDK_ONLY_PARAMS` constant (they're filtered before sending to LiteLLM) |
| 92 | + |
| 93 | +## Step 2: Update model_features.py (if applicable) |
| 94 | + |
| 95 | +Check provider documentation to determine which feature categories apply: |
| 96 | + |
| 97 | +### REASONING_EFFORT_MODELS |
| 98 | +Models that support `reasoning_effort` parameter: |
| 99 | +- OpenAI: o1, o3, o4, GPT-5 series |
| 100 | +- Anthropic: Claude Opus 4.5+, Claude Sonnet 4.6 |
| 101 | +- Google: Gemini 2.5+, Gemini 3.x series |
| 102 | +- AWS: Nova 2 Lite |
| 103 | + |
| 104 | +```python |
| 105 | +REASONING_EFFORT_MODELS: list[str] = [ |
| 106 | + "your-model-identifier", # Add here |
| 107 | +] |
| 108 | +``` |
| 109 | + |
| 110 | +**Effect**: Automatically strips `temperature` and `top_p` parameters to avoid API conflicts. |
| 111 | + |
| 112 | +### EXTENDED_THINKING_MODELS |
| 113 | +Models with extended thinking capabilities: |
| 114 | +- Anthropic: Claude Sonnet 4.5+, Claude Haiku 4.5 |
| 115 | + |
| 116 | +```python |
| 117 | +EXTENDED_THINKING_MODELS: list[str] = [ |
| 118 | + "your-model-identifier", # Add here |
| 119 | +] |
| 120 | +``` |
| 121 | + |
| 122 | +**Effect**: Automatically strips `temperature` and `top_p` parameters. |
| 123 | + |
| 124 | +### PROMPT_CACHE_MODELS |
| 125 | +Models supporting prompt caching: |
| 126 | +- Anthropic: Claude 3.5+, Claude 4+ series |
| 127 | + |
| 128 | +```python |
| 129 | +PROMPT_CACHE_MODELS: list[str] = [ |
| 130 | + "your-model-identifier", # Add here |
| 131 | +] |
| 132 | +``` |
| 133 | + |
| 134 | +### SUPPORTS_STOP_WORDS_FALSE_MODELS |
| 135 | +Models that **do not** support stop words: |
| 136 | +- OpenAI: o1, o3 series |
| 137 | +- xAI: Grok-4, Grok-code-fast-1 |
| 138 | +- DeepSeek: R1 family |
| 139 | + |
| 140 | +```python |
| 141 | +SUPPORTS_STOP_WORDS_FALSE_MODELS: list[str] = [ |
| 142 | + "your-model-identifier", # Add here |
| 143 | +] |
| 144 | +``` |
| 145 | + |
| 146 | +### FORCE_STRING_SERIALIZER_MODELS |
| 147 | +Models requiring string format for tool messages (not structured content): |
| 148 | +- DeepSeek models |
| 149 | +- GLM models |
| 150 | +- Groq: Kimi K2-Instruct |
| 151 | +- OpenRouter: MiniMax |
| 152 | + |
| 153 | +Use pattern matching: |
| 154 | +```python |
| 155 | +FORCE_STRING_SERIALIZER_MODELS: list[str] = [ |
| 156 | + "deepseek", # Matches any model with "deepseek" in name |
| 157 | + "groq/kimi-k2-instruct", # Provider-prefixed |
| 158 | +] |
| 159 | +``` |
| 160 | + |
| 161 | +### Other Categories |
| 162 | + |
| 163 | +- **PROMPT_CACHE_RETENTION_MODELS**: GPT-5 family, GPT-4.1 |
| 164 | +- **RESPONSES_API_MODELS**: GPT-5 family, codex-mini-latest |
| 165 | +- **SEND_REASONING_CONTENT_MODELS**: Kimi K2 Thinking/K2.5, MiniMax-M2, DeepSeek Reasoner |
| 166 | + |
| 167 | +See `model_features.py` for complete lists and additional documentation. |
| 168 | + |
| 169 | +## Step 3: Add Test |
| 170 | + |
| 171 | +**File**: `tests/github_workflows/test_resolve_model_config.py` |
| 172 | + |
| 173 | +**Important**: |
| 174 | +- Python function names cannot contain hyphens. Convert model ID hyphens to underscores. |
| 175 | +- **Do not modify any existing test functions** - only add your new one at the end of the file |
| 176 | +- **Do not change existing imports** - use what's already there |
| 177 | +- **Do not fix "incorrect" assertions** in other tests - they are correct |
| 178 | + |
| 179 | +**Test template** (copy and modify for your model): |
| 180 | + |
| 181 | +```python |
| 182 | +def test_your_model_id_config(): # Replace hyphens with underscores in function name |
| 183 | + """Test that your-model-id has correct configuration.""" |
| 184 | + model = MODELS["your-model-id"] # Dictionary key keeps hyphens |
| 185 | + |
| 186 | + assert model["id"] == "your-model-id" |
| 187 | + assert model["display_name"] == "Your Model Display Name" |
| 188 | + assert model["llm_config"]["model"] == "litellm_proxy/provider/model-name" |
| 189 | + # Only add assertions for parameters YOU added in resolve_model_config.py |
| 190 | + # assert model["llm_config"]["temperature"] == 0.0 |
| 191 | + # assert model["llm_config"]["disable_vision"] is True |
| 192 | +``` |
| 193 | + |
| 194 | +**What NOT to do in tests**: |
| 195 | +- Don't change assertions in other test functions (even if model names "look wrong") |
| 196 | +- Don't replace real model tests with mocked tests |
| 197 | +- Don't change `test_model` to `check_model` in imports |
| 198 | +- Don't modify mock_models dictionaries in other tests |
| 199 | +- Don't add "fixes" to existing tests - they work as-is |
| 200 | + |
| 201 | +## Step 4: Update GPT Variant Detection (GPT models only) |
| 202 | + |
| 203 | +**File**: `openhands-sdk/openhands/sdk/llm/utils/model_prompt_spec.py` |
| 204 | + |
| 205 | +Required only if this is a GPT model needing specific prompt template. |
| 206 | + |
| 207 | +**Order matters**: More specific patterns must come before general patterns. |
| 208 | + |
| 209 | +```python |
| 210 | +_MODEL_VARIANT_PATTERNS: dict[str, tuple[tuple[str, tuple[str, ...]], ...]] = { |
| 211 | + "openai_gpt": ( |
| 212 | + ( |
| 213 | + "gpt-5-codex", # Specific variant first |
| 214 | + ("gpt-5-codex", "gpt-5.1-codex", "gpt-5.2-codex", "gpt-5.3-codex"), |
| 215 | + ), |
| 216 | + ("gpt-5", ("gpt-5", "gpt-5.1", "gpt-5.2")), # General variant last |
| 217 | + ), |
| 218 | +} |
| 219 | +``` |
| 220 | + |
| 221 | +## Step 5: Run Tests Locally |
| 222 | + |
| 223 | +```bash |
| 224 | +# Pre-commit checks |
| 225 | +pre-commit run --all-files |
| 226 | + |
| 227 | +# Unit tests |
| 228 | +pytest tests/github_workflows/test_resolve_model_config.py::test_your_model_config -v |
| 229 | + |
| 230 | +# Manual verification |
| 231 | +cd .github/run-eval |
| 232 | +MODEL_IDS="your-model-id" GITHUB_OUTPUT=/tmp/output.txt python resolve_model_config.py |
| 233 | +``` |
| 234 | + |
| 235 | +## Step 6: Run Integration Tests (Required Before PR) |
| 236 | + |
| 237 | +**Mandatory**: Integration tests must pass before creating PR. |
| 238 | + |
| 239 | +### Via GitHub Actions |
| 240 | + |
| 241 | +1. Push branch: `git push origin your-branch-name` |
| 242 | +2. Navigate to: https://github.com/OpenHands/software-agent-sdk/actions/workflows/integration-runner.yml |
| 243 | +3. Click "Run workflow" |
| 244 | +4. Configure: |
| 245 | + - **Branch**: Select your branch |
| 246 | + - **model_ids**: `your-model-id` |
| 247 | + - **Reason**: "Testing model-id" |
| 248 | +5. Wait for completion |
| 249 | +6. **Save run URL** - required for PR description |
| 250 | + |
| 251 | +### Expected Results |
| 252 | + |
| 253 | +- Success rate: 100% (or 87.5% if vision test skipped) |
| 254 | +- Duration: 5-10 minutes per model |
| 255 | +- Tests: 8 total (basic commands, file ops, code editing, reasoning, errors, tools, context, vision) |
| 256 | + |
| 257 | +## Step 7: Create PR |
| 258 | + |
| 259 | +### Required in PR Description |
| 260 | + |
| 261 | +```markdown |
| 262 | +## Summary |
| 263 | +Adds the `model-id` model to resolve_model_config.py. |
| 264 | + |
| 265 | +## Changes |
| 266 | +- Added model-id to MODELS dictionary |
| 267 | +- Added test_model_id_config() test function |
| 268 | +- [Only if applicable] Added to [feature category] in model_features.py |
| 269 | + |
| 270 | +## Configuration |
| 271 | +- Model ID: model-id |
| 272 | +- Provider: Provider Name |
| 273 | +- Temperature: [value] - [reasoning for choice] |
| 274 | +- [List any special parameters and why needed] |
| 275 | + |
| 276 | +## Integration Test Results |
| 277 | +✅ Integration tests passed: [PASTE GITHUB ACTIONS RUN URL] |
| 278 | + |
| 279 | +[Summary table showing test results] |
| 280 | + |
| 281 | +Fixes #[issue-number] |
| 282 | +``` |
| 283 | + |
| 284 | +### What NOT to Include in PR Description |
| 285 | + |
| 286 | +**Do not claim to have "fixed" things unless they were actually broken**: |
| 287 | +- ❌ "Fixed test_model import issue" (if tests were passing, there was no issue) |
| 288 | +- ❌ "Fixed incorrect assertions in existing tests" (they were correct) |
| 289 | +- ❌ "Improved test coverage" (unless you actually added new test cases) |
| 290 | +- ❌ "Cleaned up code" (you shouldn't be cleaning up anything) |
| 291 | +- ❌ "Updated test approach" (you shouldn't be changing testing approach) |
| 292 | + |
| 293 | +**Only describe what you actually added**: |
| 294 | +- ✅ "Added gpt-5.3-codex model configuration" |
| 295 | +- ✅ "Added test for gpt-5.3-codex" |
| 296 | +- ✅ "Added gpt-5.3-codex to REASONING_EFFORT_MODELS" |
| 297 | + |
| 298 | +## Common Issues |
| 299 | + |
| 300 | +### Integration Tests Hang (6-8+ hours) |
| 301 | +**Causes**: |
| 302 | +- Missing `max_tokens` parameter |
| 303 | +- Claude models with both `temperature` and `top_p` set |
| 304 | +- Model not in REASONING_EFFORT_MODELS or EXTENDED_THINKING_MODELS |
| 305 | + |
| 306 | +**Solutions**: Add `max_tokens`, remove parameter conflicts, add to appropriate feature category. |
| 307 | + |
| 308 | +**Reference**: #2147 |
| 309 | + |
| 310 | +### Preflight Check: "Cannot specify both temperature and top_p" |
| 311 | +**Cause**: Claude models receiving both parameters |
| 312 | + |
| 313 | +**Solutions**: |
| 314 | +- Remove `top_p` from llm_config if `temperature` is set |
| 315 | +- Add model to REASONING_EFFORT_MODELS or EXTENDED_THINKING_MODELS (auto-strips both) |
| 316 | + |
| 317 | +**Reference**: #2137, #2193 |
| 318 | + |
| 319 | +### Vision Tests Fail |
| 320 | +**Cause**: LiteLLM reports vision support but model doesn't actually support it |
| 321 | + |
| 322 | +**Solution**: Add `"disable_vision": True` to llm_config |
| 323 | + |
| 324 | +**Reference**: #2110 (GLM-5), #1898 (GLM-4.7) |
| 325 | + |
| 326 | +### Wrong Prompt Template (GPT models) |
| 327 | +**Cause**: Model variant not detected correctly, falls through to wrong template |
| 328 | + |
| 329 | +**Solution**: Add explicit entries to `model_prompt_spec.py` with correct pattern order |
| 330 | + |
| 331 | +**Reference**: #2233 (GPT-5.2-codex, GPT-5.3-codex) |
| 332 | + |
| 333 | +### SDK-Only Parameters Sent to LiteLLM |
| 334 | +**Cause**: Parameter like `disable_vision` not in `SDK_ONLY_PARAMS` set |
| 335 | + |
| 336 | +**Solution**: Add to `SDK_ONLY_PARAMS` in `resolve_model_config.py` |
| 337 | + |
| 338 | +**Reference**: #2194 |
| 339 | + |
| 340 | +## Model Feature Detection Criteria |
| 341 | + |
| 342 | +### How to Determine if Model Needs Feature Category |
| 343 | + |
| 344 | +**Reasoning Model**: |
| 345 | +- Check provider documentation for "reasoning", "thinking", or "o1-style" mentions |
| 346 | +- Model exposes internal reasoning traces |
| 347 | +- Examples: o1, o3, GPT-5, Claude Opus 4.5+, Gemini 3+ |
| 348 | + |
| 349 | +**Extended Thinking**: |
| 350 | +- Check if model is Claude Sonnet 4.5+ or Claude Haiku 4.5 |
| 351 | +- Provider documents extended thinking capabilities |
| 352 | + |
| 353 | +**Prompt Caching**: |
| 354 | +- Check provider documentation for prompt caching support |
| 355 | +- Anthropic Claude 3.5+ and 4+ series support this |
| 356 | + |
| 357 | +**Vision Support**: |
| 358 | +- Check provider documentation (don't rely solely on LiteLLM) |
| 359 | +- If LiteLLM reports vision but provider docs say text-only, add `disable_vision: True` |
| 360 | + |
| 361 | +**Stop Words**: |
| 362 | +- Most models support stop words |
| 363 | +- o1/o3 series, some Grok models, DeepSeek R1 do not |
| 364 | + |
| 365 | +**String Serialization**: |
| 366 | +- If tool message errors mention "Input should be a valid string" |
| 367 | +- DeepSeek, GLM, some provider-specific models need this |
| 368 | + |
| 369 | +## Reference |
| 370 | + |
| 371 | +- Recent model additions: #2102, #2153, #2207, #2233, #2269 |
| 372 | +- Common issues: #2147 (hangs), #2137 (parameters), #2110 (vision), #2233 (variants), #2193 (preflight) |
| 373 | +- Integration test workflow: `.github/workflows/integration-runner.yml` |
0 commit comments