|
| 1 | +# Adding Models to resolve_model_config.py |
| 2 | + |
| 3 | +## Overview |
| 4 | + |
| 5 | +This file (`resolve_model_config.py`) defines models available for evaluation. Models must be added here before they can be used in integration tests or evaluations. |
| 6 | + |
| 7 | +## Files to Modify |
| 8 | + |
| 9 | +1. **Always required**: |
| 10 | + - `.github/run-eval/resolve_model_config.py` - Add model configuration |
| 11 | + - `tests/github_workflows/test_resolve_model_config.py` - Add test |
| 12 | + |
| 13 | +2. **Usually required** (if model has special characteristics): |
| 14 | + - `openhands-sdk/openhands/sdk/llm/utils/model_features.py` - Add to feature categories |
| 15 | + |
| 16 | +3. **Sometimes required**: |
| 17 | + - `openhands-sdk/openhands/sdk/llm/utils/model_prompt_spec.py` - GPT models only (variant detection) |
| 18 | + - `openhands-sdk/openhands/sdk/llm/utils/verified_models.py` - Production-ready models |
| 19 | + |
| 20 | +## Step 1: Add to resolve_model_config.py |
| 21 | + |
| 22 | +Add entry to `MODELS` dictionary: |
| 23 | + |
| 24 | +```python |
| 25 | +"model-id": { |
| 26 | + "id": "model-id", # Must match dictionary key |
| 27 | + "display_name": "Human Readable Name", |
| 28 | + "llm_config": { |
| 29 | + "model": "litellm_proxy/provider/model-name", |
| 30 | + "temperature": 0.0, # See temperature guide below |
| 31 | + }, |
| 32 | +}, |
| 33 | +``` |
| 34 | + |
| 35 | +### Temperature Configuration |
| 36 | + |
| 37 | +| Value | When to Use | Provider Requirements | |
| 38 | +|-------|-------------|----------------------| |
| 39 | +| `0.0` | Standard deterministic models | Most providers | |
| 40 | +| `1.0` | Reasoning models | Kimi K2, MiniMax M2.5 | |
| 41 | +| `None` | Use provider default | When unsure | |
| 42 | + |
| 43 | +### Special Parameters |
| 44 | + |
| 45 | +Add only if needed: |
| 46 | + |
| 47 | +- **`disable_vision: True`** - Model doesn't support vision despite LiteLLM reporting it does (GLM-4.7, GLM-5) |
| 48 | +- **`reasoning_effort: "high"`** - For OpenAI reasoning models that support this parameter |
| 49 | +- **`max_tokens: <value>`** - To prevent hangs or control output length |
| 50 | +- **`top_p: <value>`** - Nucleus sampling (cannot be used with `temperature` for Claude models) |
| 51 | +- **`litellm_extra_body: {...}`** - Provider-specific parameters (e.g., `{"enable_thinking": True}`) |
| 52 | + |
| 53 | +### Critical Rules |
| 54 | + |
| 55 | +1. Model ID must match dictionary key |
| 56 | +2. Model path must start with `litellm_proxy/` |
| 57 | +3. **Claude models**: Cannot use both `temperature` and `top_p` - choose one or omit both |
| 58 | +4. Parameters like `disable_vision` must be in `SDK_ONLY_PARAMS` constant (they're filtered before sending to LiteLLM) |
| 59 | + |
| 60 | +## Step 2: Update model_features.py (if applicable) |
| 61 | + |
| 62 | +Check provider documentation to determine which feature categories apply: |
| 63 | + |
| 64 | +### REASONING_EFFORT_MODELS |
| 65 | +Models that support `reasoning_effort` parameter: |
| 66 | +- OpenAI: o1, o3, o4, GPT-5 series |
| 67 | +- Anthropic: Claude Opus 4.5+, Claude Sonnet 4.6 |
| 68 | +- Google: Gemini 2.5+, Gemini 3.x series |
| 69 | +- AWS: Nova 2 Lite |
| 70 | + |
| 71 | +```python |
| 72 | +REASONING_EFFORT_MODELS: list[str] = [ |
| 73 | + "your-model-identifier", # Add here |
| 74 | +] |
| 75 | +``` |
| 76 | + |
| 77 | +**Effect**: Automatically strips `temperature` and `top_p` parameters to avoid API conflicts. |
| 78 | + |
| 79 | +### EXTENDED_THINKING_MODELS |
| 80 | +Models with extended thinking capabilities: |
| 81 | +- Anthropic: Claude Sonnet 4.5+, Claude Haiku 4.5 |
| 82 | + |
| 83 | +```python |
| 84 | +EXTENDED_THINKING_MODELS: list[str] = [ |
| 85 | + "your-model-identifier", # Add here |
| 86 | +] |
| 87 | +``` |
| 88 | + |
| 89 | +**Effect**: Automatically strips `temperature` and `top_p` parameters. |
| 90 | + |
| 91 | +### PROMPT_CACHE_MODELS |
| 92 | +Models supporting prompt caching: |
| 93 | +- Anthropic: Claude 3.5+, Claude 4+ series |
| 94 | + |
| 95 | +```python |
| 96 | +PROMPT_CACHE_MODELS: list[str] = [ |
| 97 | + "your-model-identifier", # Add here |
| 98 | +] |
| 99 | +``` |
| 100 | + |
| 101 | +### SUPPORTS_STOP_WORDS_FALSE_MODELS |
| 102 | +Models that **do not** support stop words: |
| 103 | +- OpenAI: o1, o3 series |
| 104 | +- xAI: Grok-4, Grok-code-fast-1 |
| 105 | +- DeepSeek: R1 family |
| 106 | + |
| 107 | +```python |
| 108 | +SUPPORTS_STOP_WORDS_FALSE_MODELS: list[str] = [ |
| 109 | + "your-model-identifier", # Add here |
| 110 | +] |
| 111 | +``` |
| 112 | + |
| 113 | +### FORCE_STRING_SERIALIZER_MODELS |
| 114 | +Models requiring string format for tool messages (not structured content): |
| 115 | +- DeepSeek models |
| 116 | +- GLM models |
| 117 | +- Groq: Kimi K2-Instruct |
| 118 | +- OpenRouter: MiniMax |
| 119 | + |
| 120 | +Use pattern matching: |
| 121 | +```python |
| 122 | +FORCE_STRING_SERIALIZER_MODELS: list[str] = [ |
| 123 | + "deepseek", # Matches any model with "deepseek" in name |
| 124 | + "groq/kimi-k2-instruct", # Provider-prefixed |
| 125 | +] |
| 126 | +``` |
| 127 | + |
| 128 | +### Other Categories |
| 129 | + |
| 130 | +- **PROMPT_CACHE_RETENTION_MODELS**: GPT-5 family, GPT-4.1 |
| 131 | +- **RESPONSES_API_MODELS**: GPT-5 family, codex-mini-latest |
| 132 | +- **SEND_REASONING_CONTENT_MODELS**: Kimi K2 Thinking/K2.5, MiniMax-M2, DeepSeek Reasoner |
| 133 | + |
| 134 | +See `model_features.py` for complete lists and additional documentation. |
| 135 | + |
| 136 | +## Step 3: Add Test |
| 137 | + |
| 138 | +**File**: `tests/github_workflows/test_resolve_model_config.py` |
| 139 | + |
| 140 | +**Important**: Python function names cannot contain hyphens. Convert model ID hyphens to underscores. |
| 141 | + |
| 142 | +```python |
| 143 | +def test_claude_sonnet_46_config(): # Note: hyphens -> underscores |
| 144 | + """Test that claude-sonnet-4-6 has correct configuration.""" |
| 145 | + model = MODELS["claude-sonnet-4-6"] # Dictionary key keeps hyphens |
| 146 | + |
| 147 | + assert model["id"] == "claude-sonnet-4-6" |
| 148 | + assert model["display_name"] == "Claude Sonnet 4.6" |
| 149 | + assert model["llm_config"]["model"] == "litellm_proxy/anthropic/claude-sonnet-4-6" |
| 150 | + assert model["llm_config"]["temperature"] == 0.0 |
| 151 | +``` |
| 152 | + |
| 153 | +## Step 4: Update GPT Variant Detection (GPT models only) |
| 154 | + |
| 155 | +**File**: `openhands-sdk/openhands/sdk/llm/utils/model_prompt_spec.py` |
| 156 | + |
| 157 | +Required only if this is a GPT model needing specific prompt template. |
| 158 | + |
| 159 | +**Order matters**: More specific patterns must come before general patterns. |
| 160 | + |
| 161 | +```python |
| 162 | +_MODEL_VARIANT_PATTERNS: dict[str, tuple[tuple[str, tuple[str, ...]], ...]] = { |
| 163 | + "openai_gpt": ( |
| 164 | + ( |
| 165 | + "gpt-5-codex", # Specific variant first |
| 166 | + ("gpt-5-codex", "gpt-5.1-codex", "gpt-5.2-codex", "gpt-5.3-codex"), |
| 167 | + ), |
| 168 | + ("gpt-5", ("gpt-5", "gpt-5.1", "gpt-5.2")), # General variant last |
| 169 | + ), |
| 170 | +} |
| 171 | +``` |
| 172 | + |
| 173 | +## Step 5: Run Tests Locally |
| 174 | + |
| 175 | +```bash |
| 176 | +# Pre-commit checks |
| 177 | +pre-commit run --all-files |
| 178 | + |
| 179 | +# Unit tests |
| 180 | +pytest tests/github_workflows/test_resolve_model_config.py::test_your_model_config -v |
| 181 | + |
| 182 | +# Manual verification |
| 183 | +cd .github/run-eval |
| 184 | +MODEL_IDS="your-model-id" GITHUB_OUTPUT=/tmp/output.txt python resolve_model_config.py |
| 185 | +``` |
| 186 | + |
| 187 | +## Step 6: Run Integration Tests (Required Before PR) |
| 188 | + |
| 189 | +**Mandatory**: Integration tests must pass before creating PR. |
| 190 | + |
| 191 | +### Via GitHub Actions |
| 192 | + |
| 193 | +1. Push branch: `git push origin your-branch-name` |
| 194 | +2. Navigate to: https://github.com/OpenHands/software-agent-sdk/actions/workflows/integration-runner.yml |
| 195 | +3. Click "Run workflow" |
| 196 | +4. Configure: |
| 197 | + - **Branch**: Select your branch |
| 198 | + - **model_ids**: `your-model-id` |
| 199 | + - **Reason**: "Testing model-id" |
| 200 | +5. Wait for completion |
| 201 | +6. **Save run URL** - required for PR description |
| 202 | + |
| 203 | +### Expected Results |
| 204 | + |
| 205 | +- Success rate: 100% (or 87.5% if vision test skipped) |
| 206 | +- Duration: 5-10 minutes per model |
| 207 | +- Tests: 8 total (basic commands, file ops, code editing, reasoning, errors, tools, context, vision) |
| 208 | + |
| 209 | +## Step 7: Create PR |
| 210 | + |
| 211 | +### Required in PR Description |
| 212 | + |
| 213 | +```markdown |
| 214 | +## Integration Test Results |
| 215 | +✅ Integration tests passed: [PASTE GITHUB ACTIONS RUN URL] |
| 216 | + |
| 217 | +[Summary table showing test results] |
| 218 | + |
| 219 | +## Configuration |
| 220 | +- Model ID: model-id |
| 221 | +- Provider: Provider Name |
| 222 | +- Temperature: [value] - [reasoning for choice] |
| 223 | +- Feature categories: [list categories added to model_features.py] |
| 224 | + |
| 225 | +Fixes #[issue-number] |
| 226 | +``` |
| 227 | + |
| 228 | +## Common Issues |
| 229 | + |
| 230 | +### Integration Tests Hang (6-8+ hours) |
| 231 | +**Causes**: |
| 232 | +- Missing `max_tokens` parameter |
| 233 | +- Claude models with both `temperature` and `top_p` set |
| 234 | +- Model not in REASONING_EFFORT_MODELS or EXTENDED_THINKING_MODELS |
| 235 | + |
| 236 | +**Solutions**: Add `max_tokens`, remove parameter conflicts, add to appropriate feature category. |
| 237 | + |
| 238 | +**Reference**: #2147 |
| 239 | + |
| 240 | +### Preflight Check: "Cannot specify both temperature and top_p" |
| 241 | +**Cause**: Claude models receiving both parameters |
| 242 | + |
| 243 | +**Solutions**: |
| 244 | +- Remove `top_p` from llm_config if `temperature` is set |
| 245 | +- Add model to REASONING_EFFORT_MODELS or EXTENDED_THINKING_MODELS (auto-strips both) |
| 246 | + |
| 247 | +**Reference**: #2137, #2193 |
| 248 | + |
| 249 | +### Vision Tests Fail |
| 250 | +**Cause**: LiteLLM reports vision support but model doesn't actually support it |
| 251 | + |
| 252 | +**Solution**: Add `"disable_vision": True` to llm_config |
| 253 | + |
| 254 | +**Reference**: #2110 (GLM-5), #1898 (GLM-4.7) |
| 255 | + |
| 256 | +### Wrong Prompt Template (GPT models) |
| 257 | +**Cause**: Model variant not detected correctly, falls through to wrong template |
| 258 | + |
| 259 | +**Solution**: Add explicit entries to `model_prompt_spec.py` with correct pattern order |
| 260 | + |
| 261 | +**Reference**: #2233 (GPT-5.2-codex, GPT-5.3-codex) |
| 262 | + |
| 263 | +### SDK-Only Parameters Sent to LiteLLM |
| 264 | +**Cause**: Parameter like `disable_vision` not in `SDK_ONLY_PARAMS` set |
| 265 | + |
| 266 | +**Solution**: Add to `SDK_ONLY_PARAMS` in `resolve_model_config.py` |
| 267 | + |
| 268 | +**Reference**: #2194 |
| 269 | + |
| 270 | +## Model Feature Detection Criteria |
| 271 | + |
| 272 | +### How to Determine if Model Needs Feature Category |
| 273 | + |
| 274 | +**Reasoning Model**: |
| 275 | +- Check provider documentation for "reasoning", "thinking", or "o1-style" mentions |
| 276 | +- Model exposes internal reasoning traces |
| 277 | +- Examples: o1, o3, GPT-5, Claude Opus 4.5+, Gemini 3+ |
| 278 | + |
| 279 | +**Extended Thinking**: |
| 280 | +- Check if model is Claude Sonnet 4.5+ or Claude Haiku 4.5 |
| 281 | +- Provider documents extended thinking capabilities |
| 282 | + |
| 283 | +**Prompt Caching**: |
| 284 | +- Check provider documentation for prompt caching support |
| 285 | +- Anthropic Claude 3.5+ and 4+ series support this |
| 286 | + |
| 287 | +**Vision Support**: |
| 288 | +- Check provider documentation (don't rely solely on LiteLLM) |
| 289 | +- If LiteLLM reports vision but provider docs say text-only, add `disable_vision: True` |
| 290 | + |
| 291 | +**Stop Words**: |
| 292 | +- Most models support stop words |
| 293 | +- o1/o3 series, some Grok models, DeepSeek R1 do not |
| 294 | + |
| 295 | +**String Serialization**: |
| 296 | +- If tool message errors mention "Input should be a valid string" |
| 297 | +- DeepSeek, GLM, some provider-specific models need this |
| 298 | + |
| 299 | +## Reference |
| 300 | + |
| 301 | +- Recent model additions: #2102, #2153, #2207, #2233, #2269 |
| 302 | +- Common issues: #2147 (hangs), #2137 (parameters), #2110 (vision), #2233 (variants), #2193 (preflight) |
| 303 | +- Integration test workflow: `.github/workflows/integration-runner.yml` |
0 commit comments