This file (resolve_model_config.py) defines models available for evaluation. Models must be added here before they can be used in integration tests or evaluations.
ONLY ADD NEW CONTENT - DO NOT MODIFY EXISTING CODE
- Never modify existing model entries - they are production code, already working
- Never modify existing tests - especially test assertions, mock configs, or expected values
- Never reformat existing code - preserve exact spacing, quotes, commas, formatting
- Never reorder models or imports - dictionary and import order must be preserved
- Never "fix" existing code - if it's in the file and tests pass, it works
- Never change test assertions - even if they "look wrong" to you
- Never replace real model tests with mocked tests - weakens validation
- Never fix import names - if
test_modelexists, don't change it tocheck_model
Example violations (all found in real PRs):
- Changing
assert result[0]["id"] == "claude-sonnet-4-5-20250929"to"gpt-4"❌ - Replacing real model config tests with mocked/custom model tests ❌
- "Fixing"
from resolve_model_config import test_modeltocheck_model❌ - Adding "Fixed incorrect assertions" without explaining what was incorrect ❌
- Claiming to "fix test issues" when tests were already passing ❌
When adding a model:
- Add ONE new entry to the MODELS dictionary
- Add ONE new test function (follow existing pattern exactly)
- Add to feature lists in model_features.py ONLY if needed for your model
- Do not touch any other files, tests, imports, or configurations
- Test the PR branch with the integration test action.
- Add a link to the integrations test to the PR.
- If you think something is broken, it's probably not - add a comment to the PR.
-
Always required:
.github/run-eval/resolve_model_config.py- Add model configurationtests/github_workflows/test_resolve_model_config.py- Add test
-
Usually required (if model has special characteristics):
openhands-sdk/openhands/sdk/llm/utils/model_features.py- Add to feature categories
-
Sometimes required:
openhands-sdk/openhands/sdk/llm/utils/model_prompt_spec.py- GPT models only (variant detection)openhands-sdk/openhands/sdk/llm/utils/verified_models.py- Production-ready models
⚠️ When editingverified_models.py: If you add a model toVERIFIED_OPENHANDS_MODELS, you must also add it to its provider-specific list (e.g.VERIFIED_ANTHROPIC_MODELS,VERIFIED_GEMINI_MODELS,VERIFIED_MOONSHOT_MODELS, etc.). If no list exists for the provider yet, create one and add it to theVERIFIED_MODELSdict. This ensures the model appears under its actual provider in the UI, not just under "openhands".
Add entry to MODELS dictionary:
"model-id": {
"id": "model-id", # Must match dictionary key
"display_name": "Human Readable Name",
"llm_config": {
"model": "litellm_proxy/provider/model-name",
"temperature": 0.0, # See temperature guide below
},
},| Value | When to Use | Provider Requirements |
|---|---|---|
0.0 |
Standard deterministic models | Most providers |
1.0 |
Reasoning models | Kimi K2, MiniMax M2.5 |
None |
Use provider default | When unsure |
Add only if needed:
disable_vision: True- Model doesn't support vision despite LiteLLM reporting it does (GLM-4.7, GLM-5)reasoning_effort: "high"- For OpenAI reasoning models that support this parametermax_tokens: <value>- To prevent hangs or control output lengthtop_p: <value>- Nucleus sampling (cannot be used withtemperaturefor Claude models)litellm_extra_body: {...}- Provider-specific parameters (e.g.,{"enable_thinking": True})
- Model ID must match dictionary key
- Model path must start with
litellm_proxy/ - Claude models: Cannot use both
temperatureandtop_p- choose one or omit both - Parameters like
disable_visionmust be inSDK_ONLY_PARAMSconstant (they're filtered before sending to LiteLLM)
Check provider documentation to determine which feature categories apply:
Models that support reasoning_effort parameter:
- OpenAI: o1, o3, o4, GPT-5 series
- Anthropic: Claude Opus 4.5+, Claude Sonnet 4.6
- Google: Gemini 2.5+, Gemini 3.x series
- AWS: Nova 2 Lite
REASONING_EFFORT_MODELS: list[str] = [
"your-model-identifier", # Add here
]Effect: Automatically strips temperature and top_p parameters to avoid API conflicts.
Models with extended thinking capabilities:
- Anthropic: Claude Sonnet 4.5+, Claude Haiku 4.5
EXTENDED_THINKING_MODELS: list[str] = [
"your-model-identifier", # Add here
]Effect: Automatically strips temperature and top_p parameters.
Models supporting prompt caching:
- Anthropic: Claude 3.5+, Claude 4+ series
PROMPT_CACHE_MODELS: list[str] = [
"your-model-identifier", # Add here
]Models that do not support stop words:
- OpenAI: o1, o3 series
- xAI: Grok-4, Grok-code-fast-1
- DeepSeek: R1 family
SUPPORTS_STOP_WORDS_FALSE_MODELS: list[str] = [
"your-model-identifier", # Add here
]Models requiring string format for tool messages (not structured content):
- DeepSeek models
- GLM models
- Groq: Kimi K2-Instruct
- OpenRouter: MiniMax
Use pattern matching:
FORCE_STRING_SERIALIZER_MODELS: list[str] = [
"deepseek", # Matches any model with "deepseek" in name
"groq/kimi-k2-instruct", # Provider-prefixed
]- PROMPT_CACHE_RETENTION_MODELS: GPT-5 family, GPT-4.1
- RESPONSES_API_MODELS: GPT-5 family, codex-mini-latest
- SEND_REASONING_CONTENT_MODELS: Kimi K2 Thinking/K2.5, MiniMax-M2, DeepSeek Reasoner
See model_features.py for complete lists and additional documentation.
File: tests/github_workflows/test_resolve_model_config.py
Important:
- Python function names cannot contain hyphens. Convert model ID hyphens to underscores.
- Do not modify any existing test functions - only add your new one at the end of the file
- Do not change existing imports - use what's already there
- Do not fix "incorrect" assertions in other tests - they are correct
Test template (copy and modify for your model):
def test_your_model_id_config(): # Replace hyphens with underscores in function name
"""Test that your-model-id has correct configuration."""
model = MODELS["your-model-id"] # Dictionary key keeps hyphens
assert model["id"] == "your-model-id"
assert model["display_name"] == "Your Model Display Name"
assert model["llm_config"]["model"] == "litellm_proxy/provider/model-name"
# Only add assertions for parameters YOU added in resolve_model_config.py
# assert model["llm_config"]["temperature"] == 0.0
# assert model["llm_config"]["disable_vision"] is TrueWhat NOT to do in tests:
- Don't change assertions in other test functions (even if model names "look wrong")
- Don't replace real model tests with mocked tests
- Don't change
test_modeltocheck_modelin imports - Don't modify mock_models dictionaries in other tests
- Don't add "fixes" to existing tests - they work as-is
File: openhands-sdk/openhands/sdk/llm/utils/model_prompt_spec.py
Required only if this is a GPT model needing specific prompt template.
Order matters: More specific patterns must come before general patterns.
_MODEL_VARIANT_PATTERNS: dict[str, tuple[tuple[str, tuple[str, ...]], ...]] = {
"openai_gpt": (
(
"gpt-5-codex", # Specific variant first
("gpt-5-codex", "gpt-5.1-codex", "gpt-5.2-codex", "gpt-5.3-codex"),
),
("gpt-5", ("gpt-5", "gpt-5.1", "gpt-5.2")), # General variant last
),
}# Pre-commit checks
pre-commit run --all-files
# Unit tests
pytest tests/github_workflows/test_resolve_model_config.py::test_your_model_config -v
# Manual verification
cd .github/run-eval
MODEL_IDS="your-model-id" GITHUB_OUTPUT=/tmp/output.txt python resolve_model_config.pyPush your branch and create a draft PR. Note the PR number returned - you'll need it for the integration tests.
Trigger integration tests on your PR branch:
gh workflow run integration-runner.yml \
-f model_ids=your-model-id \
-f reason="Testing new model from PR #<pr-number>" \
-f issue_number=<pr-number> \
--ref your-branch-nameResults will be posted back to the PR as a comment.
- Success rate: 100% (or 87.5% if vision test skipped)
- Duration: 5-10 minutes per model
- Tests: 8 total (basic commands, file ops, code editing, reasoning, errors, tools, context, vision)
If tests fail, see Common Issues below. After fixing:
- Push the fix:
git add . && git commit && git push - Rerun integration tests with the same command from Step 7 (using the same PR number)
When tests pass, mark the PR as ready for review:
gh pr ready <pr-number>## Summary
Adds the `model-id` model to resolve_model_config.py.
## Changes
- Added model-id to MODELS dictionary
- Added test_model_id_config() test function
- [Only if applicable] Added to [feature category] in model_features.py
## Configuration
- Model ID: model-id
- Provider: Provider Name
- Temperature: [value] - [reasoning for choice]
- [List any special parameters and why needed]
## Integration Test Results
✅ Integration tests passed: [PASTE GITHUB ACTIONS RUN URL]
[Summary table showing test results]
Fixes #[issue-number]Do not claim to have "fixed" things unless they were actually broken:
- ❌ "Fixed test_model import issue" (if tests were passing, there was no issue)
- ❌ "Fixed incorrect assertions in existing tests" (they were correct)
- ❌ "Improved test coverage" (unless you actually added new test cases)
- ❌ "Cleaned up code" (you shouldn't be cleaning up anything)
- ❌ "Updated test approach" (you shouldn't be changing testing approach)
Only describe what you actually added:
- ✅ "Added gpt-5.3-codex model configuration"
- ✅ "Added test for gpt-5.3-codex"
- ✅ "Added gpt-5.3-codex to REASONING_EFFORT_MODELS"
Causes:
- Missing
max_tokensparameter - Claude models with both
temperatureandtop_pset - Model not in REASONING_EFFORT_MODELS or EXTENDED_THINKING_MODELS
Solutions: Add max_tokens, remove parameter conflicts, add to appropriate feature category.
Reference: #2147
Cause: Claude models receiving both parameters
Solutions:
- Remove
top_pfrom llm_config iftemperatureis set - Add model to REASONING_EFFORT_MODELS or EXTENDED_THINKING_MODELS (auto-strips both)
Reference: #2137, #2193
Cause: LiteLLM reports vision support but model doesn't actually support it
Solution: Add "disable_vision": True to llm_config
Reference: #2110 (GLM-5), #1898 (GLM-4.7)
Cause: Model variant not detected correctly, falls through to wrong template
Solution: Add explicit entries to model_prompt_spec.py with correct pattern order
Reference: #2233 (GPT-5.2-codex, GPT-5.3-codex)
Cause: Parameter like disable_vision not in SDK_ONLY_PARAMS set
Solution: Add to SDK_ONLY_PARAMS in resolve_model_config.py
Reference: #2194
Reasoning Model:
- Check provider documentation for "reasoning", "thinking", or "o1-style" mentions
- Model exposes internal reasoning traces
- Examples: o1, o3, GPT-5, Claude Opus 4.5+, Gemini 3+
Extended Thinking:
- Check if model is Claude Sonnet 4.5+ or Claude Haiku 4.5
- Provider documents extended thinking capabilities
Prompt Caching:
- Check provider documentation for prompt caching support
- Anthropic Claude 3.5+ and 4+ series support this
Vision Support:
- Check provider documentation (don't rely solely on LiteLLM)
- If LiteLLM reports vision but provider docs say text-only, add
disable_vision: True
Stop Words:
- Most models support stop words
- o1/o3 series, some Grok models, DeepSeek R1 do not
String Serialization:
- If tool message errors mention "Input should be a valid string"
- DeepSeek, GLM, some provider-specific models need this
- Recent model additions: #2102, #2153, #2207, #2233, #2269
- Common issues: #2147 (hangs), #2137 (parameters), #2110 (vision), #2233 (variants), #2193 (preflight)
- Integration test workflow:
.github/workflows/integration-runner.yml - Integration tests can be triggered via:
gh workflow run integration-runner.yml --ref <branch>