Adding Models to resolve_model_config.py

Overview

This file (resolve_model_config.py) defines models available for evaluation. Models must be added here before they can be used in integration tests or evaluations.

Critical Rules

ONLY ADD NEW CONTENT - DO NOT MODIFY EXISTING CODE

What NOT to Do

Never modify existing model entries - they are production code, already working
Never modify existing tests - especially test assertions, mock configs, or expected values
Never reformat existing code - preserve exact spacing, quotes, commas, formatting
Never reorder models or imports - dictionary and import order must be preserved
Never "fix" existing code - if it's in the file and tests pass, it works
Never change test assertions - even if they "look wrong" to you
Never replace real model tests with mocked tests - weakens validation
Never fix import names - if test_model exists, don't change it to check_model

What These Rules Prevent

Example violations (all found in real PRs):

Changing assert result[0]["id"] == "claude-sonnet-4-5-20250929" to "gpt-4" ❌
Replacing real model config tests with mocked/custom model tests ❌
"Fixing" from resolve_model_config import test_model to check_model ❌
Adding "Fixed incorrect assertions" without explaining what was incorrect ❌
Claiming to "fix test issues" when tests were already passing ❌

What TO Do

When adding a model:

Add ONE new entry to the MODELS dictionary
Add ONE new test function (follow existing pattern exactly)
Add to feature lists in model_features.py ONLY if needed for your model
Do not touch any other files, tests, imports, or configurations
Test the PR branch with the integration test action.
Add a link to the integrations test to the PR.
If you think something is broken, it's probably not - add a comment to the PR.

Files to Modify

Always required:
- .github/run-eval/resolve_model_config.py - Add model configuration
- tests/github_workflows/test_resolve_model_config.py - Add test
Usually required (if model has special characteristics):
- openhands-sdk/openhands/sdk/llm/utils/model_features.py - Add to feature categories
Sometimes required:
- openhands-sdk/openhands/sdk/llm/utils/model_prompt_spec.py - GPT models only (variant detection)
- openhands-sdk/openhands/sdk/llm/utils/verified_models.py - Production-ready models
⚠️ When editing verified_models.py: If you add a model to VERIFIED_OPENHANDS_MODELS, you must also add it to its provider-specific list (e.g. VERIFIED_ANTHROPIC_MODELS, VERIFIED_GEMINI_MODELS, VERIFIED_MOONSHOT_MODELS, etc.). If no list exists for the provider yet, create one and add it to the VERIFIED_MODELS dict. This ensures the model appears under its actual provider in the UI, not just under "openhands".

Step 1: Add to resolve_model_config.py

Add entry to MODELS dictionary:

"model-id": {
    "id": "model-id",  # Must match dictionary key
    "display_name": "Human Readable Name",
    "llm_config": {
        "model": "litellm_proxy/provider/model-name",
        "temperature": 0.0,  # See temperature guide below
    },
},

Temperature Configuration

Value	When to Use	Provider Requirements
`0.0`	Standard deterministic models	Most providers
`1.0`	Reasoning models	Kimi K2, MiniMax M2.5
`None`	Use provider default	When unsure

Special Parameters

Add only if needed:

disable_vision: True - Model doesn't support vision despite LiteLLM reporting it does (GLM-4.7, GLM-5)
reasoning_effort: "high" - For OpenAI reasoning models that support this parameter
max_tokens: <value> - To prevent hangs or control output length
top_p: <value> - Nucleus sampling (cannot be used with temperature for Claude models)
litellm_extra_body: {...} - Provider-specific parameters (e.g., {"enable_thinking": True})

Critical Rules

Model ID must match dictionary key
Model path must start with litellm_proxy/
Claude models: Cannot use both temperature and top_p - choose one or omit both
Parameters like disable_vision must be in SDK_ONLY_PARAMS constant (they're filtered before sending to LiteLLM)

Step 2: Update model_features.py (if applicable)

Check provider documentation to determine which feature categories apply:

REASONING_EFFORT_MODELS

Models that support reasoning_effort parameter:

OpenAI: o1, o3, o4, GPT-5 series
Anthropic: Claude Opus 4.5+, Claude Sonnet 4.6
Google: Gemini 2.5+, Gemini 3.x series
AWS: Nova 2 Lite

REASONING_EFFORT_MODELS: list[str] = [
    "your-model-identifier",  # Add here
]

Effect: Automatically strips temperature and top_p parameters to avoid API conflicts.

EXTENDED_THINKING_MODELS

Models with extended thinking capabilities:

Anthropic: Claude Sonnet 4.5+, Claude Haiku 4.5

EXTENDED_THINKING_MODELS: list[str] = [
    "your-model-identifier",  # Add here
]

Effect: Automatically strips temperature and top_p parameters.

PROMPT_CACHE_MODELS

Models supporting prompt caching:

Anthropic: Claude 3.5+, Claude 4+ series

PROMPT_CACHE_MODELS: list[str] = [
    "your-model-identifier",  # Add here
]

SUPPORTS_STOP_WORDS_FALSE_MODELS

Models that do not support stop words:

OpenAI: o1, o3 series
xAI: Grok-4, Grok-code-fast-1
DeepSeek: R1 family

SUPPORTS_STOP_WORDS_FALSE_MODELS: list[str] = [
    "your-model-identifier",  # Add here
]

FORCE_STRING_SERIALIZER_MODELS

Models requiring string format for tool messages (not structured content):

DeepSeek models
GLM models
Groq: Kimi K2-Instruct
OpenRouter: MiniMax

Use pattern matching:

FORCE_STRING_SERIALIZER_MODELS: list[str] = [
    "deepseek",  # Matches any model with "deepseek" in name
    "groq/kimi-k2-instruct",  # Provider-prefixed
]

Other Categories

PROMPT_CACHE_RETENTION_MODELS: GPT-5 family, GPT-4.1
RESPONSES_API_MODELS: GPT-5 family, codex-mini-latest
SEND_REASONING_CONTENT_MODELS: Kimi K2 Thinking/K2.5, MiniMax-M2, DeepSeek Reasoner

See model_features.py for complete lists and additional documentation.

Step 3: Add Test

File: tests/github_workflows/test_resolve_model_config.py

Important:

Python function names cannot contain hyphens. Convert model ID hyphens to underscores.
Do not modify any existing test functions - only add your new one at the end of the file
Do not change existing imports - use what's already there
Do not fix "incorrect" assertions in other tests - they are correct

Test template (copy and modify for your model):

def test_your_model_id_config():  # Replace hyphens with underscores in function name
    """Test that your-model-id has correct configuration."""
    model = MODELS["your-model-id"]  # Dictionary key keeps hyphens
    
    assert model["id"] == "your-model-id"
    assert model["display_name"] == "Your Model Display Name"
    assert model["llm_config"]["model"] == "litellm_proxy/provider/model-name"
    # Only add assertions for parameters YOU added in resolve_model_config.py
    # assert model["llm_config"]["temperature"] == 0.0
    # assert model["llm_config"]["disable_vision"] is True

What NOT to do in tests:

Don't change assertions in other test functions (even if model names "look wrong")
Don't replace real model tests with mocked tests
Don't change test_model to check_model in imports
Don't modify mock_models dictionaries in other tests
Don't add "fixes" to existing tests - they work as-is

Step 4: Update GPT Variant Detection (GPT models only)

File: openhands-sdk/openhands/sdk/llm/utils/model_prompt_spec.py

Required only if this is a GPT model needing specific prompt template.

Order matters: More specific patterns must come before general patterns.

_MODEL_VARIANT_PATTERNS: dict[str, tuple[tuple[str, tuple[str, ...]], ...]] = {
    "openai_gpt": (
        (
            "gpt-5-codex",  # Specific variant first
            ("gpt-5-codex", "gpt-5.1-codex", "gpt-5.2-codex", "gpt-5.3-codex"),
        ),
        ("gpt-5", ("gpt-5", "gpt-5.1", "gpt-5.2")),  # General variant last
    ),
}

Step 5: Run Tests Locally

# Pre-commit checks
pre-commit run --all-files

# Unit tests
pytest tests/github_workflows/test_resolve_model_config.py::test_your_model_config -v

# Manual verification
cd .github/run-eval
MODEL_IDS="your-model-id" GITHUB_OUTPUT=/tmp/output.txt python resolve_model_config.py

Step 6: Create Draft PR

Push your branch and create a draft PR. Note the PR number returned - you'll need it for the integration tests.

Step 7: Run Integration Tests

Trigger integration tests on your PR branch:

gh workflow run integration-runner.yml \
  -f model_ids=your-model-id \
  -f reason="Testing new model from PR #<pr-number>" \
  -f issue_number=<pr-number> \
  --ref your-branch-name

Results will be posted back to the PR as a comment.

Expected Results

Success rate: 100% (or 87.5% if vision test skipped)
Duration: 5-10 minutes per model
Tests: 8 total (basic commands, file ops, code editing, reasoning, errors, tools, context, vision)

Step 8: Fix Issues and Rerun (if needed)

If tests fail, see Common Issues below. After fixing:

Push the fix: git add . && git commit && git push
Rerun integration tests with the same command from Step 7 (using the same PR number)

Step 9: Mark PR Ready

When tests pass, mark the PR as ready for review:

gh pr ready <pr-number>

Required in PR Description

## Summary
Adds the `model-id` model to resolve_model_config.py.

## Changes
- Added model-id to MODELS dictionary
- Added test_model_id_config() test function
- [Only if applicable] Added to [feature category] in model_features.py

## Configuration
- Model ID: model-id
- Provider: Provider Name  
- Temperature: [value] - [reasoning for choice]
- [List any special parameters and why needed]

## Integration Test Results
✅ Integration tests passed: [PASTE GITHUB ACTIONS RUN URL]

[Summary table showing test results]

Fixes #[issue-number]

What NOT to Include in PR Description

Do not claim to have "fixed" things unless they were actually broken:

❌ "Fixed test_model import issue" (if tests were passing, there was no issue)
❌ "Fixed incorrect assertions in existing tests" (they were correct)
❌ "Improved test coverage" (unless you actually added new test cases)
❌ "Cleaned up code" (you shouldn't be cleaning up anything)
❌ "Updated test approach" (you shouldn't be changing testing approach)

Only describe what you actually added:

✅ "Added gpt-5.3-codex model configuration"
✅ "Added test for gpt-5.3-codex"
✅ "Added gpt-5.3-codex to REASONING_EFFORT_MODELS"

Common Issues

Integration Tests Hang (6-8+ hours)

Causes:

Missing max_tokens parameter
Claude models with both temperature and top_p set
Model not in REASONING_EFFORT_MODELS or EXTENDED_THINKING_MODELS

Solutions: Add max_tokens, remove parameter conflicts, add to appropriate feature category.

Reference: #2147

Preflight Check: "Cannot specify both temperature and top_p"

Cause: Claude models receiving both parameters

Solutions:

Remove top_p from llm_config if temperature is set
Add model to REASONING_EFFORT_MODELS or EXTENDED_THINKING_MODELS (auto-strips both)

Reference: #2137, #2193

Vision Tests Fail

Cause: LiteLLM reports vision support but model doesn't actually support it

Solution: Add "disable_vision": True to llm_config

Reference: #2110 (GLM-5), #1898 (GLM-4.7)

Wrong Prompt Template (GPT models)

Cause: Model variant not detected correctly, falls through to wrong template

Solution: Add explicit entries to model_prompt_spec.py with correct pattern order

Reference: #2233 (GPT-5.2-codex, GPT-5.3-codex)

SDK-Only Parameters Sent to LiteLLM

Cause: Parameter like disable_vision not in SDK_ONLY_PARAMS set

Solution: Add to SDK_ONLY_PARAMS in resolve_model_config.py

Reference: #2194

Model Feature Detection Criteria

How to Determine if Model Needs Feature Category

Reasoning Model:

Check provider documentation for "reasoning", "thinking", or "o1-style" mentions
Model exposes internal reasoning traces
Examples: o1, o3, GPT-5, Claude Opus 4.5+, Gemini 3+

Extended Thinking:

Check if model is Claude Sonnet 4.5+ or Claude Haiku 4.5
Provider documents extended thinking capabilities

Prompt Caching:

Check provider documentation for prompt caching support
Anthropic Claude 3.5+ and 4+ series support this

Vision Support:

Check provider documentation (don't rely solely on LiteLLM)
If LiteLLM reports vision but provider docs say text-only, add disable_vision: True

Stop Words:

Most models support stop words
o1/o3 series, some Grok models, DeepSeek R1 do not

String Serialization:

If tool message errors mention "Input should be a valid string"
DeepSeek, GLM, some provider-specific models need this

Reference

Recent model additions: #2102, #2153, #2207, #2233, #2269
Common issues: #2147 (hangs), #2137 (parameters), #2110 (vision), #2233 (variants), #2193 (preflight)
Integration test workflow: .github/workflows/integration-runner.yml
Integration tests can be triggered via: gh workflow run integration-runner.yml --ref <branch>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding Models to resolve_model_config.py

Overview

Critical Rules

What NOT to Do

What These Rules Prevent

What TO Do

Files to Modify

Step 1: Add to resolve_model_config.py

Temperature Configuration

Special Parameters

Critical Rules

Step 2: Update model_features.py (if applicable)

REASONING_EFFORT_MODELS

EXTENDED_THINKING_MODELS

PROMPT_CACHE_MODELS

SUPPORTS_STOP_WORDS_FALSE_MODELS

FORCE_STRING_SERIALIZER_MODELS

Other Categories

Step 3: Add Test

Step 4: Update GPT Variant Detection (GPT models only)

Step 5: Run Tests Locally

Step 6: Create Draft PR

Step 7: Run Integration Tests

Expected Results

Step 8: Fix Issues and Rerun (if needed)

Step 9: Mark PR Ready

Required in PR Description

What NOT to Include in PR Description

Common Issues

Integration Tests Hang (6-8+ hours)

Preflight Check: "Cannot specify both temperature and top_p"

Vision Tests Fail

Wrong Prompt Template (GPT models)

SDK-Only Parameters Sent to LiteLLM

Model Feature Detection Criteria

How to Determine if Model Needs Feature Category

Reference

FilesExpand file tree

ADDINGMODEL.md

Latest commit

History

ADDINGMODEL.md

File metadata and controls

Adding Models to resolve_model_config.py

Overview

Critical Rules

What NOT to Do

What These Rules Prevent

What TO Do

Files to Modify

Step 1: Add to resolve_model_config.py

Temperature Configuration

Special Parameters

Critical Rules

Step 2: Update model_features.py (if applicable)

REASONING_EFFORT_MODELS

EXTENDED_THINKING_MODELS

PROMPT_CACHE_MODELS

SUPPORTS_STOP_WORDS_FALSE_MODELS

FORCE_STRING_SERIALIZER_MODELS

Other Categories

Step 3: Add Test

Step 4: Update GPT Variant Detection (GPT models only)

Step 5: Run Tests Locally

Step 6: Create Draft PR

Step 7: Run Integration Tests

Expected Results

Step 8: Fix Issues and Rerun (if needed)

Step 9: Mark PR Ready

Required in PR Description

What NOT to Include in PR Description

Common Issues

Integration Tests Hang (6-8+ hours)

Preflight Check: "Cannot specify both temperature and top_p"

Vision Tests Fail

Wrong Prompt Template (GPT models)

SDK-Only Parameters Sent to LiteLLM

Model Feature Detection Criteria

How to Determine if Model Needs Feature Category

Reference