-
Notifications
You must be signed in to change notification settings - Fork 217
Add AGENTS.md for model addition guidance in .github/run-eval #2284
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,303 @@ | ||
| # Adding Models to resolve_model_config.py | ||
|
|
||
| ## Overview | ||
|
|
||
| This file (`resolve_model_config.py`) defines models available for evaluation. Models must be added here before they can be used in integration tests or evaluations. | ||
|
|
||
| ## Files to Modify | ||
|
|
||
| 1. **Always required**: | ||
| - `.github/run-eval/resolve_model_config.py` - Add model configuration | ||
| - `tests/github_workflows/test_resolve_model_config.py` - Add test | ||
|
|
||
| 2. **Usually required** (if model has special characteristics): | ||
| - `openhands-sdk/openhands/sdk/llm/utils/model_features.py` - Add to feature categories | ||
|
|
||
| 3. **Sometimes required**: | ||
| - `openhands-sdk/openhands/sdk/llm/utils/model_prompt_spec.py` - GPT models only (variant detection) | ||
| - `openhands-sdk/openhands/sdk/llm/utils/verified_models.py` - Production-ready models | ||
|
|
||
| ## Step 1: Add to resolve_model_config.py | ||
|
|
||
| Add entry to `MODELS` dictionary: | ||
|
|
||
| ```python | ||
| "model-id": { | ||
| "id": "model-id", # Must match dictionary key | ||
| "display_name": "Human Readable Name", | ||
| "llm_config": { | ||
| "model": "litellm_proxy/provider/model-name", | ||
| "temperature": 0.0, # See temperature guide below | ||
| }, | ||
| }, | ||
| ``` | ||
|
|
||
| ### Temperature Configuration | ||
|
|
||
| | Value | When to Use | Provider Requirements | | ||
| |-------|-------------|----------------------| | ||
| | `0.0` | Standard deterministic models | Most providers | | ||
| | `1.0` | Reasoning models | Kimi K2, MiniMax M2.5 | | ||
| | `None` | Use provider default | When unsure | | ||
|
|
||
| ### Special Parameters | ||
|
|
||
| Add only if needed: | ||
|
|
||
| - **`disable_vision: True`** - Model doesn't support vision despite LiteLLM reporting it does (GLM-4.7, GLM-5) | ||
| - **`reasoning_effort: "high"`** - For OpenAI reasoning models that support this parameter | ||
| - **`max_tokens: <value>`** - To prevent hangs or control output length | ||
| - **`top_p: <value>`** - Nucleus sampling (cannot be used with `temperature` for Claude models) | ||
| - **`litellm_extra_body: {...}`** - Provider-specific parameters (e.g., `{"enable_thinking": True}`) | ||
|
|
||
| ### Critical Rules | ||
|
|
||
| 1. Model ID must match dictionary key | ||
| 2. Model path must start with `litellm_proxy/` | ||
| 3. **Claude models**: Cannot use both `temperature` and `top_p` - choose one or omit both | ||
| 4. Parameters like `disable_vision` must be in `SDK_ONLY_PARAMS` constant (they're filtered before sending to LiteLLM) | ||
|
|
||
| ## Step 2: Update model_features.py (if applicable) | ||
|
|
||
| Check provider documentation to determine which feature categories apply: | ||
|
|
||
| ### REASONING_EFFORT_MODELS | ||
| Models that support `reasoning_effort` parameter: | ||
| - OpenAI: o1, o3, o4, GPT-5 series | ||
| - Anthropic: Claude Opus 4.5+, Claude Sonnet 4.6 | ||
| - Google: Gemini 2.5+, Gemini 3.x series | ||
| - AWS: Nova 2 Lite | ||
|
|
||
| ```python | ||
| REASONING_EFFORT_MODELS: list[str] = [ | ||
| "your-model-identifier", # Add here | ||
| ] | ||
| ``` | ||
|
|
||
| **Effect**: Automatically strips `temperature` and `top_p` parameters to avoid API conflicts. | ||
|
|
||
| ### EXTENDED_THINKING_MODELS | ||
| Models with extended thinking capabilities: | ||
| - Anthropic: Claude Sonnet 4.5+, Claude Haiku 4.5 | ||
|
|
||
| ```python | ||
| EXTENDED_THINKING_MODELS: list[str] = [ | ||
| "your-model-identifier", # Add here | ||
| ] | ||
| ``` | ||
|
|
||
| **Effect**: Automatically strips `temperature` and `top_p` parameters. | ||
|
|
||
| ### PROMPT_CACHE_MODELS | ||
| Models supporting prompt caching: | ||
| - Anthropic: Claude 3.5+, Claude 4+ series | ||
|
|
||
juanmichelini marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| ```python | ||
| PROMPT_CACHE_MODELS: list[str] = [ | ||
| "your-model-identifier", # Add here | ||
| ] | ||
| ``` | ||
|
|
||
| ### SUPPORTS_STOP_WORDS_FALSE_MODELS | ||
| Models that **do not** support stop words: | ||
| - OpenAI: o1, o3 series | ||
| - xAI: Grok-4, Grok-code-fast-1 | ||
| - DeepSeek: R1 family | ||
|
|
||
| ```python | ||
| SUPPORTS_STOP_WORDS_FALSE_MODELS: list[str] = [ | ||
| "your-model-identifier", # Add here | ||
| ] | ||
| ``` | ||
|
|
||
| ### FORCE_STRING_SERIALIZER_MODELS | ||
| Models requiring string format for tool messages (not structured content): | ||
| - DeepSeek models | ||
| - GLM models | ||
| - Groq: Kimi K2-Instruct | ||
| - OpenRouter: MiniMax | ||
|
|
||
| Use pattern matching: | ||
| ```python | ||
| FORCE_STRING_SERIALIZER_MODELS: list[str] = [ | ||
| "deepseek", # Matches any model with "deepseek" in name | ||
| "groq/kimi-k2-instruct", # Provider-prefixed | ||
| ] | ||
| ``` | ||
|
|
||
| ### Other Categories | ||
|
|
||
| - **PROMPT_CACHE_RETENTION_MODELS**: GPT-5 family, GPT-4.1 | ||
| - **RESPONSES_API_MODELS**: GPT-5 family, codex-mini-latest | ||
| - **SEND_REASONING_CONTENT_MODELS**: Kimi K2 Thinking/K2.5, MiniMax-M2, DeepSeek Reasoner | ||
|
|
||
| See `model_features.py` for complete lists and additional documentation. | ||
|
|
||
| ## Step 3: Add Test | ||
|
|
||
| **File**: `tests/github_workflows/test_resolve_model_config.py` | ||
|
|
||
| **Important**: Python function names cannot contain hyphens. Convert model ID hyphens to underscores. | ||
|
|
||
| ```python | ||
| def test_claude_sonnet_46_config(): # Note: hyphens -> underscores | ||
| """Test that claude-sonnet-4-6 has correct configuration.""" | ||
| model = MODELS["claude-sonnet-4-6"] # Dictionary key keeps hyphens | ||
|
|
||
| assert model["id"] == "claude-sonnet-4-6" | ||
| assert model["display_name"] == "Claude Sonnet 4.6" | ||
| assert model["llm_config"]["model"] == "litellm_proxy/anthropic/claude-sonnet-4-6" | ||
| assert model["llm_config"]["temperature"] == 0.0 | ||
| ``` | ||
|
|
||
| ## Step 4: Update GPT Variant Detection (GPT models only) | ||
|
|
||
| **File**: `openhands-sdk/openhands/sdk/llm/utils/model_prompt_spec.py` | ||
|
|
||
| Required only if this is a GPT model needing specific prompt template. | ||
|
|
||
| **Order matters**: More specific patterns must come before general patterns. | ||
|
|
||
| ```python | ||
| _MODEL_VARIANT_PATTERNS: dict[str, tuple[tuple[str, tuple[str, ...]], ...]] = { | ||
| "openai_gpt": ( | ||
| ( | ||
| "gpt-5-codex", # Specific variant first | ||
| ("gpt-5-codex", "gpt-5.1-codex", "gpt-5.2-codex", "gpt-5.3-codex"), | ||
| ), | ||
| ("gpt-5", ("gpt-5", "gpt-5.1", "gpt-5.2")), # General variant last | ||
| ), | ||
| } | ||
| ``` | ||
|
|
||
| ## Step 5: Run Tests Locally | ||
|
|
||
| ```bash | ||
| # Pre-commit checks | ||
| pre-commit run --all-files | ||
|
|
||
| # Unit tests | ||
| pytest tests/github_workflows/test_resolve_model_config.py::test_your_model_config -v | ||
|
|
||
| # Manual verification | ||
| cd .github/run-eval | ||
| MODEL_IDS="your-model-id" GITHUB_OUTPUT=/tmp/output.txt python resolve_model_config.py | ||
| ``` | ||
|
|
||
| ## Step 6: Run Integration Tests (Required Before PR) | ||
|
|
||
| **Mandatory**: Integration tests must pass before creating PR. | ||
|
|
||
| ### Via GitHub Actions | ||
|
|
||
| 1. Push branch: `git push origin your-branch-name` | ||
| 2. Navigate to: https://github.com/OpenHands/software-agent-sdk/actions/workflows/integration-runner.yml | ||
| 3. Click "Run workflow" | ||
| 4. Configure: | ||
| - **Branch**: Select your branch | ||
| - **model_ids**: `your-model-id` | ||
| - **Reason**: "Testing model-id" | ||
| 5. Wait for completion | ||
| 6. **Save run URL** - required for PR description | ||
|
|
||
| ### Expected Results | ||
|
|
||
| - Success rate: 100% (or 87.5% if vision test skipped) | ||
| - Duration: 5-10 minutes per model | ||
| - Tests: 8 total (basic commands, file ops, code editing, reasoning, errors, tools, context, vision) | ||
|
|
||
| ## Step 7: Create PR | ||
|
|
||
| ### Required in PR Description | ||
|
|
||
| ```markdown | ||
| ## Integration Test Results | ||
| ✅ Integration tests passed: [PASTE GITHUB ACTIONS RUN URL] | ||
|
|
||
| [Summary table showing test results] | ||
|
|
||
| ## Configuration | ||
| - Model ID: model-id | ||
| - Provider: Provider Name | ||
| - Temperature: [value] - [reasoning for choice] | ||
| - Feature categories: [list categories added to model_features.py] | ||
|
|
||
| Fixes #[issue-number] | ||
| ``` | ||
|
|
||
| ## Common Issues | ||
|
|
||
| ### Integration Tests Hang (6-8+ hours) | ||
| **Causes**: | ||
| - Missing `max_tokens` parameter | ||
| - Claude models with both `temperature` and `top_p` set | ||
| - Model not in REASONING_EFFORT_MODELS or EXTENDED_THINKING_MODELS | ||
|
|
||
| **Solutions**: Add `max_tokens`, remove parameter conflicts, add to appropriate feature category. | ||
|
|
||
| **Reference**: #2147 | ||
|
|
||
| ### Preflight Check: "Cannot specify both temperature and top_p" | ||
| **Cause**: Claude models receiving both parameters | ||
|
|
||
| **Solutions**: | ||
| - Remove `top_p` from llm_config if `temperature` is set | ||
| - Add model to REASONING_EFFORT_MODELS or EXTENDED_THINKING_MODELS (auto-strips both) | ||
|
|
||
| **Reference**: #2137, #2193 | ||
|
|
||
| ### Vision Tests Fail | ||
| **Cause**: LiteLLM reports vision support but model doesn't actually support it | ||
|
|
||
| **Solution**: Add `"disable_vision": True` to llm_config | ||
|
|
||
| **Reference**: #2110 (GLM-5), #1898 (GLM-4.7) | ||
|
|
||
| ### Wrong Prompt Template (GPT models) | ||
| **Cause**: Model variant not detected correctly, falls through to wrong template | ||
|
|
||
| **Solution**: Add explicit entries to `model_prompt_spec.py` with correct pattern order | ||
|
|
||
| **Reference**: #2233 (GPT-5.2-codex, GPT-5.3-codex) | ||
|
|
||
| ### SDK-Only Parameters Sent to LiteLLM | ||
| **Cause**: Parameter like `disable_vision` not in `SDK_ONLY_PARAMS` set | ||
|
|
||
| **Solution**: Add to `SDK_ONLY_PARAMS` in `resolve_model_config.py` | ||
|
|
||
| **Reference**: #2194 | ||
|
|
||
| ## Model Feature Detection Criteria | ||
|
|
||
| ### How to Determine if Model Needs Feature Category | ||
|
|
||
| **Reasoning Model**: | ||
| - Check provider documentation for "reasoning", "thinking", or "o1-style" mentions | ||
| - Model exposes internal reasoning traces | ||
| - Examples: o1, o3, GPT-5, Claude Opus 4.5+, Gemini 3+ | ||
|
|
||
| **Extended Thinking**: | ||
| - Check if model is Claude Sonnet 4.5+ or Claude Haiku 4.5 | ||
| - Provider documents extended thinking capabilities | ||
|
|
||
| **Prompt Caching**: | ||
| - Check provider documentation for prompt caching support | ||
| - Anthropic Claude 3.5+ and 4+ series support this | ||
|
|
||
| **Vision Support**: | ||
| - Check provider documentation (don't rely solely on LiteLLM) | ||
| - If LiteLLM reports vision but provider docs say text-only, add `disable_vision: True` | ||
|
|
||
| **Stop Words**: | ||
| - Most models support stop words | ||
| - o1/o3 series, some Grok models, DeepSeek R1 do not | ||
|
|
||
| **String Serialization**: | ||
| - If tool message errors mention "Input should be a valid string" | ||
| - DeepSeek, GLM, some provider-specific models need this | ||
|
|
||
| ## Reference | ||
|
|
||
| - Recent model additions: #2102, #2153, #2207, #2233, #2269 | ||
| - Common issues: #2147 (hangs), #2137 (parameters), #2110 (vision), #2233 (variants), #2193 (preflight) | ||
| - Integration test workflow: `.github/workflows/integration-runner.yml` | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.