Add AGENTS.md for model addition guidance in .github/run-eval (OpenHands#2284)

zparnold · zparnold · commit cd43bc37e2cb · 2026-03-05T16:42:37.000-05:00
Cherry-pick from upstream cc34237
diff --git a/.github/run-eval/AGENTS.md b/.github/run-eval/AGENTS.md
@@ -0,0 +1,303 @@
+# Adding Models to resolve_model_config.py
+
+## Overview
+
+This file (`resolve_model_config.py`) defines models available for evaluation. Models must be added here before they can be used in integration tests or evaluations.
+
+## Files to Modify
+
+1. **Always required**:
+   - `.github/run-eval/resolve_model_config.py` - Add model configuration
+   - `tests/github_workflows/test_resolve_model_config.py` - Add test
+
+2. **Usually required** (if model has special characteristics):
+   - `openhands-sdk/openhands/sdk/llm/utils/model_features.py` - Add to feature categories
+
+3. **Sometimes required**:
+   - `openhands-sdk/openhands/sdk/llm/utils/model_prompt_spec.py` - GPT models only (variant detection)
+   - `openhands-sdk/openhands/sdk/llm/utils/verified_models.py` - Production-ready models
+
+## Step 1: Add to resolve_model_config.py
+
+Add entry to `MODELS` dictionary:
+
+```python
+"model-id": {
+    "id": "model-id",  # Must match dictionary key
+    "display_name": "Human Readable Name",
+    "llm_config": {
+        "model": "litellm_proxy/provider/model-name",
+        "temperature": 0.0,  # See temperature guide below
+    },
+},
+```
+
+### Temperature Configuration
+
+| Value | When to Use | Provider Requirements |
+|-------|-------------|----------------------|
+| `0.0` | Standard deterministic models | Most providers |
+| `1.0` | Reasoning models | Kimi K2, MiniMax M2.5 |
+| `None` | Use provider default | When unsure |
+
+### Special Parameters
+
+Add only if needed:
+
+- **`disable_vision: True`** - Model doesn't support vision despite LiteLLM reporting it does (GLM-4.7, GLM-5)
+- **`reasoning_effort: "high"`** - For OpenAI reasoning models that support this parameter
+- **`max_tokens: <value>`** - To prevent hangs or control output length
+- **`top_p: <value>`** - Nucleus sampling (cannot be used with `temperature` for Claude models)
+- **`litellm_extra_body: {...}`** - Provider-specific parameters (e.g., `{"enable_thinking": True}`)
+
+### Critical Rules
+
+1. Model ID must match dictionary key
+2. Model path must start with `litellm_proxy/`
+3. **Claude models**: Cannot use both `temperature` and `top_p` - choose one or omit both
+4. Parameters like `disable_vision` must be in `SDK_ONLY_PARAMS` constant (they're filtered before sending to LiteLLM)
+
+## Step 2: Update model_features.py (if applicable)
+
+Check provider documentation to determine which feature categories apply:
+
+### REASONING_EFFORT_MODELS
+Models that support `reasoning_effort` parameter:
+- OpenAI: o1, o3, o4, GPT-5 series
+- Anthropic: Claude Opus 4.5+, Claude Sonnet 4.6
+- Google: Gemini 2.5+, Gemini 3.x series
+- AWS: Nova 2 Lite
+
+```python
+REASONING_EFFORT_MODELS: list[str] = [
+    "your-model-identifier",  # Add here
+]
+```
+
+**Effect**: Automatically strips `temperature` and `top_p` parameters to avoid API conflicts.
+
+### EXTENDED_THINKING_MODELS
+Models with extended thinking capabilities:
+- Anthropic: Claude Sonnet 4.5+, Claude Haiku 4.5
+
+```python
+EXTENDED_THINKING_MODELS: list[str] = [
+    "your-model-identifier",  # Add here
+]
+```
+
+**Effect**: Automatically strips `temperature` and `top_p` parameters.
+
+### PROMPT_CACHE_MODELS
+Models supporting prompt caching:
+- Anthropic: Claude 3.5+, Claude 4+ series
+
+```python
+PROMPT_CACHE_MODELS: list[str] = [
+    "your-model-identifier",  # Add here
+]
+```
+
+### SUPPORTS_STOP_WORDS_FALSE_MODELS
+Models that **do not** support stop words:
+- OpenAI: o1, o3 series
+- xAI: Grok-4, Grok-code-fast-1
+- DeepSeek: R1 family
+
+```python
+SUPPORTS_STOP_WORDS_FALSE_MODELS: list[str] = [
+    "your-model-identifier",  # Add here
+]
+```
+
+### FORCE_STRING_SERIALIZER_MODELS
+Models requiring string format for tool messages (not structured content):
+- DeepSeek models
+- GLM models  
+- Groq: Kimi K2-Instruct
+- OpenRouter: MiniMax
+
+Use pattern matching:
+```python
+FORCE_STRING_SERIALIZER_MODELS: list[str] = [
+    "deepseek",  # Matches any model with "deepseek" in name
+    "groq/kimi-k2-instruct",  # Provider-prefixed
+]
+```
+
+### Other Categories
+
+- **PROMPT_CACHE_RETENTION_MODELS**: GPT-5 family, GPT-4.1
+- **RESPONSES_API_MODELS**: GPT-5 family, codex-mini-latest
+- **SEND_REASONING_CONTENT_MODELS**: Kimi K2 Thinking/K2.5, MiniMax-M2, DeepSeek Reasoner
+
+See `model_features.py` for complete lists and additional documentation.
+
+## Step 3: Add Test
+
+**File**: `tests/github_workflows/test_resolve_model_config.py`
+
+**Important**: Python function names cannot contain hyphens. Convert model ID hyphens to underscores.
+
+```python
+def test_claude_sonnet_46_config():  # Note: hyphens -> underscores
+    """Test that claude-sonnet-4-6 has correct configuration."""
+    model = MODELS["claude-sonnet-4-6"]  # Dictionary key keeps hyphens
+    
+    assert model["id"] == "claude-sonnet-4-6"
+    assert model["display_name"] == "Claude Sonnet 4.6"
+    assert model["llm_config"]["model"] == "litellm_proxy/anthropic/claude-sonnet-4-6"
+    assert model["llm_config"]["temperature"] == 0.0
+```
+
+## Step 4: Update GPT Variant Detection (GPT models only)
+
+**File**: `openhands-sdk/openhands/sdk/llm/utils/model_prompt_spec.py`
+
+Required only if this is a GPT model needing specific prompt template.
+
+**Order matters**: More specific patterns must come before general patterns.
+
+```python
+_MODEL_VARIANT_PATTERNS: dict[str, tuple[tuple[str, tuple[str, ...]], ...]] = {
+    "openai_gpt": (
+        (
+            "gpt-5-codex",  # Specific variant first
+            ("gpt-5-codex", "gpt-5.1-codex", "gpt-5.2-codex", "gpt-5.3-codex"),
+        ),
+        ("gpt-5", ("gpt-5", "gpt-5.1", "gpt-5.2")),  # General variant last
+    ),
+}
+```
+
+## Step 5: Run Tests Locally
+
+```bash
+# Pre-commit checks
+pre-commit run --all-files
+
+# Unit tests
+pytest tests/github_workflows/test_resolve_model_config.py::test_your_model_config -v
+
+# Manual verification
+cd .github/run-eval
+MODEL_IDS="your-model-id" GITHUB_OUTPUT=/tmp/output.txt python resolve_model_config.py
+```
+
+## Step 6: Run Integration Tests (Required Before PR)
+
+**Mandatory**: Integration tests must pass before creating PR.
+
+### Via GitHub Actions
+
+1. Push branch: `git push origin your-branch-name`
+2. Navigate to: https://github.com/OpenHands/software-agent-sdk/actions/workflows/integration-runner.yml
+3. Click "Run workflow"
+4. Configure:
+   - **Branch**: Select your branch
+   - **model_ids**: `your-model-id`
+   - **Reason**: "Testing model-id"
+5. Wait for completion
+6. **Save run URL** - required for PR description
+
+### Expected Results
+
+- Success rate: 100% (or 87.5% if vision test skipped)
+- Duration: 5-10 minutes per model
+- Tests: 8 total (basic commands, file ops, code editing, reasoning, errors, tools, context, vision)
+
+## Step 7: Create PR
+
+### Required in PR Description
+
+```markdown
+## Integration Test Results
+✅ Integration tests passed: [PASTE GITHUB ACTIONS RUN URL]
+
+[Summary table showing test results]
+
+## Configuration
+- Model ID: model-id
+- Provider: Provider Name
+- Temperature: [value] - [reasoning for choice]
+- Feature categories: [list categories added to model_features.py]
+
+Fixes #[issue-number]
+```
+
+## Common Issues
+
+### Integration Tests Hang (6-8+ hours)
+**Causes**:
+- Missing `max_tokens` parameter
+- Claude models with both `temperature` and `top_p` set
+- Model not in REASONING_EFFORT_MODELS or EXTENDED_THINKING_MODELS
+
+**Solutions**: Add `max_tokens`, remove parameter conflicts, add to appropriate feature category.
+
+**Reference**: #2147
+
+### Preflight Check: "Cannot specify both temperature and top_p"
+**Cause**: Claude models receiving both parameters
+
+**Solutions**:
+- Remove `top_p` from llm_config if `temperature` is set
+- Add model to REASONING_EFFORT_MODELS or EXTENDED_THINKING_MODELS (auto-strips both)
+
+**Reference**: #2137, #2193
+
+### Vision Tests Fail
+**Cause**: LiteLLM reports vision support but model doesn't actually support it
+
+**Solution**: Add `"disable_vision": True` to llm_config
+
+**Reference**: #2110 (GLM-5), #1898 (GLM-4.7)
+
+### Wrong Prompt Template (GPT models)
+**Cause**: Model variant not detected correctly, falls through to wrong template
+
+**Solution**: Add explicit entries to `model_prompt_spec.py` with correct pattern order
+
+**Reference**: #2233 (GPT-5.2-codex, GPT-5.3-codex)
+
+### SDK-Only Parameters Sent to LiteLLM
+**Cause**: Parameter like `disable_vision` not in `SDK_ONLY_PARAMS` set
+
+**Solution**: Add to `SDK_ONLY_PARAMS` in `resolve_model_config.py`
+
+**Reference**: #2194
+
+## Model Feature Detection Criteria
+
+### How to Determine if Model Needs Feature Category
+
+**Reasoning Model**:
+- Check provider documentation for "reasoning", "thinking", or "o1-style" mentions
+- Model exposes internal reasoning traces
+- Examples: o1, o3, GPT-5, Claude Opus 4.5+, Gemini 3+
+
+**Extended Thinking**:
+- Check if model is Claude Sonnet 4.5+ or Claude Haiku 4.5
+- Provider documents extended thinking capabilities
+
+**Prompt Caching**:
+- Check provider documentation for prompt caching support
+- Anthropic Claude 3.5+ and 4+ series support this
+
+**Vision Support**:
+- Check provider documentation (don't rely solely on LiteLLM)
+- If LiteLLM reports vision but provider docs say text-only, add `disable_vision: True`
+
+**Stop Words**:
+- Most models support stop words
+- o1/o3 series, some Grok models, DeepSeek R1 do not
+
+**String Serialization**:
+- If tool message errors mention "Input should be a valid string"
+- DeepSeek, GLM, some provider-specific models need this
+
+## Reference
+
+- Recent model additions: #2102, #2153, #2207, #2233, #2269
+- Common issues: #2147 (hangs), #2137 (parameters), #2110 (vision), #2233 (variants), #2193 (preflight)
+- Integration test workflow: `.github/workflows/integration-runner.yml`