Skip to content

Commit 563e208

Browse files
Split AGENTS.md: Move model addition guide to ADDINGMODEL.md
Problem: - AGENTS.md contained 373 lines focused entirely on adding models - All agents working in .github/run-eval/ got this context, even for other tasks - Not scalable as we add more task-specific guides - Too much noise for non-model-addition work Solution: 1. Created ADDINGMODEL.md with all model addition content (373 lines) - All critical rules, steps, examples, troubleshooting - Loaded only when specifically needed for model additions 2. Replaced AGENTS.md with brief overview (54 lines) - Describes directory purpose and key files - Lists common tasks with links to specific guides - Points to ADDINGMODEL.md for model additions - Provides quick debugging guidance Benefits: - Reduced noise: Agents only get relevant context for their task - Better organization: Overview → specific guides pattern - Scalable: Can add DEBUGGING.md, TESTING.md, etc. in future - Focused: Each guide serves one clear purpose - Discoverable: AGENTS.md acts as index/directory guide File changes: - .github/run-eval/ADDINGMODEL.md: +373 lines (new file) - .github/run-eval/AGENTS.md: -357 lines, +54 lines (replaced) Fixes #2290 Co-authored-by: openhands <openhands@all-hands.dev>
1 parent 8dc35fd commit 563e208

File tree

2 files changed

+411
-357
lines changed

2 files changed

+411
-357
lines changed

.github/run-eval/ADDINGMODEL.md

Lines changed: 373 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,373 @@
1+
# Adding Models to resolve_model_config.py
2+
3+
## Overview
4+
5+
This file (`resolve_model_config.py`) defines models available for evaluation. Models must be added here before they can be used in integration tests or evaluations.
6+
7+
## Critical Rules
8+
9+
**ONLY ADD NEW CONTENT - DO NOT MODIFY EXISTING CODE**
10+
11+
### What NOT to Do
12+
13+
1. **Never modify existing model entries** - they are production code, already working
14+
2. **Never modify existing tests** - especially test assertions, mock configs, or expected values
15+
3. **Never reformat existing code** - preserve exact spacing, quotes, commas, formatting
16+
4. **Never reorder models or imports** - dictionary and import order must be preserved
17+
5. **Never "fix" existing code** - if it's in the file and tests pass, it works
18+
6. **Never change test assertions** - even if they "look wrong" to you
19+
7. **Never replace real model tests with mocked tests** - weakens validation
20+
8. **Never fix import names** - if `test_model` exists, don't change it to `check_model`
21+
22+
### What These Rules Prevent
23+
24+
**Example violations** (all found in real PRs):
25+
- Changing `assert result[0]["id"] == "claude-sonnet-4-5-20250929"` to `"gpt-4"`
26+
- Replacing real model config tests with mocked/custom model tests ❌
27+
- "Fixing" `from resolve_model_config import test_model` to `check_model`
28+
- Adding "Fixed incorrect assertions" without explaining what was incorrect ❌
29+
- Claiming to "fix test issues" when tests were already passing ❌
30+
31+
### What TO Do
32+
33+
**When adding a model**:
34+
- Add ONE new entry to the MODELS dictionary
35+
- Add ONE new test function (follow existing pattern exactly)
36+
- Add to feature lists in model_features.py ONLY if needed for your model
37+
- Do not touch any other files, tests, imports, or configurations
38+
- If you think something is broken, it's probably not - leave it alone
39+
40+
## Files to Modify
41+
42+
1. **Always required**:
43+
- `.github/run-eval/resolve_model_config.py` - Add model configuration
44+
- `tests/github_workflows/test_resolve_model_config.py` - Add test
45+
46+
2. **Usually required** (if model has special characteristics):
47+
- `openhands-sdk/openhands/sdk/llm/utils/model_features.py` - Add to feature categories
48+
49+
3. **Sometimes required**:
50+
- `openhands-sdk/openhands/sdk/llm/utils/model_prompt_spec.py` - GPT models only (variant detection)
51+
- `openhands-sdk/openhands/sdk/llm/utils/verified_models.py` - Production-ready models
52+
53+
## Step 1: Add to resolve_model_config.py
54+
55+
Add entry to `MODELS` dictionary:
56+
57+
```python
58+
"model-id": {
59+
"id": "model-id", # Must match dictionary key
60+
"display_name": "Human Readable Name",
61+
"llm_config": {
62+
"model": "litellm_proxy/provider/model-name",
63+
"temperature": 0.0, # See temperature guide below
64+
},
65+
},
66+
```
67+
68+
### Temperature Configuration
69+
70+
| Value | When to Use | Provider Requirements |
71+
|-------|-------------|----------------------|
72+
| `0.0` | Standard deterministic models | Most providers |
73+
| `1.0` | Reasoning models | Kimi K2, MiniMax M2.5 |
74+
| `None` | Use provider default | When unsure |
75+
76+
### Special Parameters
77+
78+
Add only if needed:
79+
80+
- **`disable_vision: True`** - Model doesn't support vision despite LiteLLM reporting it does (GLM-4.7, GLM-5)
81+
- **`reasoning_effort: "high"`** - For OpenAI reasoning models that support this parameter
82+
- **`max_tokens: <value>`** - To prevent hangs or control output length
83+
- **`top_p: <value>`** - Nucleus sampling (cannot be used with `temperature` for Claude models)
84+
- **`litellm_extra_body: {...}`** - Provider-specific parameters (e.g., `{"enable_thinking": True}`)
85+
86+
### Critical Rules
87+
88+
1. Model ID must match dictionary key
89+
2. Model path must start with `litellm_proxy/`
90+
3. **Claude models**: Cannot use both `temperature` and `top_p` - choose one or omit both
91+
4. Parameters like `disable_vision` must be in `SDK_ONLY_PARAMS` constant (they're filtered before sending to LiteLLM)
92+
93+
## Step 2: Update model_features.py (if applicable)
94+
95+
Check provider documentation to determine which feature categories apply:
96+
97+
### REASONING_EFFORT_MODELS
98+
Models that support `reasoning_effort` parameter:
99+
- OpenAI: o1, o3, o4, GPT-5 series
100+
- Anthropic: Claude Opus 4.5+, Claude Sonnet 4.6
101+
- Google: Gemini 2.5+, Gemini 3.x series
102+
- AWS: Nova 2 Lite
103+
104+
```python
105+
REASONING_EFFORT_MODELS: list[str] = [
106+
"your-model-identifier", # Add here
107+
]
108+
```
109+
110+
**Effect**: Automatically strips `temperature` and `top_p` parameters to avoid API conflicts.
111+
112+
### EXTENDED_THINKING_MODELS
113+
Models with extended thinking capabilities:
114+
- Anthropic: Claude Sonnet 4.5+, Claude Haiku 4.5
115+
116+
```python
117+
EXTENDED_THINKING_MODELS: list[str] = [
118+
"your-model-identifier", # Add here
119+
]
120+
```
121+
122+
**Effect**: Automatically strips `temperature` and `top_p` parameters.
123+
124+
### PROMPT_CACHE_MODELS
125+
Models supporting prompt caching:
126+
- Anthropic: Claude 3.5+, Claude 4+ series
127+
128+
```python
129+
PROMPT_CACHE_MODELS: list[str] = [
130+
"your-model-identifier", # Add here
131+
]
132+
```
133+
134+
### SUPPORTS_STOP_WORDS_FALSE_MODELS
135+
Models that **do not** support stop words:
136+
- OpenAI: o1, o3 series
137+
- xAI: Grok-4, Grok-code-fast-1
138+
- DeepSeek: R1 family
139+
140+
```python
141+
SUPPORTS_STOP_WORDS_FALSE_MODELS: list[str] = [
142+
"your-model-identifier", # Add here
143+
]
144+
```
145+
146+
### FORCE_STRING_SERIALIZER_MODELS
147+
Models requiring string format for tool messages (not structured content):
148+
- DeepSeek models
149+
- GLM models
150+
- Groq: Kimi K2-Instruct
151+
- OpenRouter: MiniMax
152+
153+
Use pattern matching:
154+
```python
155+
FORCE_STRING_SERIALIZER_MODELS: list[str] = [
156+
"deepseek", # Matches any model with "deepseek" in name
157+
"groq/kimi-k2-instruct", # Provider-prefixed
158+
]
159+
```
160+
161+
### Other Categories
162+
163+
- **PROMPT_CACHE_RETENTION_MODELS**: GPT-5 family, GPT-4.1
164+
- **RESPONSES_API_MODELS**: GPT-5 family, codex-mini-latest
165+
- **SEND_REASONING_CONTENT_MODELS**: Kimi K2 Thinking/K2.5, MiniMax-M2, DeepSeek Reasoner
166+
167+
See `model_features.py` for complete lists and additional documentation.
168+
169+
## Step 3: Add Test
170+
171+
**File**: `tests/github_workflows/test_resolve_model_config.py`
172+
173+
**Important**:
174+
- Python function names cannot contain hyphens. Convert model ID hyphens to underscores.
175+
- **Do not modify any existing test functions** - only add your new one at the end of the file
176+
- **Do not change existing imports** - use what's already there
177+
- **Do not fix "incorrect" assertions** in other tests - they are correct
178+
179+
**Test template** (copy and modify for your model):
180+
181+
```python
182+
def test_your_model_id_config(): # Replace hyphens with underscores in function name
183+
"""Test that your-model-id has correct configuration."""
184+
model = MODELS["your-model-id"] # Dictionary key keeps hyphens
185+
186+
assert model["id"] == "your-model-id"
187+
assert model["display_name"] == "Your Model Display Name"
188+
assert model["llm_config"]["model"] == "litellm_proxy/provider/model-name"
189+
# Only add assertions for parameters YOU added in resolve_model_config.py
190+
# assert model["llm_config"]["temperature"] == 0.0
191+
# assert model["llm_config"]["disable_vision"] is True
192+
```
193+
194+
**What NOT to do in tests**:
195+
- Don't change assertions in other test functions (even if model names "look wrong")
196+
- Don't replace real model tests with mocked tests
197+
- Don't change `test_model` to `check_model` in imports
198+
- Don't modify mock_models dictionaries in other tests
199+
- Don't add "fixes" to existing tests - they work as-is
200+
201+
## Step 4: Update GPT Variant Detection (GPT models only)
202+
203+
**File**: `openhands-sdk/openhands/sdk/llm/utils/model_prompt_spec.py`
204+
205+
Required only if this is a GPT model needing specific prompt template.
206+
207+
**Order matters**: More specific patterns must come before general patterns.
208+
209+
```python
210+
_MODEL_VARIANT_PATTERNS: dict[str, tuple[tuple[str, tuple[str, ...]], ...]] = {
211+
"openai_gpt": (
212+
(
213+
"gpt-5-codex", # Specific variant first
214+
("gpt-5-codex", "gpt-5.1-codex", "gpt-5.2-codex", "gpt-5.3-codex"),
215+
),
216+
("gpt-5", ("gpt-5", "gpt-5.1", "gpt-5.2")), # General variant last
217+
),
218+
}
219+
```
220+
221+
## Step 5: Run Tests Locally
222+
223+
```bash
224+
# Pre-commit checks
225+
pre-commit run --all-files
226+
227+
# Unit tests
228+
pytest tests/github_workflows/test_resolve_model_config.py::test_your_model_config -v
229+
230+
# Manual verification
231+
cd .github/run-eval
232+
MODEL_IDS="your-model-id" GITHUB_OUTPUT=/tmp/output.txt python resolve_model_config.py
233+
```
234+
235+
## Step 6: Run Integration Tests (Required Before PR)
236+
237+
**Mandatory**: Integration tests must pass before creating PR.
238+
239+
### Via GitHub Actions
240+
241+
1. Push branch: `git push origin your-branch-name`
242+
2. Navigate to: https://github.com/OpenHands/software-agent-sdk/actions/workflows/integration-runner.yml
243+
3. Click "Run workflow"
244+
4. Configure:
245+
- **Branch**: Select your branch
246+
- **model_ids**: `your-model-id`
247+
- **Reason**: "Testing model-id"
248+
5. Wait for completion
249+
6. **Save run URL** - required for PR description
250+
251+
### Expected Results
252+
253+
- Success rate: 100% (or 87.5% if vision test skipped)
254+
- Duration: 5-10 minutes per model
255+
- Tests: 8 total (basic commands, file ops, code editing, reasoning, errors, tools, context, vision)
256+
257+
## Step 7: Create PR
258+
259+
### Required in PR Description
260+
261+
```markdown
262+
## Summary
263+
Adds the `model-id` model to resolve_model_config.py.
264+
265+
## Changes
266+
- Added model-id to MODELS dictionary
267+
- Added test_model_id_config() test function
268+
- [Only if applicable] Added to [feature category] in model_features.py
269+
270+
## Configuration
271+
- Model ID: model-id
272+
- Provider: Provider Name
273+
- Temperature: [value] - [reasoning for choice]
274+
- [List any special parameters and why needed]
275+
276+
## Integration Test Results
277+
✅ Integration tests passed: [PASTE GITHUB ACTIONS RUN URL]
278+
279+
[Summary table showing test results]
280+
281+
Fixes #[issue-number]
282+
```
283+
284+
### What NOT to Include in PR Description
285+
286+
**Do not claim to have "fixed" things unless they were actually broken**:
287+
- ❌ "Fixed test_model import issue" (if tests were passing, there was no issue)
288+
- ❌ "Fixed incorrect assertions in existing tests" (they were correct)
289+
- ❌ "Improved test coverage" (unless you actually added new test cases)
290+
- ❌ "Cleaned up code" (you shouldn't be cleaning up anything)
291+
- ❌ "Updated test approach" (you shouldn't be changing testing approach)
292+
293+
**Only describe what you actually added**:
294+
- ✅ "Added gpt-5.3-codex model configuration"
295+
- ✅ "Added test for gpt-5.3-codex"
296+
- ✅ "Added gpt-5.3-codex to REASONING_EFFORT_MODELS"
297+
298+
## Common Issues
299+
300+
### Integration Tests Hang (6-8+ hours)
301+
**Causes**:
302+
- Missing `max_tokens` parameter
303+
- Claude models with both `temperature` and `top_p` set
304+
- Model not in REASONING_EFFORT_MODELS or EXTENDED_THINKING_MODELS
305+
306+
**Solutions**: Add `max_tokens`, remove parameter conflicts, add to appropriate feature category.
307+
308+
**Reference**: #2147
309+
310+
### Preflight Check: "Cannot specify both temperature and top_p"
311+
**Cause**: Claude models receiving both parameters
312+
313+
**Solutions**:
314+
- Remove `top_p` from llm_config if `temperature` is set
315+
- Add model to REASONING_EFFORT_MODELS or EXTENDED_THINKING_MODELS (auto-strips both)
316+
317+
**Reference**: #2137, #2193
318+
319+
### Vision Tests Fail
320+
**Cause**: LiteLLM reports vision support but model doesn't actually support it
321+
322+
**Solution**: Add `"disable_vision": True` to llm_config
323+
324+
**Reference**: #2110 (GLM-5), #1898 (GLM-4.7)
325+
326+
### Wrong Prompt Template (GPT models)
327+
**Cause**: Model variant not detected correctly, falls through to wrong template
328+
329+
**Solution**: Add explicit entries to `model_prompt_spec.py` with correct pattern order
330+
331+
**Reference**: #2233 (GPT-5.2-codex, GPT-5.3-codex)
332+
333+
### SDK-Only Parameters Sent to LiteLLM
334+
**Cause**: Parameter like `disable_vision` not in `SDK_ONLY_PARAMS` set
335+
336+
**Solution**: Add to `SDK_ONLY_PARAMS` in `resolve_model_config.py`
337+
338+
**Reference**: #2194
339+
340+
## Model Feature Detection Criteria
341+
342+
### How to Determine if Model Needs Feature Category
343+
344+
**Reasoning Model**:
345+
- Check provider documentation for "reasoning", "thinking", or "o1-style" mentions
346+
- Model exposes internal reasoning traces
347+
- Examples: o1, o3, GPT-5, Claude Opus 4.5+, Gemini 3+
348+
349+
**Extended Thinking**:
350+
- Check if model is Claude Sonnet 4.5+ or Claude Haiku 4.5
351+
- Provider documents extended thinking capabilities
352+
353+
**Prompt Caching**:
354+
- Check provider documentation for prompt caching support
355+
- Anthropic Claude 3.5+ and 4+ series support this
356+
357+
**Vision Support**:
358+
- Check provider documentation (don't rely solely on LiteLLM)
359+
- If LiteLLM reports vision but provider docs say text-only, add `disable_vision: True`
360+
361+
**Stop Words**:
362+
- Most models support stop words
363+
- o1/o3 series, some Grok models, DeepSeek R1 do not
364+
365+
**String Serialization**:
366+
- If tool message errors mention "Input should be a valid string"
367+
- DeepSeek, GLM, some provider-specific models need this
368+
369+
## Reference
370+
371+
- Recent model additions: #2102, #2153, #2207, #2233, #2269
372+
- Common issues: #2147 (hangs), #2137 (parameters), #2110 (vision), #2233 (variants), #2193 (preflight)
373+
- Integration test workflow: `.github/workflows/integration-runner.yml`

0 commit comments

Comments
 (0)