Skip to content

Commit cd43bc3

Browse files
committed
Add AGENTS.md for model addition guidance in .github/run-eval (OpenHands#2284)
Cherry-pick from upstream cc34237
1 parent f335406 commit cd43bc3

File tree

1 file changed

+303
-0
lines changed

1 file changed

+303
-0
lines changed

.github/run-eval/AGENTS.md

Lines changed: 303 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,303 @@
1+
# Adding Models to resolve_model_config.py
2+
3+
## Overview
4+
5+
This file (`resolve_model_config.py`) defines models available for evaluation. Models must be added here before they can be used in integration tests or evaluations.
6+
7+
## Files to Modify
8+
9+
1. **Always required**:
10+
- `.github/run-eval/resolve_model_config.py` - Add model configuration
11+
- `tests/github_workflows/test_resolve_model_config.py` - Add test
12+
13+
2. **Usually required** (if model has special characteristics):
14+
- `openhands-sdk/openhands/sdk/llm/utils/model_features.py` - Add to feature categories
15+
16+
3. **Sometimes required**:
17+
- `openhands-sdk/openhands/sdk/llm/utils/model_prompt_spec.py` - GPT models only (variant detection)
18+
- `openhands-sdk/openhands/sdk/llm/utils/verified_models.py` - Production-ready models
19+
20+
## Step 1: Add to resolve_model_config.py
21+
22+
Add entry to `MODELS` dictionary:
23+
24+
```python
25+
"model-id": {
26+
"id": "model-id", # Must match dictionary key
27+
"display_name": "Human Readable Name",
28+
"llm_config": {
29+
"model": "litellm_proxy/provider/model-name",
30+
"temperature": 0.0, # See temperature guide below
31+
},
32+
},
33+
```
34+
35+
### Temperature Configuration
36+
37+
| Value | When to Use | Provider Requirements |
38+
|-------|-------------|----------------------|
39+
| `0.0` | Standard deterministic models | Most providers |
40+
| `1.0` | Reasoning models | Kimi K2, MiniMax M2.5 |
41+
| `None` | Use provider default | When unsure |
42+
43+
### Special Parameters
44+
45+
Add only if needed:
46+
47+
- **`disable_vision: True`** - Model doesn't support vision despite LiteLLM reporting it does (GLM-4.7, GLM-5)
48+
- **`reasoning_effort: "high"`** - For OpenAI reasoning models that support this parameter
49+
- **`max_tokens: <value>`** - To prevent hangs or control output length
50+
- **`top_p: <value>`** - Nucleus sampling (cannot be used with `temperature` for Claude models)
51+
- **`litellm_extra_body: {...}`** - Provider-specific parameters (e.g., `{"enable_thinking": True}`)
52+
53+
### Critical Rules
54+
55+
1. Model ID must match dictionary key
56+
2. Model path must start with `litellm_proxy/`
57+
3. **Claude models**: Cannot use both `temperature` and `top_p` - choose one or omit both
58+
4. Parameters like `disable_vision` must be in `SDK_ONLY_PARAMS` constant (they're filtered before sending to LiteLLM)
59+
60+
## Step 2: Update model_features.py (if applicable)
61+
62+
Check provider documentation to determine which feature categories apply:
63+
64+
### REASONING_EFFORT_MODELS
65+
Models that support `reasoning_effort` parameter:
66+
- OpenAI: o1, o3, o4, GPT-5 series
67+
- Anthropic: Claude Opus 4.5+, Claude Sonnet 4.6
68+
- Google: Gemini 2.5+, Gemini 3.x series
69+
- AWS: Nova 2 Lite
70+
71+
```python
72+
REASONING_EFFORT_MODELS: list[str] = [
73+
"your-model-identifier", # Add here
74+
]
75+
```
76+
77+
**Effect**: Automatically strips `temperature` and `top_p` parameters to avoid API conflicts.
78+
79+
### EXTENDED_THINKING_MODELS
80+
Models with extended thinking capabilities:
81+
- Anthropic: Claude Sonnet 4.5+, Claude Haiku 4.5
82+
83+
```python
84+
EXTENDED_THINKING_MODELS: list[str] = [
85+
"your-model-identifier", # Add here
86+
]
87+
```
88+
89+
**Effect**: Automatically strips `temperature` and `top_p` parameters.
90+
91+
### PROMPT_CACHE_MODELS
92+
Models supporting prompt caching:
93+
- Anthropic: Claude 3.5+, Claude 4+ series
94+
95+
```python
96+
PROMPT_CACHE_MODELS: list[str] = [
97+
"your-model-identifier", # Add here
98+
]
99+
```
100+
101+
### SUPPORTS_STOP_WORDS_FALSE_MODELS
102+
Models that **do not** support stop words:
103+
- OpenAI: o1, o3 series
104+
- xAI: Grok-4, Grok-code-fast-1
105+
- DeepSeek: R1 family
106+
107+
```python
108+
SUPPORTS_STOP_WORDS_FALSE_MODELS: list[str] = [
109+
"your-model-identifier", # Add here
110+
]
111+
```
112+
113+
### FORCE_STRING_SERIALIZER_MODELS
114+
Models requiring string format for tool messages (not structured content):
115+
- DeepSeek models
116+
- GLM models
117+
- Groq: Kimi K2-Instruct
118+
- OpenRouter: MiniMax
119+
120+
Use pattern matching:
121+
```python
122+
FORCE_STRING_SERIALIZER_MODELS: list[str] = [
123+
"deepseek", # Matches any model with "deepseek" in name
124+
"groq/kimi-k2-instruct", # Provider-prefixed
125+
]
126+
```
127+
128+
### Other Categories
129+
130+
- **PROMPT_CACHE_RETENTION_MODELS**: GPT-5 family, GPT-4.1
131+
- **RESPONSES_API_MODELS**: GPT-5 family, codex-mini-latest
132+
- **SEND_REASONING_CONTENT_MODELS**: Kimi K2 Thinking/K2.5, MiniMax-M2, DeepSeek Reasoner
133+
134+
See `model_features.py` for complete lists and additional documentation.
135+
136+
## Step 3: Add Test
137+
138+
**File**: `tests/github_workflows/test_resolve_model_config.py`
139+
140+
**Important**: Python function names cannot contain hyphens. Convert model ID hyphens to underscores.
141+
142+
```python
143+
def test_claude_sonnet_46_config(): # Note: hyphens -> underscores
144+
"""Test that claude-sonnet-4-6 has correct configuration."""
145+
model = MODELS["claude-sonnet-4-6"] # Dictionary key keeps hyphens
146+
147+
assert model["id"] == "claude-sonnet-4-6"
148+
assert model["display_name"] == "Claude Sonnet 4.6"
149+
assert model["llm_config"]["model"] == "litellm_proxy/anthropic/claude-sonnet-4-6"
150+
assert model["llm_config"]["temperature"] == 0.0
151+
```
152+
153+
## Step 4: Update GPT Variant Detection (GPT models only)
154+
155+
**File**: `openhands-sdk/openhands/sdk/llm/utils/model_prompt_spec.py`
156+
157+
Required only if this is a GPT model needing specific prompt template.
158+
159+
**Order matters**: More specific patterns must come before general patterns.
160+
161+
```python
162+
_MODEL_VARIANT_PATTERNS: dict[str, tuple[tuple[str, tuple[str, ...]], ...]] = {
163+
"openai_gpt": (
164+
(
165+
"gpt-5-codex", # Specific variant first
166+
("gpt-5-codex", "gpt-5.1-codex", "gpt-5.2-codex", "gpt-5.3-codex"),
167+
),
168+
("gpt-5", ("gpt-5", "gpt-5.1", "gpt-5.2")), # General variant last
169+
),
170+
}
171+
```
172+
173+
## Step 5: Run Tests Locally
174+
175+
```bash
176+
# Pre-commit checks
177+
pre-commit run --all-files
178+
179+
# Unit tests
180+
pytest tests/github_workflows/test_resolve_model_config.py::test_your_model_config -v
181+
182+
# Manual verification
183+
cd .github/run-eval
184+
MODEL_IDS="your-model-id" GITHUB_OUTPUT=/tmp/output.txt python resolve_model_config.py
185+
```
186+
187+
## Step 6: Run Integration Tests (Required Before PR)
188+
189+
**Mandatory**: Integration tests must pass before creating PR.
190+
191+
### Via GitHub Actions
192+
193+
1. Push branch: `git push origin your-branch-name`
194+
2. Navigate to: https://github.com/OpenHands/software-agent-sdk/actions/workflows/integration-runner.yml
195+
3. Click "Run workflow"
196+
4. Configure:
197+
- **Branch**: Select your branch
198+
- **model_ids**: `your-model-id`
199+
- **Reason**: "Testing model-id"
200+
5. Wait for completion
201+
6. **Save run URL** - required for PR description
202+
203+
### Expected Results
204+
205+
- Success rate: 100% (or 87.5% if vision test skipped)
206+
- Duration: 5-10 minutes per model
207+
- Tests: 8 total (basic commands, file ops, code editing, reasoning, errors, tools, context, vision)
208+
209+
## Step 7: Create PR
210+
211+
### Required in PR Description
212+
213+
```markdown
214+
## Integration Test Results
215+
✅ Integration tests passed: [PASTE GITHUB ACTIONS RUN URL]
216+
217+
[Summary table showing test results]
218+
219+
## Configuration
220+
- Model ID: model-id
221+
- Provider: Provider Name
222+
- Temperature: [value] - [reasoning for choice]
223+
- Feature categories: [list categories added to model_features.py]
224+
225+
Fixes #[issue-number]
226+
```
227+
228+
## Common Issues
229+
230+
### Integration Tests Hang (6-8+ hours)
231+
**Causes**:
232+
- Missing `max_tokens` parameter
233+
- Claude models with both `temperature` and `top_p` set
234+
- Model not in REASONING_EFFORT_MODELS or EXTENDED_THINKING_MODELS
235+
236+
**Solutions**: Add `max_tokens`, remove parameter conflicts, add to appropriate feature category.
237+
238+
**Reference**: #2147
239+
240+
### Preflight Check: "Cannot specify both temperature and top_p"
241+
**Cause**: Claude models receiving both parameters
242+
243+
**Solutions**:
244+
- Remove `top_p` from llm_config if `temperature` is set
245+
- Add model to REASONING_EFFORT_MODELS or EXTENDED_THINKING_MODELS (auto-strips both)
246+
247+
**Reference**: #2137, #2193
248+
249+
### Vision Tests Fail
250+
**Cause**: LiteLLM reports vision support but model doesn't actually support it
251+
252+
**Solution**: Add `"disable_vision": True` to llm_config
253+
254+
**Reference**: #2110 (GLM-5), #1898 (GLM-4.7)
255+
256+
### Wrong Prompt Template (GPT models)
257+
**Cause**: Model variant not detected correctly, falls through to wrong template
258+
259+
**Solution**: Add explicit entries to `model_prompt_spec.py` with correct pattern order
260+
261+
**Reference**: #2233 (GPT-5.2-codex, GPT-5.3-codex)
262+
263+
### SDK-Only Parameters Sent to LiteLLM
264+
**Cause**: Parameter like `disable_vision` not in `SDK_ONLY_PARAMS` set
265+
266+
**Solution**: Add to `SDK_ONLY_PARAMS` in `resolve_model_config.py`
267+
268+
**Reference**: #2194
269+
270+
## Model Feature Detection Criteria
271+
272+
### How to Determine if Model Needs Feature Category
273+
274+
**Reasoning Model**:
275+
- Check provider documentation for "reasoning", "thinking", or "o1-style" mentions
276+
- Model exposes internal reasoning traces
277+
- Examples: o1, o3, GPT-5, Claude Opus 4.5+, Gemini 3+
278+
279+
**Extended Thinking**:
280+
- Check if model is Claude Sonnet 4.5+ or Claude Haiku 4.5
281+
- Provider documents extended thinking capabilities
282+
283+
**Prompt Caching**:
284+
- Check provider documentation for prompt caching support
285+
- Anthropic Claude 3.5+ and 4+ series support this
286+
287+
**Vision Support**:
288+
- Check provider documentation (don't rely solely on LiteLLM)
289+
- If LiteLLM reports vision but provider docs say text-only, add `disable_vision: True`
290+
291+
**Stop Words**:
292+
- Most models support stop words
293+
- o1/o3 series, some Grok models, DeepSeek R1 do not
294+
295+
**String Serialization**:
296+
- If tool message errors mention "Input should be a valid string"
297+
- DeepSeek, GLM, some provider-specific models need this
298+
299+
## Reference
300+
301+
- Recent model additions: #2102, #2153, #2207, #2233, #2269
302+
- Common issues: #2147 (hangs), #2137 (parameters), #2110 (vision), #2233 (variants), #2193 (preflight)
303+
- Integration test workflow: `.github/workflows/integration-runner.yml`

0 commit comments

Comments
 (0)