Add AGENTS.md for model addition guidance in .github/run-eval#2284
Add AGENTS.md for model addition guidance in .github/run-eval#2284juanmichelini merged 2 commits intomainfrom
Conversation
This adds contextual documentation for agents working on model additions to the OpenHands SDK evaluation configuration. Key features: - Step-by-step guide for adding models to resolve_model_config.py - Comprehensive coverage of model feature categories - Integration testing procedures and requirements - Common issues and troubleshooting solutions - Quick reference decision trees This documentation is based on analysis of 10+ model additions over the past 2 months, including common issues discovered in: - Integration test hangs (#2147) - Parameter conflicts (#2137) - Vision misreporting (#2110) - Wrong prompt templates (#2233) - Preflight failures (#2193) The AGENTS.md file will be automatically loaded as context when agents work on files in the .github/run-eval/ directory, providing guidance when creating issues or PRs for model additions. Fixes #2283 Co-authored-by: openhands <openhands@all-hands.dev>
API breakage checks (Griffe)Result: Failed Log excerpt (first 1000 characters) |
Agent server REST API breakage checks (OpenAPI)Result: Passed |
all-hands-bot
left a comment
There was a problem hiding this comment.
Taste Rating: 🟡 Acceptable
Comprehensive documentation addressing a real problem (30% of issues caught only in integration testing). The AI agent auto-loading context justifies the length, but the execution has issues: invalid code examples, unsubstantiated statistics, and significant redundancy between sections.
Key Insight: Documentation tries to solve a process problem. The real fix is CI enforcement (require integration test runs before PR merge), but for AI agents this contextual approach is pragmatic.
See inline comments for specific issues.
Changes based on review comments from all-hands-bot: 1. Fixed invalid Python function name examples (hyphens -> underscores) - test_your-model-id_config() -> test_claude_sonnet_46_config() 2. Removed unsubstantiated statistics - Removed percentage claims without clear sourcing - Kept only factual references to documented issues 3. Eliminated redundancy - Consolidated duplicated content between sections - Removed repetitive explanations - Streamlined structure: 191 additions, 270 deletions (net -79 lines) 4. Added clear decision criteria - Defined what makes a model a 'reasoning model' - Specified how to determine feature category membership - Based on provider documentation, not just examples 5. Improved parameter documentation - Made Claude temperature+top_p constraint more prominent - Added clear table for temperature configuration - Clarified when to use disable_vision 6. More structured and factual tone - Removed conversational/motivational language - Focused on essential information only - Made content more neutral and direct The document is now: - More concise (reduced from 382 to 303 lines) - Better structured (clear sections, tables, criteria) - More factual (removed unsubstantiated claims) - Easier to follow (less redundancy) Co-authored-by: openhands <openhands@all-hands.dev>
Addressed Review FeedbackThank you for the thorough review! I've restructured the AGENTS.md to address all feedback: 1. ✅ Fixed Invalid Python Function NamesBefore: Added note explaining that dictionary keys keep hyphens, but Python function names must use underscores. 2. ✅ Removed Unsubstantiated Statistics
3. ✅ Eliminated Redundancy
4. ✅ Added Clear Decision CriteriaBefore: "Is this a reasoning model? (o1, o3, GPT-5...)" Added "Model Feature Detection Criteria" section explaining HOW to determine category membership. 5. ✅ Improved Parameter DocumentationClaude constraint now prominent: "temperature": 0.0, # See temperature guide belowAdded clear table: Clarified disable_vision usage with specific criteria. 6. ✅ More Structured and FactualChanges:
SummaryThe document is now:
All review feedback has been addressed. The document maintains comprehensive coverage while removing redundancy and unverified claims. |
all-hands-bot
left a comment
There was a problem hiding this comment.
Taste Rating: 🟡 Acceptable
Comprehensive documentation addressing a real problem based on analysis of 10+ model additions and 15+ issues. The guidance is practical, references actual issues, and will help agents avoid common mistakes when adding models. Previous review concerns (Python function naming, etc.) have been addressed.
Key Insight: This is pragmatic documentation that solves a process problem - it codifies learnings from recent model additions into reusable guidance for AI agents. The length is justified given it will be auto-loaded as context.
Verdict: ✅ Worth merging - solid reference documentation that will reduce model addition issues.
Summary
Adds contextual documentation for agents working on model additions to the OpenHands SDK evaluation configuration.
What This Adds
A comprehensive
AGENTS.mdfile in.github/run-eval/that provides:resolve_model_config.pyWhen This Will Be Triggered
This AGENTS.md will be automatically loaded as context when:
resolve_model_config.pyor related files.github/run-eval/directoryResearch Foundation
This documentation is based on analysis of:
Key Insights Included
temperatureandtop_pdisable_visionmust be filtered before LiteLLMExpected Impact
File Structure
The AGENTS.md includes:
Testing
Checklist
Fixes #2283
Note to reviewers: This AGENTS.md is designed to be loaded automatically when agents work in the
.github/run-eval/directory, providing them with the context needed to successfully add models without the common pitfalls we've seen in recent additions.Agent Server images for this PR
• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server
Variants & Base Images
eclipse-temurin:17-jdknikolaik/python-nodejs:python3.12-nodejs22golang:1.21-bookwormPull (multi-arch manifest)
# Each variant is a multi-arch manifest supporting both amd64 and arm64 docker pull ghcr.io/openhands/agent-server:09c9978-pythonRun
All tags pushed for this build
About Multi-Architecture Support
09c9978-python) is a multi-arch manifest supporting both amd64 and arm6409c9978-python-amd64) are also available if needed