From ba822a3fd446419ce7c9eae6406a0ef46b82d4f2 Mon Sep 17 00:00:00 2001 From: openhands Date: Tue, 3 Mar 2026 15:01:28 +0000 Subject: [PATCH 1/5] Add learnings from code review analysis Distilled patterns from recent PR reviews into: - New skill: .openhands/skills/eval-risk-assessment/SKILL.md - AGENTS.md guideline updates (Code Review Learnings section) Analysis summary: - PRs analyzed: 15 merged PRs from the past 30 days - Comments processed: 25+ review comments - Patterns identified: - Eval Risk Assessment (3+ occurrences) - Testing new behavior - Input validation - API documentation Co-authored-by: openhands --- .../skills/eval-risk-assessment/SKILL.md | 80 +++++++++++++++++++ AGENTS.md | 28 +++++++ 2 files changed, 108 insertions(+) create mode 100644 .openhands/skills/eval-risk-assessment/SKILL.md diff --git a/.openhands/skills/eval-risk-assessment/SKILL.md b/.openhands/skills/eval-risk-assessment/SKILL.md new file mode 100644 index 0000000000..1dd1de8339 --- /dev/null +++ b/.openhands/skills/eval-risk-assessment/SKILL.md @@ -0,0 +1,80 @@ +--- +name: eval-risk-assessment +description: Guidelines for identifying and flagging changes that could affect benchmark/evaluation performance. Distilled from recurring code review feedback. +--- + +# Eval Risk Assessment + +When reviewing or submitting PRs, flag changes that could plausibly affect benchmark/evaluation performance for human review and lightweight evals. + +## Overview + +This skill was distilled from recurring code review feedback in the repository. Multiple reviews flagged the need to assess "eval risk" before merging changes that affect agent behavior. + +## Changes That Require Eval Verification + +Flag the following types of changes: + +1. **Prompt Template Modifications** + - System prompts + - Tool descriptions + - Agent instructions + +2. **Tool Behavior Changes** + - Tool calling/execution logic + - Tool output formatting + - New or modified tool parameters + +3. **Agent Decision Logic** + - Loop/iteration control + - Task delegation logic + - State management changes + +4. **LLM Integration Changes** + - Model parameter handling (temperature, max_tokens, etc.) + - Response parsing + - Context management + +## How to Flag Eval Risk + +When submitting a PR with eval-risk changes: + +```markdown +## Eval Risk Flag 🚩 + +This PR modifies [prompt templates / tool behavior / agent logic]. +Recommend running lightweight evals before merge to verify no +unexpected impact on benchmark performance. +``` + +When reviewing a PR with eval-risk changes: + +- Do NOT approve without human maintainer review +- Request lightweight eval verification +- Use `COMMENT` status instead of `APPROVE` if uncertain + +## Examples from Code Reviews + +> "According to the eval-risk policy... changes to prompt templates (which includes tool descriptions) should be flagged for human review with lightweight evals." + +> "This PR touches git tool endpoints that agents use during coding tasks. Per review policy, flagging for lightweight eval verification before merge." + +> "Eval Risk Flag: This changes max_output_tokens behavior... Should run lightweight evals before merge to confirm no unexpected impact on agent performance." + +## Related Patterns + +### Test New Behavior, Not Just Removed Behavior + +When refactoring, tests should verify: +- The new behavior works as expected +- Not just that old behavior was removed + +> "Test verifies what was removed but doesn't verify the NEW behavior." + +### Document API Behavior Changes + +When modifying public APIs: +- Update docstrings to reflect new behavior +- Note any breaking changes + +> "The docstring doesn't clarify that it only returns user-registered agents, not built-in agents." diff --git a/AGENTS.md b/AGENTS.md index 6e61b41b19..3a6ab90016 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -423,4 +423,32 @@ For examples that use the critic model (e.g., `34_critic_example.py`), the criti - Ruff ignores `ARG` (unused arguments) under `tests/**/*.py` to allow pytest fixtures. - Repository guidance lives in `AGENTS.md` (loaded as a third-party skill file). + + +# Code Review Learnings + +Patterns distilled from recurring code review feedback (auto-generated). + +## Testing + +- **Test new behavior, not just removed behavior**: When refactoring, verify the new implementation works correctly, not just that the old code was removed. +- **Add test cases for new model/feature support**: When adding new model identifiers or feature flags, include corresponding test cases to prevent regressions. + +## Input Validation + +- **Validate user inputs early**: For workflow inputs (especially string-based), add validation to fail fast with clear error messages rather than causing confusing downstream failures. + +## API Changes + +- **Document breaking changes**: When modifying public API output (e.g., `get_factory_info()`), update docstrings to reflect the new behavior. +- **Mark deprecated endpoints**: Legacy endpoints should be marked with deprecation annotations. + +## Eval Risk Assessment + +Changes that could affect benchmark/evaluation performance require special handling: +- Flag changes to prompt templates, tool descriptions, or agent decision logic +- Request lightweight evals before merge +- Use COMMENT instead of APPROVE for uncertain cases +- See `.openhands/skills/eval-risk-assessment/SKILL.md` for detailed guidelines + From 38d130688c638724e28d1c11bcab321d33859dda Mon Sep 17 00:00:00 2001 From: openhands Date: Tue, 3 Mar 2026 16:18:46 +0000 Subject: [PATCH 2/5] Remove code review references from generated content Focus on actionable guidelines rather than implementation details of how the skills were created. Co-authored-by: openhands --- .../skills/eval-risk-assessment/SKILL.md | 32 +------------------ AGENTS.md | 8 ++--- 2 files changed, 3 insertions(+), 37 deletions(-) diff --git a/.openhands/skills/eval-risk-assessment/SKILL.md b/.openhands/skills/eval-risk-assessment/SKILL.md index 1dd1de8339..8ee4f34c27 100644 --- a/.openhands/skills/eval-risk-assessment/SKILL.md +++ b/.openhands/skills/eval-risk-assessment/SKILL.md @@ -1,16 +1,12 @@ --- name: eval-risk-assessment -description: Guidelines for identifying and flagging changes that could affect benchmark/evaluation performance. Distilled from recurring code review feedback. +description: Guidelines for identifying and flagging changes that could affect benchmark/evaluation performance. --- # Eval Risk Assessment When reviewing or submitting PRs, flag changes that could plausibly affect benchmark/evaluation performance for human review and lightweight evals. -## Overview - -This skill was distilled from recurring code review feedback in the repository. Multiple reviews flagged the need to assess "eval risk" before merging changes that affect agent behavior. - ## Changes That Require Eval Verification Flag the following types of changes: @@ -52,29 +48,3 @@ When reviewing a PR with eval-risk changes: - Do NOT approve without human maintainer review - Request lightweight eval verification - Use `COMMENT` status instead of `APPROVE` if uncertain - -## Examples from Code Reviews - -> "According to the eval-risk policy... changes to prompt templates (which includes tool descriptions) should be flagged for human review with lightweight evals." - -> "This PR touches git tool endpoints that agents use during coding tasks. Per review policy, flagging for lightweight eval verification before merge." - -> "Eval Risk Flag: This changes max_output_tokens behavior... Should run lightweight evals before merge to confirm no unexpected impact on agent performance." - -## Related Patterns - -### Test New Behavior, Not Just Removed Behavior - -When refactoring, tests should verify: -- The new behavior works as expected -- Not just that old behavior was removed - -> "Test verifies what was removed but doesn't verify the NEW behavior." - -### Document API Behavior Changes - -When modifying public APIs: -- Update docstrings to reflect new behavior -- Note any breaking changes - -> "The docstring doesn't clarify that it only returns user-registered agents, not built-in agents." diff --git a/AGENTS.md b/AGENTS.md index 3a6ab90016..4d46cfa825 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -424,11 +424,7 @@ For examples that use the critic model (e.g., `34_critic_example.py`), the criti - Repository guidance lives in `AGENTS.md` (loaded as a third-party skill file). - -# Code Review Learnings - -Patterns distilled from recurring code review feedback (auto-generated). - + ## Testing - **Test new behavior, not just removed behavior**: When refactoring, verify the new implementation works correctly, not just that the old code was removed. @@ -450,5 +446,5 @@ Changes that could affect benchmark/evaluation performance require special handl - Request lightweight evals before merge - Use COMMENT instead of APPROVE for uncertain cases - See `.openhands/skills/eval-risk-assessment/SKILL.md` for detailed guidelines - + From c24bbfb552f5c1d34c9a13deba95077ad7bccf25 Mon Sep 17 00:00:00 2001 From: openhands Date: Tue, 3 Mar 2026 16:19:49 +0000 Subject: [PATCH 3/5] Remove redundant CODING_GUIDELINES section The AGENTS.md file itself is coding guidelines - no need for a separate wrapper section. Co-authored-by: openhands --- AGENTS.md | 23 ----------------------- 1 file changed, 23 deletions(-) diff --git a/AGENTS.md b/AGENTS.md index 4d46cfa825..1719729210 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -424,27 +424,4 @@ For examples that use the critic model (e.g., `34_critic_example.py`), the criti - Repository guidance lives in `AGENTS.md` (loaded as a third-party skill file). - -## Testing - -- **Test new behavior, not just removed behavior**: When refactoring, verify the new implementation works correctly, not just that the old code was removed. -- **Add test cases for new model/feature support**: When adding new model identifiers or feature flags, include corresponding test cases to prevent regressions. - -## Input Validation - -- **Validate user inputs early**: For workflow inputs (especially string-based), add validation to fail fast with clear error messages rather than causing confusing downstream failures. - -## API Changes - -- **Document breaking changes**: When modifying public API output (e.g., `get_factory_info()`), update docstrings to reflect the new behavior. -- **Mark deprecated endpoints**: Legacy endpoints should be marked with deprecation annotations. - -## Eval Risk Assessment - -Changes that could affect benchmark/evaluation performance require special handling: -- Flag changes to prompt templates, tool descriptions, or agent decision logic -- Request lightweight evals before merge -- Use COMMENT instead of APPROVE for uncertain cases -- See `.openhands/skills/eval-risk-assessment/SKILL.md` for detailed guidelines - From dbabdd814e32eb4997febec0ff35b8a2e5bf8137 Mon Sep 17 00:00:00 2001 From: openhands Date: Thu, 5 Mar 2026 19:58:19 +0000 Subject: [PATCH 4/5] Simplify eval-risk to single AGENTS.md line Remove the separate eval-risk-assessment skill and consolidate guidance into a single line in the TESTING section of AGENTS.md. Co-authored-by: openhands --- .../skills/eval-risk-assessment/SKILL.md | 50 ------------------- AGENTS.md | 1 + 2 files changed, 1 insertion(+), 50 deletions(-) delete mode 100644 .openhands/skills/eval-risk-assessment/SKILL.md diff --git a/.openhands/skills/eval-risk-assessment/SKILL.md b/.openhands/skills/eval-risk-assessment/SKILL.md deleted file mode 100644 index 8ee4f34c27..0000000000 --- a/.openhands/skills/eval-risk-assessment/SKILL.md +++ /dev/null @@ -1,50 +0,0 @@ ---- -name: eval-risk-assessment -description: Guidelines for identifying and flagging changes that could affect benchmark/evaluation performance. ---- - -# Eval Risk Assessment - -When reviewing or submitting PRs, flag changes that could plausibly affect benchmark/evaluation performance for human review and lightweight evals. - -## Changes That Require Eval Verification - -Flag the following types of changes: - -1. **Prompt Template Modifications** - - System prompts - - Tool descriptions - - Agent instructions - -2. **Tool Behavior Changes** - - Tool calling/execution logic - - Tool output formatting - - New or modified tool parameters - -3. **Agent Decision Logic** - - Loop/iteration control - - Task delegation logic - - State management changes - -4. **LLM Integration Changes** - - Model parameter handling (temperature, max_tokens, etc.) - - Response parsing - - Context management - -## How to Flag Eval Risk - -When submitting a PR with eval-risk changes: - -```markdown -## Eval Risk Flag 🚩 - -This PR modifies [prompt templates / tool behavior / agent logic]. -Recommend running lightweight evals before merge to verify no -unexpected impact on benchmark performance. -``` - -When reviewing a PR with eval-risk changes: - -- Do NOT approve without human maintainer review -- Request lightweight eval verification -- Use `COMMENT` status instead of `APPROVE` if uncertain diff --git a/AGENTS.md b/AGENTS.md index 1719729210..662500d85f 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -337,6 +337,7 @@ These are enforced by `check_sdk_api_breakage.py` (runs on release PRs). Depreca - DON'T write TEST CLASSES unless absolutely necessary! - If you find yourself duplicating logics in preparing mocks, loading data etc, these logic should be fixtures in conftest.py! - Please test only the logic implemented in the current codebase. Do not test functionality (e.g., BaseModel.model_dumps()) that is not implemented in this repository. +- For changes to prompt templates, tool descriptions, or agent decision logic, flag the PR for human review and request lightweight integration tests to verify no unexpected impact on benchmark performance. # Behavior Tests From ec1b77d67af32c8a0313b712517d3b6e4fa7091e Mon Sep 17 00:00:00 2001 From: openhands Date: Thu, 5 Mar 2026 20:03:51 +0000 Subject: [PATCH 5/5] Update eval-risk guidance to mention integration-test label The integration tests can be triggered by adding the 'integration-test' label to a PR, which the agent can do. Co-authored-by: openhands --- AGENTS.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/AGENTS.md b/AGENTS.md index 662500d85f..d5f32230f0 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -337,7 +337,7 @@ These are enforced by `check_sdk_api_breakage.py` (runs on release PRs). Depreca - DON'T write TEST CLASSES unless absolutely necessary! - If you find yourself duplicating logics in preparing mocks, loading data etc, these logic should be fixtures in conftest.py! - Please test only the logic implemented in the current codebase. Do not test functionality (e.g., BaseModel.model_dumps()) that is not implemented in this repository. -- For changes to prompt templates, tool descriptions, or agent decision logic, flag the PR for human review and request lightweight integration tests to verify no unexpected impact on benchmark performance. +- For changes to prompt templates, tool descriptions, or agent decision logic, add the `integration-test` label to trigger integration tests and verify no unexpected impact on benchmark performance. # Behavior Tests