From ba822a3fd446419ce7c9eae6406a0ef46b82d4f2 Mon Sep 17 00:00:00 2001
From: openhands <openhands@all-hands.dev>
Date: Tue, 3 Mar 2026 15:01:28 +0000
Subject: [PATCH 1/5] Add learnings from code review analysis

Distilled patterns from recent PR reviews into:
- New skill: .openhands/skills/eval-risk-assessment/SKILL.md
- AGENTS.md guideline updates (Code Review Learnings section)

Analysis summary:
- PRs analyzed: 15 merged PRs from the past 30 days
- Comments processed: 25+ review comments
- Patterns identified:
  - Eval Risk Assessment (3+ occurrences)
  - Testing new behavior
  - Input validation
  - API documentation

Co-authored-by: openhands <openhands@all-hands.dev>
---
 .../skills/eval-risk-assessment/SKILL.md      | 80 +++++++++++++++++++
 AGENTS.md                                     | 28 +++++++
 2 files changed, 108 insertions(+)
 create mode 100644 .openhands/skills/eval-risk-assessment/SKILL.md

diff --git a/.openhands/skills/eval-risk-assessment/SKILL.md b/.openhands/skills/eval-risk-assessment/SKILL.md
new file mode 100644
index 0000000000..1dd1de8339
--- /dev/null
+++ b/.openhands/skills/eval-risk-assessment/SKILL.md
@@ -0,0 +1,80 @@
+---
+name: eval-risk-assessment
+description: Guidelines for identifying and flagging changes that could affect benchmark/evaluation performance. Distilled from recurring code review feedback.
+---
+
+# Eval Risk Assessment
+
+When reviewing or submitting PRs, flag changes that could plausibly affect benchmark/evaluation performance for human review and lightweight evals.
+
+## Overview
+
+This skill was distilled from recurring code review feedback in the repository. Multiple reviews flagged the need to assess "eval risk" before merging changes that affect agent behavior.
+
+## Changes That Require Eval Verification
+
+Flag the following types of changes:
+
+1. **Prompt Template Modifications**
+   - System prompts
+   - Tool descriptions
+   - Agent instructions
+
+2. **Tool Behavior Changes**
+   - Tool calling/execution logic
+   - Tool output formatting
+   - New or modified tool parameters
+
+3. **Agent Decision Logic**
+   - Loop/iteration control
+   - Task delegation logic
+   - State management changes
+
+4. **LLM Integration Changes**
+   - Model parameter handling (temperature, max_tokens, etc.)
+   - Response parsing
+   - Context management
+
+## How to Flag Eval Risk
+
+When submitting a PR with eval-risk changes:
+
+```markdown
+## Eval Risk Flag 🚩
+
+This PR modifies [prompt templates / tool behavior / agent logic].
+Recommend running lightweight evals before merge to verify no
+unexpected impact on benchmark performance.
+```
+
+When reviewing a PR with eval-risk changes:
+
+- Do NOT approve without human maintainer review
+- Request lightweight eval verification
+- Use `COMMENT` status instead of `APPROVE` if uncertain
+
+## Examples from Code Reviews
+
+> "According to the eval-risk policy... changes to prompt templates (which includes tool descriptions) should be flagged for human review with lightweight evals."
+
+> "This PR touches git tool endpoints that agents use during coding tasks. Per review policy, flagging for lightweight eval verification before merge."
+
+> "Eval Risk Flag: This changes max_output_tokens behavior... Should run lightweight evals before merge to confirm no unexpected impact on agent performance."
+
+## Related Patterns
+
+### Test New Behavior, Not Just Removed Behavior
+
+When refactoring, tests should verify:
+- The new behavior works as expected
+- Not just that old behavior was removed
+
+> "Test verifies what was removed but doesn't verify the NEW behavior."
+
+### Document API Behavior Changes
+
+When modifying public APIs:
+- Update docstrings to reflect new behavior
+- Note any breaking changes
+
+> "The docstring doesn't clarify that it only returns user-registered agents, not built-in agents."
diff --git a/AGENTS.md b/AGENTS.md
index 6e61b41b19..3a6ab90016 100644
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -423,4 +423,32 @@ For examples that use the critic model (e.g., `34_critic_example.py`), the criti
 - Ruff ignores `ARG` (unused arguments) under `tests/**/*.py` to allow pytest fixtures.
 - Repository guidance lives in `AGENTS.md` (loaded as a third-party skill file).
 </REPO_CONFIG_NOTES>
+
+<CODE_REVIEW_LEARNINGS>
+# Code Review Learnings
+
+Patterns distilled from recurring code review feedback (auto-generated).
+
+## Testing
+
+- **Test new behavior, not just removed behavior**: When refactoring, verify the new implementation works correctly, not just that the old code was removed.
+- **Add test cases for new model/feature support**: When adding new model identifiers or feature flags, include corresponding test cases to prevent regressions.
+
+## Input Validation
+
+- **Validate user inputs early**: For workflow inputs (especially string-based), add validation to fail fast with clear error messages rather than causing confusing downstream failures.
+
+## API Changes
+
+- **Document breaking changes**: When modifying public API output (e.g., `get_factory_info()`), update docstrings to reflect the new behavior.
+- **Mark deprecated endpoints**: Legacy endpoints should be marked with deprecation annotations.
+
+## Eval Risk Assessment
+
+Changes that could affect benchmark/evaluation performance require special handling:
+- Flag changes to prompt templates, tool descriptions, or agent decision logic
+- Request lightweight evals before merge
+- Use COMMENT instead of APPROVE for uncertain cases
+- See `.openhands/skills/eval-risk-assessment/SKILL.md` for detailed guidelines
+</CODE_REVIEW_LEARNINGS>
 </REPO>

From 38d130688c638724e28d1c11bcab321d33859dda Mon Sep 17 00:00:00 2001
From: openhands <openhands@all-hands.dev>
Date: Tue, 3 Mar 2026 16:18:46 +0000
Subject: [PATCH 2/5] Remove code review references from generated content

Focus on actionable guidelines rather than implementation details
of how the skills were created.

Co-authored-by: openhands <openhands@all-hands.dev>
---
 .../skills/eval-risk-assessment/SKILL.md      | 32 +------------------
 AGENTS.md                                     |  8 ++---
 2 files changed, 3 insertions(+), 37 deletions(-)

diff --git a/.openhands/skills/eval-risk-assessment/SKILL.md b/.openhands/skills/eval-risk-assessment/SKILL.md
index 1dd1de8339..8ee4f34c27 100644
--- a/.openhands/skills/eval-risk-assessment/SKILL.md
+++ b/.openhands/skills/eval-risk-assessment/SKILL.md
@@ -1,16 +1,12 @@
 ---
 name: eval-risk-assessment
-description: Guidelines for identifying and flagging changes that could affect benchmark/evaluation performance. Distilled from recurring code review feedback.
+description: Guidelines for identifying and flagging changes that could affect benchmark/evaluation performance.
 ---
 
 # Eval Risk Assessment
 
 When reviewing or submitting PRs, flag changes that could plausibly affect benchmark/evaluation performance for human review and lightweight evals.
 
-## Overview
-
-This skill was distilled from recurring code review feedback in the repository. Multiple reviews flagged the need to assess "eval risk" before merging changes that affect agent behavior.
-
 ## Changes That Require Eval Verification
 
 Flag the following types of changes:
@@ -52,29 +48,3 @@ When reviewing a PR with eval-risk changes:
 - Do NOT approve without human maintainer review
 - Request lightweight eval verification
 - Use `COMMENT` status instead of `APPROVE` if uncertain
-
-## Examples from Code Reviews
-
-> "According to the eval-risk policy... changes to prompt templates (which includes tool descriptions) should be flagged for human review with lightweight evals."
-
-> "This PR touches git tool endpoints that agents use during coding tasks. Per review policy, flagging for lightweight eval verification before merge."
-
-> "Eval Risk Flag: This changes max_output_tokens behavior... Should run lightweight evals before merge to confirm no unexpected impact on agent performance."
-
-## Related Patterns
-
-### Test New Behavior, Not Just Removed Behavior
-
-When refactoring, tests should verify:
-- The new behavior works as expected
-- Not just that old behavior was removed
-
-> "Test verifies what was removed but doesn't verify the NEW behavior."
-
-### Document API Behavior Changes
-
-When modifying public APIs:
-- Update docstrings to reflect new behavior
-- Note any breaking changes
-
-> "The docstring doesn't clarify that it only returns user-registered agents, not built-in agents."
diff --git a/AGENTS.md b/AGENTS.md
index 3a6ab90016..4d46cfa825 100644
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -424,11 +424,7 @@ For examples that use the critic model (e.g., `34_critic_example.py`), the criti
 - Repository guidance lives in `AGENTS.md` (loaded as a third-party skill file).
 </REPO_CONFIG_NOTES>
 
-<CODE_REVIEW_LEARNINGS>
-# Code Review Learnings
-
-Patterns distilled from recurring code review feedback (auto-generated).
-
+<CODING_GUIDELINES>
 ## Testing
 
 - **Test new behavior, not just removed behavior**: When refactoring, verify the new implementation works correctly, not just that the old code was removed.
@@ -450,5 +446,5 @@ Changes that could affect benchmark/evaluation performance require special handl
 - Request lightweight evals before merge
 - Use COMMENT instead of APPROVE for uncertain cases
 - See `.openhands/skills/eval-risk-assessment/SKILL.md` for detailed guidelines
-</CODE_REVIEW_LEARNINGS>
+</CODING_GUIDELINES>
 </REPO>

From c24bbfb552f5c1d34c9a13deba95077ad7bccf25 Mon Sep 17 00:00:00 2001
From: openhands <openhands@all-hands.dev>
Date: Tue, 3 Mar 2026 16:19:49 +0000
Subject: [PATCH 3/5] Remove redundant CODING_GUIDELINES section

The AGENTS.md file itself is coding guidelines - no need for
a separate wrapper section.

Co-authored-by: openhands <openhands@all-hands.dev>
---
 AGENTS.md | 23 -----------------------
 1 file changed, 23 deletions(-)

diff --git a/AGENTS.md b/AGENTS.md
index 4d46cfa825..1719729210 100644
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -424,27 +424,4 @@ For examples that use the critic model (e.g., `34_critic_example.py`), the criti
 - Repository guidance lives in `AGENTS.md` (loaded as a third-party skill file).
 </REPO_CONFIG_NOTES>
 
-<CODING_GUIDELINES>
-## Testing
-
-- **Test new behavior, not just removed behavior**: When refactoring, verify the new implementation works correctly, not just that the old code was removed.
-- **Add test cases for new model/feature support**: When adding new model identifiers or feature flags, include corresponding test cases to prevent regressions.
-
-## Input Validation
-
-- **Validate user inputs early**: For workflow inputs (especially string-based), add validation to fail fast with clear error messages rather than causing confusing downstream failures.
-
-## API Changes
-
-- **Document breaking changes**: When modifying public API output (e.g., `get_factory_info()`), update docstrings to reflect the new behavior.
-- **Mark deprecated endpoints**: Legacy endpoints should be marked with deprecation annotations.
-
-## Eval Risk Assessment
-
-Changes that could affect benchmark/evaluation performance require special handling:
-- Flag changes to prompt templates, tool descriptions, or agent decision logic
-- Request lightweight evals before merge
-- Use COMMENT instead of APPROVE for uncertain cases
-- See `.openhands/skills/eval-risk-assessment/SKILL.md` for detailed guidelines
-</CODING_GUIDELINES>
 </REPO>

From dbabdd814e32eb4997febec0ff35b8a2e5bf8137 Mon Sep 17 00:00:00 2001
From: openhands <openhands@all-hands.dev>
Date: Thu, 5 Mar 2026 19:58:19 +0000
Subject: [PATCH 4/5] Simplify eval-risk to single AGENTS.md line

Remove the separate eval-risk-assessment skill and consolidate guidance
into a single line in the TESTING section of AGENTS.md.

Co-authored-by: openhands <openhands@all-hands.dev>
---
 .../skills/eval-risk-assessment/SKILL.md      | 50 -------------------
 AGENTS.md                                     |  1 +
 2 files changed, 1 insertion(+), 50 deletions(-)
 delete mode 100644 .openhands/skills/eval-risk-assessment/SKILL.md

diff --git a/.openhands/skills/eval-risk-assessment/SKILL.md b/.openhands/skills/eval-risk-assessment/SKILL.md
deleted file mode 100644
index 8ee4f34c27..0000000000
--- a/.openhands/skills/eval-risk-assessment/SKILL.md
+++ /dev/null
@@ -1,50 +0,0 @@
----
-name: eval-risk-assessment
-description: Guidelines for identifying and flagging changes that could affect benchmark/evaluation performance.
----
-
-# Eval Risk Assessment
-
-When reviewing or submitting PRs, flag changes that could plausibly affect benchmark/evaluation performance for human review and lightweight evals.
-
-## Changes That Require Eval Verification
-
-Flag the following types of changes:
-
-1. **Prompt Template Modifications**
-   - System prompts
-   - Tool descriptions
-   - Agent instructions
-
-2. **Tool Behavior Changes**
-   - Tool calling/execution logic
-   - Tool output formatting
-   - New or modified tool parameters
-
-3. **Agent Decision Logic**
-   - Loop/iteration control
-   - Task delegation logic
-   - State management changes
-
-4. **LLM Integration Changes**
-   - Model parameter handling (temperature, max_tokens, etc.)
-   - Response parsing
-   - Context management
-
-## How to Flag Eval Risk
-
-When submitting a PR with eval-risk changes:
-
-```markdown
-## Eval Risk Flag 🚩
-
-This PR modifies [prompt templates / tool behavior / agent logic].
-Recommend running lightweight evals before merge to verify no
-unexpected impact on benchmark performance.
-```
-
-When reviewing a PR with eval-risk changes:
-
-- Do NOT approve without human maintainer review
-- Request lightweight eval verification
-- Use `COMMENT` status instead of `APPROVE` if uncertain
diff --git a/AGENTS.md b/AGENTS.md
index 1719729210..662500d85f 100644
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -337,6 +337,7 @@ These are enforced by `check_sdk_api_breakage.py` (runs on release PRs). Depreca
 - DON'T write TEST CLASSES unless absolutely necessary!
 - If you find yourself duplicating logics in preparing mocks, loading data etc, these logic should be fixtures in conftest.py!
 - Please test only the logic implemented in the current codebase. Do not test functionality (e.g., BaseModel.model_dumps()) that is not implemented in this repository.
+- For changes to prompt templates, tool descriptions, or agent decision logic, flag the PR for human review and request lightweight integration tests to verify no unexpected impact on benchmark performance.
 
 # Behavior Tests
 

From ec1b77d67af32c8a0313b712517d3b6e4fa7091e Mon Sep 17 00:00:00 2001
From: openhands <openhands@all-hands.dev>
Date: Thu, 5 Mar 2026 20:03:51 +0000
Subject: [PATCH 5/5] Update eval-risk guidance to mention integration-test
 label

The integration tests can be triggered by adding the 'integration-test'
label to a PR, which the agent can do.

Co-authored-by: openhands <openhands@all-hands.dev>
---
 AGENTS.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/AGENTS.md b/AGENTS.md
index 662500d85f..d5f32230f0 100644
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -337,7 +337,7 @@ These are enforced by `check_sdk_api_breakage.py` (runs on release PRs). Depreca
 - DON'T write TEST CLASSES unless absolutely necessary!
 - If you find yourself duplicating logics in preparing mocks, loading data etc, these logic should be fixtures in conftest.py!
 - Please test only the logic implemented in the current codebase. Do not test functionality (e.g., BaseModel.model_dumps()) that is not implemented in this repository.
-- For changes to prompt templates, tool descriptions, or agent decision logic, flag the PR for human review and request lightweight integration tests to verify no unexpected impact on benchmark performance.
+- For changes to prompt templates, tool descriptions, or agent decision logic, add the `integration-test` label to trigger integration tests and verify no unexpected impact on benchmark performance.
 
 # Behavior Tests