PR review agent: avoid approving eval-risk behavior changes by enyst · Pull Request #2246 · OpenHands/software-agent-sdk

enyst · 2026-02-28T19:37:28Z

This updates the PR review agent prompt to avoid submitting APPROVE reviews for PRs that could plausibly impact benchmark/evaluation performance (tool execution, loop logic, I/O/terminal handling, etc.).

For those PRs, the reviewer should leave a COMMENT (or REQUEST_CHANGES when appropriate) and flag for a human maintainer to decide after lightweight evals.

Change is confined to: examples/03_github_workflows/02_pr_review/prompt.py

Agent Server images for this PR

• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant	Architectures	Base Image	Docs / Tags
java	amd64, arm64	`eclipse-temurin:17-jdk`	Link
python	amd64, arm64	`nikolaik/python-nodejs:python3.12-nodejs22`	Link
golang	amd64, arm64	`golang:1.21-bookworm`	Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:59cf06a-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-59cf06a-python \
  ghcr.io/openhands/agent-server:59cf06a-python

All tags pushed for this build

ghcr.io/openhands/agent-server:59cf06a-golang-amd64
ghcr.io/openhands/agent-server:59cf06a-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:59cf06a-golang-arm64
ghcr.io/openhands/agent-server:59cf06a-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:59cf06a-java-amd64
ghcr.io/openhands/agent-server:59cf06a-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:59cf06a-java-arm64
ghcr.io/openhands/agent-server:59cf06a-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:59cf06a-python-amd64
ghcr.io/openhands/agent-server:59cf06a-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-amd64
ghcr.io/openhands/agent-server:59cf06a-python-arm64
ghcr.io/openhands/agent-server:59cf06a-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-arm64
ghcr.io/openhands/agent-server:59cf06a-golang
ghcr.io/openhands/agent-server:59cf06a-java
ghcr.io/openhands/agent-server:59cf06a-python

About Multi-Architecture Support

Each variant tag (e.g., 59cf06a-python) is a multi-arch manifest supporting both amd64 and arm64
Docker automatically pulls the correct architecture for your platform
Individual architecture tags (e.g., 59cf06a-python-amd64) are also available if needed

Co-authored-by: openhands <openhands@all-hands.dev>

github-actions · 2026-02-28T19:37:53Z

API breakage checks (Griffe)

Result: Failed

Log excerpt (first 1000 characters)


============================================================
Checking openhands-sdk (openhands.sdk)
============================================================
Comparing openhands-sdk 1.11.5 against 1.11.4
::notice title=openhands-sdk API::Ignoring Field metadata-only change (non-breaking): load_public_skills
No breaking changes detected

============================================================
Checking openhands-workspace (openhands.workspace)
============================================================
Comparing openhands-workspace 1.11.5 against 1.11.4
::warning file=openhands-workspace/openhands/workspace/docker/dev_workspace.py,line=33,title=DockerDevWorkspace.server_image::Attribute value was changed: `Field(default='ghcr.io/openhands/agent-server:latest-python', description='Pre-built agent server image to use.')` -> `Field(default=None, description='Pre-built agent server image. Mutually exclusive with base_image.')`
::error title=SemVer::Breaking changes detected (1); re

Action log

github-actions · 2026-02-28T19:38:04Z

Agent server REST API breakage checks (OpenAPI)

Result: Passed

Action log

all-hands-bot

Taste Rating: 🟢 Good taste

This is a straightforward, pragmatic safety improvement that solves a real problem: preventing the PR review agent from auto-approving changes that could affect eval/benchmark performance.

Why this is sound:

Clear real-world need: Prevents premature approval of risky changes (prompt templates, tool execution, loop logic, etc.)
Simple approach: Just adds policy text with concrete examples - no complex logic
Well-scoped: Lists specific categories that need human + eval verification
No breaking changes: Only adds constraints; existing behavior unchanged
Meta-level change: This is about PR review workflow, not task-solving agent behavior

Note on scope: This PR modifies a prompt template, but it's for the PR review agent (CI workflow), not the eval agents (SWE-bench, GAIA, etc.). It won't affect benchmark performance - it only makes the PR review process more conservative about approving changes that could affect benchmarks.

Verdict: ✅ Worth merging - improves PR review safety without unnecessary complexity.

Key insight: This is the kind of simple, defensive guardrail that prevents production issues. Good engineering taste.

xingyaoww · 2026-03-01T12:44:42Z

examples/03_github_workflows/02_pr_review/prompt.py

+memory/condenser behavior, terminal/stdin/stdout handling, or evaluation harness code.
+
+If a PR is in this category (or you are uncertain), leave a COMMENTED review and
+explicitly flag it for a human maintainer to decide after running lightweight evals.


@enyst I think we should not put the prompt in this file, since it will affect all the repos using the codereview action, right? i think we should put it into a repository specific .agents/code-review.md

@OpenHands Let's fix this

I'm on it! enyst can track my progress at all-hands.dev

Of course, oops

Summary of work completed

Identified that the “eval / benchmark risk” review decision policy was added to examples/03_github_workflows/02_pr_review/prompt.py, which is used by the PR-review GitHub Action and therefore would affect all downstream repos that use the action.

Implemented the requested fix by moving that policy out of the shared example prompt and into this repository’s repo-specific review guidance at .agents/skills/custom-codereview-guide.md.

Changes made (concise)

examples/03_github_workflows/02_pr_review/prompt.py

Removed the eval/benchmark-risk “do not APPROVE…” section so the action’s base prompt remains generic for all repos.

.agents/skills/custom-codereview-guide.md

Added a new “Review decision policy (eval / benchmark risk)” section that instructs reviewers to COMMENT instead of APPROVE when changes could plausibly affect benchmark/eval performance.

Verification / compliance checklist

Addressed the reviewer concern about global impact by relocating policy to repo-specific rules.

Kept changes minimal (only 2 files, no unrelated edits).

Ran uv run pre-commit run --files ... on the modified files (passed).

Commit message included Co-authored-by: openhands <openhands@all-hands.dev>.

Delivery status

PR PR review agent: avoid approving eval-risk behavior changes #2246 was already merged (and its branch no longer existed remotely), so I pushed the fix to a new branch and opened a follow-up PR:

PR review agent: make eval-risk approval policy repo-specific #2254

chore(pr-review): avoid approving eval-risk changes

2da95a9

Co-authored-by: openhands <openhands@all-hands.dev>

all-hands-bot approved these changes Feb 28, 2026

View reviewed changes

Update prompt.py

a1a0683

enyst enabled auto-merge (squash) February 28, 2026 20:06

enyst merged commit bcdbd5c into main Feb 28, 2026
22 checks passed

enyst deleted the chore/pr-review-eval-risk-policy branch February 28, 2026 20:08

xingyaoww reviewed Mar 1, 2026

View reviewed changes

neubig mentioned this pull request Mar 3, 2026

Add learnings from code review analysis #2280

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PR review agent: avoid approving eval-risk behavior changes#2246

PR review agent: avoid approving eval-risk behavior changes#2246
enyst merged 2 commits intomainfrom
chore/pr-review-eval-risk-policy

enyst commented Feb 28, 2026 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Feb 28, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Feb 28, 2026 •

edited

Loading

Uh oh!

all-hands-bot left a comment

Uh oh!

Uh oh!

xingyaoww Mar 1, 2026

Uh oh!

enyst Mar 1, 2026

Uh oh!

openhands-ai bot Mar 1, 2026

Uh oh!

enyst Mar 1, 2026

Uh oh!

openhands-ai bot Mar 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

enyst commented Feb 28, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Feb 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

API breakage checks (Griffe)

Uh oh!

github-actions bot commented Feb 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Agent server REST API breakage checks (OpenAPI)

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

xingyaoww Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

enyst Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

openhands-ai bot Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

enyst Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

openhands-ai bot Mar 1, 2026

Choose a reason for hiding this comment

Summary of work completed

Changes made (concise)

Verification / compliance checklist

Delivery status

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

enyst commented Feb 28, 2026 •

edited by github-actions bot

Loading

github-actions bot commented Feb 28, 2026 •

edited

Loading

github-actions bot commented Feb 28, 2026 •

edited

Loading