Skip to content

Commit 82447d2

Browse files
authored
Merge branch 'main' into fix/2467-image-downscale
2 parents 487da93 + 7044805 commit 82447d2

File tree

114 files changed

+4062
-909
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

114 files changed

+4062
-909
lines changed

.github/PULL_REQUEST_TEMPLATE.md

Lines changed: 45 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,49 @@
1+
<!-- Keep this PR as draft until it is ready for review. -->
2+
3+
<!-- AI/LLM agents:
4+
5+
Provide evidence that the code runs properly end-to-end. Just running unit tests is NOT sufficient. Explain exactly the command that you ran, and provide evidence that the code works as expected, either in the form of log outputs or screenshots. In addition, if it is a bug fix, also run the same code before the bug fix and demonstrate that the code did NOT work before the fix to demonstrate that you were able to reproduce the problem.
6+
-->
7+
8+
- [ ] A human has tested these changes.
9+
10+
---
11+
12+
## Why
13+
14+
<!-- Describe problem, motivation, etc.-->
15+
116
## Summary
217

3-
[fill in a summary of this PR]
18+
<!-- 1-3 bullets describing what changed. -->
19+
-
20+
21+
## Issue Number
22+
<!-- Required if there is a relevant issue to this PR. -->
23+
24+
## How to Test
25+
26+
<!--
27+
Required. Share the steps for the reviewer to be able to test your PR. e.g. You can test by running `npm install` then `npm build dev`.
28+
29+
If you could not test this, say why.
30+
-->
31+
32+
## Video/Screenshots
33+
34+
<!--
35+
Provide a video or screenshots of testing your PR. e.g. you added a new feature to the gui, show us the video of you testing it successfully.
36+
37+
-->
38+
39+
## Type
40+
41+
- [ ] Bug fix
42+
- [ ] Feature
43+
- [ ] Refactor
44+
- [ ] Breaking change
45+
- [ ] Docs / chore
446

5-
## Checklist
47+
## Notes
648

7-
- [ ] If the PR is changing/adding functionality, are there tests to reflect this?
8-
- [ ] If there is an example, have you run the example to make sure that it works?
9-
- [ ] If there are instructions on how to run the code, have you followed the instructions and made sure that it works?
10-
- [ ] If the feature is significant enough to require documentation, is there a PR open on the OpenHands/docs repository with the same branch name?
11-
- [ ] Is the github CI passing?
49+
<!-- Optional: config changes, rollout concerns, follow-ups, or anything reviewers should know. -->

.github/run-eval/ADDINGMODEL.md

Lines changed: 29 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -240,29 +240,44 @@ cd .github/run-eval
240240
MODEL_IDS="your-model-id" GITHUB_OUTPUT=/tmp/output.txt python resolve_model_config.py
241241
```
242242

243-
## Step 6: Run Integration Tests (Required Before PR)
243+
## Step 6: Create Draft PR
244244

245-
**Mandatory**: Integration tests must pass before creating PR.
245+
Push your branch and create a draft PR. Note the PR number returned - you'll need it for the integration tests.
246246

247-
### Via GitHub Actions
247+
## Step 7: Run Integration Tests
248248

249-
1. Push branch: `git push origin your-branch-name`
250-
2. Navigate to: https://github.com/OpenHands/software-agent-sdk/actions/workflows/integration-runner.yml
251-
3. Click "Run workflow"
252-
4. Configure:
253-
- **Branch**: Select your branch
254-
- **model_ids**: `your-model-id`
255-
- **Reason**: "Testing model-id"
256-
5. Wait for completion
257-
6. **Save run URL** - required for PR description
249+
Trigger integration tests on your PR branch:
250+
251+
```bash
252+
gh workflow run integration-runner.yml \
253+
-f model_ids=your-model-id \
254+
-f reason="Testing new model from PR #<pr-number>" \
255+
-f issue_number=<pr-number> \
256+
--ref your-branch-name
257+
```
258+
259+
Results will be posted back to the PR as a comment.
258260

259261
### Expected Results
260262

261263
- Success rate: 100% (or 87.5% if vision test skipped)
262264
- Duration: 5-10 minutes per model
263265
- Tests: 8 total (basic commands, file ops, code editing, reasoning, errors, tools, context, vision)
264266

265-
## Step 7: Create PR
267+
## Step 8: Fix Issues and Rerun (if needed)
268+
269+
If tests fail, see [Common Issues](#common-issues) below. After fixing:
270+
271+
1. Push the fix: `git add . && git commit && git push`
272+
2. Rerun integration tests with the same command from Step 7 (using the same PR number)
273+
274+
## Step 9: Mark PR Ready
275+
276+
When tests pass, mark the PR as ready for review:
277+
278+
```bash
279+
gh pr ready <pr-number>
280+
```
266281

267282
### Required in PR Description
268283

@@ -379,3 +394,4 @@ Fixes #[issue-number]
379394
- Recent model additions: #2102, #2153, #2207, #2233, #2269
380395
- Common issues: #2147 (hangs), #2137 (parameters), #2110 (vision), #2233 (variants), #2193 (preflight)
381396
- Integration test workflow: `.github/workflows/integration-runner.yml`
397+
- Integration tests can be triggered via: `gh workflow run integration-runner.yml --ref <branch>`

.github/run-eval/resolve_model_config.py

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -242,6 +242,16 @@ def _sigterm_handler(signum: int, _frame: object) -> None:
242242
"disable_vision": True,
243243
},
244244
},
245+
"glm-5.1": {
246+
"id": "glm-5.1",
247+
"display_name": "GLM-5.1",
248+
"llm_config": {
249+
"model": "litellm_proxy/openrouter/z-ai/glm-5.1",
250+
"temperature": 0.0,
251+
# OpenRouter glm-5.1 is text-only despite LiteLLM reporting vision support
252+
"disable_vision": True,
253+
},
254+
},
245255
"qwen3-coder-next": {
246256
"id": "qwen3-coder-next",
247257
"display_name": "Qwen3 Coder Next",
@@ -282,6 +292,15 @@ def _sigterm_handler(signum: int, _frame: object) -> None:
282292
"temperature": 0.0,
283293
},
284294
},
295+
"trinity-large-thinking": {
296+
"id": "trinity-large-thinking",
297+
"display_name": "Trinity Large Thinking",
298+
"llm_config": {
299+
"model": "litellm_proxy/trinity-large-thinking",
300+
"temperature": 1.0,
301+
"top_p": 0.95,
302+
},
303+
},
285304
}
286305

287306

.github/workflows/pypi-release.yml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,12 @@ on:
1010

1111
jobs:
1212
publish:
13+
# Skip PyPI publishing for pre-releases (e.g., release candidates).
14+
# Pre-releases can still be created on GitHub for testing without
15+
# pushing packages to PyPI. Manual workflow_dispatch always runs.
16+
if: >
17+
github.event_name == 'workflow_dispatch' ||
18+
!github.event.release.prerelease
1319
runs-on: ubuntu-24.04
1420
outputs:
1521
version: ${{ steps.extract_version.outputs.version }}
Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
---
2+
# EXPERIMENTAL: Automated QA validation of PR changes using OpenHands.
3+
#
4+
# Unlike pr-review (which reads diffs and posts code-review comments),
5+
# this workflow actually runs the code — setting up the environment,
6+
# executing tests, exercising changed behavior, and posting a structured
7+
# QA report as a PR comment.
8+
#
9+
# This is an early experiment; expect rough edges. The plugin source is
10+
# pinned to the extensions feature branch while we iterate.
11+
name: QA Changes by OpenHands [experimental]
12+
13+
on:
14+
pull_request:
15+
types: [opened, ready_for_review, labeled, review_requested]
16+
17+
permissions:
18+
contents: read
19+
pull-requests: write
20+
issues: write
21+
22+
jobs:
23+
qa-changes:
24+
# Only run for same-repo PRs (secrets aren't available for forks).
25+
# Trigger conditions mirror pr-review, but use the 'qa-this' label
26+
# and openhands-agent reviewer request.
27+
if: |
28+
github.event.pull_request.head.repo.full_name == github.repository && (
29+
(github.event.action == 'opened' && github.event.pull_request.draft == false && github.event.pull_request.author_association != 'FIRST_TIME_CONTRIBUTOR' && github.event.pull_request.author_association != 'NONE') ||
30+
(github.event.action == 'ready_for_review' && github.event.pull_request.author_association != 'FIRST_TIME_CONTRIBUTOR' && github.event.pull_request.author_association != 'NONE') ||
31+
github.event.label.name == 'qa-this' ||
32+
github.event.requested_reviewer.login == 'openhands-agent' ||
33+
github.event.requested_reviewer.login == 'all-hands-bot'
34+
)
35+
concurrency:
36+
group: qa-changes-${{ github.event.pull_request.number }}
37+
cancel-in-progress: true
38+
runs-on: ubuntu-24.04
39+
timeout-minutes: 30
40+
steps:
41+
- name: Run QA Changes
42+
# EXPERIMENTAL: pointing at feature branch while iterating
43+
uses: OpenHands/extensions/plugins/qa-changes@feat/qa-changes-plugin
44+
with:
45+
llm-model: litellm_proxy/claude-sonnet-4-5-20250929
46+
llm-base-url: https://llm-proxy.app.all-hands.dev
47+
max-budget: '10.0'
48+
timeout-minutes: '30'
49+
max-iterations: '500'
50+
# EXPERIMENTAL: use the feature branch of extensions
51+
extensions-version: feat/qa-changes-plugin
52+
llm-api-key: ${{ secrets.LLM_API_KEY }}
53+
github-token: ${{ secrets.PAT_TOKEN }}
54+
lmnr-api-key: ${{ secrets.LMNR_SKILLS_API_KEY }}
Lines changed: 87 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,87 @@
1+
---
2+
name: QA Changes Evaluation [experimental]
3+
4+
# This workflow evaluates how well QA validation performed.
5+
# It runs when a PR is closed to assess QA effectiveness.
6+
#
7+
# Security note: pull_request_target is safe here because this workflow
8+
# never checks out or executes PR code. It only:
9+
# 1. Downloads artifacts produced by a trusted workflow run
10+
# 2. Runs evaluation scripts from the extensions repo (main/pinned branch)
11+
12+
on:
13+
pull_request_target:
14+
types: [closed]
15+
16+
permissions:
17+
contents: read
18+
pull-requests: read
19+
20+
jobs:
21+
evaluate:
22+
runs-on: ubuntu-24.04
23+
env:
24+
PR_NUMBER: ${{ github.event.pull_request.number }}
25+
REPO_NAME: ${{ github.repository }}
26+
PR_MERGED: ${{ github.event.pull_request.merged }}
27+
28+
steps:
29+
- name: Download QA trace artifact
30+
id: download-trace
31+
uses: dawidd6/action-download-artifact@v19
32+
continue-on-error: true
33+
with:
34+
workflow: qa-changes-by-openhands.yml
35+
name: qa-changes-trace-${{ github.event.pull_request.number }}
36+
path: trace-info
37+
search_artifacts: true
38+
if_no_artifact_found: warn
39+
40+
- name: Check if trace file exists
41+
id: check-trace
42+
run: |
43+
if [ -f "trace-info/laminar_trace_info.json" ]; then
44+
echo "trace_exists=true" >> $GITHUB_OUTPUT
45+
echo "Found trace file for PR #$PR_NUMBER"
46+
else
47+
echo "trace_exists=false" >> $GITHUB_OUTPUT
48+
echo "No trace file found for PR #$PR_NUMBER - skipping evaluation"
49+
fi
50+
51+
# EXPERIMENTAL: pinned to feature branch while qa-changes plugin is in development.
52+
# Switch to @main (and remove ref:) once the plugin is merged.
53+
- name: Checkout extensions repository
54+
if: steps.check-trace.outputs.trace_exists == 'true'
55+
uses: actions/checkout@v6
56+
with:
57+
repository: OpenHands/extensions
58+
ref: feat/qa-changes-plugin
59+
path: extensions
60+
61+
- name: Set up Python
62+
if: steps.check-trace.outputs.trace_exists == 'true'
63+
uses: actions/setup-python@v6
64+
with:
65+
python-version: '3.12'
66+
67+
- name: Install dependencies
68+
if: steps.check-trace.outputs.trace_exists == 'true'
69+
run: pip install lmnr
70+
71+
- name: Run evaluation
72+
if: steps.check-trace.outputs.trace_exists == 'true'
73+
env:
74+
# Script expects LMNR_PROJECT_API_KEY; org secret is named LMNR_SKILLS_API_KEY
75+
LMNR_PROJECT_API_KEY: ${{ secrets.LMNR_SKILLS_API_KEY }}
76+
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
77+
run: |
78+
python extensions/plugins/qa-changes/scripts/evaluate_qa_changes.py \
79+
--trace-file trace-info/laminar_trace_info.json
80+
81+
- name: Upload evaluation logs
82+
uses: actions/upload-artifact@v7
83+
if: always() && steps.check-trace.outputs.trace_exists == 'true'
84+
with:
85+
name: qa-changes-evaluation-${{ github.event.pull_request.number }}
86+
path: '*.log'
87+
retention-days: 30

.github/workflows/run-eval.yml

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -63,6 +63,11 @@ on:
6363
required: false
6464
default: main
6565
type: string
66+
extensions_branch:
67+
description: Extensions repo branch to use (for testing feature branches with skills/plugins)
68+
required: false
69+
default: main
70+
type: string
6671
instance_ids:
6772
description: >-
6873
Comma-separated instance IDs to evaluate.
@@ -157,6 +162,7 @@ jobs:
157162
echo "reason: ${{ github.event.inputs.reason || 'N/A' }}"
158163
echo "eval_branch: ${{ github.event.inputs.eval_branch || 'main' }}"
159164
echo "benchmarks_branch: ${{ github.event.inputs.benchmarks_branch || 'main' }}"
165+
echo "extensions_branch: ${{ github.event.inputs.extensions_branch || 'main' }}"
160166
echo "instance_ids: ${{ github.event.inputs.instance_ids || 'N/A' }}"
161167
echo "num_infer_workers: ${{ github.event.inputs.num_infer_workers || '(default)' }}"
162168
echo "num_eval_workers: ${{ github.event.inputs.num_eval_workers || '(default)' }}"
@@ -341,6 +347,7 @@ jobs:
341347
EVAL_WORKFLOW: ${{ env.EVAL_WORKFLOW }}
342348
EVAL_BRANCH: ${{ github.event.inputs.eval_branch || 'main' }}
343349
BENCHMARKS_BRANCH: ${{ github.event.inputs.benchmarks_branch || 'main' }}
350+
EXTENSIONS_BRANCH: ${{ github.event.inputs.extensions_branch || 'main' }}
344351
BENCHMARK: ${{ github.event.inputs.benchmark || 'swebench' }}
345352
TRIGGER_REASON: ${{ github.event.inputs.reason }}
346353
PR_NUMBER: ${{ steps.params.outputs.pr_number }}
@@ -357,7 +364,7 @@ jobs:
357364
# Normalize instance_ids: strip all spaces
358365
INSTANCE_IDS=$(printf '%s' "$INSTANCE_IDS" | tr -d ' ')
359366
360-
echo "Dispatching evaluation workflow with SDK commit: $SDK_SHA (benchmark: $BENCHMARK, eval branch: $EVAL_BRANCH, benchmarks branch: $BENCHMARKS_BRANCH, tool preset: $TOOL_PRESET)"
367+
echo "Dispatching evaluation workflow with SDK commit: $SDK_SHA (benchmark: $BENCHMARK, eval branch: $EVAL_BRANCH, benchmarks branch: $BENCHMARKS_BRANCH, extensions branch: $EXTENSIONS_BRANCH, tool preset: $TOOL_PRESET)"
361368
PAYLOAD=$(jq -n \
362369
--arg sdk "$SDK_SHA" \
363370
--arg sdk_run_id "${{ github.run_id }}" \
@@ -367,6 +374,7 @@ jobs:
367374
--arg reason "$TRIGGER_REASON" \
368375
--arg pr "$PR_NUMBER" \
369376
--arg benchmarks "$BENCHMARKS_BRANCH" \
377+
--arg extensions "$EXTENSIONS_BRANCH" \
370378
--arg benchmark "$BENCHMARK" \
371379
--arg instance_ids "$INSTANCE_IDS" \
372380
--arg num_infer_workers "$NUM_INFER_WORKERS" \
@@ -377,7 +385,7 @@ jobs:
377385
--arg agent_type "$AGENT_TYPE" \
378386
--arg partial_archive_url "$PARTIAL_ARCHIVE_URL" \
379387
--arg triggered_by "$TRIGGERED_BY" \
380-
'{ref: $ref, inputs: {sdk_commit: $sdk, sdk_workflow_run_id: $sdk_run_id, eval_limit: $eval_limit, models_json: ($models | tostring), trigger_reason: $reason, pr_number: $pr, benchmarks_branch: $benchmarks, benchmark: $benchmark, instance_ids: $instance_ids, num_infer_workers: $num_infer_workers, num_eval_workers: $num_eval_workers, enable_conversation_event_logging: $enable_conversation_event_logging, max_retries: $max_retries, tool_preset: $tool_preset, agent_type: $agent_type, partial_archive_url: $partial_archive_url, triggered_by: $triggered_by}}')
388+
'{ref: $ref, inputs: {sdk_commit: $sdk, sdk_workflow_run_id: $sdk_run_id, eval_limit: $eval_limit, models_json: ($models | tostring), trigger_reason: $reason, pr_number: $pr, benchmarks_branch: $benchmarks, extensions_branch: $extensions, benchmark: $benchmark, instance_ids: $instance_ids, num_infer_workers: $num_infer_workers, num_eval_workers: $num_eval_workers, enable_conversation_event_logging: $enable_conversation_event_logging, max_retries: $max_retries, tool_preset: $tool_preset, agent_type: $agent_type, partial_archive_url: $partial_archive_url, triggered_by: $triggered_by}}')
381389
RESPONSE=$(curl -sS -o /tmp/dispatch.out -w "%{http_code}" -X POST \
382390
-H "Authorization: token $PAT_TOKEN" \
383391
-H "Accept: application/vnd.github+json" \

examples/01_standalone_sdk/43_mixed_marketplace_skills/main.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@
3737
import sys
3838
from pathlib import Path
3939

40-
from openhands.sdk.plugin import Marketplace
40+
from openhands.sdk.marketplace import Marketplace
4141
from openhands.sdk.skills import (
4242
install_skills_from_marketplace,
4343
list_installed_skills,

examples/05_skills_and_plugins/01_loading_agentskills/main.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@
2727
from pydantic import SecretStr
2828

2929
from openhands.sdk import LLM, Agent, AgentContext, Conversation
30-
from openhands.sdk.context.skills import (
30+
from openhands.sdk.skills import (
3131
discover_skill_resources,
3232
load_skills_from_dir,
3333
)

openhands-agent-server/AGENTS.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,16 @@ This package lives in the monorepo root. Typical commands (run from repo root):
1414
When adding non-Python files (JS, templates, etc.) loaded at runtime, add them to `openhands-agent-server/openhands/agent_server/agent-server.spec` using `collect_data_files`.
1515

1616

17+
## Live server integration tests
18+
19+
Small endpoint additions or changes to server behaviour should be covered by a
20+
test in `tests/cross/test_remote_conversation_live_server.py`. These tests spin
21+
up a real FastAPI server with a patched LLM and exercise the full HTTP / WebSocket
22+
stack end-to-end. Add or extend a test there whenever the change is localised
23+
enough that a single new test function (or a few assertions added to an existing
24+
test) captures the expected behaviour.
25+
26+
1727
## Concurrency / async safety
1828

1929
- `ConversationState` uses a synchronous `FIFOLock`. In async agent-server code, never do `with conversation._state` directly on the event loop when the conversation may be running.

0 commit comments

Comments
 (0)