OpenHands
diff --git a/‎.github/PULL_REQUEST_TEMPLATE.md‎
Lines changed: 45 additions & 7 deletions b/‎.github/PULL_REQUEST_TEMPLATE.md‎
Lines changed: 45 additions & 7 deletions
diff --git a/‎.github/run-eval/ADDINGMODEL.md‎
Lines changed: 29 additions & 13 deletions b/‎.github/run-eval/ADDINGMODEL.md‎
Lines changed: 29 additions & 13 deletions
diff --git a/‎.github/run-eval/resolve_model_config.py‎
Lines changed: 19 additions & 0 deletions b/‎.github/run-eval/resolve_model_config.py‎
Lines changed: 19 additions & 0 deletions
diff --git a/‎.github/workflows/pypi-release.yml‎
Lines changed: 6 additions & 0 deletions b/‎.github/workflows/pypi-release.yml‎
Lines changed: 6 additions & 0 deletions
diff --git a/‎.github/workflows/qa-changes-by-openhands.yml‎
Lines changed: 54 additions & 0 deletions b/‎.github/workflows/qa-changes-by-openhands.yml‎
Lines changed: 54 additions & 0 deletions
diff --git a/‎.github/workflows/qa-changes-evaluation.yml‎
Lines changed: 87 additions & 0 deletions b/‎.github/workflows/qa-changes-evaluation.yml‎
Lines changed: 87 additions & 0 deletions
diff --git a/‎.github/workflows/run-eval.yml‎
Lines changed: 10 additions & 2 deletions b/‎.github/workflows/run-eval.yml‎
Lines changed: 10 additions & 2 deletions
diff --git a/‎examples/01_standalone_sdk/43_mixed_marketplace_skills/main.py‎
Lines changed: 1 addition & 1 deletion b/‎examples/01_standalone_sdk/43_mixed_marketplace_skills/main.py‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎examples/05_skills_and_plugins/01_loading_agentskills/main.py‎
Lines changed: 1 addition & 1 deletion b/‎examples/05_skills_and_plugins/01_loading_agentskills/main.py‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎openhands-agent-server/AGENTS.md‎
Lines changed: 10 additions & 0 deletions b/‎openhands-agent-server/AGENTS.md‎
Lines changed: 10 additions & 0 deletions
@@ -1,11 +1,49 @@
+<!-- Keep this PR as draft until it is ready for review. -->
+
+<!-- AI/LLM agents: 
+
+Provide evidence that the code runs properly end-to-end. Just running unit tests is NOT sufficient. Explain exactly the command that you ran, and provide evidence that the code works as expected, either in the form of log outputs or screenshots. In addition, if it is a bug fix, also run the same code before the bug fix and demonstrate that the code did NOT work before the fix to demonstrate that you were able to reproduce the problem.
+-->
+
+- [ ] A human has tested these changes.
+
+---
+
+## Why
+
+<!-- Describe problem, motivation, etc.-->
+
 ## Summary
 
-[fill in a summary of this PR]
+<!-- 1-3 bullets describing what changed. -->
+-
+
+## Issue Number
+<!-- Required if there is a relevant issue to this PR. -->
+
+## How to Test
+
+<!--
+Required. Share the steps for the reviewer to be able to test your PR. e.g. You can test by running `npm install` then `npm build dev`.
+
+If you could not test this, say why.
+-->
+
+## Video/Screenshots
+
+<!--
+Provide a video or screenshots of testing your PR. e.g. you added a new feature to the gui, show us the video of you testing it successfully.
+
+-->
+
+## Type
+
+- [ ] Bug fix
+- [ ] Feature
+- [ ] Refactor
+- [ ] Breaking change
+- [ ] Docs / chore
 
-## Checklist
+## Notes
 
-- [ ] If the PR is changing/adding functionality, are there tests to reflect this?
-- [ ] If there is an example, have you run the example to make sure that it works?
-- [ ] If there are instructions on how to run the code, have you followed the instructions and made sure that it works?
-- [ ] If the feature is significant enough to require documentation, is there a PR open on the OpenHands/docs repository with the same branch name?
-- [ ] Is the github CI passing?
+<!-- Optional: config changes, rollout concerns, follow-ups, or anything reviewers should know. -->
@@ -240,29 +240,44 @@ cd .github/run-eval
 MODEL_IDS="your-model-id" GITHUB_OUTPUT=/tmp/output.txt python resolve_model_config.py
 ```
 
-## Step 6: Run Integration Tests (Required Before PR)
+## Step 6: Create Draft PR
 
-**Mandatory**: Integration tests must pass before creating PR.
+Push your branch and create a draft PR. Note the PR number returned - you'll need it for the integration tests.
 
-### Via GitHub Actions
+## Step 7: Run Integration Tests
 
-1. Push branch: `git push origin your-branch-name`
-2. Navigate to: https://github.com/OpenHands/software-agent-sdk/actions/workflows/integration-runner.yml
-3. Click "Run workflow"
-4. Configure:
-   - **Branch**: Select your branch
-   - **model_ids**: `your-model-id`
-   - **Reason**: "Testing model-id"
-5. Wait for completion
-6. **Save run URL** - required for PR description
+Trigger integration tests on your PR branch:
+
+```bash
+gh workflow run integration-runner.yml \
+  -f model_ids=your-model-id \
+  -f reason="Testing new model from PR #<pr-number>" \
+  -f issue_number=<pr-number> \
+  --ref your-branch-name
+```
+
+Results will be posted back to the PR as a comment.
 
 ### Expected Results
 
 - Success rate: 100% (or 87.5% if vision test skipped)
 - Duration: 5-10 minutes per model
 - Tests: 8 total (basic commands, file ops, code editing, reasoning, errors, tools, context, vision)
 
-## Step 7: Create PR
+## Step 8: Fix Issues and Rerun (if needed)
+
+If tests fail, see [Common Issues](#common-issues) below. After fixing:
+
+1. Push the fix: `git add . && git commit && git push`
+2. Rerun integration tests with the same command from Step 7 (using the same PR number)
+
+## Step 9: Mark PR Ready
+
+When tests pass, mark the PR as ready for review:
+
+```bash
+gh pr ready <pr-number>
+```
 
 ### Required in PR Description
 
@@ -379,3 +394,4 @@ Fixes #[issue-number]
 - Recent model additions: #2102, #2153, #2207, #2233, #2269
 - Common issues: #2147 (hangs), #2137 (parameters), #2110 (vision), #2233 (variants), #2193 (preflight)
 - Integration test workflow: `.github/workflows/integration-runner.yml`
+- Integration tests can be triggered via: `gh workflow run integration-runner.yml --ref <branch>`
@@ -242,6 +242,16 @@ def _sigterm_handler(signum: int, _frame: object) -> None:
             "disable_vision": True,
         },
     },
+    "glm-5.1": {
+        "id": "glm-5.1",
+        "display_name": "GLM-5.1",
+        "llm_config": {
+            "model": "litellm_proxy/openrouter/z-ai/glm-5.1",
+            "temperature": 0.0,
+            # OpenRouter glm-5.1 is text-only despite LiteLLM reporting vision support
+            "disable_vision": True,
+        },
+    },
     "qwen3-coder-next": {
         "id": "qwen3-coder-next",
         "display_name": "Qwen3 Coder Next",
@@ -282,6 +292,15 @@ def _sigterm_handler(signum: int, _frame: object) -> None:
             "temperature": 0.0,
         },
     },
+    "trinity-large-thinking": {
+        "id": "trinity-large-thinking",
+        "display_name": "Trinity Large Thinking",
+        "llm_config": {
+            "model": "litellm_proxy/trinity-large-thinking",
+            "temperature": 1.0,
+            "top_p": 0.95,
+        },
+    },
 }
 
 
 
@@ -10,6 +10,12 @@ on:
 
 jobs:
     publish:
+        # Skip PyPI publishing for pre-releases (e.g., release candidates).
+        # Pre-releases can still be created on GitHub for testing without
+        # pushing packages to PyPI.  Manual workflow_dispatch always runs.
+        if: >
+            github.event_name == 'workflow_dispatch' ||
+            !github.event.release.prerelease
         runs-on: ubuntu-24.04
         outputs:
             version: ${{ steps.extract_version.outputs.version }}
 
@@ -0,0 +1,54 @@
+---
+# EXPERIMENTAL: Automated QA validation of PR changes using OpenHands.
+#
+# Unlike pr-review (which reads diffs and posts code-review comments),
+# this workflow actually runs the code — setting up the environment,
+# executing tests, exercising changed behavior, and posting a structured
+# QA report as a PR comment.
+#
+# This is an early experiment; expect rough edges.  The plugin source is
+# pinned to the extensions feature branch while we iterate.
+name: QA Changes by OpenHands [experimental]
+
+on:
+    pull_request:
+        types: [opened, ready_for_review, labeled, review_requested]
+
+permissions:
+    contents: read
+    pull-requests: write
+    issues: write
+
+jobs:
+    qa-changes:
+        # Only run for same-repo PRs (secrets aren't available for forks).
+        # Trigger conditions mirror pr-review, but use the 'qa-this' label
+        # and openhands-agent reviewer request.
+        if: |
+            github.event.pull_request.head.repo.full_name == github.repository && (
+                (github.event.action == 'opened' && github.event.pull_request.draft == false && github.event.pull_request.author_association != 'FIRST_TIME_CONTRIBUTOR' && github.event.pull_request.author_association != 'NONE') ||
+                (github.event.action == 'ready_for_review' && github.event.pull_request.author_association != 'FIRST_TIME_CONTRIBUTOR' && github.event.pull_request.author_association != 'NONE') ||
+                github.event.label.name == 'qa-this' ||
+                github.event.requested_reviewer.login == 'openhands-agent' ||
+                github.event.requested_reviewer.login == 'all-hands-bot'
+            )
+        concurrency:
+            group: qa-changes-${{ github.event.pull_request.number }}
+            cancel-in-progress: true
+        runs-on: ubuntu-24.04
+        timeout-minutes: 30
+        steps:
+            - name: Run QA Changes
+              # EXPERIMENTAL: pointing at feature branch while iterating
+              uses: OpenHands/extensions/plugins/qa-changes@feat/qa-changes-plugin
+              with:
+                  llm-model: litellm_proxy/claude-sonnet-4-5-20250929
+                  llm-base-url: https://llm-proxy.app.all-hands.dev
+                  max-budget: '10.0'
+                  timeout-minutes: '30'
+                  max-iterations: '500'
+                  # EXPERIMENTAL: use the feature branch of extensions
+                  extensions-version: feat/qa-changes-plugin
+                  llm-api-key: ${{ secrets.LLM_API_KEY }}
+                  github-token: ${{ secrets.PAT_TOKEN }}
+                  lmnr-api-key: ${{ secrets.LMNR_SKILLS_API_KEY }}
@@ -0,0 +1,87 @@
+---
+name: QA Changes Evaluation [experimental]
+
+# This workflow evaluates how well QA validation performed.
+# It runs when a PR is closed to assess QA effectiveness.
+#
+# Security note: pull_request_target is safe here because this workflow
+# never checks out or executes PR code. It only:
+# 1. Downloads artifacts produced by a trusted workflow run
+# 2. Runs evaluation scripts from the extensions repo (main/pinned branch)
+
+on:
+    pull_request_target:
+        types: [closed]
+
+permissions:
+    contents: read
+    pull-requests: read
+
+jobs:
+    evaluate:
+        runs-on: ubuntu-24.04
+        env:
+            PR_NUMBER: ${{ github.event.pull_request.number }}
+            REPO_NAME: ${{ github.repository }}
+            PR_MERGED: ${{ github.event.pull_request.merged }}
+
+        steps:
+            - name: Download QA trace artifact
+              id: download-trace
+              uses: dawidd6/action-download-artifact@v19
+              continue-on-error: true
+              with:
+                  workflow: qa-changes-by-openhands.yml
+                  name: qa-changes-trace-${{ github.event.pull_request.number }}
+                  path: trace-info
+                  search_artifacts: true
+                  if_no_artifact_found: warn
+
+            - name: Check if trace file exists
+              id: check-trace
+              run: |
+                  if [ -f "trace-info/laminar_trace_info.json" ]; then
+                    echo "trace_exists=true" >> $GITHUB_OUTPUT
+                    echo "Found trace file for PR #$PR_NUMBER"
+                  else
+                    echo "trace_exists=false" >> $GITHUB_OUTPUT
+                    echo "No trace file found for PR #$PR_NUMBER - skipping evaluation"
+                  fi
+
+            # EXPERIMENTAL: pinned to feature branch while qa-changes plugin is in development.
+            # Switch to @main (and remove ref:) once the plugin is merged.
+            - name: Checkout extensions repository
+              if: steps.check-trace.outputs.trace_exists == 'true'
+              uses: actions/checkout@v6
+              with:
+                  repository: OpenHands/extensions
+                  ref: feat/qa-changes-plugin
+                  path: extensions
+
+            - name: Set up Python
+              if: steps.check-trace.outputs.trace_exists == 'true'
+              uses: actions/setup-python@v6
+              with:
+                  python-version: '3.12'
+
+            - name: Install dependencies
+              if: steps.check-trace.outputs.trace_exists == 'true'
+              run: pip install lmnr
+
+            - name: Run evaluation
+              if: steps.check-trace.outputs.trace_exists == 'true'
+              env:
+                  # Script expects LMNR_PROJECT_API_KEY; org secret is named LMNR_SKILLS_API_KEY
+                  LMNR_PROJECT_API_KEY: ${{ secrets.LMNR_SKILLS_API_KEY }}
+                  GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+              run: |
+                  python extensions/plugins/qa-changes/scripts/evaluate_qa_changes.py \
+                      --trace-file trace-info/laminar_trace_info.json
+
+            - name: Upload evaluation logs
+              uses: actions/upload-artifact@v7
+              if: always() && steps.check-trace.outputs.trace_exists == 'true'
+              with:
+                  name: qa-changes-evaluation-${{ github.event.pull_request.number }}
+                  path: '*.log'
+                  retention-days: 30
@@ -63,6 +63,11 @@ on:
                 required: false
                 default: main
                 type: string
+            extensions_branch:
+                description: Extensions repo branch to use (for testing feature branches with skills/plugins)
+                required: false
+                default: main
+                type: string
             instance_ids:
                 description: >-
                     Comma-separated instance IDs to evaluate.
@@ -157,6 +162,7 @@ jobs:
                   echo "reason: ${{ github.event.inputs.reason || 'N/A' }}"
                   echo "eval_branch: ${{ github.event.inputs.eval_branch || 'main' }}"
                   echo "benchmarks_branch: ${{ github.event.inputs.benchmarks_branch || 'main' }}"
+                  echo "extensions_branch: ${{ github.event.inputs.extensions_branch || 'main' }}"
                   echo "instance_ids: ${{ github.event.inputs.instance_ids || 'N/A' }}"
                   echo "num_infer_workers: ${{ github.event.inputs.num_infer_workers || '(default)' }}"
                   echo "num_eval_workers: ${{ github.event.inputs.num_eval_workers || '(default)' }}"
@@ -341,6 +347,7 @@ jobs:
                   EVAL_WORKFLOW: ${{ env.EVAL_WORKFLOW }}
                   EVAL_BRANCH: ${{ github.event.inputs.eval_branch || 'main' }}
                   BENCHMARKS_BRANCH: ${{ github.event.inputs.benchmarks_branch || 'main' }}
+                  EXTENSIONS_BRANCH: ${{ github.event.inputs.extensions_branch || 'main' }}
                   BENCHMARK: ${{ github.event.inputs.benchmark || 'swebench' }}
                   TRIGGER_REASON: ${{ github.event.inputs.reason }}
                   PR_NUMBER: ${{ steps.params.outputs.pr_number }}
@@ -357,7 +364,7 @@ jobs:
                   # Normalize instance_ids: strip all spaces
                   INSTANCE_IDS=$(printf '%s' "$INSTANCE_IDS" | tr -d ' ')
 
-                  echo "Dispatching evaluation workflow with SDK commit: $SDK_SHA (benchmark: $BENCHMARK, eval branch: $EVAL_BRANCH, benchmarks branch: $BENCHMARKS_BRANCH, tool preset: $TOOL_PRESET)"
+                  echo "Dispatching evaluation workflow with SDK commit: $SDK_SHA (benchmark: $BENCHMARK, eval branch: $EVAL_BRANCH, benchmarks branch: $BENCHMARKS_BRANCH, extensions branch: $EXTENSIONS_BRANCH, tool preset: $TOOL_PRESET)"
                   PAYLOAD=$(jq -n \
                     --arg sdk "$SDK_SHA" \
                     --arg sdk_run_id "${{ github.run_id }}" \
@@ -367,6 +374,7 @@ jobs:
                     --arg reason "$TRIGGER_REASON" \
                     --arg pr "$PR_NUMBER" \
                     --arg benchmarks "$BENCHMARKS_BRANCH" \
+                    --arg extensions "$EXTENSIONS_BRANCH" \
                     --arg benchmark "$BENCHMARK" \
                     --arg instance_ids "$INSTANCE_IDS" \
                     --arg num_infer_workers "$NUM_INFER_WORKERS" \
@@ -377,7 +385,7 @@ jobs:
                     --arg agent_type "$AGENT_TYPE" \
                     --arg partial_archive_url "$PARTIAL_ARCHIVE_URL" \
                     --arg triggered_by "$TRIGGERED_BY" \
-                    '{ref: $ref, inputs: {sdk_commit: $sdk, sdk_workflow_run_id: $sdk_run_id, eval_limit: $eval_limit, models_json: ($models | tostring), trigger_reason: $reason, pr_number: $pr, benchmarks_branch: $benchmarks, benchmark: $benchmark, instance_ids: $instance_ids, num_infer_workers: $num_infer_workers, num_eval_workers: $num_eval_workers, enable_conversation_event_logging: $enable_conversation_event_logging, max_retries: $max_retries, tool_preset: $tool_preset, agent_type: $agent_type, partial_archive_url: $partial_archive_url, triggered_by: $triggered_by}}')
+                    '{ref: $ref, inputs: {sdk_commit: $sdk, sdk_workflow_run_id: $sdk_run_id, eval_limit: $eval_limit, models_json: ($models | tostring), trigger_reason: $reason, pr_number: $pr, benchmarks_branch: $benchmarks, extensions_branch: $extensions, benchmark: $benchmark, instance_ids: $instance_ids, num_infer_workers: $num_infer_workers, num_eval_workers: $num_eval_workers, enable_conversation_event_logging: $enable_conversation_event_logging, max_retries: $max_retries, tool_preset: $tool_preset, agent_type: $agent_type, partial_archive_url: $partial_archive_url, triggered_by: $triggered_by}}')
                   RESPONSE=$(curl -sS -o /tmp/dispatch.out -w "%{http_code}" -X POST \
                     -H "Authorization: token $PAT_TOKEN" \
                     -H "Accept: application/vnd.github+json" \
 
@@ -37,7 +37,7 @@
 import sys
 from pathlib import Path
 
-from openhands.sdk.plugin import Marketplace
+from openhands.sdk.marketplace import Marketplace
 from openhands.sdk.skills import (
     install_skills_from_marketplace,
     list_installed_skills,
 
@@ -27,7 +27,7 @@
 from pydantic import SecretStr
 
 from openhands.sdk import LLM, Agent, AgentContext, Conversation
-from openhands.sdk.context.skills import (
+from openhands.sdk.skills import (
     discover_skill_resources,
     load_skills_from_dir,
 )
 
@@ -14,6 +14,16 @@ This package lives in the monorepo root. Typical commands (run from repo root):
 When adding non-Python files (JS, templates, etc.) loaded at runtime, add them to `openhands-agent-server/openhands/agent_server/agent-server.spec` using `collect_data_files`.
 
 
+## Live server integration tests
+
+Small endpoint additions or changes to server behaviour should be covered by a
+test in `tests/cross/test_remote_conversation_live_server.py`.  These tests spin
+up a real FastAPI server with a patched LLM and exercise the full HTTP / WebSocket
+stack end-to-end.  Add or extend a test there whenever the change is localised
+enough that a single new test function (or a few assertions added to an existing
+test) captures the expected behaviour.
+
+
 ## Concurrency / async safety
 
 - `ConversationState` uses a synchronous `FIFOLock`. In async agent-server code, never do `with conversation._state` directly on the event loop when the conversation may be running.
Original file line number	Diff line number	Diff line change
`@@ -27,7 +27,7 @@`
`27`	`27`	`from pydantic import SecretStr`
`28`	`28`
`29`	`29`	`from openhands.sdk import LLM, Agent, AgentContext, Conversation`
`30`		`-from openhands.sdk.context.skills import (`
	`30`	`+from openhands.sdk.skills import (`
`31`	`31`	`discover_skill_resources,`
`32`	`32`	`load_skills_from_dir,`
`33`	`33`	`)`