Skip to content

Commit 909d6b9

Browse files
Merge branch 'main' into fix/tool-call-compat-shim
2 parents 89171a2 + e220bda commit 909d6b9

File tree

8 files changed

+202
-17
lines changed

8 files changed

+202
-17
lines changed

.github/run-eval/ADDINGMODEL.md

Lines changed: 29 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -240,29 +240,44 @@ cd .github/run-eval
240240
MODEL_IDS="your-model-id" GITHUB_OUTPUT=/tmp/output.txt python resolve_model_config.py
241241
```
242242

243-
## Step 6: Run Integration Tests (Required Before PR)
243+
## Step 6: Create Draft PR
244244

245-
**Mandatory**: Integration tests must pass before creating PR.
245+
Push your branch and create a draft PR. Note the PR number returned - you'll need it for the integration tests.
246246

247-
### Via GitHub Actions
247+
## Step 7: Run Integration Tests
248248

249-
1. Push branch: `git push origin your-branch-name`
250-
2. Navigate to: https://github.com/OpenHands/software-agent-sdk/actions/workflows/integration-runner.yml
251-
3. Click "Run workflow"
252-
4. Configure:
253-
- **Branch**: Select your branch
254-
- **model_ids**: `your-model-id`
255-
- **Reason**: "Testing model-id"
256-
5. Wait for completion
257-
6. **Save run URL** - required for PR description
249+
Trigger integration tests on your PR branch:
250+
251+
```bash
252+
gh workflow run integration-runner.yml \
253+
-f model_ids=your-model-id \
254+
-f reason="Testing new model from PR #<pr-number>" \
255+
-f issue_number=<pr-number> \
256+
--ref your-branch-name
257+
```
258+
259+
Results will be posted back to the PR as a comment.
258260

259261
### Expected Results
260262

261263
- Success rate: 100% (or 87.5% if vision test skipped)
262264
- Duration: 5-10 minutes per model
263265
- Tests: 8 total (basic commands, file ops, code editing, reasoning, errors, tools, context, vision)
264266

265-
## Step 7: Create PR
267+
## Step 8: Fix Issues and Rerun (if needed)
268+
269+
If tests fail, see [Common Issues](#common-issues) below. After fixing:
270+
271+
1. Push the fix: `git add . && git commit && git push`
272+
2. Rerun integration tests with the same command from Step 7 (using the same PR number)
273+
274+
## Step 9: Mark PR Ready
275+
276+
When tests pass, mark the PR as ready for review:
277+
278+
```bash
279+
gh pr ready <pr-number>
280+
```
266281

267282
### Required in PR Description
268283

@@ -379,3 +394,4 @@ Fixes #[issue-number]
379394
- Recent model additions: #2102, #2153, #2207, #2233, #2269
380395
- Common issues: #2147 (hangs), #2137 (parameters), #2110 (vision), #2233 (variants), #2193 (preflight)
381396
- Integration test workflow: `.github/workflows/integration-runner.yml`
397+
- Integration tests can be triggered via: `gh workflow run integration-runner.yml --ref <branch>`

.github/run-eval/resolve_model_config.py

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -242,6 +242,16 @@ def _sigterm_handler(signum: int, _frame: object) -> None:
242242
"disable_vision": True,
243243
},
244244
},
245+
"glm-5.1": {
246+
"id": "glm-5.1",
247+
"display_name": "GLM-5.1",
248+
"llm_config": {
249+
"model": "litellm_proxy/openrouter/z-ai/glm-5.1",
250+
"temperature": 0.0,
251+
# OpenRouter glm-5.1 is text-only despite LiteLLM reporting vision support
252+
"disable_vision": True,
253+
},
254+
},
245255
"qwen3-coder-next": {
246256
"id": "qwen3-coder-next",
247257
"display_name": "Qwen3 Coder Next",

.github/workflows/run-eval.yml

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -63,6 +63,11 @@ on:
6363
required: false
6464
default: main
6565
type: string
66+
extensions_branch:
67+
description: Extensions repo branch to use (for testing feature branches with skills/plugins)
68+
required: false
69+
default: main
70+
type: string
6671
instance_ids:
6772
description: >-
6873
Comma-separated instance IDs to evaluate.
@@ -157,6 +162,7 @@ jobs:
157162
echo "reason: ${{ github.event.inputs.reason || 'N/A' }}"
158163
echo "eval_branch: ${{ github.event.inputs.eval_branch || 'main' }}"
159164
echo "benchmarks_branch: ${{ github.event.inputs.benchmarks_branch || 'main' }}"
165+
echo "extensions_branch: ${{ github.event.inputs.extensions_branch || 'main' }}"
160166
echo "instance_ids: ${{ github.event.inputs.instance_ids || 'N/A' }}"
161167
echo "num_infer_workers: ${{ github.event.inputs.num_infer_workers || '(default)' }}"
162168
echo "num_eval_workers: ${{ github.event.inputs.num_eval_workers || '(default)' }}"
@@ -341,6 +347,7 @@ jobs:
341347
EVAL_WORKFLOW: ${{ env.EVAL_WORKFLOW }}
342348
EVAL_BRANCH: ${{ github.event.inputs.eval_branch || 'main' }}
343349
BENCHMARKS_BRANCH: ${{ github.event.inputs.benchmarks_branch || 'main' }}
350+
EXTENSIONS_BRANCH: ${{ github.event.inputs.extensions_branch || 'main' }}
344351
BENCHMARK: ${{ github.event.inputs.benchmark || 'swebench' }}
345352
TRIGGER_REASON: ${{ github.event.inputs.reason }}
346353
PR_NUMBER: ${{ steps.params.outputs.pr_number }}
@@ -357,7 +364,7 @@ jobs:
357364
# Normalize instance_ids: strip all spaces
358365
INSTANCE_IDS=$(printf '%s' "$INSTANCE_IDS" | tr -d ' ')
359366
360-
echo "Dispatching evaluation workflow with SDK commit: $SDK_SHA (benchmark: $BENCHMARK, eval branch: $EVAL_BRANCH, benchmarks branch: $BENCHMARKS_BRANCH, tool preset: $TOOL_PRESET)"
367+
echo "Dispatching evaluation workflow with SDK commit: $SDK_SHA (benchmark: $BENCHMARK, eval branch: $EVAL_BRANCH, benchmarks branch: $BENCHMARKS_BRANCH, extensions branch: $EXTENSIONS_BRANCH, tool preset: $TOOL_PRESET)"
361368
PAYLOAD=$(jq -n \
362369
--arg sdk "$SDK_SHA" \
363370
--arg sdk_run_id "${{ github.run_id }}" \
@@ -367,6 +374,7 @@ jobs:
367374
--arg reason "$TRIGGER_REASON" \
368375
--arg pr "$PR_NUMBER" \
369376
--arg benchmarks "$BENCHMARKS_BRANCH" \
377+
--arg extensions "$EXTENSIONS_BRANCH" \
370378
--arg benchmark "$BENCHMARK" \
371379
--arg instance_ids "$INSTANCE_IDS" \
372380
--arg num_infer_workers "$NUM_INFER_WORKERS" \
@@ -377,7 +385,7 @@ jobs:
377385
--arg agent_type "$AGENT_TYPE" \
378386
--arg partial_archive_url "$PARTIAL_ARCHIVE_URL" \
379387
--arg triggered_by "$TRIGGERED_BY" \
380-
'{ref: $ref, inputs: {sdk_commit: $sdk, sdk_workflow_run_id: $sdk_run_id, eval_limit: $eval_limit, models_json: ($models | tostring), trigger_reason: $reason, pr_number: $pr, benchmarks_branch: $benchmarks, benchmark: $benchmark, instance_ids: $instance_ids, num_infer_workers: $num_infer_workers, num_eval_workers: $num_eval_workers, enable_conversation_event_logging: $enable_conversation_event_logging, max_retries: $max_retries, tool_preset: $tool_preset, agent_type: $agent_type, partial_archive_url: $partial_archive_url, triggered_by: $triggered_by}}')
388+
'{ref: $ref, inputs: {sdk_commit: $sdk, sdk_workflow_run_id: $sdk_run_id, eval_limit: $eval_limit, models_json: ($models | tostring), trigger_reason: $reason, pr_number: $pr, benchmarks_branch: $benchmarks, extensions_branch: $extensions, benchmark: $benchmark, instance_ids: $instance_ids, num_infer_workers: $num_infer_workers, num_eval_workers: $num_eval_workers, enable_conversation_event_logging: $enable_conversation_event_logging, max_retries: $max_retries, tool_preset: $tool_preset, agent_type: $agent_type, partial_archive_url: $partial_archive_url, triggered_by: $triggered_by}}')
381389
RESPONSE=$(curl -sS -o /tmp/dispatch.out -w "%{http_code}" -X POST \
382390
-H "Authorization: token $PAT_TOKEN" \
383391
-H "Accept: application/vnd.github+json" \

openhands-agent-server/openhands/agent_server/event_service.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -656,7 +656,8 @@ async def close(self):
656656
await self._pub_sub.close()
657657
if self._conversation:
658658
loop = asyncio.get_running_loop()
659-
loop.run_in_executor(None, self._conversation.close)
659+
await loop.run_in_executor(None, self._conversation.close)
660+
self._conversation = None
660661

661662
async def generate_title(
662663
self, llm: "LLM | None" = None, max_length: int = 50

openhands-sdk/openhands/sdk/context/skills/skill.py

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
import io
22
import json
3+
import os
34
import re
45
from pathlib import Path
56
from typing import Annotated, ClassVar, Literal, Union
@@ -891,7 +892,9 @@ def load_project_skills(work_dir: str | Path) -> list[Skill]:
891892

892893
# Public skills repository configuration
893894
PUBLIC_SKILLS_REPO = "https://github.com/OpenHands/extensions"
894-
PUBLIC_SKILLS_BRANCH = "main"
895+
# Allow overriding the branch via EXTENSIONS_REF environment variable
896+
# (used by evaluation/benchmarks workflows to test feature branches)
897+
PUBLIC_SKILLS_BRANCH = os.environ.get("EXTENSIONS_REF", "main")
895898
DEFAULT_MARKETPLACE_PATH = "marketplaces/default.json"
896899

897900

tests/agent_server/test_event_service.py

Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1725,3 +1725,49 @@ def hold_lock_like_run_loop():
17251725
f"for {hold_seconds}s. The read path is blocked by the write lock "
17261726
f"(see HANG_REPRO.md)."
17271727
)
1728+
1729+
1730+
class TestEventServiceClose:
1731+
"""Tests for EventService.close() awaiting conversation teardown."""
1732+
1733+
@pytest.mark.asyncio
1734+
async def test_close_awaits_conversation_close(self, event_service):
1735+
"""close() must await conversation.close(), not fire-and-forget."""
1736+
conversation = MagicMock(spec=Conversation)
1737+
event_service._conversation = conversation
1738+
1739+
closed = asyncio.Event()
1740+
1741+
def slow_close():
1742+
# Simulate non-trivial teardown work
1743+
time.sleep(0.05)
1744+
closed.set()
1745+
1746+
conversation.close = slow_close
1747+
1748+
await event_service.close()
1749+
1750+
assert closed.is_set(), (
1751+
"EventService.close() returned before conversation.close() finished"
1752+
)
1753+
1754+
@pytest.mark.asyncio
1755+
async def test_close_clears_conversation_reference(self, event_service):
1756+
"""close() must set _conversation to None after closing."""
1757+
conversation = MagicMock()
1758+
event_service._conversation = conversation
1759+
1760+
await event_service.close()
1761+
1762+
assert event_service._conversation is None
1763+
1764+
@pytest.mark.asyncio
1765+
async def test_close_is_idempotent(self, event_service):
1766+
"""Calling close() twice must not raise."""
1767+
conversation = MagicMock()
1768+
event_service._conversation = conversation
1769+
1770+
await event_service.close()
1771+
await event_service.close() # second call — _conversation is already None
1772+
1773+
conversation.close.assert_called_once()

tests/github_workflows/test_resolve_model_config.py

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -268,6 +268,16 @@ def test_glm_5_config():
268268
assert model["llm_config"]["disable_vision"] is True
269269

270270

271+
def test_glm_5_1_config():
272+
"""Test that glm-5.1 has correct configuration."""
273+
model = MODELS["glm-5.1"]
274+
275+
assert model["id"] == "glm-5.1"
276+
assert model["display_name"] == "GLM-5.1"
277+
assert model["llm_config"]["model"] == "litellm_proxy/openrouter/z-ai/glm-5.1"
278+
assert model["llm_config"]["disable_vision"] is True
279+
280+
271281
# Tests for preflight check functionality
272282

273283

Lines changed: 91 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,91 @@
1+
"""Tests for EXTENSIONS_REF environment variable support.
2+
3+
These tests use subprocess to run each test in an isolated Python process,
4+
avoiding module state pollution that would affect other tests.
5+
"""
6+
7+
import subprocess
8+
import sys
9+
10+
11+
def _run_in_subprocess(test_code: str, env_extra: dict | None = None) -> None:
12+
"""Run test code in a subprocess with the given environment variables."""
13+
import os
14+
15+
env = os.environ.copy()
16+
if env_extra:
17+
env.update(env_extra)
18+
19+
result = subprocess.run(
20+
[sys.executable, "-c", test_code],
21+
env=env,
22+
capture_output=True,
23+
text=True,
24+
)
25+
if result.returncode != 0:
26+
raise AssertionError(
27+
f"Subprocess test failed:\nstdout: {result.stdout}\nstderr: {result.stderr}"
28+
)
29+
30+
31+
def test_extensions_ref_default():
32+
"""PUBLIC_SKILLS_BRANCH should default to 'main' when EXTENSIONS_REF is not set."""
33+
code = """
34+
import os
35+
if "EXTENSIONS_REF" in os.environ:
36+
del os.environ["EXTENSIONS_REF"]
37+
from openhands.sdk.context.skills.skill import PUBLIC_SKILLS_BRANCH
38+
assert PUBLIC_SKILLS_BRANCH == "main", (
39+
f"Expected 'main' but got '{PUBLIC_SKILLS_BRANCH}'"
40+
)
41+
"""
42+
_run_in_subprocess(code)
43+
44+
45+
def test_extensions_ref_custom_branch():
46+
"""PUBLIC_SKILLS_BRANCH should use EXTENSIONS_REF when set."""
47+
code = """
48+
from openhands.sdk.context.skills.skill import PUBLIC_SKILLS_BRANCH
49+
assert PUBLIC_SKILLS_BRANCH == "feature-branch", (
50+
f"Expected 'feature-branch' but got '{PUBLIC_SKILLS_BRANCH}'"
51+
)
52+
"""
53+
_run_in_subprocess(code, {"EXTENSIONS_REF": "feature-branch"})
54+
55+
56+
def test_extensions_ref_with_load_public_skills():
57+
"""load_public_skills should respect EXTENSIONS_REF environment variable."""
58+
code = """
59+
from unittest import mock
60+
from openhands.sdk.context.skills.skill import (
61+
PUBLIC_SKILLS_BRANCH,
62+
load_public_skills,
63+
)
64+
assert PUBLIC_SKILLS_BRANCH == "test-branch", (
65+
f"Expected 'test-branch' but got '{PUBLIC_SKILLS_BRANCH}'"
66+
)
67+
with mock.patch(
68+
"openhands.sdk.context.skills.skill.update_skills_repository"
69+
) as mock_update:
70+
mock_update.return_value = None
71+
load_public_skills()
72+
mock_update.assert_called_once()
73+
call_args = mock_update.call_args
74+
# branch is 2nd positional arg: (repo_url, branch, cache_dir)
75+
assert call_args[0][1] == "test-branch", (
76+
f"Expected branch='test-branch' but got {call_args[0][1]}"
77+
)
78+
"""
79+
_run_in_subprocess(code, {"EXTENSIONS_REF": "test-branch"})
80+
81+
82+
def test_extensions_ref_empty_string():
83+
"""Empty EXTENSIONS_REF should fall back to 'main'."""
84+
code = """
85+
from openhands.sdk.context.skills.skill import PUBLIC_SKILLS_BRANCH
86+
# Empty string returns empty string per os.environ.get behavior
87+
assert PUBLIC_SKILLS_BRANCH == "", (
88+
f"Expected '' but got '{PUBLIC_SKILLS_BRANCH}'"
89+
)
90+
"""
91+
_run_in_subprocess(code, {"EXTENSIONS_REF": ""})

0 commit comments

Comments
 (0)