Conversation
Co-authored-by: openhands <openhands@all-hands.dev>
|
Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly. |
|
Hi! I started running the behavior tests on your PR. You will receive a comment with the results shortly. |
Python API breakage checks — ✅ PASSEDResult: ✅ PASSED |
REST API breakage checks (OpenAPI) — ✅ PASSEDResult: ✅ PASSED |
all-hands-bot
left a comment
There was a problem hiding this comment.
🟢 Good taste - Clean version bump for v1.18.0 release.
All package versions consistently updated from 1.17.0 → 1.18.0, eval workflow default updated to match, and uv.lock properly reflects the changes. LGTM! 🚀
(Would approve, but GitHub doesn't allow approving your own PR)
all-hands-bot
left a comment
There was a problem hiding this comment.
✅ QA Report: PASS
Release preparation complete: all four packages successfully bumped from v1.17.0 to v1.18.0 with consistent versioning across pyproject.toml files, lockfile, and workflow defaults.
Does this PR achieve its stated goal?
Yes. The PR's stated goal is to "prepare the release for version 1.18.0" by updating version numbers from 1.17.0 to 1.18.0. The changes successfully:
- Update all four package versions consistently (sdk, tools, workspace, agent-server)
- Update the run-eval.yml workflow default from v1.17.0 to v1.18.0
- Synchronize the uv.lock with the new versions
- Maintain backward compatibility (no deprecation deadlines for this release)
All version-related files are correctly updated, the lockfile is synchronized, packages install successfully, and runtime version reporting works as expected.
| Phase | Result |
|---|---|
| Environment Setup | ✅ Dependencies installed, 233 packages in 628ms |
| CI & Tests | ✅ Core tests passing (sdk, agent-server, workspace, cross, pre-commit, package version check) |
| Functional Verification | ✅ Version consistency verified, runtime checks pass, basic SDK functionality works |
Functional Verification
Test 1: Version Consistency Across All Packages
Step 1 — Establish baseline (main branch at 1.17.0):
Checked all package versions on main branch:
$ git show main:openhands-sdk/pyproject.toml | grep "^version"
version = "1.17.0"
$ git show main:openhands-tools/pyproject.toml | grep "^version"
version = "1.17.0"
$ git show main:openhands-workspace/pyproject.toml | grep "^version"
version = "1.17.0"
$ git show main:openhands-agent-server/pyproject.toml | grep "^version"
version = "1.17.0"
$ git show main:.github/workflows/run-eval.yml | grep -A 1 "default:"
default: v1.17.0This confirms the baseline is 1.17.0 across all packages.
Step 2 — Apply the PR's changes:
Checked out rel-1.18.0 branch (commit a937440).
Step 3 — Verify version bump:
Checked all package versions on the release branch:
$ grep "^version" openhands-*/pyproject.toml
openhands-agent-server/pyproject.toml:version = "1.18.0"
openhands-sdk/pyproject.toml:version = "1.18.0"
openhands-tools/pyproject.toml:version = "1.18.0"
openhands-workspace/pyproject.toml:version = "1.18.0"
$ grep -A 1 "default:" .github/workflows/run-eval.yml
default: v1.18.0All four packages are consistently bumped to 1.18.0, and the workflow default is updated.
Step 4 — Verify lockfile consistency:
$ grep 'name = "openhands-' uv.lock -A 1 | grep version
version = "1.18.0"
version = "1.18.0"
version = "1.18.0"
version = "1.18.0"
$ uv lock --locked
Resolved 402 packages in 1msLockfile is synchronized with pyproject.toml files (all at 1.18.0).
Test 2: Runtime Version Reporting
Step 1 — Install packages:
$ make build
Installing dependencies with uv sync --dev...
Installed 233 packages in 628ms
+ openhands-agent-server==1.18.0
+ openhands-sdk==1.18.0
+ openhands-tools==1.18.0
+ openhands-workspace==1.18.0Packages installed successfully at version 1.18.0.
Step 2 — Verify runtime version reporting:
$ python -c "import openhands.sdk; print(f'SDK: {openhands.sdk.__version__}')"
SDK: 1.18.0
$ python -c "import openhands.tools; print(f'Tools: {openhands.tools.__version__}')"
Tools: 1.18.0SDK and tools correctly report version 1.18.0 at runtime (workspace and agent-server don't expose __version__ by design).
Test 3: Basic SDK Functionality
Step 1 — Test agent creation:
from openhands.sdk import Agent
from openhands.sdk.llm import LLM
llm = LLM(model="gpt-4o-mini")
agent = Agent(
llm=llm,
system_prompt="You are a helpful assistant.",
)Result:
✓ Agent creation successful
✓ Agent LLM model: gpt-4o-mini
✓ Agent system_prompt set: True
✓ Basic SDK functionality verified
Core SDK functionality works correctly with the new version.
Test 4: Deprecation Deadline Check
Step 1 — Search for deprecations scheduled for removal in 1.18.0:
$ grep -h "removed_in" --include="*.py" -r openhands-* | sort -u
removed_in="1.19.0",
removed_in="1.20.0",
removed_in="1.22.0",
removed_in="1.23.0",
removed_in="2.0.0",
removed_in=None,No deprecations are scheduled for removal in 1.18.0. The earliest removal is 1.19.0 (next release), which is correct.
Test 5: CI Status Review
Completed and passing:
- ✅ Check package versions (critical for release)
- ✅ pre-commit
- ✅ sdk-tests
- ✅ agent-server-tests
- ✅ workspace-tests
- ✅ cross-tests
- ✅ build-binary-and-test (ubuntu-latest)
- ✅ Python API
- ✅ REST API (OpenAPI)
- ✅ Check OpenAPI Schema
- ✅ Some integration tests (claude-sonnet-4-6, gemini-3.1-pro)
Still in progress:
- Build & Push (various architectures)
- Additional integration tests
- qa-changes (this report)
Core functionality tests have all passed. Remaining checks are builds and extended integration tests.
Issues Found
None. The release preparation is complete and ready for the next steps in the release checklist.
🧪 Integration Tests ResultsOverall Success Rate: 94.1% 📁 Detailed Logs & ArtifactsClick the links below to access detailed agent/LLM logs showing the complete reasoning process for each model. On the GitHub Actions page, scroll down to the 'Artifacts' section to download the logs.
📊 Summary
📋 Detailed Resultslitellm_proxy_moonshot_kimi_k2_thinking
Skipped Tests:
litellm_proxy_deepseek_deepseek_reasoner
Skipped Tests:
litellm_proxy_gemini_3.1_pro_preview
Failed Tests:
litellm_proxy_anthropic_claude_sonnet_4_6
Failed Tests:
|
🔄 Running Examples with
|
| Example | Status | Duration | Cost |
|---|---|---|---|
| 01_standalone_sdk/02_custom_tools.py | ✅ PASS | 23.4s | $0.03 |
| 01_standalone_sdk/03_activate_skill.py | ✅ PASS | 19.8s | $0.03 |
| 01_standalone_sdk/05_use_llm_registry.py | ✅ PASS | 11.5s | $0.01 |
| 01_standalone_sdk/07_mcp_integration.py | ✅ PASS | 37.3s | $0.03 |
| 01_standalone_sdk/09_pause_example.py | ✅ PASS | 13.0s | $0.01 |
| 01_standalone_sdk/10_persistence.py | ✅ PASS | 48.4s | $0.05 |
| 01_standalone_sdk/11_async.py | ✅ PASS | 31.4s | $0.03 |
| 01_standalone_sdk/12_custom_secrets.py | ✅ PASS | 12.1s | $0.01 |
| 01_standalone_sdk/13_get_llm_metrics.py | ✅ PASS | 30.7s | $0.03 |
| 01_standalone_sdk/14_context_condenser.py | ✅ PASS | 2m 18s | $0.16 |
| 01_standalone_sdk/17_image_input.py | ✅ PASS | 16.4s | $0.01 |
| 01_standalone_sdk/18_send_message_while_processing.py | ✅ PASS | 19.3s | $0.02 |
| 01_standalone_sdk/19_llm_routing.py | ✅ PASS | 11.9s | $0.02 |
| 01_standalone_sdk/20_stuck_detector.py | ✅ PASS | 21.4s | $0.02 |
| 01_standalone_sdk/21_generate_extraneous_conversation_costs.py | ✅ PASS | 9.5s | $0.00 |
| 01_standalone_sdk/22_anthropic_thinking.py | ✅ PASS | 13.4s | $0.01 |
| 01_standalone_sdk/23_responses_reasoning.py | ✅ PASS | 43.2s | $0.01 |
| 01_standalone_sdk/24_planning_agent_workflow.py | ✅ PASS | 3m 47s | $0.26 |
| 01_standalone_sdk/25_agent_delegation.py | ✅ PASS | 1m 21s | $0.08 |
| 01_standalone_sdk/26_custom_visualizer.py | ✅ PASS | 20.4s | $0.03 |
| 01_standalone_sdk/28_ask_agent_example.py | ✅ PASS | 30.1s | $0.03 |
| 01_standalone_sdk/29_llm_streaming.py | ✅ PASS | 44.8s | $0.03 |
| 01_standalone_sdk/30_tom_agent.py | ✅ PASS | 8.5s | $0.01 |
| 01_standalone_sdk/31_iterative_refinement.py | ❌ FAIL Timed out after 600 seconds |
10m 0s | -- |
| 01_standalone_sdk/32_configurable_security_policy.py | ✅ PASS | 18.7s | $0.02 |
| 01_standalone_sdk/34_critic_example.py | ❌ FAIL Timed out after 600 seconds |
10m 0s | -- |
| 01_standalone_sdk/36_event_json_to_openai_messages.py | ✅ PASS | 10.1s | $0.01 |
| 01_standalone_sdk/37_llm_profile_store/main.py | ✅ PASS | 3.4s | $0.00 |
| 01_standalone_sdk/38_browser_session_recording.py | ✅ PASS | 33.2s | $0.03 |
| 01_standalone_sdk/39_llm_fallback.py | ✅ PASS | 10.5s | $0.01 |
| 01_standalone_sdk/40_acp_agent_example.py | ✅ PASS | 25.6s | $0.13 |
| 01_standalone_sdk/41_task_tool_set.py | ✅ PASS | 25.6s | $0.03 |
| 01_standalone_sdk/42_file_based_subagents.py | ✅ PASS | 1m 42s | $0.11 |
| 01_standalone_sdk/43_mixed_marketplace_skills/main.py | ✅ PASS | 3.1s | $0.00 |
| 01_standalone_sdk/44_model_switching_in_convo.py | ✅ PASS | 6.9s | $0.01 |
| 01_standalone_sdk/45_parallel_tool_execution.py | ✅ PASS | 3m 11s | $0.36 |
| 01_standalone_sdk/46_agent_settings.py | ✅ PASS | 9.8s | $0.01 |
| 01_standalone_sdk/47_defense_in_depth_security.py | ✅ PASS | 3.1s | $0.00 |
| 01_standalone_sdk/48_conversation_fork.py | ✅ PASS | 12.0s | $0.00 |
| 02_remote_agent_server/01_convo_with_local_agent_server.py | ✅ PASS | 30.7s | $0.03 |
| 02_remote_agent_server/02_convo_with_docker_sandboxed_server.py | ✅ PASS | 1m 33s | $0.05 |
| 02_remote_agent_server/03_browser_use_with_docker_sandboxed_server.py | ✅ PASS | 54.2s | $0.06 |
| 02_remote_agent_server/04_convo_with_api_sandboxed_server.py | ✅ PASS | 1m 40s | $0.03 |
| 02_remote_agent_server/07_convo_with_cloud_workspace.py | ✅ PASS | 30.2s | $0.03 |
| 02_remote_agent_server/08_convo_with_apptainer_sandboxed_server.py | ✅ PASS | 3m 13s | $0.02 |
| 02_remote_agent_server/09_acp_agent_with_remote_runtime.py | ✅ PASS | 1m 17s | $0.11 |
| 02_remote_agent_server/10_cloud_workspace_share_credentials.py | ✅ PASS | 27.9s | $0.05 |
| 02_remote_agent_server/11_conversation_fork.py | ✅ PASS | 34.1s | $0.00 |
| 04_llm_specific_tools/01_gpt5_apply_patch_preset.py | ✅ PASS | 14.7s | $0.02 |
| 04_llm_specific_tools/02_gemini_file_tools.py | ✅ PASS | 48.9s | $0.08 |
| 05_skills_and_plugins/01_loading_agentskills/main.py | ✅ PASS | 10.7s | $0.02 |
| 05_skills_and_plugins/02_loading_plugins/main.py | ✅ PASS | 14.3s | $0.02 |
❌ Some tests failed
Total: 52 | Passed: 50 | Failed: 2 | Total Cost: $2.16
Failed examples:
- examples/01_standalone_sdk/31_iterative_refinement.py: Timed out after 600 seconds
- examples/01_standalone_sdk/34_critic_example.py: Timed out after 600 seconds
|
@OpenHands check the failed test #2908 (comment) Should i be worried? |
|
I'm on it! xingyaoww can track my progress at all-hands.dev |
|
No, you should not be worried. These are pre-existing flaky timeouts, not caused by this PR. AnalysisThe two failed examples both timed out after the 600-second limit:
Why this is not concerning
Summary
These failures are LLM latency/behavior dependent and unrelated to the release changes. The 50/52 pass rate (96.2%) is consistent with recent runs. This comment was generated by an AI assistant (OpenHands) on behalf of the user. |
SummaryThe user asked me to check the failed tests on PR #2908 and whether they should be worried. I investigated and answered the question by posting a comment on the PR. AnswerNo, you should not be worried. The two failed example tests are pre-existing flaky timeouts unrelated to the PR:
I posted this analysis as a detailed comment on the PR at: #2908 (comment) Checklist
|
🧪 Integration Tests ResultsOverall Success Rate: 86.7% 📁 Detailed Logs & ArtifactsClick the links below to access detailed agent/LLM logs showing the complete reasoning process for each model. On the GitHub Actions page, scroll down to the 'Artifacts' section to download the logs.
📊 Summary
📋 Detailed Resultslitellm_proxy_moonshot_kimi_k2_thinking
Failed Tests:
However, there is a violation of explicit evaluation criteria. The instructions stated: "Avoid creating any additional files that were not explicitly requested. Only one README.md file is acceptable if it pertains to the new training script." The agent created three files:
While the COMPARISON.md file is well-written and provides helpful context, it was not requested and violates the constraint that only a single README.md is acceptable as an additional file. The agent should have stopped after creating the script and README.md. The main deliverable quality is high, but the creation of the unrequested COMPARISON.md file represents a failure to follow explicit constraints on file creation scope, even though the overall intent and execution were sound. (confidence=0.72) (Cost: $0.18) litellm_proxy_deepseek_deepseek_reasoner
Failed Tests:
Scope Violation: Over-Verification:
After the initial targeted test passed, additional verification was unnecessary and wasteful. Lack of Clear Stopping Point: Positive Aspects:
Critical Issue: litellm_proxy_anthropic_claude_sonnet_4_6
|
|
Evaluation Triggered
|
|
@OpenHands apologize from me and resolve the conflict. |
|
I'm on it! xingyaoww can track my progress at all-hands.dev |
|
^😓 "merge from main" ugh, speech to text |
Resolved the uv.lock conflict by keeping the release branch's newer exclude-newer date (2026-04-14) and regenerating the lockfile. Co-authored-by: openhands <openhands@all-hands.dev>
|
Sorry about the merge conflict! 🙏 I've resolved it — the conflict was in The PR should now be conflict-free. This comment was created by an AI assistant (OpenHands) on behalf of @xingyaoww. |
Summary of WorkThe PR comment asked me to apologize on behalf of the user and resolve the merge conflict on PR #2908. Here's how both requests were addressed: Checklist
ConcisenessThe changes are minimal and targeted — only the |
Take uv.lock from main and regenerate to pick up the 1.18.0 version bumps. This ensures the lockfile reflects main's dependency state with only the release version changes on top. Co-authored-by: openhands <openhands@all-hands.dev>
Release v1.18.0
This PR prepares the release for version 1.18.0.
Release Checklist
integration-test)behavior-test)test-examples)v1.18.0rel-1.18.0Next Steps
Once the release is published on GitHub, the PyPI packages will be automatically published via the
pypi-release.ymlworkflow.Agent Server images for this PR
• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server
Variants & Base Images
eclipse-temurin:17-jdknikolaik/python-nodejs:python3.13-nodejs22-slimgolang:1.21-bookwormPull (multi-arch manifest)
# Each variant is a multi-arch manifest supporting both amd64 and arm64 docker pull ghcr.io/openhands/agent-server:45ab866-pythonRun
All tags pushed for this build
About Multi-Architecture Support
45ab866-python) is a multi-arch manifest supporting both amd64 and arm6445ab866-python-amd64) are also available if needed