feat(sdk/agent): Parallel Tool Call Execution#2390
Conversation
… tool execution Add infrastructure for executing multiple tool calls concurrently with a configurable global concurrency limit. Classes: - ToolExecutorSemaphore: Process-global singleton that limits concurrent tool executions across all agents and sub-agents. Configured via OPENHANDS_TOOL_CONCURRENCY_LIMIT environment variable (default: 8). - ParallelToolExecutor: Executes batches of tool calls concurrently using ThreadPoolExecutor, with concurrency controlled by the semaphore. Key design decisions: - Single layer of concurrency control via environment variable - Singleton pattern using __new__ for ToolExecutorSemaphore - ThreadPoolExecutor for I/O-bound tool execution - Results returned in original order regardless of completion order Related to #2350 Co-authored-by: openhands <openhands@all-hands.dev>
Python API breakage checks — ✅ PASSEDResult: ✅ PASSED |
REST API breakage checks (OpenAPI) — ✅ PASSEDResult: ✅ PASSED |
Coverage Report •
|
|||||||||||||||||||||||||
all-hands-bot
left a comment
There was a problem hiding this comment.
🟡 Taste Rating: Acceptable - Requires Eval Verification
Core architecture is excellent. Making _execute_action_event side-effect-free (returns events instead of emitting directly) is exactly the right design — this eliminates the need for locks and makes the special case (parallel execution) become a normal case. Per-agent thread pools elegantly prevent deadlocks without complex detection logic.
The code is clean, tests are comprehensive and test real behavior (not mocks), and default concurrency=1 preserves backward compatibility.
However, this PR changes core agent execution flow (tool calling, event emission, state management). Even with the backward-compatible default, the execution path has been refactored significantly. Per repository policy, PRs that change agent behavior require lightweight eval verification before merge.
KEY INSIGHT
The refactoring turns concurrency from a special case requiring complex coordination into a normal case with side-effect-free functions. This is "good taste" — the right abstraction eliminates the complexity rather than managing it with locks and conditionals.
VERDICT
✅ Code quality is solid — approve from a technical perspective
Move _emit_batch and _handle_finish logic from Agent into _ActionBatch as emit() and finalize() methods. Agent-specific logic (iterative refinement check, mark-finished callback) is injected via callables, keeping _ActionBatch decoupled from the Agent class. This simplifies Agent._execute_actions to a clean prepare → emit → finalize pipeline and gives _ActionBatch full ownership of the batch lifecycle. Co-authored-by: openhands <openhands@all-hands.dev>
|
@OpenHands Do a /codereview-roasted on this PR. |
|
I'm on it! enyst can track my progress at all-hands.dev |
enyst
left a comment
There was a problem hiding this comment.
🔴 Needs improvement
[CRITICAL ISSUES]
- [openhands-sdk/openhands/sdk/agent/parallel_executor.py, Lines 97-103] Breaking change disguised as the fallback path:
TOOL_CONCURRENCY_LIMIT=1still routes any multi-tool batch throughThreadPoolExecutor(max_workers=1). That is not the old behavior. It changes thread affinity, and because results are buffered until the batch finishes, it also changes when observations hit the conversation. I reproduced this locally with a tiny tool: both calls ran onThreadPoolExecutor-*, notMainThread, and the second call saw zero priorObservationEvents. So the PR description's “fully backward-compatible” claim is false. Fix: keep the oldfor action in action_events: execute + emitpath when the limit is1, and only use the batch executor when the limit is actually>1. - [openhands-sdk/openhands/sdk/agent/agent.py, Lines 389-396] Sequential semantics were silently changed:
_ActionBatch.prepare()executes the whole batch beforebatch.emit(), so later tools in the same batch no longer see earlier observations inconversation.state.events. Even with concurrency effectively “off”, you've changed execution fromrun tool -> emit observation -> run next toolintorun everything -> emit later. That's a real semantic regression for tools/hooks that inspect conversation state mid-batch. Fix: preserve incremental emission in the sequential path; don't reuse the buffered parallel path as the fallback. - [openhands-sdk/openhands/sdk/agent/agent.py, Lines 389-393] Unsafe by construction for the stock tool set: once
TOOL_CONCURRENCY_LIMIT > 1, this code blindly parallelizes whatever tool calls the model produced. But the default tool set includes shared mutable executors like terminal sessions, browser sessions, and file-editor history. A warning in a docstring is not a concurrency model; it just means the env var can turn into garbled terminal I/O or corrupted edits. Fix: parallelize only tools that explicitly declare themselves safe, or serialize known shared-resource tools.
[TESTING GAPS]
- [tests/sdk/agent/test_parallel_execution_integration.py, Lines 188-213] The default-limit regression isn't actually tested: this test only checks ordering. It doesn't assert same-thread execution or incremental state visibility, so it passes even though the fallback already changed behavior.
- [tests/sdk/agent/test_parallel_execution_integration.py] Only synthetic tools are exercised: the tests never hit the real risky executors (
terminal,file_editor,browser), so they won't catch the races this PR introduces.
VERDICT:
❌ Needs rework: the concurrency primitive is reasonable, but the fallback path is not backward-compatible, and the parallel path has no safety gate for the tools people actually run.
KEY INSIGHT:
The real bug here is that the code collapsed “sequential semantics” and “parallel scheduling” into one path. Those are not the same thing, and when limit=1 they must stay identical to the old behavior.
Because this PR changes agent tool-execution behavior, I'd still want a human maintainer to make the final merge call after lightweight evals even after the above is addressed.
This comment was marked as duplicate.
This comment was marked as duplicate.
|
OpenHands-GPT-5.4 here. I re-read my earlier review with fresh eyes, section by section, and I think I overstated parts of it. Here's my updated take. 1)
|
|
OpenHands-GPT-5.4 follow-up: you're right, and I need to correct my re-review. I made a reasoning mistake there. I wrote that the This PR exists precisely to handle the multi-tool-call case better. If multi-tool responses are common in real usage, then the changed fallback semantics are not some corner case — they hit the exact workload this feature is supposed to serve. So I want to sharpen the conclusion:
I would restate the concern more cleanly as:
I still think my earlier re-review was right to soften one thing: synthetic tools are fine for testing the concurrency primitive itself. But I over-corrected on severity. If multi-tool responses are common in production, then the fallback-semantics issue is materially more important than I gave it credit for. I also think this strengthens — not weakens — the concern about parallelizing shared-state tools. If the intended real-world usage is "LLM emits several tool calls and we want them parallel," then correctness for the default tool ecosystem is part of the core feature story, not just a docs footnote. |
|
HUMAN: Just to note, I do think multiple tool calls are a normal occurrence. I don’t have numbers, but the agent here is confused; they’re often enough that, in practice, it’s like, see We have the eval trajectories we could compute to see the number of batches with the same |
@enyst yes this could be interesting. |
|
@enyst I answered your points.
The previous implementation wasn't 100% backward compatible, but it was a simple fix. I have updated the code and added a test case to ensure full compatibility.
I’m not sure I follow this point. If multiple tool calls are being executed in parallel, they shouldn't need to depend on each other's outputs.
This is a valid point. However, I avoided adding that specific logic here to prevent the PR from becoming even larger than it already is. My approach was to first implement parallel tool calling and then address the prevention of dependent tool calls in a subsequent update. This aligns with what we originally discussed in the issue, and is why the current default remains the standard behavior.
I have added integration tests to cover scenarios where num_workers = 1 and the request contains multiple tool calls. This should show that we have backward compatibility. |
Makes sense. I changed as suggested by OH
I created an example where we spawn subagents that perform code searches. This happens in parallel, and at the end there is also a parallel report to confirm the calls were actually executed in parallel. If it is too verbose for an example I can also remove it. I just wanted to make sure we are correctly parallelizing the calls. The report looks like: |
Ha! There’s no |
If you want I can force the example to have |
😇 Just out of curiosity |
🔄 Running Examples with
|
| Example | Status | Duration | Cost |
|---|---|---|---|
| 01_standalone_sdk/02_custom_tools.py | ✅ PASS | 34.9s | $0.02 |
| 01_standalone_sdk/03_activate_skill.py | ✅ PASS | 17.9s | $0.02 |
| 01_standalone_sdk/05_use_llm_registry.py | ✅ PASS | 12.7s | $0.01 |
| 01_standalone_sdk/07_mcp_integration.py | ✅ PASS | 31.1s | $0.02 |
| 01_standalone_sdk/09_pause_example.py | ✅ PASS | 14.9s | $0.01 |
| 01_standalone_sdk/10_persistence.py | ✅ PASS | 27.2s | $0.03 |
| 01_standalone_sdk/11_async.py | ✅ PASS | 29.3s | $0.04 |
| 01_standalone_sdk/12_custom_secrets.py | ✅ PASS | 13.0s | $0.01 |
| 01_standalone_sdk/13_get_llm_metrics.py | ✅ PASS | 21.5s | $0.02 |
| 01_standalone_sdk/14_context_condenser.py | ✅ PASS | 2m 47s | $0.19 |
| 01_standalone_sdk/17_image_input.py | ✅ PASS | 16.9s | $0.01 |
| 01_standalone_sdk/18_send_message_while_processing.py | ✅ PASS | 24.7s | $0.01 |
| 01_standalone_sdk/19_llm_routing.py | ✅ PASS | 15.0s | $0.02 |
| 01_standalone_sdk/20_stuck_detector.py | ✅ PASS | 15.6s | $0.02 |
| 01_standalone_sdk/21_generate_extraneous_conversation_costs.py | ✅ PASS | 11.8s | $0.00 |
| 01_standalone_sdk/22_anthropic_thinking.py | ✅ PASS | 17.8s | $0.01 |
| 01_standalone_sdk/23_responses_reasoning.py | ✅ PASS | 54.5s | $0.01 |
| 01_standalone_sdk/24_planning_agent_workflow.py | ✅ PASS | 52.1s | $0.05 |
| 01_standalone_sdk/25_agent_delegation.py | ✅ PASS | 51.5s | $0.06 |
| 01_standalone_sdk/26_custom_visualizer.py | ✅ PASS | 24.0s | $0.03 |
| 01_standalone_sdk/28_ask_agent_example.py | ✅ PASS | 31.2s | $0.03 |
| 01_standalone_sdk/29_llm_streaming.py | ✅ PASS | 33.7s | $0.02 |
| 01_standalone_sdk/30_tom_agent.py | ✅ PASS | 20.9s | $0.01 |
| 01_standalone_sdk/31_iterative_refinement.py | ❌ FAIL Timed out after 600 seconds |
10m 0s | -- |
| 01_standalone_sdk/32_configurable_security_policy.py | ✅ PASS | 16.4s | $0.02 |
| 01_standalone_sdk/34_critic_example.py | ✅ PASS | 9m 30s | $0.67 |
| 01_standalone_sdk/36_event_json_to_openai_messages.py | ✅ PASS | 13.3s | $0.01 |
| 01_standalone_sdk/37_llm_profile_store.py | ✅ PASS | 4.3s | $0.00 |
| 01_standalone_sdk/38_browser_session_recording.py | ✅ PASS | 27.3s | $0.03 |
| 01_standalone_sdk/39_llm_fallback.py | ✅ PASS | 10.4s | $0.01 |
| 01_standalone_sdk/40_acp_agent_example.py | ✅ PASS | 27.6s | $0.10 |
| 01_standalone_sdk/41_task_tool_set.py | ✅ PASS | 26.8s | $0.03 |
| 01_standalone_sdk/42_file_based_subagents.py | ✅ PASS | 1m 12s | $0.06 |
| 01_standalone_sdk/43_mixed_marketplace_skills/main.py | ✅ PASS | 7.8s | $0.00 |
| 01_standalone_sdk/44_model_switching_in_convo.py | ✅ PASS | 8.1s | $0.01 |
| 01_standalone_sdk/45_parallel_tool_execution.py | ✅ PASS | 3m 31s | $0.24 |
| 02_remote_agent_server/01_convo_with_local_agent_server.py | ✅ PASS | 35.3s | $0.03 |
| 02_remote_agent_server/02_convo_with_docker_sandboxed_server.py | ✅ PASS | 1m 38s | $0.05 |
| 02_remote_agent_server/03_browser_use_with_docker_sandboxed_server.py | ✅ PASS | 50.6s | $0.00 |
| 02_remote_agent_server/04_convo_with_api_sandboxed_server.py | ✅ PASS | 1m 33s | $0.05 |
| 02_remote_agent_server/07_convo_with_cloud_workspace.py | ✅ PASS | 27.7s | $0.02 |
| 02_remote_agent_server/08_convo_with_apptainer_sandboxed_server.py | ✅ PASS | 3m 37s | $0.02 |
| 02_remote_agent_server/09_acp_agent_with_remote_runtime.py | ✅ PASS | 58.9s | $0.12 |
| 04_llm_specific_tools/01_gpt5_apply_patch_preset.py | ✅ PASS | 20.9s | $0.03 |
| 04_llm_specific_tools/02_gemini_file_tools.py | ✅ PASS | 44.0s | $0.06 |
| 05_skills_and_plugins/01_loading_agentskills/main.py | ✅ PASS | 17.3s | $0.01 |
| 05_skills_and_plugins/02_loading_plugins/main.py | ✅ PASS | 17.9s | $0.02 |
❌ Some tests failed
Total: 47 | Passed: 46 | Failed: 1 | Total Cost: $2.25
Failed examples:
- examples/01_standalone_sdk/31_iterative_refinement.py: Timed out after 600 seconds
|
Reminder: The main focus of this PR is simply to add the infrastructure for parallel tool calls. There will be additional PRs to ensure everything is thread-safe. For this reason, we have set the default to sequential tool calls. @enyst I investigate your concerns and here what is going on. Current state: TerminalTool is not thread-safe. All commands go through a single PTY. Two concurrent commands on the same PTY interleave their bytes, corrupting output. When it matters: Only when tool_concurrency_limit > 1 and the LLM emits multiple terminal calls in the same response. Subagents are fine: each gets its own session. What's safe today: Parallel batches with different tool types (e.g., 3 delegate calls, or terminal + file editor). This is the common case. Possible fixes (simplest to most capable)
What I propose:
For comparison, here are the parallel tool calls from my CC sessions:
Note that Read is paralleizabe (because of FileEditor view). |
Aha! Thank you. 😅 Your proposals sound good to me, thanks for satisfying my dumb little curiosity. |
xingyaoww
left a comment
There was a problem hiding this comment.
With this context, #2390 (comment)
This PR LGTM!
Immediately after this PR: I will submit a PR with a terminal lock. This is a minimal change with no behavioral impact. We will lose some parallelism, but we'll be safer, and we still gain performance (and save tokens) for mixed tool batches.
Shall we create an issue here in case we forget about it?
I have a linear ticket to remember myself to create all the relative issues to continue the parallel tool calls dev. Moreover, i also have to create documentation as I introduced a new example. |
Rename FailingAction/FailingObservation to ParallelFailingAction/ParallelFailingObservation to avoid name collisions with the existing test classes in tests/sdk/conversation/local/test_rerun_actions.py. When pytest-xdist runs tests in parallel, both files get loaded in the same process, causing the Action/Observation class registry to detect duplicate class definitions and raise ValidationErrors. Co-authored-by: openhands <openhands@all-hands.dev>
Summary
(ref #2350)
Add ParallelToolExecutor to enable concurrent tool execution within agent steps, controlled by the TOOL_CONCURRENCY_LIMIT environment variable (default: 1, fully backward-compatible).
Motivation
When an LLM returns multiple tool calls in a single response (e.g., "read these 3 files" or "run these 4 independent searches"), the current agent executes them sequentially. For I/O-bound tools — file reads, HTTP requests, MCP server calls, database queries — this leaves significant performance on the table. Parallel execution turns N × latency into ~1 × latency for independent operations.
Concrete scenarios where this helps:
What this does NOT help: CPU-bound tools limited by the GIL, or tools with shared mutable state that aren't thread-safe.
Design
emission) happen on the main thread after parallel work completes.
exceptions (RuntimeError, AssertionError, etc.) are logged at ERROR with full traceback to aid debugging.
Thread safety warning
When TOOL_CONCURRENCY_LIMIT > 1, tools run in parallel threads sharing the same conversation object. Tools are not thread-safe by default. Callers opting into parallelism must ensure their tools are safe for concurrent execution
(no shared mutable filesystem state, no concurrent conversation mutations).
Evaluation
I ran an evaluation with SWE-bench to ensure that the default behavior is the one we already have in the repo [ref]
Report from trace investigation of OpenHands CLI:
No parallel tool calls detected -- the feature is cleanly disabled. Here's the full breakdown: Trace Format - Events alternate between ActionEvent (tool call) and ObservationEvent (tool result) - Tools used: terminal (1150), file_editor (588), think (58), finish (25) - 1,821 action events matched exactly 1,821 observation events across all 25 traces Parallel Tool Call Check: CLEAN - Zero shared llm_response_id across events (each LLM turn produced exactly 1 tool call) - Perfect action-observation interleaving -- no consecutive actions or observations - No tool_calls arrays, no parallel batching of any kind - All 25 conversations completed normally with a finish actionAgent Server images for this PR
• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server
Variants & Base Images
eclipse-temurin:17-jdknikolaik/python-nodejs:python3.13-nodejs22golang:1.21-bookwormPull (multi-arch manifest)
# Each variant is a multi-arch manifest supporting both amd64 and arm64 docker pull ghcr.io/openhands/agent-server:bda3d3c-pythonRun
All tags pushed for this build
About Multi-Architecture Support
bda3d3c-python) is a multi-arch manifest supporting both amd64 and arm64bda3d3c-python-amd64) are also available if needed