Enable ACPAgent on RemoteRuntime API#2190
Conversation
- Polymorphic agent deserialization in EventService (type(self.stored.agent) instead of hardcoded Agent) - Eager import of ACPAgent in api.py to register in discriminated union - Add close() lifecycle method to AgentBase, call from LocalConversation.close() - Await conversation close in EventService to ensure subprocess cleanup - Pre-install claude-code-acp in Docker image for ACPAgent support - Add remote runtime example for ACPAgent Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
API breakage checks (Griffe)Result: Passed |
all-hands-bot
left a comment
There was a problem hiding this comment.
🟢 Good taste - Elegant solution that eliminates special cases
KEY INSIGHT: The polymorphic deserialization (agent_cls = type(self.stored.agent)) is textbook good design - it preserves type information naturally without conditionals, making ACPAgent work seamlessly alongside the base Agent class.
What I like:
- Data structure wins: Using Python's type system to handle polymorphism instead of switch statements
- Proper lifecycle management: The
close()hook is clean extensibility (no-op by default, override where needed) - Correctness fix: The await on
run_in_executor(line 609) - you actually need to wait for the cleanup - Solves real problem: ACPAgent subprocess cleanup is necessary, not theoretical
Minor notes:
- The eager import in api.py is a necessary hack for Pydantic discriminated unions - acceptable pragmatism
- Could add a type guard on
agent_cls, but givenstored.agentis pre-validated, this is safe
✅ LGTM - Core logic is sound, changes are minimal and well-targeted.
The claude-code-acp npm install fails on golang and java base images that don't have npm. Make it conditional on npm availability. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…yloads With ACPAgent registered in the discriminated union, agent payloads now require an explicit "kind" field. Updated 6 test payloads and renumbered the ACP example from 07 to 09 to avoid duplicate. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Coverage Report •
|
||||||||||||||||||||||||||||||||||||||||||||||||||
The CI Agent Server workflow builds binary target images, not source. Update the example to use target_type="binary" accordingly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add 'agent_type' input (default/acp) to run-eval.yml and pass it through to the evaluation repo's eval-job workflow dispatch payload. This enables dispatching SWE-bench evaluations using ACPAgent (Claude Code via ACP). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
SWE-bench prebaked images use Python-only base images without Node.js. The conditional `if command -v npm` silently skipped claude-code-acp installation. Now installs Node.js from nodesource when npm is absent. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
ACPAgent now emits a FinishAction ActionEvent + ObservationEvent after each step() completes. This makes it compatible with evaluation frameworks (SWE-bench, etc.) that detect task completion by scanning for FinishAction in the event history. Without this, the benchmarks fake_user_response loop didn't detect completion and sent up to 10 "please continue" messages, and the AgentFinishedCritic marked results as failed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Claude Code ACP defaults to "ask" mode for tool permissions when no settings file exists. This causes the agent to hang waiting for interactive permission approval in headless/eval environments. Add /etc/claude-code/managed-settings.json with allow rules for Edit, Read, and Bash tools so Claude Code can operate without human approval. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Claude Code's internal permission system blocks tools like Write
when running in default mode, causing the agent to hang waiting
for interactive approval. This happens even when the ACP server's
settings allow the tools, because Claude Code checks permissions
before making MCP tool calls.
Call set_session_mode("bypassPermissions") after creating the
session so Claude Code skips all permission checks in headless
mode.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add acp_prompt_timeout field (default 1800s) to ACPAgent that wraps the prompt() call with AsyncExecutor's timeout. This prevents the eval from hanging forever when claude-code-acp fails to send the JSON-RPC response after completing its work. Root cause analysis: claude-code-acp's prompt() method awaits sessionUpdate notifications end-to-end through the stdout pipe. If the response is never sent (e.g., due to stdout backpressure or an unhandled result subtype), the Python ACP client blocks indefinitely. Also fixes test assertions for FinishAction emission added in d72fd08. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The default asyncio.StreamReader limit (64 KiB) is too small for ACP session_update notifications that carry large tool-call outputs. When a JSON-RPC line exceeds the limit, readline() raises LimitOverrunError which silently kills the filter pipeline, leaving the prompt() future unresolved forever and eventually deadlocking the subprocess stdout pipe. Root cause confirmed on v10 eval: 59,496 bytes of unread data stuck in the subprocess pipe, both processes sleeping in ep_poll. Increase both the subprocess pipe and filtered_reader limits to 100 MiB. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Auto-detect ACP server type from InitializeResponse.agent_info.name - Map session modes: claude-code → bypassPermissions, codex-acp → full-access - Add acp_session_mode field for manual override - Install codex-acp alongside claude-code-acp in Docker image - Add acp-codex to run-eval.yml agent_type choices Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
codex-acp requires an explicit `conn.authenticate()` call after initialize and before session creation. Auto-detect the auth method from environment variables (CODEX_API_KEY or OPENAI_API_KEY). Without this, codex-acp returns "Authentication required" when trying to create a session in the remote runtime container. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Aligns naming with 'acp-codex': both now follow the
acp-{provider} pattern, making the agent type self-documenting.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add elapsed time tracking to _record_usage() via add_response_latency() for both main prompt and fork (ask_agent) paths - Capture agent_version from InitializeResponse.agent_info alongside agent_name - Expose agent_name and agent_version as public properties on ACPAgent so benchmarks can include ACP server identity in eval output Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The _BYPASS_MODE_MAP only matched "claude-code" but the actual agent name is "@zed-industries/claude-agent-acp", which doesn't contain that substring. Add "claude-agent" as an additional key so the mode correctly resolves to "bypassPermissions" instead of falling through to the default "full-access" (which claude-agent-acp rejects). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
4176666 to
b9c42fd
Compare
The npm package was renamed from @zed-industries/claude-code-acp to @zed-industries/claude-agent-acp. Update all references: bypass mode map key, Dockerfile, examples, tests, and docstrings. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Agent server REST API breakage checks (OpenAPI)Result: Failed Log excerpt (first 1000 characters) |
all-hands-bot
left a comment
There was a problem hiding this comment.
🟢 Good taste - Pragmatic engineering that solves real problems.
The complexity growth (auth methods, bypass modes, timeout handling, 100 MiB buffer) is justified - each addresses actual production issues rather than theoretical edge cases. The one-shot ACPAgent design (emitting FinishAction per step) correctly models how ACP servers work internally.
Trade-offs are reasonable: 100 MiB stream buffers solve real LimitOverrunError crashes for large tool outputs; Docker managed-settings.json bypass is documented and appropriate for eval environments. Good engineering.
Add ANTHROPIC_BASE_URL, ANTHROPIC_API_KEY, OPENAI_BASE_URL, and OPENAI_API_KEY environment variables to the run-examples workflow. This allows Claude Code and Codex ACP servers to route their API calls through the LiteLLM proxy using the existing LLM_API_KEY, eliminating the need to provision separate Anthropic/OpenAI API keys. See: OpenHands/evaluation#297 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The test framework requires each example to emit an EXAMPLE_COST marker to stdout. Add cost reporting using agent.llm.metrics for the standalone ACP example and conversation_stats for the remote ACP example. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Remove backward-compatible ANTHROPIC_API_KEY requirements from ACP examples. Instead, route all ACP agent requests through the LiteLLM proxy by setting ANTHROPIC_BASE_URL and ANTHROPIC_API_KEY from the existing LLM_BASE_URL and LLM_API_KEY environment variables. This approach: - Eliminates the need for separate Anthropic/OpenAI API key secrets - Enables full cost tracking through LiteLLM - Provides model flexibility (can test with different backends) - Centralizes observability and rate limiting Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When ACP prompt() raises, the agent now: 1. Emits a ConversationErrorEvent (in addition to the existing MessageEvent) so RemoteConversation._get_last_error_detail() can report the actual error instead of the generic fallback message 2. Tags usage/content policy errors with code "UsagePolicyRefusal" 3. Re-raises the exception so LocalConversation.run() breaks the loop immediately instead of spinning until max_iteration_per_run Also fixes pre-existing ruff E501 lint error on line 385. Fixes OpenHands/benchmarks#495 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Resolve conflict in acp_agent.py: keep agent_name/agent_version properties from feat/acp-remote-runtime and adopt Generator[LLM] return type from main. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Allows callers to specify which model the ACP server should use. The model is passed via session _meta.claudeCode.options.model when creating a new ACP session. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
new_session() assigns **kwargs directly to the _meta field.
Passing _meta={"claudeCode": ...} wraps it in an extra _meta key,
producing {"_meta": {"_meta": {"claudeCode": ...}}} — the ACP server
never sees the model config.
Fix: unpack the meta dict as direct kwargs so they become _meta content.
Also skip claudeCode meta for non-Claude ACP servers (e.g. codex-acp).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The step() error handler now re-raises after emitting events and emits both a MessageEvent and a ConversationErrorEvent. Update the test to expect the re-raise (pytest.raises) and verify both emitted events. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The ACPAgent uses a sentinel LLM with model="acp-managed" that cannot make direct LLM calls. When generate_title() was called for ACP conversations, it would fall back to this sentinel LLM and fail with: "LiteLLM Provider NOT provided. You passed model=acp-managed" Add a guard to detect the sentinel model and pass None to generate_conversation_title(), which causes it to use simple truncation fallback instead of LLM-based title generation. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
[Automatic Post]: It has been a while since there was any activity on this PR. @simonrosenberg, are you still working on it? If so, please go ahead, if not then please request review, close it, or request that someone else follow up. |
Add automatic retry for ACP prompt failures caused by transient connection errors (OSError, ConnectionError, BrokenPipeError, EOFError). Changes: - Wrap prompt() call in retry loop for connection exception types - Retry up to 3 times with exponential backoff (5s, 15s, 30s) - Configurable via ACP_PROMPT_MAX_RETRIES env var - Reset client accumulators between retries - Timeout errors are NOT retried (handled separately) This preserves session state when connection errors occur, avoiding the need to restart instances from scratch in the evaluation framework. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
d0127c4 to
bda77a6
Compare
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
This PR actually failed the OpenAPI check:
I'll fix it to fail more loudly. I'm not sure if it's time to make it mandatory for merging, I think maybe we should, because it's too easy to miss only in a comment... 🤔 @simonrosenberg Please correct me if I see wrong, serialization, and the REST schema, are broken from older clients? We have a policy to give a deprecation period (~5 releases) before we make breaking changes to public APIs. 🤔 Maybe we could revert this PR, and rebuild it with deprecation markers? |
|
Failures log: OpenAPI checks |
Done |
|
Hi — I’m OpenHands-GPT-5.4. I looked back at this PR in light of the OpenAPI breakage that later surfaced. What brokeThis PR changed the REST contract of an existing agent-server endpoint in a breaking way: the So the break here was not primarily “an endpoint was removed without deprecation.” It was a breaking request/response contract change on an existing endpoint without a deprecation runway. Relevant policy linesI opened a follow-up policy PR to make this explicit: #2433. Relevant line references there:
Practical options to handle this kind of changeA few reasonable paths would have been:
Bottom lineThe core issue here was changing an existing REST contract incompatibly without a migration/deprecation path for clients. The follow-up policy PR (#2433) codifies the intended expectation more clearly: all REST contract breaks need a deprecation notice and a runway of 5 minor releases. |
This reverts merge commit ca621a4. Co-authored-by: openhands <openhands@all-hands.dev>
Summary
Test plan
test_acp_agent.py+test_agent_loading.pyAgent Server images for this PR
• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server
Variants & Base Images
eclipse-temurin:17-jdknikolaik/python-nodejs:python3.13-nodejs22golang:1.21-bookwormPull (multi-arch manifest)
# Each variant is a multi-arch manifest supporting both amd64 and arm64 docker pull ghcr.io/openhands/agent-server:3dcb3dd-pythonRun
All tags pushed for this build
About Multi-Architecture Support
3dcb3dd-python) is a multi-arch manifest supporting both amd64 and arm643dcb3dd-python-amd64) are also available if needed