Skip to content

Commit 52b3aeb

Browse files
majdyzSwiftyos
andauthored
feat(backend/sdk): Claude Agent SDK integration for CoPilot (#12103)
## Summary Full integration of the **Claude Agent SDK** to replace the existing one-turn OpenAI-compatible CoPilot implementation with a multi-turn, tool-using AI agent. ### What changed **Core SDK Integration** (`chat/sdk/` — new module) - **`service.py`**: Main orchestrator — spawns Claude Code CLI as a subprocess per user message, streams responses back via SSE. Handles conversation history compression, session lifecycle, and error recovery. - **`response_adapter.py`**: Translates Claude Agent SDK events (text deltas, tool use, errors, result messages) into the existing CoPilot `StreamEvent` protocol so the frontend works unchanged. - **`tool_adapter.py`**: Bridges CoPilot's MCP tools (find_block, run_block, create_agent, etc.) into the SDK's tool format. Handles schema conversion and result serialization. - **`security_hooks.py`**: Pre/Post tool-use hooks that enforce a strict allowlist of tools, block path traversal, sandbox file operations to per-session workspace directories, cap sub-agent spawning, and prevent the model from accessing unauthorized system resources. - **`transcript.py`**: JSONL transcript I/O utilities for the stateless `--resume` feature (see below). **Stateless Multi-Turn Resume** (new) - Instead of compressing conversation history via LLM on every turn (lossy and expensive), we capture Claude Code's native JSONL session transcript via a **Stop hook** callback, persist it in the DB (`ChatSession.sdkTranscript`), and restore it on the next turn via `--resume <file>`. - This preserves full tool call/result context across turns with zero token overhead for history. - Feature-flagged via `CLAUDE_AGENT_USE_RESUME` (default: off). - DB migration: `ALTER TABLE "ChatSession" ADD COLUMN "sdkTranscript" TEXT`. **Sandboxed Tool Execution** (`chat/tools/`) - **`bash_exec.py`**: Sandboxed bash execution using bubblewrap (`bwrap`) with read-only root filesystem, per-session writable workspace, resource limits (CPU, memory, file size), and network isolation. - **`sandbox.py`**: Shared bubblewrap sandbox infrastructure — generates `bwrap` command lines with configurable mounts, environment, and resource constraints. - **`web_fetch.py`**: URL fetching tool with domain allowlist, size limits, and content-type filtering. - **`check_operation_status.py`**: Polling tool for long-running operations (agent creation, block execution) so the SDK doesn't block waiting. - **`find_block.py`** / **`run_block.py`**: Enhanced with category filtering, optimized response size (removed raw JSON schemas), and better error handling. **Security** - Path traversal prevention: session IDs sanitized, all file ops confined to workspace dirs, symlink resolution. - Tool allowlist enforcement via SDK hooks — model cannot call arbitrary tools. - Built-in `Bash` tool blocked via `disallowed_tools` to prevent bypassing sandboxed `bash_exec`. - Sub-agent (`Task`) spawning capped at configurable limit (default: 10). - CodeQL-clean path sanitization patterns. **Streaming & Reconnection** - SSE stream registry backed by Redis Streams for crash-resilient reconnection. - Long-running operation tracking with TTL-based cleanup. - Atomic message append to prevent race conditions on concurrent writes. **Configuration** (`config.py`) - `use_claude_agent_sdk` — master toggle (default: on) - `claude_agent_model` — model override for SDK path - `claude_agent_max_buffer_size` — JSON parsing buffer (10MB) - `claude_agent_max_subtasks` — sub-agent cap (10) - `claude_agent_use_resume` — transcript-based resume (default: off) - `thinking_enabled` — extended thinking for Claude models **Tests** - `sdk/response_adapter_test.py` — 366 lines covering all event translation paths - `sdk/security_hooks_test.py` — 165 lines covering tool blocking, path traversal, subtask limits - `chat/model_test.py` — 214 lines covering session model serialization - `chat/service_test.py` — Integration tests including multi-turn resume keyword recall - `tools/find_block_test.py` / `run_block_test.py` — Extended with new tool behavior tests ## Test plan - [x] Unit tests pass (`sdk/response_adapter_test.py`, `security_hooks_test.py`, `model_test.py`) - [x] Integration test: multi-turn keyword recall via `--resume` (`service_test.py::test_sdk_resume_multi_turn`) - [x] Manual E2E: CoPilot chat sessions with tool calls, bash execution, and multi-turn context - [x] Pre-commit hooks pass (ruff, isort, black, pyright, flake8) - [ ] Staging deployment with `claude_agent_use_resume=false` initially - [ ] Enable resume in staging, verify transcript capture and recall <!-- greptile_comment --> <h2>Greptile Overview</h2> <details><summary><h3>Greptile Summary</h3></summary> This PR replaces the existing OpenAI-compatible CoPilot with a full Claude Agent SDK integration, introducing multi-turn conversations, stateless resume via JSONL transcripts, and sandboxed tool execution. **Key changes:** - **SDK integration** (`chat/sdk/`): spawns Claude Code CLI subprocess per message, translates events to frontend protocol, bridges MCP tools - **Stateless resume**: captures JSONL transcripts via Stop hook, persists in `ChatSession.sdkTranscript`, restores with `--resume` (feature-flagged, default off) - **Sandboxed execution**: bubblewrap sandbox for bash commands with filesystem whitelist, network isolation, resource limits - **Security hooks**: tool allowlist enforcement, path traversal prevention, workspace-scoped file operations, sub-agent spawn limits - **Long-running operations**: delegates `create_agent`/`edit_agent` to existing stream_registry infrastructure for SSE reconnection - **Feature flag**: `CHAT_USE_CLAUDE_AGENT_SDK` with LaunchDarkly support, defaults to enabled **Security issues found:** - Path traversal validation has logic errors in `security_hooks.py:82` (tilde expansion order) and `service.py:266` (redundant `..` check) - Config validator always prefers env var over explicit `False` value (`config.py:162`) - Race condition in `routes.py:323` — message persisted before task registration, could duplicate on retry - Resource limits in sandbox may fail silently (`sandbox.py:109`) **Test coverage is strong** with 366 lines for response adapter, 165 for security hooks, and integration tests for multi-turn resume. </details> <details><summary><h3>Confidence Score: 3/5</h3></summary> - This PR is generally safe but has critical security issues in path validation that must be fixed before merge - Score reflects strong architecture and test coverage offset by real security vulnerabilities: the tilde expansion bug in `security_hooks.py` could allow sandbox escape, the race condition could cause message duplication, and the silent ulimit failures could bypass resource limits. The bubblewrap sandbox and allowlist enforcement are well-designed, but the path validation bugs need fixing. The transcript resume feature is properly feature-flagged. Overall the implementation is solid but the security issues prevent a higher score. - Pay close attention to `backend/api/features/chat/sdk/security_hooks.py` (path traversal vulnerability), `backend/api/features/chat/routes.py` (race condition), `backend/api/features/chat/tools/sandbox.py` (silent resource limit failures), and `backend/api/features/chat/sdk/service.py` (redundant security check) </details> <details><summary><h3>Sequence Diagram</h3></summary> ```mermaid sequenceDiagram participant Frontend participant Routes as routes.py participant SDKService as sdk/service.py participant ClaudeSDK as Claude Agent SDK CLI participant SecurityHooks as security_hooks.py participant ToolAdapter as tool_adapter.py participant CoPilotTools as tools/* participant Sandbox as sandbox.py (bwrap) participant DB as Database participant Redis as stream_registry Frontend->>Routes: POST /chat (user message) Routes->>SDKService: stream_chat_completion_sdk() SDKService->>DB: get_chat_session() DB-->>SDKService: session + messages alt Resume enabled AND transcript exists SDKService->>SDKService: validate_transcript() SDKService->>SDKService: write_transcript_to_tempfile() Note over SDKService: Pass --resume to SDK else No resume SDKService->>SDKService: _compress_conversation_history() Note over SDKService: Inject history into user message end SDKService->>SecurityHooks: create_security_hooks() SDKService->>ToolAdapter: create_copilot_mcp_server() SDKService->>ClaudeSDK: spawn subprocess with MCP server loop Streaming Conversation ClaudeSDK->>SDKService: AssistantMessage (text/tool_use) SDKService->>Frontend: StreamTextDelta / StreamToolInputAvailable alt Tool Call ClaudeSDK->>SecurityHooks: PreToolUse hook SecurityHooks->>SecurityHooks: validate path, check allowlist alt Tool blocked SecurityHooks-->>ClaudeSDK: deny else Tool allowed SecurityHooks-->>ClaudeSDK: allow ClaudeSDK->>ToolAdapter: call MCP tool alt Long-running tool (create_agent, edit_agent) ToolAdapter->>Redis: register task ToolAdapter->>DB: save OperationPendingResponse ToolAdapter->>ToolAdapter: spawn background task ToolAdapter-->>ClaudeSDK: OperationStartedResponse else Regular tool (find_block, bash_exec) ToolAdapter->>CoPilotTools: execute() alt bash_exec CoPilotTools->>Sandbox: run_sandboxed() Sandbox->>Sandbox: build bwrap command Note over Sandbox: Network isolation,<br/>filesystem whitelist,<br/>resource limits Sandbox-->>CoPilotTools: stdout, stderr, exit_code end CoPilotTools-->>ToolAdapter: result ToolAdapter->>ToolAdapter: stash full output ToolAdapter-->>ClaudeSDK: MCP response end SecurityHooks->>SecurityHooks: PostToolUse hook (log) end end ClaudeSDK->>SDKService: UserMessage (ToolResultBlock) SDKService->>ToolAdapter: pop_pending_tool_output() SDKService->>Frontend: StreamToolOutputAvailable end ClaudeSDK->>SecurityHooks: Stop hook SecurityHooks->>SDKService: transcript_path callback SDKService->>SDKService: read_transcript_file() SDKService->>DB: save transcript to session.sdkTranscript ClaudeSDK->>SDKService: ResultMessage (success) SDKService->>Frontend: StreamFinish SDKService->>DB: upsert_chat_session() ``` </details> <sub>Last reviewed commit: 28c1121</sub> <!-- greptile_other_comments_section --> <!-- /greptile_comment --> --------- Co-authored-by: Swifty <craigswift13@gmail.com>
1 parent 965b7d3 commit 52b3aeb

32 files changed

+4187
-55
lines changed

autogpt_platform/backend/Dockerfile

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -66,13 +66,19 @@ ENV POETRY_HOME=/opt/poetry \
6666
DEBIAN_FRONTEND=noninteractive
6767
ENV PATH=/opt/poetry/bin:$PATH
6868

69-
# Install Python, FFmpeg, and ImageMagick (required for video processing blocks)
69+
# Install Python, FFmpeg, ImageMagick, and CLI tools for agent use.
70+
# bubblewrap provides OS-level sandbox (whitelist-only FS + no network)
71+
# for the bash_exec MCP tool.
7072
# Using --no-install-recommends saves ~650MB by skipping unnecessary deps like llvm, mesa, etc.
7173
RUN apt-get update && apt-get install -y --no-install-recommends \
7274
python3.13 \
7375
python3-pip \
7476
ffmpeg \
7577
imagemagick \
78+
jq \
79+
ripgrep \
80+
tree \
81+
bubblewrap \
7682
&& rm -rf /var/lib/apt/lists/*
7783

7884
COPY --from=builder /usr/local/lib/python3* /usr/local/lib/python3*

autogpt_platform/backend/backend/api/features/chat/config.py

Lines changed: 40 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -27,12 +27,11 @@ class ChatConfig(BaseSettings):
2727
session_ttl: int = Field(default=43200, description="Session TTL in seconds")
2828

2929
# Streaming Configuration
30-
max_context_messages: int = Field(
31-
default=50, ge=1, le=200, description="Maximum context messages"
32-
)
33-
3430
stream_timeout: int = Field(default=300, description="Stream timeout in seconds")
35-
max_retries: int = Field(default=3, description="Maximum number of retries")
31+
max_retries: int = Field(
32+
default=3,
33+
description="Max retries for fallback path (SDK handles retries internally)",
34+
)
3635
max_agent_runs: int = Field(default=30, description="Maximum number of agent runs")
3736
max_agent_schedules: int = Field(
3837
default=30, description="Maximum number of agent schedules"
@@ -93,6 +92,31 @@ class ChatConfig(BaseSettings):
9392
description="Name of the prompt in Langfuse to fetch",
9493
)
9594

95+
# Claude Agent SDK Configuration
96+
use_claude_agent_sdk: bool = Field(
97+
default=True,
98+
description="Use Claude Agent SDK for chat completions",
99+
)
100+
claude_agent_model: str | None = Field(
101+
default=None,
102+
description="Model for the Claude Agent SDK path. If None, derives from "
103+
"the `model` field by stripping the OpenRouter provider prefix.",
104+
)
105+
claude_agent_max_buffer_size: int = Field(
106+
default=10 * 1024 * 1024, # 10MB (default SDK is 1MB)
107+
description="Max buffer size in bytes for Claude Agent SDK JSON message parsing. "
108+
"Increase if tool outputs exceed the limit.",
109+
)
110+
claude_agent_max_subtasks: int = Field(
111+
default=10,
112+
description="Max number of sub-agent Tasks the SDK can spawn per session.",
113+
)
114+
claude_agent_use_resume: bool = Field(
115+
default=True,
116+
description="Use --resume for multi-turn conversations instead of "
117+
"history compression. Falls back to compression when unavailable.",
118+
)
119+
96120
# Extended thinking configuration for Claude models
97121
thinking_enabled: bool = Field(
98122
default=True,
@@ -138,6 +162,17 @@ def get_internal_api_key(cls, v):
138162
v = os.getenv("CHAT_INTERNAL_API_KEY")
139163
return v
140164

165+
@field_validator("use_claude_agent_sdk", mode="before")
166+
@classmethod
167+
def get_use_claude_agent_sdk(cls, v):
168+
"""Get use_claude_agent_sdk from environment if not provided."""
169+
# Check environment variable - default to True if not set
170+
env_val = os.getenv("CHAT_USE_CLAUDE_AGENT_SDK", "").lower()
171+
if env_val:
172+
return env_val in ("true", "1", "yes", "on")
173+
# Default to True (SDK enabled by default)
174+
return True if v is None else v
175+
141176
# Prompt paths for different contexts
142177
PROMPT_PATHS: dict[str, str] = {
143178
"default": "prompts/chat_system.md",

autogpt_platform/backend/backend/api/features/chat/model.py

Lines changed: 54 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -334,9 +334,8 @@ async def _get_session_from_cache(session_id: str) -> ChatSession | None:
334334
try:
335335
session = ChatSession.model_validate_json(raw_session)
336336
logger.info(
337-
f"Loading session {session_id} from cache: "
338-
f"message_count={len(session.messages)}, "
339-
f"roles={[m.role for m in session.messages]}"
337+
f"[CACHE] Loaded session {session_id}: {len(session.messages)} messages, "
338+
f"last_roles={[m.role for m in session.messages[-3:]]}" # Last 3 roles
340339
)
341340
return session
342341
except Exception as e:
@@ -378,11 +377,9 @@ async def _get_session_from_db(session_id: str) -> ChatSession | None:
378377
return None
379378

380379
messages = prisma_session.Messages
381-
logger.info(
382-
f"Loading session {session_id} from DB: "
383-
f"has_messages={messages is not None}, "
384-
f"message_count={len(messages) if messages else 0}, "
385-
f"roles={[m.role for m in messages] if messages else []}"
380+
logger.debug(
381+
f"[DB] Loaded session {session_id}: {len(messages) if messages else 0} messages, "
382+
f"roles={[m.role for m in messages[-3:]] if messages else []}" # Last 3 roles
386383
)
387384

388385
return ChatSession.from_db(prisma_session, messages)
@@ -433,10 +430,9 @@ async def _save_session_to_db(
433430
"function_call": msg.function_call,
434431
}
435432
)
436-
logger.info(
437-
f"Saving {len(new_messages)} new messages to DB for session {session.session_id}: "
438-
f"roles={[m['role'] for m in messages_data]}, "
439-
f"start_sequence={existing_message_count}"
433+
logger.debug(
434+
f"[DB] Saving {len(new_messages)} messages to session {session.session_id}, "
435+
f"roles={[m['role'] for m in messages_data]}"
440436
)
441437
await chat_db.add_chat_messages_batch(
442438
session_id=session.session_id,
@@ -476,7 +472,7 @@ async def get_chat_session(
476472
logger.warning(f"Unexpected cache error for session {session_id}: {e}")
477473

478474
# Fall back to database
479-
logger.info(f"Session {session_id} not in cache, checking database")
475+
logger.debug(f"Session {session_id} not in cache, checking database")
480476
session = await _get_session_from_db(session_id)
481477

482478
if session is None:
@@ -493,7 +489,6 @@ async def get_chat_session(
493489
# Cache the session from DB
494490
try:
495491
await _cache_session(session)
496-
logger.info(f"Cached session {session_id} from database")
497492
except Exception as e:
498493
logger.warning(f"Failed to cache session {session_id}: {e}")
499494

@@ -558,6 +553,40 @@ async def upsert_chat_session(
558553
return session
559554

560555

556+
async def append_and_save_message(session_id: str, message: ChatMessage) -> ChatSession:
557+
"""Atomically append a message to a session and persist it.
558+
559+
Acquires the session lock, re-fetches the latest session state,
560+
appends the message, and saves — preventing message loss when
561+
concurrent requests modify the same session.
562+
"""
563+
lock = await _get_session_lock(session_id)
564+
565+
async with lock:
566+
session = await get_chat_session(session_id)
567+
if session is None:
568+
raise ValueError(f"Session {session_id} not found")
569+
570+
session.messages.append(message)
571+
existing_message_count = await chat_db.get_chat_session_message_count(
572+
session_id
573+
)
574+
575+
try:
576+
await _save_session_to_db(session, existing_message_count)
577+
except Exception as e:
578+
raise DatabaseError(
579+
f"Failed to persist message to session {session_id}"
580+
) from e
581+
582+
try:
583+
await _cache_session(session)
584+
except Exception as e:
585+
logger.warning(f"Cache write failed for session {session_id}: {e}")
586+
587+
return session
588+
589+
561590
async def create_chat_session(user_id: str) -> ChatSession:
562591
"""Create a new chat session and persist it.
563592
@@ -664,13 +693,19 @@ async def update_session_title(session_id: str, title: str) -> bool:
664693
logger.warning(f"Session {session_id} not found for title update")
665694
return False
666695

667-
# Invalidate cache so next fetch gets updated title
696+
# Update title in cache if it exists (instead of invalidating).
697+
# This prevents race conditions where cache invalidation causes
698+
# the frontend to see stale DB data while streaming is still in progress.
668699
try:
669-
redis_key = _get_session_cache_key(session_id)
670-
async_redis = await get_redis_async()
671-
await async_redis.delete(redis_key)
700+
cached = await _get_session_from_cache(session_id)
701+
if cached:
702+
cached.title = title
703+
await _cache_session(cached)
672704
except Exception as e:
673-
logger.warning(f"Failed to invalidate cache for session {session_id}: {e}")
705+
# Not critical - title will be correct on next full cache refresh
706+
logger.warning(
707+
f"Failed to update title in cache for session {session_id}: {e}"
708+
)
674709

675710
return True
676711
except Exception as e:

0 commit comments

Comments
 (0)