Skip to content

Commit d8b2d2d

Browse files
Anthropic CUA Template update (Playwright -> Computer Controls) (#72)
## Anthropic Computer Use Template Overhaul This PR overhauls both the TypeScript and Python Anthropic Computer Use templates to use **Kernel's Computer Controls API** instead of Playwright for all browser interactions. ### Why This Change The previous implementation used Playwright directly, which required maintaining browser connections and handling lower-level browser automation. By migrating to Kernel's Computer Controls API, users get: - **Native integration** with Kernel's browser infrastructure - **Built-in replay recording** for debugging and auditing - **Consistent API** across all Kernel computer use templates - **Simplified session management** with automatic cleanup ### Architecture Overview ``` ┌─────────────────────────────────────────────────────────────┐ │ Entry Point (index.ts / main.py) │ │ - Defines the Kernel app and action │ │ - Creates browser session with KernelBrowserSession │ │ - Invokes the sampling loop │ └─────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────┐ │ Session Manager (session.ts / .py) │ │ - Manages browser lifecycle (create/delete) │ │ - Handles replay recording (start/stop/poll for URL) │ │ - Configures viewport (1024x768 @ 60Hz) │ └─────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────┐ │ Sampling Loop (loop.ts / .py) │ │ - Implements the Anthropic prompt loop │ │ - Manages conversation history │ │ - Routes tool calls to ToolCollection │ └─────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────┐ │ Tool Collection (tools/) │ │ - ComputerTool: Mouse, keyboard, screenshots via Kernel │ │ - Maps Anthropic actions to Kernel Computer Controls API │ │ - Tracks last known mouse position for drag operations │ └─────────────────────────────────────────────────────────────┘ ``` ### File Structure (TypeScript) ``` anthropic-computer-use/ ├── index.ts # Entry point - defines Kernel app and cua-task action ├── session.ts # KernelBrowserSession - manages browser lifecycle + replays ├── loop.ts # Anthropic sampling loop with tool routing ├── tools/ │ ├── collection.ts # ToolCollection - routes tool calls, manages versions │ ├── computer.ts # ComputerTool - implements all mouse/keyboard actions │ ├── types/ │ │ └── computer.ts # TypeScript types for actions and results │ └── utils/ │ ├── keyboard.ts # Key mapping utilities │ └── validator.ts # Coordinate validation ├── types/ │ └── beta.ts # Anthropic beta API types ├── utils/ │ ├── message-processing.ts # Prompt caching, image filtering │ └── tool-results.ts # Format tool results for API └── README.md ``` The Python template follows the same structure with equivalent modules. ### Key Components #### 1. KernelBrowserSession (`session.ts` / `session.py`) Manages the browser lifecycle as a context manager: ```typescript const session = new KernelBrowserSession(kernel, { stealth: true, recordReplay: true, // Optional: capture video replay }); await session.start(); // ... use session.sessionId for computer controls const info = await session.stop(); // info.replayViewUrl contains the video URL if recording was enabled ``` Features: - Automatic browser creation with configurable viewport (1024x768 @ 60Hz) - Optional replay recording with grace period before stopping - Polls for replay URL after stopping - Automatic cleanup on exit #### 2. ComputerTool (`tools/computer.ts` / `tools/computer.py`) Maps Anthropic's computer use actions to Kernel's Computer Controls API: | Anthropic Action | Kernel API | |-----------------|------------| | `left_click`, `right_click`, `double_click` | `computer.clickMouse()` | | `mouse_move` | `computer.moveMouse()` | | `left_click_drag` | `computer.dragMouse()` | | `type` | `computer.typeText()` | | `key` | `computer.pressKey()` | | `scroll` | `computer.scroll()` | | `screenshot` | `computer.captureScreenshot()` | Key implementation details: - Maintains `lastMousePosition` to support drag operations from current position - Maps Anthropic key names to Kernel/xdotool format - Returns base64-encoded screenshots after each action - Supports both `computer_use_20241022` and `computer_use_20250124` API versions #### 3. Sampling Loop (`loop.ts` / `loop.py`) Implements the Anthropic computer use prompt loop: - Sends messages to Claude with computer use tools - Processes tool calls and executes them via ToolCollection - Supports thinking mode with configurable budget - Handles prompt caching for efficiency ### New Features #### Replay Recording Users can enable video replay recording by passing `record_replay: true` in the payload: ```bash kernel invoke ts-anthropic-cua cua-task --payload '{"query": "...", "record_replay": true}' ``` The response includes a `replay_url` field with a link to view the recorded session. ### Known Limitations **Cursor Position**: The `cursor_position` action is not supported with Kernel's Computer Controls API. If the model attempts to use this action, an error is returned. This is a known limitation that does not significantly impact most workflows, as the model tracks cursor position through screenshots. ### Testing Both templates have been tested with the magnitasks.com Kanban board task, which exercises: - Navigation and clicking - Drag-and-drop (`left_click_drag`) - Multiple sequential actions ### Updated Documentation - Template READMEs updated with setup, usage, replay recording, and limitations - QA command updated with new test task - CLI post-install instructions updated with new example <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Overhauls the Anthropic Computer Use templates to use Kernel’s Computer Controls API instead of Playwright, with built-in browser session management and optional replay recording. > > - Replace Playwright automation with Kernel controls in both TS (`tools/computer.ts`, `loop.ts`, `session.ts`, `index.ts`) and Python (`tools/computer.py`, `loop.py`, `session.py`, `main.py`) templates > - Add `KernelBrowserSession` to manage browser lifecycle, live view, and replays (configurable viewport `1024x768@60Hz`; stop/poll for `replay_url`) > - Update sampling loops to construct `ToolCollection` with Kernel client and `sessionId`; handle thinking blocks and tool_use routing; enable prompt caching > - Implement comprehensive key/mouse/scroll mappings to Kernel APIs; drop unsupported `cursor_position`; track last mouse position for drags; standardize typing delay and screenshot flow > - Bump SDKs (`@onkernel/sdk` / `kernel` to `0.24.0`); remove Playwright deps and code paths > - Refresh READMEs with setup/usage and replay instructions; adjust QA/template invoke commands to magnitasks task and optional `record_replay` flag > - Update template defaults in `pkg/create/templates.go` to new invoke payloads > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit ac3baaa. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Co-authored-by: Tanmay Sardesai <tanmay@onkernel.com>
1 parent 60432fe commit d8b2d2d

File tree

15 files changed

+990
-325
lines changed

15 files changed

+990
-325
lines changed

.cursor/commands/qa.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -230,7 +230,7 @@ Once all deployments are complete, present the human with these invoke commands
230230
kernel invoke ts-basic get-page-title --payload '{"url": "https://www.google.com"}'
231231
kernel invoke ts-captcha-solver test-captcha-solver
232232
kernel invoke ts-stagehand teamsize-task --payload '{"company": "Kernel"}'
233-
kernel invoke ts-anthropic-cua cua-task --payload '{"query": "Return the first url of a search result for NYC restaurant reviews Pete Wells"}'
233+
kernel invoke ts-anthropic-cua cua-task --payload '{"query": "Go to http://magnitasks.com, Click the Tasks option in the left-side bar, and move the 5 items in the To Do and In Progress items to the Done section of the Kanban board. You are done successfully when the items are moved.", "record_replay": true}'
234234
kernel invoke ts-magnitude mag-url-extract --payload '{"url": "https://en.wikipedia.org/wiki/Special:Random"}'
235235
kernel invoke ts-openai-cua cua-task --payload '{"task": "Go to https://news.ycombinator.com and get the top 5 articles"}'
236236
kernel invoke ts-gemini-cua gemini-cua-task --payload '{"startingUrl": "https://www.magnitasks.com/", "instruction": "Click the Tasks option in the left-side bar, and move the 5 items in the To Do and In Progress items to the Done section of the Kanban board? You are done successfully when the items are moved."}'
@@ -240,7 +240,7 @@ kernel invoke ts-claude-agent-sdk agent-task --payload '{"task": "Go to https://
240240
kernel invoke python-basic get-page-title --payload '{"url": "https://www.google.com"}'
241241
kernel invoke python-captcha-solver test-captcha-solver
242242
kernel invoke python-bu bu-task --payload '{"task": "Compare the price of gpt-4o and DeepSeek-V3"}'
243-
kernel invoke python-anthropic-cua cua-task --payload '{"query": "Return the first url of a search result for NYC restaurant reviews Pete Wells"}'
243+
kernel invoke python-anthropic-cua cua-task --payload '{"query": "Go to http://magnitasks.com, Click the Tasks option in the left-side bar, and move the 5 items in the To Do and In Progress items to the Done section of the Kanban board. You are done successfully when the items are moved.", "record_replay": true}'
244244
kernel invoke python-openai-cua cua-task --payload '{"task": "Go to https://news.ycombinator.com and get the top 5 articles"}'
245245
kernel invoke python-openagi-cua openagi-default-task -p '{"instruction": "Navigate to https://agiopen.org and click the What is Computer Use? button"}'
246246
kernel invoke py-claude-agent-sdk agent-task --payload '{"task": "Go to https://news.ycombinator.com and get the top 3 stories"}'

pkg/create/templates.go

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -178,7 +178,7 @@ var Commands = map[string]map[string]DeployConfig{
178178
TemplateAnthropicComputerUse: {
179179
EntryPoint: "index.ts",
180180
NeedsEnvFile: true,
181-
InvokeCommand: `kernel invoke ts-anthropic-cua cua-task --payload '{"query": "Return the first url of a search result for NYC restaurant reviews Pete Wells"}'`,
181+
InvokeCommand: `kernel invoke ts-anthropic-cua cua-task --payload '{"query": "Navigate to http://magnitasks.com and click on Tasks in the sidebar"}'`,
182182
},
183183
TemplateMagnitude: {
184184
EntryPoint: "index.ts",
@@ -220,7 +220,7 @@ var Commands = map[string]map[string]DeployConfig{
220220
TemplateAnthropicComputerUse: {
221221
EntryPoint: "main.py",
222222
NeedsEnvFile: true,
223-
InvokeCommand: `kernel invoke python-anthropic-cua cua-task --payload '{"query": "Return the first url of a search result for NYC restaurant reviews Pete Wells"}'`,
223+
InvokeCommand: `kernel invoke python-anthropic-cua cua-task --payload '{"query": "Navigate to http://magnitasks.com and click on Tasks in the sidebar"}'`,
224224
},
225225
TemplateOpenAIComputerUse: {
226226
EntryPoint: "main.py",
Lines changed: 43 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,47 @@
11
# Kernel Python Sample App - Anthropic Computer Use
22

3-
This is a simple Kernel application that implements a prompt loop using Anthropic Computer Use.
3+
This is a Kernel application that implements a prompt loop using Anthropic Computer Use with Kernel's Computer Controls API.
44

5-
It generally follows the [Anthropic Reference Implementation](https://github.com/anthropics/anthropic-quickstarts/tree/main/computer-use-demo) but replaces `xodotool` and `gnome-screenshot` with Playwright.
5+
It generally follows the [Anthropic Reference Implementation](https://github.com/anthropics/anthropic-quickstarts/tree/main/computer-use-demo) but uses Kernel's Computer Controls API instead of `xdotool` and `gnome-screenshot`.
66

7-
See the [docs](https://www.kernel.sh/docs/quickstart) for information.
7+
## Setup
8+
9+
1. Get your API keys:
10+
- **Kernel**: [dashboard.onkernel.com](https://dashboard.onkernel.com)
11+
- **Anthropic**: [console.anthropic.com](https://console.anthropic.com)
12+
13+
2. Deploy the app:
14+
```bash
15+
kernel login
16+
cp .env.example .env # Add your ANTHROPIC_API_KEY
17+
kernel deploy main.py --env-file .env
18+
```
19+
20+
## Usage
21+
22+
```bash
23+
kernel invoke python-anthropic-cua cua-task --payload '{"query": "Navigate to https://example.com and describe the page"}'
24+
```
25+
26+
## Recording Replays
27+
28+
> **Note:** Replay recording is only available to Kernel users on paid plans.
29+
30+
Add `"record_replay": true` to your payload to capture a video of the browser session:
31+
32+
```bash
33+
kernel invoke python-anthropic-cua cua-task --payload '{"query": "Navigate to https://example.com", "record_replay": true}'
34+
```
35+
36+
When enabled, the response will include a `replay_url` field with a link to view the recorded session.
37+
38+
## Known Limitations
39+
40+
### Cursor Position
41+
42+
The `cursor_position` action is not supported with Kernel's Computer Controls API. If the model attempts to use this action, an error will be returned. This is a known limitation that does not significantly impact most computer use workflows, as the model typically tracks cursor position through screenshots.
43+
44+
## Resources
45+
46+
- [Anthropic Computer Use Documentation](https://docs.anthropic.com/en/docs/build-with-claude/computer-use)
47+
- [Kernel Documentation](https://www.kernel.sh/docs/quickstart)

pkg/templates/python/anthropic-computer-use/loop.py

Lines changed: 33 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -1,25 +1,16 @@
11
"""
22
Agentic sampling loop that calls the Anthropic API and local implementation of anthropic-defined computer use tools.
33
From https://github.com/anthropics/anthropic-quickstarts/blob/main/computer-use-demo/computer_use_demo/loop.py
4+
Modified to use Kernel Computer Controls API instead of Playwright.
45
"""
56

67
import os
7-
import platform
8-
from collections.abc import Callable
98
from datetime import datetime
109
from enum import StrEnum
1110
from typing import Any, cast
12-
from playwright.async_api import Page
13-
14-
import httpx
15-
from anthropic import (
16-
Anthropic,
17-
AnthropicBedrock,
18-
AnthropicVertex,
19-
APIError,
20-
APIResponseValidationError,
21-
APIStatusError,
22-
)
11+
12+
from kernel import Kernel
13+
from anthropic import Anthropic
2314
from anthropic.types.beta import (
2415
BetaCacheControlEphemeralParam,
2516
BetaContentBlockParam,
@@ -78,14 +69,15 @@ async def sampling_loop(
7869
model: str,
7970
messages: list[BetaMessageParam],
8071
api_key: str,
72+
kernel: Kernel,
73+
session_id: str,
8174
provider: APIProvider = APIProvider.ANTHROPIC,
8275
system_prompt_suffix: str = "",
8376
only_n_most_recent_images: int | None = None,
8477
max_tokens: int = 4096,
8578
tool_version: ToolVersion = "computer_use_20250124",
8679
thinking_budget: int | None = None,
8780
token_efficient_tools_beta: bool = False,
88-
playwright_page: Page,
8981
):
9082
"""
9183
Agentic sampling loop for the assistant/tool interaction of computer use.
@@ -94,19 +86,20 @@ async def sampling_loop(
9486
model: The model to use for the API call
9587
messages: The conversation history
9688
api_key: The API key for authentication
89+
kernel: The Kernel client instance
90+
session_id: The Kernel browser session ID
9791
provider: The API provider (defaults to ANTHROPIC)
9892
system_prompt_suffix: Additional system prompt text (defaults to empty string)
9993
only_n_most_recent_images: Optional limit on number of recent images to keep
10094
max_tokens: Maximum tokens for the response (defaults to 4096)
10195
tool_version: Version of tools to use (defaults to V20250124)
10296
thinking_budget: Optional token budget for thinking
10397
token_efficient_tools_beta: Whether to use token efficient tools beta
104-
playwright_page: The Playwright page instance for browser automation
10598
"""
10699
tool_group = TOOL_GROUPS_BY_VERSION[tool_version]
107100
tool_collection = ToolCollection(
108101
*(
109-
ToolCls(page=playwright_page if ToolCls.__name__.startswith("ComputerTool") else None)
102+
ToolCls(kernel=kernel, session_id=session_id) if ToolCls.__name__.startswith("ComputerTool") else ToolCls()
110103
for ToolCls in tool_group.tools
111104
)
112105
)
@@ -252,21 +245,31 @@ def _response_to_params(
252245
) -> list[BetaContentBlockParam]:
253246
res: list[BetaContentBlockParam] = []
254247
for block in response.content:
255-
if isinstance(block, BetaTextBlock):
256-
if block.text:
248+
block_type = getattr(block, "type", None)
249+
250+
if block_type == "thinking":
251+
thinking_block = {
252+
"type": "thinking",
253+
"thinking": getattr(block, "thinking", None),
254+
}
255+
if hasattr(block, "signature"):
256+
thinking_block["signature"] = getattr(block, "signature", None)
257+
res.append(cast(BetaContentBlockParam, thinking_block))
258+
elif block_type == "text" or isinstance(block, BetaTextBlock):
259+
if getattr(block, "text", None):
257260
res.append(BetaTextBlockParam(type="text", text=block.text))
258-
elif getattr(block, "type", None) == "thinking":
259-
# Handle thinking blocks - include signature field
260-
thinking_block = {
261-
"type": "thinking",
262-
"thinking": getattr(block, "thinking", None),
263-
}
264-
if hasattr(block, "signature"):
265-
thinking_block["signature"] = getattr(block, "signature", None)
266-
res.append(cast(BetaContentBlockParam, thinking_block))
261+
elif block_type == "tool_use":
262+
tool_use_block: BetaToolUseBlockParam = {
263+
"type": "tool_use",
264+
"id": block.id,
265+
"name": block.name,
266+
"input": block.input,
267+
}
268+
res.append(tool_use_block)
267269
else:
268-
# Handle tool use blocks normally
269-
res.append(cast(BetaToolUseBlockParam, block.model_dump()))
270+
# Preserve unexpected block types to avoid silently dropping content
271+
if hasattr(block, "model_dump"):
272+
res.append(cast(BetaContentBlockParam, block.model_dump()))
270273
return res
271274

272275

@@ -334,4 +337,4 @@ def _make_api_tool_result(
334337
def _maybe_prepend_system_tool_result(result: ToolResult, result_text: str):
335338
if result.system:
336339
result_text = f"<system>{result.system}</system>\n{result_text}"
337-
return result_text
340+
return result_text
Lines changed: 61 additions & 66 deletions
Original file line numberDiff line numberDiff line change
@@ -1,97 +1,92 @@
11
import os
2-
from typing import Dict, TypedDict
2+
from typing import Dict, Optional, TypedDict
33

44
import kernel
5-
from kernel import Kernel
65
from loop import sampling_loop
7-
from playwright.async_api import async_playwright
6+
from session import KernelBrowserSession
87

98

109
class QueryInput(TypedDict):
1110
query: str
11+
record_replay: Optional[bool]
1212

1313

1414
class QueryOutput(TypedDict):
1515
result: str
16+
replay_url: Optional[str]
1617

1718

1819
api_key = os.getenv("ANTHROPIC_API_KEY")
1920
if not api_key:
2021
raise ValueError("ANTHROPIC_API_KEY is not set")
2122

22-
client = Kernel()
2323
app = kernel.App("python-anthropic-cua")
2424

25+
2526
@app.action("cua-task")
2627
async def cua_task(
2728
ctx: kernel.KernelContext,
2829
payload: QueryInput,
2930
) -> QueryOutput:
30-
# A function that processes a user query using a browser-based sampling loop
31-
32-
# Args:
33-
# ctx: Kernel context containing invocation information
34-
# payload: An object containing a query string to process
35-
36-
# Returns:
37-
# A dictionary containing the result of the sampling loop as a string
31+
"""
32+
Process a user query using Anthropic Computer Use with Kernel's browser automation.
33+
34+
Args:
35+
ctx: Kernel context containing invocation information
36+
payload: An object containing:
37+
- query: The task/query string to process
38+
- record_replay: Optional boolean to enable video replay recording
39+
40+
Returns:
41+
A dictionary containing:
42+
- result: The result of the sampling loop as a string
43+
- replay_url: URL to view the replay (if recording was enabled)
44+
"""
3845
if not payload or not payload.get("query"):
3946
raise ValueError("Query is required")
4047

41-
kernel_browser = client.browsers.create(
42-
invocation_id=ctx.invocation_id, stealth=True
43-
)
44-
print("Kernel browser live view url: ", kernel_browser.browser_live_view_url)
45-
46-
try:
47-
async with async_playwright() as playwright:
48-
browser = await playwright.chromium.connect_over_cdp(
49-
kernel_browser.cdp_ws_url
50-
)
51-
context = (
52-
browser.contexts[0] if browser.contexts else await browser.new_context()
48+
record_replay = payload.get("record_replay", False)
49+
50+
async with KernelBrowserSession(
51+
stealth=True,
52+
record_replay=record_replay,
53+
) as session:
54+
print("Kernel browser live view url:", session.live_view_url)
55+
56+
final_messages = await sampling_loop(
57+
model="claude-sonnet-4-5-20250929",
58+
messages=[
59+
{
60+
"role": "user",
61+
"content": payload["query"],
62+
}
63+
],
64+
api_key=str(api_key),
65+
thinking_budget=1024,
66+
kernel=session.kernel,
67+
session_id=session.session_id,
68+
)
69+
70+
if not final_messages:
71+
raise ValueError("No messages were generated during the sampling loop")
72+
73+
last_message = final_messages[-1]
74+
if not last_message:
75+
raise ValueError(
76+
"Failed to get the last message from the sampling loop"
5377
)
54-
page = context.pages[0] if context.pages else await context.new_page()
55-
56-
# Run the sampling loop
57-
final_messages = await sampling_loop(
58-
model="claude-sonnet-4-20250514",
59-
messages=[
60-
{
61-
"role": "user",
62-
"content": payload["query"],
63-
}
64-
],
65-
api_key=str(api_key),
66-
thinking_budget=1024,
67-
playwright_page=page,
78+
79+
result = ""
80+
if isinstance(last_message.get("content"), str):
81+
result = last_message["content"] # type: ignore[assignment]
82+
else:
83+
result = "".join(
84+
block["text"]
85+
for block in last_message["content"] # type: ignore[index]
86+
if isinstance(block, Dict) and block.get("type") == "text"
6887
)
6988

70-
# Extract the final result
71-
if not final_messages:
72-
raise ValueError("No messages were generated during the sampling loop")
73-
74-
last_message = final_messages[-1]
75-
if not last_message:
76-
raise ValueError(
77-
"Failed to get the last message from the sampling loop"
78-
)
79-
80-
result = ""
81-
if isinstance(last_message.get("content"), str):
82-
result = last_message["content"] # type: ignore[assignment]
83-
else:
84-
result = "".join(
85-
block["text"]
86-
for block in last_message["content"] # type: ignore[index]
87-
if isinstance(block, Dict) and block.get("type") == "text"
88-
)
89-
90-
return {"result": result}
91-
except Exception as exc:
92-
print(f"Error in sampling loop: {exc}")
93-
raise
94-
finally:
95-
if browser is not None:
96-
await browser.close()
97-
client.browsers.delete_by_id(kernel_browser.session_id)
89+
return {
90+
"result": result,
91+
"replay_url": session.replay_view_url,
92+
}

pkg/templates/python/anthropic-computer-use/pyproject.toml

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -5,11 +5,9 @@ description = "Kernel reference app for Anthropic Computer Use"
55
requires-python = ">=3.9"
66
dependencies = [
77
"anthropic>=0.75.0",
8-
"playwright>=1.56.0",
98
"python-dateutil>=2.9.0",
109
"pydantic>=2.12.5",
1110
"typing-extensions>=4.15.0",
12-
"kernel>=0.23.0",
11+
"kernel>=0.24.0",
1312
"python-dotenv>=1.2.1",
14-
"httpx>=0.28.1",
1513
]

0 commit comments

Comments
 (0)