Skip to content

feat: browser-use tool migration [0.9.0]#352

Open
shanyu-strix wants to merge 40 commits intousestrix:mainfrom
shanyu-strix:feat/browseruse
Open

feat: browser-use tool migration [0.9.0]#352
shanyu-strix wants to merge 40 commits intousestrix:mainfrom
shanyu-strix:feat/browseruse

Conversation

@shanyu-strix
Copy link

@shanyu-strix shanyu-strix commented Mar 11, 2026

feat: browser-use tool migration

TL;DR

Ripped out the old Playwright browser tool and replaced it with browser-use. Agents can now browse the web with natural language ("go to example.com and find the login form") or precise commands (click element #5, set a cookie, take a screenshot). Works both inside the sandbox container and against your local Chrome.

2026-03-11 11 21 13

What's new

Two ways to use the browser:

  • NEW 🌟 action="run" + a natural language task → browser-use Agent handles it end-to-end
  • action="click", "screenshot", "cookies", etc. → direct CDP commands, no LLM needed

Two modes:

  • Sandboxed (default) — Chromium runs inside the Docker container, traffic goes through Caido proxy
  • Local (use_local=True) — attaches to your system Chrome, keeps your login sessions and cookies [⚠️EXPERIMENTAL]

Self-healing sessions:

  • if Chromium crashes (it will), a watchdog restarts it and the Python layer reconnects automatically. A crash mid-task means ~15-30s of recovery, not a dead session.

TODO

  • Add ability for the agent to select extra outputs from the agentic run
  • Add authentication to cdp
  • Harden authentication for tool server
  • Add timeouts

Notable decisions

socat for CDP — Chromium only binds its debug port to 127.0.0.1, even if you tell it not to. Docker port mapping needs 0.0.0.0. socat forwards between the two. Standard workaround.

WebSocket URL rewriting — Chromium reports its WebSocket URL with the container-internal address. We rewrite it to the Docker-mapped host:port so connections from outside the container actually work.

LiteLLM adapter — browser-use expects a specific chat model interface. Rather than hardcoding a provider, ChatLiteLLM routes through litellm so any model string (anthropic/claude-sonnet-4-20250514, openrouter/google/gemini-2.0-flash-001, etc.) just works.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 11, 2026

Greptile Summary

This PR replaces the Playwright-based browser tool with browser-use, adding a natural-language agent mode (action="run") alongside all the existing direct CDP commands. It introduces a clean ChatLiteLLM adapter so any litellm-compatible model string works out of the box, wires up an authenticated CDP proxy (proxy.py) instead of the previous raw socat forward, adds a self-healing Chromium watchdog in the entrypoint, and plumbs browser_cdp_port all the way from DockerRuntime through SandboxInfo to the tool. The overall architecture is solid and the credential-hygiene improvement (token read from env instead of CLI arg) is a nice touch.

Key findings:

  • Resource leak in _launch_local_browser race: The race-condition fix applied to _launch_browser (closing the discarded browser via a background task) was not applied to _launch_local_browser. When two coroutines race, the losing browser from Browser.from_system_chrome() is silently dropped without being closed.
  • Agent cleanup double-close: The finally block in run_agent iterates over ("close", "stop") with no break/return after a successful call, so both methods are always invoked on the agent. The coro := fn() walrus assignment also calls the function as a side effect before the coroutine check. The equivalent helper _close_browser avoids this with an explicit return after success.
  • Non-constant-time token comparison in proxy.py: Bearer token validation uses Python ==, which is not timing-safe. Using hmac.compare_digest is the standard fix.

Confidence Score: 3/5

  • Safe to merge after fixing the local-browser resource leak; the other two issues are low-severity cleanup concerns.
  • The PR is a substantial, well-structured migration. The resource leak in _launch_local_browser is a real bug (it's the same class of issue already fixed for _launch_browser), but it only triggers under concurrent-launch conditions for the experimental local-Chrome mode. The double-close issue is low-risk because all errors are caught. The timing attack is minor given the Docker-internal context. No data-loss, auth-bypass, or crash-path issues were found in the primary sandboxed flow.
  • strix/tools/browser/browser_manager.py_launch_local_browser needs the same race-condition fix already applied to _launch_browser.

Important Files Changed

Filename Overview
strix/tools/browser/browser_manager.py Core session lifecycle management for browser-use; the race-condition fix applied to _launch_browser was not applied symmetrically to _launch_local_browser, leaving a resource leak when two coroutines race to create a local session.
strix/tools/browser/browser_actions.py Implements all browser actions (agent run + direct CDP commands); agent cleanup loop has no early-exit after first successful close()/stop() call, causing both methods to always be invoked.
containers/proxy.py New authenticated CDP proxy (replaces socat); correctly strips the auth token before forwarding and handles both HTTP and WebSocket upgrades; token comparison uses == rather than a constant-time digest.
containers/docker-entrypoint.sh Adds full Chromium lifecycle management (launch, CDP readiness wait, auth-proxy startup, 10-second watchdog loop with configurable restart cap); --token flag removed from tool-server invocation in favour of env-var injection, improving credential hygiene.
containers/healthcheck.sh New healthcheck that probes tool server, Caido, and Chromium CDP on the internal port (no auth needed); correctly avoids the unauthenticated-proxy problem by using the internal port directly.
strix/tools/browser/litellm/chat.py Clean LiteLLM adapter implementing BaseChatModel; covers structured output via response_format, maps all major litellm error types to browser-use exceptions, and handles extended thinking via reasoning_content.
strix/tools/browser/litellm/serializer.py Serializes browser-use message types to litellm-compatible OpenAI format; handles UserMessage, SystemMessage, and AssistantMessage (including tool_calls); unknown message types are silently skipped with no warning (previously flagged).
strix/runtime/tool_server.py Added /cdp/version authenticated endpoint and _check_cdp_health() helper; removed --token CLI arg in favour of env-var; _run_tool still uses asyncio.to_thread which would mishandle async tools if ever called via the HTTP endpoint, but browser_actions (async) has sandbox_execution=False so it is never forwarded to the tool server in normal flow.
strix/runtime/docker_runtime.py Added _browser_cdp_port allocation and recovery; browser_cdp_port is now properly included in SandboxInfo and propagated through to agents; no issues found.
containers/Dockerfile Adds socat to apt package list (kept for compatibility despite socat being replaced by the Python auth proxy), copies healthcheck.sh and proxy.py, and wires up HEALTHCHECK; straightforward and correct.
strix/interface/tool_components/browser_renderer.py New Textual renderer for browser_actions; covers all actions with distinct visual styles and concise summaries; no logic issues found.

Comments Outside Diff (2)

  1. strix/tools/browser/browser_actions.py, line 243-249 (link)

    Agent cleanup always calls both close() and stop()

    The for name in ("close", "stop") loop has no break or return after a successful cleanup, so both methods are always attempted. More importantly, coro := fn() calls fn() as a side effect in the condition expression — if close() is synchronous, it executes immediately but asyncio.iscoroutine returns False, the loop falls through, and stop() is called as well.

    Compare with _close_browser in browser_manager.py (line 198–212), which has an explicit return after a successful close to stop trying further methods.

    The fix is to break (or return) after successfully closing:

    for name in ("close", "stop"):
        if fn := getattr(agent, name, None):
            try:
                if asyncio.iscoroutine(coro := fn()):
                    await asyncio.wait_for(coro, timeout=5)
            except Exception:  # noqa: BLE001,S110
                pass
            else:
                break  # stop after the first successful cleanup call
  2. containers/proxy.py, line 16-19 (link)

    Non-constant-time token comparison

    Using Python's == operator for bearer-token comparison is susceptible to timing attacks — an attacker can distinguish valid-prefix guesses from fully-wrong guesses by measuring response latency. The standard fix is hmac.compare_digest, which is constant-time:

    While the proxy listens only inside the Docker network, defence-in-depth is worthwhile for an auth gate.

Prompt To Fix All With AI
This is a comment left during a code review.
Path: strix/tools/browser/browser_manager.py
Line: 329-330

Comment:
**Local browser leaked on concurrent-launch race**

`_launch_local_browser` has the same race condition that was fixed for `_launch_browser` (lines 307-310), but the fix was not applied here.

Between the first guard check (line 319) and this second guard check (line 329), another coroutine may have already created and stored a session for the same `agent_id`. When that happens, the newly-created `browser = Browser.from_system_chrome(...)` is silently dropped without being closed, leaking its connection to the system Chrome instance.

`_launch_browser` closes the discarded browser via a background task:
```python
# _launch_browser (lines 307–310) — existing fix
if session := _manager.get(agent_id):
    task = asyncio.ensure_future(_close_browser(browser, label="race-discarded"))
    _manager.background_tasks.add(task)
    task.add_done_callback(_manager.background_tasks.discard)
    return session
```

`_launch_local_browser` should do the same:
```python
if session := _manager.get(agent_id):
    task = asyncio.ensure_future(_close_browser(browser, label="race-discarded-local"))
    _manager.background_tasks.add(task)
    task.add_done_callback(_manager.background_tasks.discard)
    return session
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: strix/tools/browser/browser_actions.py
Line: 243-249

Comment:
**Agent cleanup always calls both `close()` and `stop()`**

The `for name in ("close", "stop")` loop has no `break` or `return` after a successful cleanup, so both methods are always attempted. More importantly, `coro := fn()` calls `fn()` as a side effect in the condition expression — if `close()` is synchronous, it executes immediately but `asyncio.iscoroutine` returns `False`, the loop falls through, and `stop()` is called as well.

Compare with `_close_browser` in `browser_manager.py` (line 198–212), which has an explicit `return` after a successful close to stop trying further methods.

The fix is to `break` (or `return`) after successfully closing:
```python
for name in ("close", "stop"):
    if fn := getattr(agent, name, None):
        try:
            if asyncio.iscoroutine(coro := fn()):
                await asyncio.wait_for(coro, timeout=5)
        except Exception:  # noqa: BLE001,S110
            pass
        else:
            break  # stop after the first successful cleanup call
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: containers/proxy.py
Line: 16-19

Comment:
**Non-constant-time token comparison**

Using Python's `==` operator for bearer-token comparison is susceptible to timing attacks — an attacker can distinguish valid-prefix guesses from fully-wrong guesses by measuring response latency. The standard fix is `hmac.compare_digest`, which is constant-time:

```suggestion
import hmac

def _is_authorized(req: web.Request) -> bool:
    auth_header: str = req.headers.get("Authorization", "")
    token_param: str | None = req.query.get("token")
    expected = TOKEN.encode()
    header_ok = auth_header.startswith("Bearer ") and hmac.compare_digest(
        auth_header[len("Bearer "):].encode(), expected
    )
    param_ok = token_param is not None and hmac.compare_digest(token_param.encode(), expected)
    return header_ok or param_ok
```

While the proxy listens only inside the Docker network, defence-in-depth is worthwhile for an auth gate.

How can I resolve this? If you propose a fix, please make it concise.

Last reviewed commit: e827141

@shanyu-strix shanyu-strix marked this pull request as draft March 11, 2026 19:09
@shanyu-strix
Copy link
Author

@greptileai please review

Enhances browser command robustness by gracefully handling cases where
the DOM state is not yet available, returning an informative error
instead of asserting.

Updates the container healthcheck to directly probe Chromium's internal
CDP port, simplifying the check by removing previous indirection.
@shanyu-strix
Copy link
Author

@greptileai fixed, please check

@shanyu-strix
Copy link
Author

@greptileai review

@shanyu-strix shanyu-strix marked this pull request as ready for review March 12, 2026 17:49
@shanyu-strix
Copy link
Author

@greptileai check please

@shanyu-strix
Copy link
Author

@greptileai please check

pyproject.toml Outdated
Comment on lines +271 to +285
"strix/tools/browser/**/*.py" = [
"ARG002", # Unused method argument (interface methods may not use all args)
"PLR0911", # Too many return statements (dispatchers and complex browser logic)
"PLR0912", # Too many branches (dispatch functions)
"PLR0915", # Too many statements (complex browser task handling)
]
"strix/interface/tool_components/browser_renderer.py" = [
"PLR0911", # Too many return statements (action dispatcher)
"PLR0912", # Too many branches
"PLR0915", # Too many statements
]
"strix/telemetry/tracer.py" = [
"PLR0912", # Too many branches (save_run_data is legitimately complex)
"PLR0915", # Too many statements
]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove those

pyproject.toml Outdated
Comment on lines +158 to +163
"browser_use",
"browser_use.*",
"cdp_use",
"cdp_use.*",
"aiohttp",
"aiohttp.*",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why duplicated? keep ones with * only

from .python import * # noqa: F403
from .terminal import * # noqa: F403

if not DISABLE_BROWSER:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revert this edit

Comment on lines +108 to +113
# Propagate agent_id so tools can scope per-agent resources (e.g. browser instances).
# NOTE: This is needed for browser_use tools to work correctly.
agent_id = getattr(agent_state, "agent_id", None) if agent_state else None
if agent_id:
set_current_agent_id(agent_id)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this already done? I think we already provide "state" obj to all tools, which has the agent id

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me check

Comment on lines +397 to +406
def llm_supports_vision() -> bool:
try:
import litellm

from strix.config.config import resolve_llm_config

model, _, _ = resolve_llm_config()
return bool(model and litellm.supports_vision(model))
except Exception: # noqa: BLE001
return False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a good idea to make a function for this...

Comment on lines +88 to +114
def _parse_usage(response: Any) -> ChatInvokeUsage | None:
"""Extract token usage from a litellm response."""
usage = getattr(response, "usage", None)
if usage is None:
return None

prompt_tokens = getattr(usage, "prompt_tokens", 0) or 0
completion_tokens = getattr(usage, "completion_tokens", 0) or 0

prompt_cached = getattr(usage, "cache_read_input_tokens", None)
cache_creation = getattr(usage, "cache_creation_input_tokens", None)

if prompt_cached is None:
details = getattr(usage, "prompt_tokens_details", None)
if details:
prompt_cached = getattr(details, "cached_tokens", None)

return ChatInvokeUsage(
prompt_tokens=prompt_tokens,
prompt_cached_tokens=int(prompt_cached) if prompt_cached is not None else None,
prompt_cache_creation_tokens=int(cache_creation)
if cache_creation is not None
else None,
prompt_image_tokens=None,
completion_tokens=completion_tokens,
total_tokens=prompt_tokens + completion_tokens,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to get usage data in case of having browser use internally?
I think we can just get rid of it

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we want to be tracking the token spend from the browser agent?

Comment on lines +25 to +26
def _strip_token(url: str) -> str:
return re.sub(r"[?&]token=[^&]*", "", url).replace("?&", "?").rstrip("?")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we parsing it manually? why not just do it like the tool server, and use uvicorn/fastapi for the whole file

f"{CONTAINER_CAIDO_PORT}/tcp": self._caido_port,
},
cap_add=["NET_ADMIN", "NET_RAW"],
shm_size="256m",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be removed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants