Skip to content

Commit 2ee5eb5

Browse files
committed
Add documentation for browser_action tool and its usage
1 parent 57d9d98 commit 2ee5eb5

File tree

1 file changed

+152
-0
lines changed

1 file changed

+152
-0
lines changed
Lines changed: 152 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,152 @@
1+
# browser_action
2+
3+
The `browser_action` tool enables web automation and interaction via a Puppeteer-controlled browser. It allows Roo to launch browsers, navigate to websites, click elements, type text, and scroll pages with visual feedback through screenshots.
4+
5+
## Parameters
6+
7+
The tool accepts these parameters:
8+
9+
- `action` (required): The action to perform:
10+
* `launch`: Start a new browser session at a URL
11+
* `click`: Click at specific x,y coordinates
12+
* `type`: Type text via the keyboard
13+
* `scroll_down`: Scroll down one page height
14+
* `scroll_up`: Scroll up one page height
15+
* `close`: End the browser session
16+
- `url` (optional): The URL to navigate to when using the `launch` action
17+
- `coordinate` (optional): The x,y coordinates for the `click` action (e.g., "450,300")
18+
- `text` (optional): The text to type when using the `type` action
19+
20+
## What It Does
21+
22+
This tool creates an automated browser session that Roo can control to navigate websites, interact with elements, and perform tasks that require browser automation. Each action provides a screenshot of the current state, enabling visual verification of the process.
23+
24+
## When is it used?
25+
26+
- When Roo needs to interact with web applications or websites
27+
- When testing user interfaces or web functionality
28+
- When capturing screenshots of web pages
29+
- When demonstrating web workflows visually
30+
31+
## Key Features
32+
33+
- Provides visual feedback with screenshots after each action and captures console logs
34+
- Supports complete workflows from launching to page interaction to closing
35+
- Enables precise interactions via coordinates, keyboard input, and scrolling
36+
- Maintains consistent browser sessions with intelligent page loading detection
37+
- Operates in two modes: local (isolated Puppeteer instance) or remote (connects to existing Chrome)
38+
- Handles errors gracefully with automatic session cleanup and detailed messages
39+
- Optimizes visual output with support for various formats and quality settings
40+
- Tracks interaction state with position indicators and action history
41+
42+
## Browser Modes
43+
44+
The tool operates in two distinct modes:
45+
46+
### Local Browser Mode (Default)
47+
- Downloads and manages a local Chromium instance through Puppeteer
48+
- Creates a fresh browser environment with each launch
49+
- No access to existing user profiles, cookies, or extensions
50+
- Consistent, predictable behavior in a sandboxed environment
51+
- Completely closes the browser when the session ends
52+
53+
### Remote Browser Mode
54+
- Connects to an existing Chrome/Chromium instance running with remote debugging enabled
55+
- Can access existing browser state, cookies, and potentially extensions
56+
- Faster startup as it reuses an existing browser process
57+
- Supports connecting to browsers in Docker containers or on remote machines
58+
- Only disconnects (doesn't close) from the browser when session ends
59+
- Requires Chrome to be running with remote debugging port open (typically port 9222)
60+
61+
## Limitations
62+
63+
- While the browser is active, only `browser_action` tool can be used
64+
- Browser coordinates are viewport-relative, not page-relative
65+
- Click actions must target visible elements within the viewport
66+
- Browser sessions must be explicitly closed before using other tools
67+
- Browser window has configurable dimensions (default 900x600)
68+
- Cannot directly interact with browser DevTools
69+
- Browser sessions are temporary and not persistent across Roo restarts
70+
- Works only with Chrome/Chromium browsers, not Firefox or Safari
71+
- Local mode has no access to existing cookies; remote mode requires Chrome with debugging enabled
72+
73+
## How It Works
74+
75+
When the `browser_action` tool is invoked, it follows this process:
76+
77+
1. **Action Validation and Browser Management**:
78+
- Validates the required parameters for the requested action
79+
- For `launch`: Initializes a browser session (either local Puppeteer instance or remote Chrome)
80+
- For interaction actions: Uses the existing browser session
81+
- For `close`: Terminates or disconnects from the browser appropriately
82+
83+
2. **Page Interaction and Stability**:
84+
- Ensures pages are fully loaded using DOM stability detection via `waitTillHTMLStable` algorithm
85+
- Executes requested actions (navigation, clicking, typing, scrolling) with proper timing
86+
- Monitors network activity after clicks and waits for navigation when necessary
87+
88+
3. **Visual Feedback**:
89+
- Captures optimized screenshots using WebP format (with PNG fallback)
90+
- Records browser console logs for debugging purposes
91+
- Tracks mouse position and maintains paginated history of actions
92+
93+
4. **Session Management**:
94+
- Maintains browser state across multiple actions
95+
- Handles errors and automatically cleans up resources
96+
- Enforces proper workflow sequence (launch → interactions → close)
97+
98+
## Workflow Sequence
99+
100+
Browser interactions must follow this specific sequence:
101+
102+
1. **Session Initialization**: All browser workflows must start with a `launch` action
103+
2. **Interaction Phase**: Multiple `click`, `type`, and scroll actions can be performed
104+
3. **Session Termination**: All browser workflows must end with a `close` action
105+
4. **Tool Switching**: After closing the browser, other tools can be used
106+
107+
## Examples When Used
108+
109+
- When creating a web form submission process, Roo launches a browser, navigates to the form, fills out fields with the `type` action, and clicks submit.
110+
- When testing a responsive website, Roo navigates to the site and uses scroll actions to examine different sections.
111+
- When capturing screenshots of a web application, Roo navigates through different pages and takes screenshots at each step.
112+
- When demonstrating an e-commerce checkout flow, Roo simulates the entire process from product selection to payment confirmation.
113+
114+
## Usage Examples
115+
116+
Launching a browser and navigating to a website:
117+
```
118+
<browser_action>
119+
<action>launch</action>
120+
<url>https://example.com</url>
121+
</browser_action>
122+
```
123+
124+
Clicking at specific coordinates (e.g., a button):
125+
```
126+
<browser_action>
127+
<action>click</action>
128+
<coordinate>450,300</coordinate>
129+
</browser_action>
130+
```
131+
132+
Typing text into a focused input field:
133+
```
134+
<browser_action>
135+
<action>type</action>
136+
<text>Hello, World!</text>
137+
</browser_action>
138+
```
139+
140+
Scrolling down to see more content:
141+
```
142+
<browser_action>
143+
<action>scroll_down</action>
144+
</browser_action>
145+
```
146+
147+
Closing the browser session:
148+
```
149+
<browser_action>
150+
<action>close</action>
151+
</browser_action>
152+
```

0 commit comments

Comments
 (0)