|
| 1 | +# browser_action |
| 2 | + |
| 3 | +The `browser_action` tool enables web automation and interaction via a Puppeteer-controlled browser. It allows Roo to launch browsers, navigate to websites, click elements, type text, and scroll pages with visual feedback through screenshots. |
| 4 | + |
| 5 | +## Parameters |
| 6 | + |
| 7 | +The tool accepts these parameters: |
| 8 | + |
| 9 | +- `action` (required): The action to perform: |
| 10 | + * `launch`: Start a new browser session at a URL |
| 11 | + * `click`: Click at specific x,y coordinates |
| 12 | + * `type`: Type text via the keyboard |
| 13 | + * `scroll_down`: Scroll down one page height |
| 14 | + * `scroll_up`: Scroll up one page height |
| 15 | + * `close`: End the browser session |
| 16 | +- `url` (optional): The URL to navigate to when using the `launch` action |
| 17 | +- `coordinate` (optional): The x,y coordinates for the `click` action (e.g., "450,300") |
| 18 | +- `text` (optional): The text to type when using the `type` action |
| 19 | + |
| 20 | +## What It Does |
| 21 | + |
| 22 | +This tool creates an automated browser session that Roo can control to navigate websites, interact with elements, and perform tasks that require browser automation. Each action provides a screenshot of the current state, enabling visual verification of the process. |
| 23 | + |
| 24 | +## When is it used? |
| 25 | + |
| 26 | +- When Roo needs to interact with web applications or websites |
| 27 | +- When testing user interfaces or web functionality |
| 28 | +- When capturing screenshots of web pages |
| 29 | +- When demonstrating web workflows visually |
| 30 | + |
| 31 | +## Key Features |
| 32 | + |
| 33 | +- Provides visual feedback with screenshots after each action and captures console logs |
| 34 | +- Supports complete workflows from launching to page interaction to closing |
| 35 | +- Enables precise interactions via coordinates, keyboard input, and scrolling |
| 36 | +- Maintains consistent browser sessions with intelligent page loading detection |
| 37 | +- Operates in two modes: local (isolated Puppeteer instance) or remote (connects to existing Chrome) |
| 38 | +- Handles errors gracefully with automatic session cleanup and detailed messages |
| 39 | +- Optimizes visual output with support for various formats and quality settings |
| 40 | +- Tracks interaction state with position indicators and action history |
| 41 | + |
| 42 | +## Browser Modes |
| 43 | + |
| 44 | +The tool operates in two distinct modes: |
| 45 | + |
| 46 | +### Local Browser Mode (Default) |
| 47 | +- Downloads and manages a local Chromium instance through Puppeteer |
| 48 | +- Creates a fresh browser environment with each launch |
| 49 | +- No access to existing user profiles, cookies, or extensions |
| 50 | +- Consistent, predictable behavior in a sandboxed environment |
| 51 | +- Completely closes the browser when the session ends |
| 52 | + |
| 53 | +### Remote Browser Mode |
| 54 | +- Connects to an existing Chrome/Chromium instance running with remote debugging enabled |
| 55 | +- Can access existing browser state, cookies, and potentially extensions |
| 56 | +- Faster startup as it reuses an existing browser process |
| 57 | +- Supports connecting to browsers in Docker containers or on remote machines |
| 58 | +- Only disconnects (doesn't close) from the browser when session ends |
| 59 | +- Requires Chrome to be running with remote debugging port open (typically port 9222) |
| 60 | + |
| 61 | +## Limitations |
| 62 | + |
| 63 | +- While the browser is active, only `browser_action` tool can be used |
| 64 | +- Browser coordinates are viewport-relative, not page-relative |
| 65 | +- Click actions must target visible elements within the viewport |
| 66 | +- Browser sessions must be explicitly closed before using other tools |
| 67 | +- Browser window has configurable dimensions (default 900x600) |
| 68 | +- Cannot directly interact with browser DevTools |
| 69 | +- Browser sessions are temporary and not persistent across Roo restarts |
| 70 | +- Works only with Chrome/Chromium browsers, not Firefox or Safari |
| 71 | +- Local mode has no access to existing cookies; remote mode requires Chrome with debugging enabled |
| 72 | + |
| 73 | +## How It Works |
| 74 | + |
| 75 | +When the `browser_action` tool is invoked, it follows this process: |
| 76 | + |
| 77 | +1. **Action Validation and Browser Management**: |
| 78 | + - Validates the required parameters for the requested action |
| 79 | + - For `launch`: Initializes a browser session (either local Puppeteer instance or remote Chrome) |
| 80 | + - For interaction actions: Uses the existing browser session |
| 81 | + - For `close`: Terminates or disconnects from the browser appropriately |
| 82 | + |
| 83 | +2. **Page Interaction and Stability**: |
| 84 | + - Ensures pages are fully loaded using DOM stability detection via `waitTillHTMLStable` algorithm |
| 85 | + - Executes requested actions (navigation, clicking, typing, scrolling) with proper timing |
| 86 | + - Monitors network activity after clicks and waits for navigation when necessary |
| 87 | + |
| 88 | +3. **Visual Feedback**: |
| 89 | + - Captures optimized screenshots using WebP format (with PNG fallback) |
| 90 | + - Records browser console logs for debugging purposes |
| 91 | + - Tracks mouse position and maintains paginated history of actions |
| 92 | + |
| 93 | +4. **Session Management**: |
| 94 | + - Maintains browser state across multiple actions |
| 95 | + - Handles errors and automatically cleans up resources |
| 96 | + - Enforces proper workflow sequence (launch → interactions → close) |
| 97 | + |
| 98 | +## Workflow Sequence |
| 99 | + |
| 100 | +Browser interactions must follow this specific sequence: |
| 101 | + |
| 102 | +1. **Session Initialization**: All browser workflows must start with a `launch` action |
| 103 | +2. **Interaction Phase**: Multiple `click`, `type`, and scroll actions can be performed |
| 104 | +3. **Session Termination**: All browser workflows must end with a `close` action |
| 105 | +4. **Tool Switching**: After closing the browser, other tools can be used |
| 106 | + |
| 107 | +## Examples When Used |
| 108 | + |
| 109 | +- When creating a web form submission process, Roo launches a browser, navigates to the form, fills out fields with the `type` action, and clicks submit. |
| 110 | +- When testing a responsive website, Roo navigates to the site and uses scroll actions to examine different sections. |
| 111 | +- When capturing screenshots of a web application, Roo navigates through different pages and takes screenshots at each step. |
| 112 | +- When demonstrating an e-commerce checkout flow, Roo simulates the entire process from product selection to payment confirmation. |
| 113 | + |
| 114 | +## Usage Examples |
| 115 | + |
| 116 | +Launching a browser and navigating to a website: |
| 117 | +``` |
| 118 | +<browser_action> |
| 119 | +<action>launch</action> |
| 120 | +<url>https://example.com</url> |
| 121 | +</browser_action> |
| 122 | +``` |
| 123 | + |
| 124 | +Clicking at specific coordinates (e.g., a button): |
| 125 | +``` |
| 126 | +<browser_action> |
| 127 | +<action>click</action> |
| 128 | +<coordinate>450,300</coordinate> |
| 129 | +</browser_action> |
| 130 | +``` |
| 131 | + |
| 132 | +Typing text into a focused input field: |
| 133 | +``` |
| 134 | +<browser_action> |
| 135 | +<action>type</action> |
| 136 | +<text>Hello, World!</text> |
| 137 | +</browser_action> |
| 138 | +``` |
| 139 | + |
| 140 | +Scrolling down to see more content: |
| 141 | +``` |
| 142 | +<browser_action> |
| 143 | +<action>scroll_down</action> |
| 144 | +</browser_action> |
| 145 | +``` |
| 146 | + |
| 147 | +Closing the browser session: |
| 148 | +``` |
| 149 | +<browser_action> |
| 150 | +<action>close</action> |
| 151 | +</browser_action> |
| 152 | +``` |
0 commit comments