-
Notifications
You must be signed in to change notification settings - Fork 471
Add Browser Env Integration #732
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
filip-michalsky
wants to merge
15
commits into
PrimeIntellect-ai:main
Choose a base branch
from
filip-michalsky:fm/add-browser-env
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
15 commits
Select commit
Hold shift + click to select a range
cedc25b
ruff precommit
filip-michalsky 7280e65
smoke test env
filip-michalsky 9cfd1c6
simplify smoke test
filip-michalsky c93bca2
delete datasets from wrong place
filip-michalsky 8a978c5
restructure
filip-michalsky 3d6d559
Remove gaia, mind2web, and webvoyager environment folders
filip-michalsky 5db1d16
increment examples
filip-michalsky e3024b8
update readme and auto start cua server
filip-michalsky 01eac47
add env check
filip-michalsky 79e04d0
update readme
filip-michalsky 0affb42
update agents md
filip-michalsky 49f5b95
fix tests
filip-michalsky 813eca4
update tests
filip-michalsky 906a836
Remove gaia, webvoyager, mind2web from tracking
filip-michalsky e688da6
make bugbot happier
filip-michalsky File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,106 @@ | ||
| # Browser CUA Mode Example | ||
|
|
||
| A simple example environment demonstrating **CUA (Computer Use Agent) mode** browser automation using [Browserbase](https://browserbase.com). | ||
|
|
||
| CUA mode uses vision-based primitives to control the browser through screenshots, similar to how a human would interact with a screen. | ||
|
|
||
| ## How CUA Mode Works | ||
|
|
||
| CUA mode provides low-level vision-based operations: | ||
| - **click(x, y)**: Click at screen coordinates | ||
| - **type_text(text)**: Type text into focused element | ||
| - **scroll(direction)**: Scroll the page | ||
| - **screenshot()**: Capture current screen state | ||
| - **navigate(url)**: Go to a URL | ||
|
|
||
| The agent sees screenshots and decides which actions to take based on visual understanding. | ||
|
|
||
| ## Prerequisites: CUA Server | ||
|
|
||
| CUA mode requires a running CUA server that handles browser automation. The server is located at: | ||
| ``` | ||
| verifiers/envs/integrations/browser_env/cua-server/ | ||
| ``` | ||
|
|
||
| ### Starting the CUA Server | ||
|
|
||
| ```bash | ||
| cd verifiers/envs/integrations/browser_env/cua-server | ||
|
|
||
| # Start the server (installs dependencies automatically if needed) | ||
| ./start.sh | ||
| ``` | ||
|
|
||
| The server runs on `http://localhost:3000` by default. Use `./start.sh --port 8080` for a custom port. | ||
|
|
||
| ## Installation | ||
|
|
||
| ```bash | ||
| # Install browser extras | ||
| uv pip install -e ".[browser]" | ||
|
|
||
| # Install this example environment | ||
| uv pip install -e ./environments/browser_cua_example | ||
| ``` | ||
|
|
||
| ## Configuration | ||
|
|
||
| ### Required Environment Variables | ||
|
|
||
| ```bash | ||
| # Browserbase credentials | ||
| export BROWSERBASE_API_KEY="your-api-key" | ||
| export BROWSERBASE_PROJECT_ID="your-project-id" | ||
|
|
||
| # API key for agent model | ||
| export OPENAI_API_KEY="your-openai-key" | ||
| ``` | ||
|
|
||
| Note: CUA mode does NOT require `MODEL_API_KEY` since it doesn't use Stagehand. | ||
|
|
||
| ## Usage | ||
|
|
||
| 1. **Start the CUA server** (in a separate terminal): | ||
| ```bash | ||
| cd verifiers/envs/integrations/browser_env/cua-server && ./start.sh | ||
| ``` | ||
|
|
||
| 2. **Run the evaluation**: | ||
| ```bash | ||
| prime eval run browser-cua-example -m gpt-4.1-mini -b https://api.openai.com/v1 -k OPENAI_API_KEY | ||
| ``` | ||
|
|
||
| ### Custom Server URL | ||
|
|
||
| If running the CUA server on a different port: | ||
| ```bash | ||
| prime eval run browser-cua-example -m gpt-4.1-mini -b https://api.openai.com/v1 -k OPENAI_API_KEY -a '{"server_url": "http://localhost:8080"}' | ||
| ``` | ||
|
|
||
| ## Environment Arguments | ||
|
|
||
| | Argument | Default | Description | | ||
| |----------|---------|-------------| | ||
| | `max_turns` | `10` | Maximum conversation turns | | ||
| | `judge_model` | `"gpt-4o-mini"` | Model for task completion judging | | ||
| | `server_url` | `"http://localhost:3000"` | CUA server URL | | ||
| | `viewport_width` | `1024` | Browser viewport width | | ||
| | `viewport_height` | `768` | Browser viewport height | | ||
| | `save_screenshots` | `False` | Save screenshots during execution | | ||
|
|
||
| ## DOM vs CUA Mode Comparison | ||
|
|
||
| | Aspect | DOM Mode | CUA Mode | | ||
| |--------|----------|----------| | ||
| | **Control** | Natural language via Stagehand | Vision-based coordinates | | ||
| | **Server** | None required | CUA server required | | ||
| | **MODEL_API_KEY** | Required (for Stagehand) | Not required | | ||
| | **Best for** | Structured web interactions | Visual/complex UIs | | ||
| | **Speed** | Faster (direct DOM) | Slower (screenshots) | | ||
|
|
||
| ## Requirements | ||
|
|
||
| - Python >= 3.10 | ||
| - Node.js (for CUA server) | ||
| - Browserbase account with API credentials | ||
| - OpenAI API key |
170 changes: 170 additions & 0 deletions
170
environments/browser_cua_example/browser_cua_example.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,170 @@ | ||
| """ | ||
| Browser CUA Mode Example Environment. | ||
|
|
||
| This example demonstrates using BrowserEnv with CUA (Computer Use Agent) mode | ||
| for vision-based browser control. | ||
|
|
||
| CUA mode uses screenshots and vision models to interact with the browser, | ||
| providing low-level primitives like click, scroll, and type_text. | ||
|
|
||
| Usage: | ||
| # Start CUA server first, then: | ||
| prime eval run browser-cua-example -m gpt-4.1-mini -b https://api.openai.com/v1 -k OPENAI_API_KEY | ||
| """ | ||
|
|
||
| from typing import Literal | ||
|
|
||
| import verifiers as vf | ||
| from verifiers.envs.integrations.browser_env import BrowserEnv | ||
| from datasets import Dataset | ||
|
|
||
|
|
||
| def create_example_dataset() -> Dataset: | ||
| """ | ||
| Create a simple inline dataset for the CUA mode hello world example. | ||
|
|
||
| This dataset tests basic browser navigation and content extraction | ||
| using vision-based interactions. | ||
| """ | ||
| return Dataset.from_dict( | ||
| { | ||
| "question": [ | ||
| "What does the headline say on the primeintellect.ai homepage?" | ||
| ], | ||
| "answer": ["The Open Superintelligence Stack"], | ||
| "start_url": ["https://primeintellect.ai"], | ||
| "task_id": ["cua-example-0"], | ||
| } | ||
| ) | ||
|
|
||
|
|
||
| # Judge prompt for evaluating answer correctness | ||
| JUDGE_PROMPT = """You are evaluating a browser automation agent's answer to a question. | ||
|
|
||
| Question: | ||
| ``` | ||
| {question} | ||
| ``` | ||
|
|
||
| Expected Answer: | ||
| ``` | ||
| {answer} | ||
| ``` | ||
|
|
||
| Agent's Response: | ||
| ``` | ||
| {response} | ||
| ``` | ||
|
|
||
| Does the agent's response contain the correct answer? The answer may be embedded in a longer response or phrased differently, but should convey the same information as the expected answer. | ||
|
|
||
| Respond "yes" if the agent's response contains the correct answer, "no" if it does not.""" | ||
|
|
||
|
|
||
| async def judge_answer( | ||
| judge, | ||
| prompt: str | list, | ||
| completion: str | list, | ||
| answer: str, | ||
| state: vf.State, | ||
| ) -> float: | ||
| """ | ||
| LLM judge reward that compares the agent's final answer to the reference answer. | ||
|
|
||
| Args: | ||
| judge: Callable injected by JudgeRubric that calls the judge LLM | ||
| prompt: The original prompt/question given to the agent | ||
| completion: The agent's full response/trajectory | ||
| answer: The expected/reference answer from the dataset | ||
| state: The current environment state | ||
|
|
||
| Returns: | ||
| float: 1.0 if the judge determines the answer is correct, 0.0 otherwise | ||
| """ | ||
| judge_response = await judge(prompt, completion, answer, state) | ||
| is_correct = "yes" in judge_response.lower() | ||
| return 1.0 if is_correct else 0.0 | ||
|
|
||
|
|
||
| def load_environment( | ||
| max_turns: int = 15, | ||
| judge_model: str = "gpt-4o-mini", | ||
| server_url: str = "http://localhost:3000", | ||
| browserbase_api_key: str | None = None, | ||
| browserbase_project_id: str | None = None, | ||
| env: Literal["LOCAL", "BROWSERBASE"] = "LOCAL", | ||
| viewport_width: int = 1024, | ||
| viewport_height: int = 768, | ||
| save_screenshots: bool = False, | ||
| keep_recent_screenshots: int | None = 2, | ||
| proxies: bool = False, | ||
| **kwargs, | ||
| ) -> vf.Environment: | ||
| """ | ||
| Load a CUA mode browser example environment. | ||
|
|
||
| This is a self-contained "hello world" example demonstrating how to use | ||
| BrowserEnv with CUA mode for vision-based browser control. | ||
|
|
||
| CUA mode uses vision-based primitives for browser control. | ||
| Requires a running CUA server (see cua-server/). | ||
|
|
||
| Available tools in CUA mode: | ||
| - click(x, y, button): Click at coordinates | ||
| - double_click(x, y): Double-click at coordinates | ||
| - type_text(text): Type text into focused element | ||
| - keypress(keys): Press keyboard keys | ||
| - scroll(x, y, scroll_x, scroll_y): Scroll at position | ||
| - goto(url): Navigate to URL | ||
| - back(): Go back in history | ||
| - forward(): Go forward in history | ||
| - wait(time_ms): Wait for specified milliseconds | ||
| - screenshot(): Capture current page state | ||
|
|
||
| Args: | ||
| max_turns: Maximum conversation turns (default: 15) | ||
| judge_model: Model for judging task completion | ||
| server_url: CUA server URL (default: http://localhost:3000) | ||
| browserbase_api_key: Browserbase API key (or set BROWSERBASE_API_KEY env var) | ||
| browserbase_project_id: Browserbase project ID (or set BROWSERBASE_PROJECT_ID env var) | ||
| env: Environment type - "LOCAL" or "BROWSERBASE" | ||
| viewport_width: Browser viewport width (default: 1024) | ||
| viewport_height: Browser viewport height (default: 768) | ||
| save_screenshots: Save screenshots during execution (default: False) | ||
| keep_recent_screenshots: Number of recent screenshots to keep in context | ||
| proxies: Enable Browserbase proxies | ||
| **kwargs: Additional arguments passed to BrowserEnv | ||
|
|
||
| Returns: | ||
| Configured BrowserEnv instance in CUA mode | ||
|
|
||
| Example: | ||
| >>> env = load_environment(server_url="http://localhost:3000") | ||
| """ | ||
| # Create inline dataset | ||
| dataset = create_example_dataset() | ||
|
|
||
| # Create judge rubric for evaluation | ||
| rubric = vf.JudgeRubric( | ||
| judge_model=judge_model, | ||
| judge_prompt=JUDGE_PROMPT, | ||
| ) | ||
| rubric.add_reward_func(judge_answer, weight=1.0) | ||
|
|
||
| # Create BrowserEnv with CUA mode (uses default system prompt) | ||
| return BrowserEnv( | ||
| mode="cua", | ||
| dataset=dataset, | ||
| rubric=rubric, | ||
| max_turns=max_turns, | ||
| server_url=server_url, | ||
| browserbase_api_key=browserbase_api_key, | ||
| browserbase_project_id=browserbase_project_id, | ||
| env=env, | ||
| viewport_width=viewport_width, | ||
| viewport_height=viewport_height, | ||
| save_screenshots=save_screenshots, | ||
| keep_recent_screenshots=keep_recent_screenshots, | ||
| proxies=proxies, | ||
| **kwargs, | ||
| ) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,20 @@ | ||
| [project] | ||
| name = "browser-cua-example" | ||
| version = "0.1.1" | ||
| description = "Example environment demonstrating CUA (Computer Use Agent) mode browser automation" | ||
| tags = ["browser", "browserbase", "cua", "vision", "example"] | ||
| requires-python = ">=3.10" | ||
| dependencies = [ | ||
| "verifiers[browser]>=0.1.8", | ||
| ] | ||
|
|
||
| [build-system] | ||
| requires = ["hatchling"] | ||
| build-backend = "hatchling.build" | ||
|
|
||
| [tool.hatch.build] | ||
| include = ["browser_cua_example.py", "pyproject.toml"] | ||
|
|
||
| [tool.verifiers.eval] | ||
| num_examples = 1 | ||
| rollouts_per_example = 1 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,71 @@ | ||
| # Browser DOM Mode Example | ||
|
|
||
| A simple example environment demonstrating **DOM mode** browser automation using [Browserbase](https://browserbase.com) and [Stagehand](https://github.com/browserbase/stagehand). | ||
|
|
||
| DOM mode uses the Stagehand SDK to translate natural language commands into browser actions. | ||
|
|
||
| ## How DOM Mode Works | ||
|
|
||
| DOM mode provides these natural language operations: | ||
| - **act**: Perform actions like clicking buttons, filling forms | ||
| - **observe**: Get information about visible elements | ||
| - **extract**: Extract structured data from the page | ||
| - **navigate**: Go to URLs | ||
|
|
||
| Stagehand uses an LLM (configured via `stagehand_model`) to understand the page DOM and execute the appropriate browser actions. | ||
|
|
||
| ## Installation | ||
|
|
||
| ```bash | ||
| # Install browser extras | ||
| uv pip install -e ".[browser]" | ||
|
|
||
| # Install this example environment | ||
| uv pip install -e ./environments/browser_dom_example | ||
| ``` | ||
|
|
||
| ## Configuration | ||
|
|
||
| ### Required Environment Variables | ||
|
|
||
| ```bash | ||
| # Browserbase credentials | ||
| export BROWSERBASE_API_KEY="your-api-key" | ||
| export BROWSERBASE_PROJECT_ID="your-project-id" | ||
|
|
||
| # API keys for models | ||
| export OPENAI_API_KEY="your-openai-key" # For agent model | ||
| export MODEL_API_KEY="your-openai-key" # For Stagehand (can be same as OPENAI_API_KEY) | ||
| ``` | ||
|
|
||
| ### Why MODEL_API_KEY? | ||
|
|
||
| Stagehand needs its own LLM to understand the DOM and translate natural language to actions. The `MODEL_API_KEY` environment variable provides the API key for this internal Stagehand model. | ||
|
|
||
| ## Usage | ||
|
|
||
| ```bash | ||
| prime eval run browser-dom-example -m gpt-4.1-mini -b https://api.openai.com/v1 -k OPENAI_API_KEY | ||
| ``` | ||
|
|
||
| ## Environment Arguments | ||
|
|
||
| | Argument | Default | Description | | ||
| |----------|---------|-------------| | ||
| | `max_turns` | `10` | Maximum conversation turns | | ||
| | `judge_model` | `"gpt-4o-mini"` | Model for task completion judging | | ||
| | `stagehand_model` | `"openai/gpt-4o-mini"` | Model for Stagehand DOM operations | | ||
|
|
||
| ## Example Task | ||
|
|
||
| The smoke test navigates to the Prime Intellect homepage and asks the agent to read the headline. The agent uses DOM mode operations to: | ||
| 1. Navigate to the page | ||
| 2. Observe visible text | ||
| 3. Extract the headline content | ||
| 4. Report the answer | ||
|
|
||
| ## Requirements | ||
|
|
||
| - Python >= 3.10 | ||
| - Browserbase account with API credentials | ||
| - OpenAI API key (for agent and Stagehand) |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.