Skip to content
Merged
Show file tree
Hide file tree
Changes from 13 commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
cedc25b
ruff precommit
filip-michalsky Jan 15, 2026
7280e65
smoke test env
filip-michalsky Jan 15, 2026
9cfd1c6
simplify smoke test
filip-michalsky Jan 15, 2026
c93bca2
delete datasets from wrong place
filip-michalsky Jan 16, 2026
8a978c5
restructure
filip-michalsky Jan 16, 2026
3d6d559
Remove gaia, mind2web, and webvoyager environment folders
filip-michalsky Jan 16, 2026
5db1d16
increment examples
filip-michalsky Jan 16, 2026
e3024b8
update readme and auto start cua server
filip-michalsky Jan 16, 2026
01eac47
add env check
filip-michalsky Jan 16, 2026
79e04d0
update readme
filip-michalsky Jan 16, 2026
0affb42
update agents md
filip-michalsky Jan 16, 2026
49f5b95
fix tests
filip-michalsky Jan 16, 2026
813eca4
update tests
filip-michalsky Jan 16, 2026
906a836
Remove gaia, webvoyager, mind2web from tracking
filip-michalsky Jan 16, 2026
e688da6
make bugbot happier
filip-michalsky Jan 16, 2026
64096e6
Fm/browser sandbox env (#3)
filip-michalsky Jan 24, 2026
278ae78
move cua server to assets
filip-michalsky Jan 24, 2026
3ef51bb
update readmes
filip-michalsky Jan 24, 2026
be0762f
Merge branch 'main' into fm/add-browser-env
filip-michalsky Jan 24, 2026
40f5b44
DRY modes
filip-michalsky Jan 24, 2026
38d8ae1
Merge branch 'fm/add-browser-env' of https://github.com/filip-michals…
filip-michalsky Jan 24, 2026
5dd0649
fix act in dom mode-small dict schema issue
filip-michalsky Jan 26, 2026
71899f5
update README to recommend max turns as 50 in examples
filip-michalsky Jan 27, 2026
310d8c9
update assets
filip-michalsky Jan 27, 2026
3a6a365
proxy bug
filip-michalsky Jan 28, 2026
ddf8fb4
remove duplicate code
filip-michalsky Jan 28, 2026
311164c
add advanced stealth flag
filip-michalsky Jan 28, 2026
9f4bb13
update logging
filip-michalsky Jan 28, 2026
732cb07
make bugbot happy
filip-michalsky Jan 28, 2026
9288b11
remove system prompts from browser_env
cdreetz Jan 29, 2026
d253e1d
remove references to sys prompts and tests for sys prpompts
cdreetz Jan 29, 2026
d96787d
add prompt to examples
filip-michalsky Jan 29, 2026
0373e01
local browser config only local
filip-michalsky Jan 29, 2026
907fddd
streamlining for call_tool fix + ty
willccbb Jan 29, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion docs/environments.md
Original file line number Diff line number Diff line change
Expand Up @@ -720,8 +720,9 @@ Supported third-party environment integrations include:

- **`TextArenaEnv`** — wraps [TextArena](https://github.com/LeonGuertler/TextArena) text-based game environments
- **`ReasoningGymEnv`** — wraps [reasoning-gym](https://github.com/open-thought/reasoning-gym) procedural datasets
- **`BrowserEnv`** — unified browser automation via [Browserbase](https://browserbase.com) with DOM and CUA modes

These require additional dependencies installed via extras (e.g., `uv add 'verifiers[ta]'` for TextArena).
These require additional dependencies installed via extras (e.g., `uv add 'verifiers[ta]'` for TextArena, `uv add 'verifiers[browser]'` for BrowserEnv).

Newer and more experimental environment classes include:

Expand Down
5 changes: 3 additions & 2 deletions environments/AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -724,12 +724,13 @@ Supported third-party environment integrations include:

- **`TextArenaEnv`** — wraps [TextArena](https://github.com/LeonGuertler/TextArena) text-based game environments
- **`ReasoningGymEnv`** — wraps [reasoning-gym](https://github.com/open-thought/reasoning-gym) procedural datasets
- **`BrowserEnv`** — unified browser automation via [Browserbase](https://browserbase.com) with DOM and CUA modes

These require additional dependencies installed via extras (e.g., `uv add 'verifiers[ta]'` for TextArena).
These require additional dependencies installed via extras (e.g., `uv add 'verifiers[ta]'` for TextArena, `uv add 'verifiers[browser]'` for BrowserEnv).

Newer and more experimental environment classes include:

- **`GymEnv`** — universal runner for Gym-compatible environments (OpenAI Gym / Gymnasium API)
- **`CliAgentEnv`** — runs custom agent code inside sandboxes, intercepting API requests
- **`HarborEnv`** — loads Harbor-format agent benchmark tasks
- **`RLMEnv`** — implements Recursive Language Models for unbounded context processing
- **`RLMEnv`** — implements Recursive Language Models for unbounded context processing. Supports `execution_backend="sandbox"` (default) or `"local"` for host execution. User code runs in a Python REPL with a best-effort guardrail that blocks common filesystem modules and `open` by default; customize via `disallowed_modules`/`disallowed_builtins`.
106 changes: 106 additions & 0 deletions environments/browser_cua_example/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
# Browser CUA Mode Example

A simple example environment demonstrating **CUA (Computer Use Agent) mode** browser automation using [Browserbase](https://browserbase.com).

CUA mode uses vision-based primitives to control the browser through screenshots, similar to how a human would interact with a screen.

## How CUA Mode Works

CUA mode provides low-level vision-based operations:
- **click(x, y)**: Click at screen coordinates
- **type_text(text)**: Type text into focused element
- **scroll(direction)**: Scroll the page
- **screenshot()**: Capture current screen state
- **navigate(url)**: Go to a URL

The agent sees screenshots and decides which actions to take based on visual understanding.

## Prerequisites: CUA Server

CUA mode requires a running CUA server that handles browser automation. The server is located at:
```
verifiers/envs/integrations/browser_env/cua-server/
```

### Starting the CUA Server

```bash
cd verifiers/envs/integrations/browser_env/cua-server

# Start the server (installs dependencies automatically if needed)
./start.sh
```

The server runs on `http://localhost:3000` by default. Use `./start.sh --port 8080` for a custom port.

## Installation

```bash
# Install browser extras
uv pip install -e ".[browser]"

# Install this example environment
uv pip install -e ./environments/browser_cua_example
```

## Configuration

### Required Environment Variables

```bash
# Browserbase credentials
export BROWSERBASE_API_KEY="your-api-key"
export BROWSERBASE_PROJECT_ID="your-project-id"

# API key for agent model
export OPENAI_API_KEY="your-openai-key"
```

Note: CUA mode does NOT require `MODEL_API_KEY` since it doesn't use Stagehand.

## Usage

1. **Start the CUA server** (in a separate terminal):
```bash
cd verifiers/envs/integrations/browser_env/cua-server && ./start.sh
```

2. **Run the evaluation**:
```bash
prime eval run browser-cua-example -m gpt-4.1-mini -b https://api.openai.com/v1 -k OPENAI_API_KEY
```

### Custom Server URL

If running the CUA server on a different port:
```bash
prime eval run browser-cua-example -m gpt-4.1-mini -b https://api.openai.com/v1 -k OPENAI_API_KEY -a '{"server_url": "http://localhost:8080"}'
```

## Environment Arguments

| Argument | Default | Description |
|----------|---------|-------------|
| `max_turns` | `10` | Maximum conversation turns |
| `judge_model` | `"gpt-4o-mini"` | Model for task completion judging |
| `server_url` | `"http://localhost:3000"` | CUA server URL |
| `viewport_width` | `1024` | Browser viewport width |
| `viewport_height` | `768` | Browser viewport height |
| `save_screenshots` | `False` | Save screenshots during execution |

## DOM vs CUA Mode Comparison

| Aspect | DOM Mode | CUA Mode |
|--------|----------|----------|
| **Control** | Natural language via Stagehand | Vision-based coordinates |
| **Server** | None required | CUA server required |
| **MODEL_API_KEY** | Required (for Stagehand) | Not required |
| **Best for** | Structured web interactions | Visual/complex UIs |
| **Speed** | Faster (direct DOM) | Slower (screenshots) |

## Requirements

- Python >= 3.10
- Node.js (for CUA server)
- Browserbase account with API credentials
- OpenAI API key
170 changes: 170 additions & 0 deletions environments/browser_cua_example/browser_cua_example.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,170 @@
"""
Browser CUA Mode Example Environment.

This example demonstrates using BrowserEnv with CUA (Computer Use Agent) mode
for vision-based browser control.

CUA mode uses screenshots and vision models to interact with the browser,
providing low-level primitives like click, scroll, and type_text.

Usage:
# Start CUA server first, then:
prime eval run browser-cua-example -m gpt-4.1-mini -b https://api.openai.com/v1 -k OPENAI_API_KEY
"""

from typing import Literal

import verifiers as vf
from verifiers.envs.integrations.browser_env import BrowserEnv
from datasets import Dataset


def create_example_dataset() -> Dataset:
"""
Create a simple inline dataset for the CUA mode hello world example.

This dataset tests basic browser navigation and content extraction
using vision-based interactions.
"""
return Dataset.from_dict(
{
"question": [
"What does the headline say on the primeintellect.ai homepage?"
],
"answer": ["The Open Superintelligence Stack"],
"start_url": ["https://primeintellect.ai"],
"task_id": ["cua-example-0"],
}
)


# Judge prompt for evaluating answer correctness
JUDGE_PROMPT = """You are evaluating a browser automation agent's answer to a question.

Question:
```
{question}
```

Expected Answer:
```
{answer}
```

Agent's Response:
```
{response}
```

Does the agent's response contain the correct answer? The answer may be embedded in a longer response or phrased differently, but should convey the same information as the expected answer.

Respond "yes" if the agent's response contains the correct answer, "no" if it does not."""


async def judge_answer(
judge,
prompt: str | list,
completion: str | list,
answer: str,
state: vf.State,
) -> float:
"""
LLM judge reward that compares the agent's final answer to the reference answer.

Args:
judge: Callable injected by JudgeRubric that calls the judge LLM
prompt: The original prompt/question given to the agent
completion: The agent's full response/trajectory
answer: The expected/reference answer from the dataset
state: The current environment state

Returns:
float: 1.0 if the judge determines the answer is correct, 0.0 otherwise
"""
judge_response = await judge(prompt, completion, answer, state)
is_correct = "yes" in judge_response.lower()
return 1.0 if is_correct else 0.0


def load_environment(
max_turns: int = 15,
judge_model: str = "gpt-4o-mini",
server_url: str = "http://localhost:3000",
browserbase_api_key: str | None = None,
browserbase_project_id: str | None = None,
env: Literal["LOCAL", "BROWSERBASE"] = "LOCAL",
viewport_width: int = 1024,
viewport_height: int = 768,
save_screenshots: bool = False,
keep_recent_screenshots: int | None = 2,
proxies: bool = False,
**kwargs,
) -> vf.Environment:
"""
Load a CUA mode browser example environment.

This is a self-contained "hello world" example demonstrating how to use
BrowserEnv with CUA mode for vision-based browser control.

CUA mode uses vision-based primitives for browser control.
Requires a running CUA server (see cua-server/).

Available tools in CUA mode:
- click(x, y, button): Click at coordinates
- double_click(x, y): Double-click at coordinates
- type_text(text): Type text into focused element
- keypress(keys): Press keyboard keys
- scroll(x, y, scroll_x, scroll_y): Scroll at position
- goto(url): Navigate to URL
- back(): Go back in history
- forward(): Go forward in history
- wait(time_ms): Wait for specified milliseconds
- screenshot(): Capture current page state

Args:
max_turns: Maximum conversation turns (default: 15)
judge_model: Model for judging task completion
server_url: CUA server URL (default: http://localhost:3000)
browserbase_api_key: Browserbase API key (or set BROWSERBASE_API_KEY env var)
browserbase_project_id: Browserbase project ID (or set BROWSERBASE_PROJECT_ID env var)
env: Environment type - "LOCAL" or "BROWSERBASE"
viewport_width: Browser viewport width (default: 1024)
viewport_height: Browser viewport height (default: 768)
save_screenshots: Save screenshots during execution (default: False)
keep_recent_screenshots: Number of recent screenshots to keep in context
proxies: Enable Browserbase proxies
**kwargs: Additional arguments passed to BrowserEnv

Returns:
Configured BrowserEnv instance in CUA mode

Example:
>>> env = load_environment(server_url="http://localhost:3000")
"""
# Create inline dataset
dataset = create_example_dataset()

# Create judge rubric for evaluation
rubric = vf.JudgeRubric(
judge_model=judge_model,
judge_prompt=JUDGE_PROMPT,
)
rubric.add_reward_func(judge_answer, weight=1.0)

# Create BrowserEnv with CUA mode (uses default system prompt)
return BrowserEnv(
mode="cua",
dataset=dataset,
rubric=rubric,
max_turns=max_turns,
server_url=server_url,
browserbase_api_key=browserbase_api_key,
browserbase_project_id=browserbase_project_id,
env=env,
viewport_width=viewport_width,
viewport_height=viewport_height,
save_screenshots=save_screenshots,
keep_recent_screenshots=keep_recent_screenshots,
proxies=proxies,
**kwargs,
)
20 changes: 20 additions & 0 deletions environments/browser_cua_example/pyproject.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
[project]
name = "browser-cua-example"
version = "0.1.1"
description = "Example environment demonstrating CUA (Computer Use Agent) mode browser automation"
tags = ["browser", "browserbase", "cua", "vision", "example"]
requires-python = ">=3.10"
dependencies = [
"verifiers[browser]>=0.1.8",
]

[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"

[tool.hatch.build]
include = ["browser_cua_example.py", "pyproject.toml"]

[tool.verifiers.eval]
num_examples = 1
rollouts_per_example = 1
71 changes: 71 additions & 0 deletions environments/browser_dom_example/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
# Browser DOM Mode Example

A simple example environment demonstrating **DOM mode** browser automation using [Browserbase](https://browserbase.com) and [Stagehand](https://github.com/browserbase/stagehand).

DOM mode uses the Stagehand SDK to translate natural language commands into browser actions.

## How DOM Mode Works

DOM mode provides these natural language operations:
- **act**: Perform actions like clicking buttons, filling forms
- **observe**: Get information about visible elements
- **extract**: Extract structured data from the page
- **navigate**: Go to URLs

Stagehand uses an LLM (configured via `stagehand_model`) to understand the page DOM and execute the appropriate browser actions.

## Installation

```bash
# Install browser extras
uv pip install -e ".[browser]"

# Install this example environment
uv pip install -e ./environments/browser_dom_example
```

## Configuration

### Required Environment Variables

```bash
# Browserbase credentials
export BROWSERBASE_API_KEY="your-api-key"
export BROWSERBASE_PROJECT_ID="your-project-id"

# API keys for models
export OPENAI_API_KEY="your-openai-key" # For agent model
export MODEL_API_KEY="your-openai-key" # For Stagehand (can be same as OPENAI_API_KEY)
```

### Why MODEL_API_KEY?

Stagehand needs its own LLM to understand the DOM and translate natural language to actions. The `MODEL_API_KEY` environment variable provides the API key for this internal Stagehand model.

## Usage

```bash
prime eval run browser-dom-example -m gpt-4.1-mini -b https://api.openai.com/v1 -k OPENAI_API_KEY
```

## Environment Arguments

| Argument | Default | Description |
|----------|---------|-------------|
| `max_turns` | `10` | Maximum conversation turns |
| `judge_model` | `"gpt-4o-mini"` | Model for task completion judging |
| `stagehand_model` | `"openai/gpt-4o-mini"` | Model for Stagehand DOM operations |

## Example Task

The smoke test navigates to the Prime Intellect homepage and asks the agent to read the headline. The agent uses DOM mode operations to:
1. Navigate to the page
2. Observe visible text
3. Extract the headline content
4. Report the answer

## Requirements

- Python >= 3.10
- Browserbase account with API credentials
- OpenAI API key (for agent and Stagehand)
Loading