Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion docs/environments.md
Original file line number Diff line number Diff line change
Expand Up @@ -720,8 +720,9 @@ Supported third-party environment integrations include:

- **`TextArenaEnv`** — wraps [TextArena](https://github.com/LeonGuertler/TextArena) text-based game environments
- **`ReasoningGymEnv`** — wraps [reasoning-gym](https://github.com/open-thought/reasoning-gym) procedural datasets
- **`BrowserEnv`** — unified browser automation via [Browserbase](https://browserbase.com) with DOM and CUA modes

These require additional dependencies installed via extras (e.g., `uv add 'verifiers[ta]'` for TextArena).
These require additional dependencies installed via extras (e.g., `uv add 'verifiers[ta]'` for TextArena, `uv add 'verifiers[browser]'` for BrowserEnv).

Newer and more experimental environment classes include:

Expand Down
5 changes: 3 additions & 2 deletions environments/AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -724,12 +724,13 @@ Supported third-party environment integrations include:

- **`TextArenaEnv`** — wraps [TextArena](https://github.com/LeonGuertler/TextArena) text-based game environments
- **`ReasoningGymEnv`** — wraps [reasoning-gym](https://github.com/open-thought/reasoning-gym) procedural datasets
- **`BrowserEnv`** — unified browser automation via [Browserbase](https://browserbase.com) with DOM and CUA modes

These require additional dependencies installed via extras (e.g., `uv add 'verifiers[ta]'` for TextArena).
These require additional dependencies installed via extras (e.g., `uv add 'verifiers[ta]'` for TextArena, `uv add 'verifiers[browser]'` for BrowserEnv).

Newer and more experimental environment classes include:

- **`GymEnv`** — universal runner for Gym-compatible environments (OpenAI Gym / Gymnasium API)
- **`CliAgentEnv`** — runs custom agent code inside sandboxes, intercepting API requests
- **`HarborEnv`** — loads Harbor-format agent benchmark tasks
- **`RLMEnv`** — implements Recursive Language Models for unbounded context processing
- **`RLMEnv`** — implements Recursive Language Models for unbounded context processing. Supports `execution_backend="sandbox"` (default) or `"local"` for host execution. User code runs in a Python REPL with a best-effort guardrail that blocks common filesystem modules and `open` by default; customize via `disallowed_modules`/`disallowed_builtins`.
106 changes: 106 additions & 0 deletions environments/browser_cua_example/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
# Browser CUA Mode Example

A simple example environment demonstrating **CUA (Computer Use Agent) mode** browser automation using [Browserbase](https://browserbase.com).

CUA mode uses vision-based primitives to control the browser through screenshots, similar to how a human would interact with a screen.

## How CUA Mode Works

CUA mode provides low-level vision-based operations:
- **click(x, y)**: Click at screen coordinates
- **type_text(text)**: Type text into focused element
- **scroll(direction)**: Scroll the page
- **screenshot()**: Capture current screen state
- **navigate(url)**: Go to a URL

The agent sees screenshots and decides which actions to take based on visual understanding.

## Prerequisites: CUA Server

CUA mode requires a running CUA server that handles browser automation. The server is located at:
```
verifiers/envs/integrations/browser_env/cua-server/
```

### Starting the CUA Server

```bash
cd verifiers/envs/integrations/browser_env/cua-server

# Start the server (installs dependencies automatically if needed)
./start.sh
```

The server runs on `http://localhost:3000` by default. Use `./start.sh --port 8080` for a custom port.

## Installation

```bash
# Install browser extras
uv pip install -e ".[browser]"

# Install this example environment
uv pip install -e ./environments/browser_cua_example
```

## Configuration

### Required Environment Variables

```bash
# Browserbase credentials
export BROWSERBASE_API_KEY="your-api-key"
export BROWSERBASE_PROJECT_ID="your-project-id"

# API key for agent model
export OPENAI_API_KEY="your-openai-key"
```

Note: CUA mode does NOT require `MODEL_API_KEY` since it doesn't use Stagehand.

## Usage

1. **Start the CUA server** (in a separate terminal):
```bash
cd verifiers/envs/integrations/browser_env/cua-server && ./start.sh
```

2. **Run the evaluation**:
```bash
prime eval run browser-cua-example -m gpt-4.1-mini -b https://api.openai.com/v1 -k OPENAI_API_KEY
```

### Custom Server URL

If running the CUA server on a different port:
```bash
prime eval run browser-cua-example -m gpt-4.1-mini -b https://api.openai.com/v1 -k OPENAI_API_KEY -a '{"server_url": "http://localhost:8080"}'
```

## Environment Arguments

| Argument | Default | Description |
|----------|---------|-------------|
| `max_turns` | `10` | Maximum conversation turns |
| `judge_model` | `"gpt-4o-mini"` | Model for task completion judging |
| `server_url` | `"http://localhost:3000"` | CUA server URL |
| `viewport_width` | `1024` | Browser viewport width |
| `viewport_height` | `768` | Browser viewport height |
| `save_screenshots` | `False` | Save screenshots during execution |

## DOM vs CUA Mode Comparison

| Aspect | DOM Mode | CUA Mode |
|--------|----------|----------|
| **Control** | Natural language via Stagehand | Vision-based coordinates |
| **Server** | None required | CUA server required |
| **MODEL_API_KEY** | Required (for Stagehand) | Not required |
| **Best for** | Structured web interactions | Visual/complex UIs |
| **Speed** | Faster (direct DOM) | Slower (screenshots) |

## Requirements

- Python >= 3.10
- Node.js (for CUA server)
- Browserbase account with API credentials
- OpenAI API key
170 changes: 170 additions & 0 deletions environments/browser_cua_example/browser_cua_example.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,170 @@
"""
Browser CUA Mode Example Environment.

This example demonstrates using BrowserEnv with CUA (Computer Use Agent) mode
for vision-based browser control.

CUA mode uses screenshots and vision models to interact with the browser,
providing low-level primitives like click, scroll, and type_text.

Usage:
# Start CUA server first, then:
prime eval run browser-cua-example -m gpt-4.1-mini -b https://api.openai.com/v1 -k OPENAI_API_KEY
"""

from typing import Literal

import verifiers as vf
from verifiers.envs.integrations.browser_env import BrowserEnv
from datasets import Dataset


def create_example_dataset() -> Dataset:
"""
Create a simple inline dataset for the CUA mode hello world example.

This dataset tests basic browser navigation and content extraction
using vision-based interactions.
"""
return Dataset.from_dict(
{
"question": [
"What does the headline say on the primeintellect.ai homepage?"
],
"answer": ["The Open Superintelligence Stack"],
"start_url": ["https://primeintellect.ai"],
"task_id": ["cua-example-0"],
}
)


# Judge prompt for evaluating answer correctness
JUDGE_PROMPT = """You are evaluating a browser automation agent's answer to a question.

Question:
```
{question}
```

Expected Answer:
```
{answer}
```

Agent's Response:
```
{response}
```

Does the agent's response contain the correct answer? The answer may be embedded in a longer response or phrased differently, but should convey the same information as the expected answer.

Respond "yes" if the agent's response contains the correct answer, "no" if it does not."""


async def judge_answer(
judge,
prompt: str | list,
completion: str | list,
answer: str,
state: vf.State,
) -> float:
"""
LLM judge reward that compares the agent's final answer to the reference answer.

Args:
judge: Callable injected by JudgeRubric that calls the judge LLM
prompt: The original prompt/question given to the agent
completion: The agent's full response/trajectory
answer: The expected/reference answer from the dataset
state: The current environment state

Returns:
float: 1.0 if the judge determines the answer is correct, 0.0 otherwise
"""
judge_response = await judge(prompt, completion, answer, state)
is_correct = "yes" in judge_response.lower()
return 1.0 if is_correct else 0.0


def load_environment(
max_turns: int = 15,
judge_model: str = "gpt-4o-mini",
server_url: str = "http://localhost:3000",
browserbase_api_key: str | None = None,
browserbase_project_id: str | None = None,
env: Literal["LOCAL", "BROWSERBASE"] = "LOCAL",
viewport_width: int = 1024,
viewport_height: int = 768,
save_screenshots: bool = False,
keep_recent_screenshots: int | None = 2,
proxies: bool = False,
**kwargs,
) -> vf.Environment:
"""
Load a CUA mode browser example environment.

This is a self-contained "hello world" example demonstrating how to use
BrowserEnv with CUA mode for vision-based browser control.

CUA mode uses vision-based primitives for browser control.
Requires a running CUA server (see cua-server/).

Available tools in CUA mode:
- click(x, y, button): Click at coordinates
- double_click(x, y): Double-click at coordinates
- type_text(text): Type text into focused element
- keypress(keys): Press keyboard keys
- scroll(x, y, scroll_x, scroll_y): Scroll at position
- goto(url): Navigate to URL
- back(): Go back in history
- forward(): Go forward in history
- wait(time_ms): Wait for specified milliseconds
- screenshot(): Capture current page state

Args:
max_turns: Maximum conversation turns (default: 15)
judge_model: Model for judging task completion
server_url: CUA server URL (default: http://localhost:3000)
browserbase_api_key: Browserbase API key (or set BROWSERBASE_API_KEY env var)
browserbase_project_id: Browserbase project ID (or set BROWSERBASE_PROJECT_ID env var)
env: Environment type - "LOCAL" or "BROWSERBASE"
viewport_width: Browser viewport width (default: 1024)
viewport_height: Browser viewport height (default: 768)
save_screenshots: Save screenshots during execution (default: False)
keep_recent_screenshots: Number of recent screenshots to keep in context
proxies: Enable Browserbase proxies
**kwargs: Additional arguments passed to BrowserEnv

Returns:
Configured BrowserEnv instance in CUA mode

Example:
>>> env = load_environment(server_url="http://localhost:3000")
"""
# Create inline dataset
dataset = create_example_dataset()

# Create judge rubric for evaluation
rubric = vf.JudgeRubric(
judge_model=judge_model,
judge_prompt=JUDGE_PROMPT,
)
rubric.add_reward_func(judge_answer, weight=1.0)

# Create BrowserEnv with CUA mode (uses default system prompt)
return BrowserEnv(
mode="cua",
dataset=dataset,
rubric=rubric,
max_turns=max_turns,
server_url=server_url,
browserbase_api_key=browserbase_api_key,
browserbase_project_id=browserbase_project_id,
env=env,
viewport_width=viewport_width,
viewport_height=viewport_height,
save_screenshots=save_screenshots,
keep_recent_screenshots=keep_recent_screenshots,
proxies=proxies,
**kwargs,
)
20 changes: 20 additions & 0 deletions environments/browser_cua_example/pyproject.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
[project]
name = "browser-cua-example"
version = "0.1.1"
description = "Example environment demonstrating CUA (Computer Use Agent) mode browser automation"
tags = ["browser", "browserbase", "cua", "vision", "example"]
requires-python = ">=3.10"
dependencies = [
"verifiers[browser]>=0.1.8",
]

[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"

[tool.hatch.build]
include = ["browser_cua_example.py", "pyproject.toml"]

[tool.verifiers.eval]
num_examples = 1
rollouts_per_example = 1
71 changes: 71 additions & 0 deletions environments/browser_dom_example/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
# Browser DOM Mode Example

A simple example environment demonstrating **DOM mode** browser automation using [Browserbase](https://browserbase.com) and [Stagehand](https://github.com/browserbase/stagehand).

DOM mode uses the Stagehand SDK to translate natural language commands into browser actions.

## How DOM Mode Works

DOM mode provides these natural language operations:
- **act**: Perform actions like clicking buttons, filling forms
- **observe**: Get information about visible elements
- **extract**: Extract structured data from the page
- **navigate**: Go to URLs

Stagehand uses an LLM (configured via `stagehand_model`) to understand the page DOM and execute the appropriate browser actions.

## Installation

```bash
# Install browser extras
uv pip install -e ".[browser]"

# Install this example environment
uv pip install -e ./environments/browser_dom_example
```

## Configuration

### Required Environment Variables

```bash
# Browserbase credentials
export BROWSERBASE_API_KEY="your-api-key"
export BROWSERBASE_PROJECT_ID="your-project-id"

# API keys for models
export OPENAI_API_KEY="your-openai-key" # For agent model
export MODEL_API_KEY="your-openai-key" # For Stagehand (can be same as OPENAI_API_KEY)
```

### Why MODEL_API_KEY?

Stagehand needs its own LLM to understand the DOM and translate natural language to actions. The `MODEL_API_KEY` environment variable provides the API key for this internal Stagehand model.

## Usage

```bash
prime eval run browser-dom-example -m gpt-4.1-mini -b https://api.openai.com/v1 -k OPENAI_API_KEY
```

## Environment Arguments

| Argument | Default | Description |
|----------|---------|-------------|
| `max_turns` | `10` | Maximum conversation turns |
| `judge_model` | `"gpt-4o-mini"` | Model for task completion judging |
| `stagehand_model` | `"openai/gpt-4o-mini"` | Model for Stagehand DOM operations |

## Example Task

The smoke test navigates to the Prime Intellect homepage and asks the agent to read the headline. The agent uses DOM mode operations to:
1. Navigate to the page
2. Observe visible text
3. Extract the headline content
4. Report the answer

## Requirements

- Python >= 3.10
- Browserbase account with API credentials
- OpenAI API key (for agent and Stagehand)
Loading