PrimeIntellect-ai · filip-michalsky · Jan 15, 2026 · Jan 15, 2026 · Jan 15, 2026 · Jan 16, 2026
diff --git a/docs/environments.md b/docs/environments.md
@@ -720,8 +720,9 @@ Supported third-party environment integrations include:
 
 - **`TextArenaEnv`** — wraps [TextArena](https://github.com/LeonGuertler/TextArena) text-based game environments
 - **`ReasoningGymEnv`** — wraps [reasoning-gym](https://github.com/open-thought/reasoning-gym) procedural datasets
+- **`BrowserEnv`** — unified browser automation via [Browserbase](https://browserbase.com) with DOM and CUA modes
 
-These require additional dependencies installed via extras (e.g., `uv add 'verifiers[ta]'` for TextArena).
+These require additional dependencies installed via extras (e.g., `uv add 'verifiers[ta]'` for TextArena, `uv add 'verifiers[browser]'` for BrowserEnv).
 
 Newer and more experimental environment classes include:
 

diff --git a/environments/AGENTS.md b/environments/AGENTS.md
@@ -724,12 +724,13 @@ Supported third-party environment integrations include:
 
 - **`TextArenaEnv`** — wraps [TextArena](https://github.com/LeonGuertler/TextArena) text-based game environments
 - **`ReasoningGymEnv`** — wraps [reasoning-gym](https://github.com/open-thought/reasoning-gym) procedural datasets
+- **`BrowserEnv`** — unified browser automation via [Browserbase](https://browserbase.com) with DOM and CUA modes
 
-These require additional dependencies installed via extras (e.g., `uv add 'verifiers[ta]'` for TextArena).
+These require additional dependencies installed via extras (e.g., `uv add 'verifiers[ta]'` for TextArena, `uv add 'verifiers[browser]'` for BrowserEnv).
 
 Newer and more experimental environment classes include:
 
 - **`GymEnv`** — universal runner for Gym-compatible environments (OpenAI Gym / Gymnasium API)
 - **`CliAgentEnv`** — runs custom agent code inside sandboxes, intercepting API requests
 - **`HarborEnv`** — loads Harbor-format agent benchmark tasks
-- **`RLMEnv`** — implements Recursive Language Models for unbounded context processing
+- **`RLMEnv`** — implements Recursive Language Models for unbounded context processing. Supports `execution_backend="sandbox"` (default) or `"local"` for host execution. User code runs in a Python REPL with a best-effort guardrail that blocks common filesystem modules and `open` by default; customize via `disallowed_modules`/`disallowed_builtins`.
diff --git a/environments/browser_cua_example/README.md b/environments/browser_cua_example/README.md
@@ -0,0 +1,106 @@
+# Browser CUA Mode Example
+
+A simple example environment demonstrating **CUA (Computer Use Agent) mode** browser automation using [Browserbase](https://browserbase.com).
+
+CUA mode uses vision-based primitives to control the browser through screenshots, similar to how a human would interact with a screen.
+
+## How CUA Mode Works
+
+CUA mode provides low-level vision-based operations:
+- **click(x, y)**: Click at screen coordinates
+- **type_text(text)**: Type text into focused element
+- **scroll(direction)**: Scroll the page
+- **screenshot()**: Capture current screen state
+- **navigate(url)**: Go to a URL
+
+The agent sees screenshots and decides which actions to take based on visual understanding.
+
+## Prerequisites: CUA Server
+
+CUA mode requires a running CUA server that handles browser automation. The server is located at:
+```
+verifiers/envs/integrations/browser_env/cua-server/
+```
+
+### Starting the CUA Server
+
+```bash
+cd verifiers/envs/integrations/browser_env/cua-server
+
+# Start the server (installs dependencies automatically if needed)
+./start.sh
+```
+
+The server runs on `http://localhost:3000` by default. Use `./start.sh --port 8080` for a custom port.
+
+## Installation
+
+```bash
+# Install browser extras
+uv pip install -e ".[browser]"
+
+# Install this example environment
+uv pip install -e ./environments/browser_cua_example
+```
+
+## Configuration
+
+### Required Environment Variables
+
+```bash
+# Browserbase credentials
+export BROWSERBASE_API_KEY="your-api-key"
+export BROWSERBASE_PROJECT_ID="your-project-id"
+
+# API key for agent model
+export OPENAI_API_KEY="your-openai-key"
+```
+
+Note: CUA mode does NOT require `MODEL_API_KEY` since it doesn't use Stagehand.
+
+## Usage
+
+1. **Start the CUA server** (in a separate terminal):
+   ```bash
+   cd verifiers/envs/integrations/browser_env/cua-server && ./start.sh
+   ```
+
+2. **Run the evaluation**:
+   ```bash
+   prime eval run browser-cua-example -m gpt-4.1-mini -b https://api.openai.com/v1 -k OPENAI_API_KEY
+   ```
+
+### Custom Server URL
+
+If running the CUA server on a different port:
+```bash
+prime eval run browser-cua-example -m gpt-4.1-mini -b https://api.openai.com/v1 -k OPENAI_API_KEY -a '{"server_url": "http://localhost:8080"}'
+```
+
+## Environment Arguments
+
+| Argument | Default | Description |
+|----------|---------|-------------|
+| `max_turns` | `10` | Maximum conversation turns |
+| `judge_model` | `"gpt-4o-mini"` | Model for task completion judging |
+| `server_url` | `"http://localhost:3000"` | CUA server URL |
+| `viewport_width` | `1024` | Browser viewport width |
+| `viewport_height` | `768` | Browser viewport height |
+| `save_screenshots` | `False` | Save screenshots during execution |
+
+## DOM vs CUA Mode Comparison
+
+| Aspect | DOM Mode | CUA Mode |
+|--------|----------|----------|
+| **Control** | Natural language via Stagehand | Vision-based coordinates |
+| **Server** | None required | CUA server required |
+| **MODEL_API_KEY** | Required (for Stagehand) | Not required |
+| **Best for** | Structured web interactions | Visual/complex UIs |
+| **Speed** | Faster (direct DOM) | Slower (screenshots) |
+
+## Requirements
+
+- Python >= 3.10
+- Node.js (for CUA server)
+- Browserbase account with API credentials
+- OpenAI API key
diff --git a/environments/browser_cua_example/browser_cua_example.py b/environments/browser_cua_example/browser_cua_example.py
@@ -0,0 +1,170 @@
+"""
+Browser CUA Mode Example Environment.
+
+This example demonstrates using BrowserEnv with CUA (Computer Use Agent) mode
+for vision-based browser control.
+
+CUA mode uses screenshots and vision models to interact with the browser,
+providing low-level primitives like click, scroll, and type_text.
+
+Usage:
+    # Start CUA server first, then:
+    prime eval run browser-cua-example -m gpt-4.1-mini -b https://api.openai.com/v1 -k OPENAI_API_KEY
+"""
+
+from typing import Literal
+
+import verifiers as vf
+from verifiers.envs.integrations.browser_env import BrowserEnv
+from datasets import Dataset
+
+
+def create_example_dataset() -> Dataset:
+    """
+    Create a simple inline dataset for the CUA mode hello world example.
+
+    This dataset tests basic browser navigation and content extraction
+    using vision-based interactions.
+    """
+    return Dataset.from_dict(
+        {
+            "question": [
+                "What does the headline say on the primeintellect.ai homepage?"
+            ],
+            "answer": ["The Open Superintelligence Stack"],
+            "start_url": ["https://primeintellect.ai"],
+            "task_id": ["cua-example-0"],
+        }
+    )
+
+
+# Judge prompt for evaluating answer correctness
+JUDGE_PROMPT = """You are evaluating a browser automation agent's answer to a question.
+
+Question:
+```
+{question}
+```
+
+Expected Answer:
+```
+{answer}
+```
+
+Agent's Response:
+```
+{response}
+```
+
+Does the agent's response contain the correct answer? The answer may be embedded in a longer response or phrased differently, but should convey the same information as the expected answer.
+
+Respond "yes" if the agent's response contains the correct answer, "no" if it does not."""
+
+
+async def judge_answer(
+    judge,
+    prompt: str | list,
+    completion: str | list,
+    answer: str,
+    state: vf.State,
+) -> float:
+    """
+    LLM judge reward that compares the agent's final answer to the reference answer.
+
+    Args:
+        judge: Callable injected by JudgeRubric that calls the judge LLM
+        prompt: The original prompt/question given to the agent
+        completion: The agent's full response/trajectory
+        answer: The expected/reference answer from the dataset
+        state: The current environment state
+
+    Returns:
+        float: 1.0 if the judge determines the answer is correct, 0.0 otherwise
+    """
+    judge_response = await judge(prompt, completion, answer, state)
+    is_correct = "yes" in judge_response.lower()
+    return 1.0 if is_correct else 0.0
+
+
+def load_environment(
+    max_turns: int = 15,
+    judge_model: str = "gpt-4o-mini",
+    server_url: str = "http://localhost:3000",
+    browserbase_api_key: str | None = None,
+    browserbase_project_id: str | None = None,
+    env: Literal["LOCAL", "BROWSERBASE"] = "LOCAL",
+    viewport_width: int = 1024,
+    viewport_height: int = 768,
+    save_screenshots: bool = False,
+    keep_recent_screenshots: int | None = 2,
+    proxies: bool = False,
+    **kwargs,
+) -> vf.Environment:
+    """
+    Load a CUA mode browser example environment.
+
+    This is a self-contained "hello world" example demonstrating how to use
+    BrowserEnv with CUA mode for vision-based browser control.
+
+    CUA mode uses vision-based primitives for browser control.
+    Requires a running CUA server (see cua-server/).
+
+    Available tools in CUA mode:
+    - click(x, y, button): Click at coordinates
+    - double_click(x, y): Double-click at coordinates
+    - type_text(text): Type text into focused element
+    - keypress(keys): Press keyboard keys
+    - scroll(x, y, scroll_x, scroll_y): Scroll at position
+    - goto(url): Navigate to URL
+    - back(): Go back in history
+    - forward(): Go forward in history
+    - wait(time_ms): Wait for specified milliseconds
+    - screenshot(): Capture current page state
+
+    Args:
+        max_turns: Maximum conversation turns (default: 15)
+        judge_model: Model for judging task completion
+        server_url: CUA server URL (default: http://localhost:3000)
+        browserbase_api_key: Browserbase API key (or set BROWSERBASE_API_KEY env var)
+        browserbase_project_id: Browserbase project ID (or set BROWSERBASE_PROJECT_ID env var)
+        env: Environment type - "LOCAL" or "BROWSERBASE"
+        viewport_width: Browser viewport width (default: 1024)
+        viewport_height: Browser viewport height (default: 768)
+        save_screenshots: Save screenshots during execution (default: False)
+        keep_recent_screenshots: Number of recent screenshots to keep in context
+        proxies: Enable Browserbase proxies
+        **kwargs: Additional arguments passed to BrowserEnv
+
+    Returns:
+        Configured BrowserEnv instance in CUA mode
+
+    Example:
+        >>> env = load_environment(server_url="http://localhost:3000")
+    """
+    # Create inline dataset
+    dataset = create_example_dataset()
+
+    # Create judge rubric for evaluation
+    rubric = vf.JudgeRubric(
+        judge_model=judge_model,
+        judge_prompt=JUDGE_PROMPT,
+    )
+    rubric.add_reward_func(judge_answer, weight=1.0)
+
+    # Create BrowserEnv with CUA mode (uses default system prompt)
+    return BrowserEnv(
+        mode="cua",
+        dataset=dataset,
+        rubric=rubric,
+        max_turns=max_turns,
+        server_url=server_url,
+        browserbase_api_key=browserbase_api_key,
+        browserbase_project_id=browserbase_project_id,
+        env=env,
+        viewport_width=viewport_width,
+        viewport_height=viewport_height,
+        save_screenshots=save_screenshots,
+        keep_recent_screenshots=keep_recent_screenshots,
+        proxies=proxies,
+        **kwargs,
+    )
diff --git a/environments/browser_cua_example/pyproject.toml b/environments/browser_cua_example/pyproject.toml
@@ -0,0 +1,20 @@
+[project]
+name = "browser-cua-example"
+version = "0.1.1"
+description = "Example environment demonstrating CUA (Computer Use Agent) mode browser automation"
+tags = ["browser", "browserbase", "cua", "vision", "example"]
+requires-python = ">=3.10"
+dependencies = [
+    "verifiers[browser]>=0.1.8",
+]
+
+[build-system]
+requires = ["hatchling"]
+build-backend = "hatchling.build"
+
+[tool.hatch.build]
+include = ["browser_cua_example.py", "pyproject.toml"]
+
+[tool.verifiers.eval]
+num_examples = 1
+rollouts_per_example = 1
diff --git a/environments/browser_dom_example/README.md b/environments/browser_dom_example/README.md
@@ -0,0 +1,71 @@
+# Browser DOM Mode Example
+
+A simple example environment demonstrating **DOM mode** browser automation using [Browserbase](https://browserbase.com) and [Stagehand](https://github.com/browserbase/stagehand).
+
+DOM mode uses the Stagehand SDK to translate natural language commands into browser actions.
+
+## How DOM Mode Works
+
+DOM mode provides these natural language operations:
+- **act**: Perform actions like clicking buttons, filling forms
+- **observe**: Get information about visible elements
+- **extract**: Extract structured data from the page
+- **navigate**: Go to URLs
+
+Stagehand uses an LLM (configured via `stagehand_model`) to understand the page DOM and execute the appropriate browser actions.
+
+## Installation
+
+```bash
+# Install browser extras
+uv pip install -e ".[browser]"
+
+# Install this example environment
+uv pip install -e ./environments/browser_dom_example
+```
+
+## Configuration
+
+### Required Environment Variables
+
+```bash
+# Browserbase credentials
+export BROWSERBASE_API_KEY="your-api-key"
+export BROWSERBASE_PROJECT_ID="your-project-id"
+
+# API keys for models
+export OPENAI_API_KEY="your-openai-key"    # For agent model
+export MODEL_API_KEY="your-openai-key"     # For Stagehand (can be same as OPENAI_API_KEY)
+```
+
+### Why MODEL_API_KEY?
+
+Stagehand needs its own LLM to understand the DOM and translate natural language to actions. The `MODEL_API_KEY` environment variable provides the API key for this internal Stagehand model.
+
+## Usage
+
+```bash
+prime eval run browser-dom-example -m gpt-4.1-mini -b https://api.openai.com/v1 -k OPENAI_API_KEY
+```
+
+## Environment Arguments
+
+| Argument | Default | Description |
+|----------|---------|-------------|
+| `max_turns` | `10` | Maximum conversation turns |
+| `judge_model` | `"gpt-4o-mini"` | Model for task completion judging |
+| `stagehand_model` | `"openai/gpt-4o-mini"` | Model for Stagehand DOM operations |
+
+## Example Task
+
+The smoke test navigates to the Prime Intellect homepage and asks the agent to read the headline. The agent uses DOM mode operations to:
+1. Navigate to the page
+2. Observe visible text
+3. Extract the headline content
+4. Report the answer
+
+## Requirements
+
+- Python >= 3.10
+- Browserbase account with API credentials
+- OpenAI API key (for agent and Stagehand)