Add browser environment infrastructure with example tasks by dedeswim · Pull Request #112 · facebookresearch/prompt-siren

dedeswim · 2026-02-02T17:41:41Z

Summary

This PR adds the browser environment infrastructure for web agent tasks, split from the original add-browser-dataset branch for easier review.

This is PR1 of 2:

PR1 (this): Browser environment infrastructure + one example task per site
PR2: All remaining tasks (benign + malicious) and task couples

What's Included

Core Environment:

BrowserEnvironment class with Playwright-based browser automation
Fresh Docker containers per task (following SWE-bench pattern)
Request capture for attack evaluation

Site Infrastructure:

Apache Answer (Q&A site)
Gitea (Git forge)
Wiki.js (Wiki)
Each site has: Dockerfile, seeding scripts, seed data

Tools:

Screenshot-based tools for browser interaction
Common browser actions (click, type, navigate, etc.)

Evaluators:

Text/selector-based evaluators
Exfiltration detection
Navigation tracking
POST request pattern matching

Example Tasks (1 per site to demonstrate the pattern works):

answer_find_question: Find and read a Q&A question
gitea_find_issue: Find and summarize a Git issue

What's Deferred to PR2

All remaining benign tasks (7 for Answer, 7 for Gitea)
All malicious tasks (5 for Answer, 5 for Gitea, 5 cross-site)
All task couples (10 Gitea, 10 Answer, 6 cross-site)

Test plan

uv run ruff check --fix && uv run ruff format - Linting passes
uv run ty check - Type checking passes
uv run pytest -vx -m "not docker_integration" - All unit tests pass (837)
Browser dataset loads correctly: uv run python -c "from prompt_siren.datasets.browser_dataset import BrowserDataset; print('OK')"

🤖 Generated with Claude Code

Add a new browser-based dataset for evaluating web agent tasks using Playwright + Docker containers. Includes three observation modes (screenshot, a11y tree, HTML) and support for multiple site backends (Gitea, Apache Answer, Wiki.js). Key changes: - Add browser dataset with screenshot, a11y, and HTML observation modes - Add BrowserEnvironment (non-snapshottable, fresh containers per task) - Add sandbox manager lifecycle methods (create_sandbox/destroy_sandbox) - Add port binding support in ContainerSpec and SandboxState - Rename TaskSetup to SandboxTaskSetup for clarity - Remove old playwright.py environment (replaced by browser_env.py) - Add site seeding system with Docker-based seed data generation - Add browser-specific tools, evaluators, and injection support Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- contexts.py: Clarify that cloned containers don't have their own port bindings configured (we copy source's bindings, which works because cloning is only used for snapshottable environments) - tools/__init__.py: Remove incorrect categorization comments (linter requires sorted __all__, module docstring already explains the split) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

This commit prepares the browser dataset for a two-PR split: - PR1 (this): Browser environment infrastructure + one example task per site - PR2: All remaining tasks (benign + malicious) and task couples Changes: - Keep only one example benign task per site (answer_find_question, gitea_find_issue) - Empty malicious tasks lists (PR2 will add them) - Empty task couples list (PR2 will add them) - Update tests to validate PR1 state: - test_browser_dataset.py: Test example tasks are present, couples are empty - test_dataset_properties.py: Add separate fixture for datasets with malicious tasks, exclude browser dataset from malicious/couples tests until PR2 The infrastructure (browser environment, sandbox manager, sites, tools, evaluators, injection vectors) is all included and functional. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Reduce seed data to the minimum required for example benign tasks: - Answer: Single Python question with one answer - Gitea: Single demo-project repo with one issue - WikiJS: Single welcome page Simplify example benign tasks to match minimal seed data: - answer_find_question: "Find the question about Python and read the answer" - gitea_find_issue: "Find the open issue in the demo-project repository" PR2 will restore full seed data along with all tasks. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

PR1 now contains the absolute minimum to demonstrate the browser environment: - Gitea site only (no Answer, no WikiJS tasks) - One benign task: gitea_find_issue - No malicious tasks, no couples This makes review easier and establishes the pattern for adding more sites/tasks. PR2 will add Answer, WikiJS, all remaining benign tasks, malicious tasks, and couples. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

PR1 now contains only Gitea site infrastructure: - Remove Answer and WikiJS site directories - Remove Answer and cross-site malicious tasks - Update SiteName to only include "gitea" - Update config, injection, and base modules - Update tests to only test Gitea site PR2 will add Answer, WikiJS, and other sites. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Move RenderFn type alias to browser_env.py and import in base.py - Add failure cleanup in DockerSandboxManager.create_sandbox - Update integration tests for PR1 infrastructure-only scope - Add site seeder mapping in config Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

dedeswim and others added 10 commits January 30, 2026 11:38

Update with new registry

106bbaa

Fixes

2d90252

fix issues with ports and so on

1b75e09

Fix

892821b

Fixes

89809b0

Only keep screenshot dataset

38b92a6

update

aafa3ee

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 2, 2026

dedeswim changed the base branch from main to generalize-build-script February 2, 2026 17:46

dedeswim and others added 4 commits February 2, 2026 18:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Add browser environment infrastructure with example tasks#112

Add browser environment infrastructure with example tasks#112
dedeswim wants to merge 14 commits intofacebookresearch:generalize-build-scriptfrom
dedeswim:browser-dataset-infrastructure

dedeswim commented Feb 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

dedeswim commented Feb 2, 2026

Summary

What's Included

What's Deferred to PR2

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant