Skip to content

Comments

Add browser environment infrastructure with example tasks#112

Open
dedeswim wants to merge 14 commits intofacebookresearch:generalize-build-scriptfrom
dedeswim:browser-dataset-infrastructure
Open

Add browser environment infrastructure with example tasks#112
dedeswim wants to merge 14 commits intofacebookresearch:generalize-build-scriptfrom
dedeswim:browser-dataset-infrastructure

Conversation

@dedeswim
Copy link
Collaborator

@dedeswim dedeswim commented Feb 2, 2026

Summary

This PR adds the browser environment infrastructure for web agent tasks, split from the original add-browser-dataset branch for easier review.

This is PR1 of 2:

  • PR1 (this): Browser environment infrastructure + one example task per site
  • PR2: All remaining tasks (benign + malicious) and task couples

What's Included

Core Environment:

  • BrowserEnvironment class with Playwright-based browser automation
  • Fresh Docker containers per task (following SWE-bench pattern)
  • Request capture for attack evaluation

Site Infrastructure:

  • Apache Answer (Q&A site)
  • Gitea (Git forge)
  • Wiki.js (Wiki)
  • Each site has: Dockerfile, seeding scripts, seed data

Tools:

  • Screenshot-based tools for browser interaction
  • Common browser actions (click, type, navigate, etc.)

Evaluators:

  • Text/selector-based evaluators
  • Exfiltration detection
  • Navigation tracking
  • POST request pattern matching

Example Tasks (1 per site to demonstrate the pattern works):

  • answer_find_question: Find and read a Q&A question
  • gitea_find_issue: Find and summarize a Git issue

What's Deferred to PR2

  • All remaining benign tasks (7 for Answer, 7 for Gitea)
  • All malicious tasks (5 for Answer, 5 for Gitea, 5 cross-site)
  • All task couples (10 Gitea, 10 Answer, 6 cross-site)

Test plan

  • uv run ruff check --fix && uv run ruff format - Linting passes
  • uv run ty check - Type checking passes
  • uv run pytest -vx -m "not docker_integration" - All unit tests pass (837)
  • Browser dataset loads correctly: uv run python -c "from prompt_siren.datasets.browser_dataset import BrowserDataset; print('OK')"

🤖 Generated with Claude Code

dedeswim and others added 10 commits January 30, 2026 11:38
Add a new browser-based dataset for evaluating web agent tasks using
Playwright + Docker containers. Includes three observation modes
(screenshot, a11y tree, HTML) and support for multiple site backends
(Gitea, Apache Answer, Wiki.js).

Key changes:
- Add browser dataset with screenshot, a11y, and HTML observation modes
- Add BrowserEnvironment (non-snapshottable, fresh containers per task)
- Add sandbox manager lifecycle methods (create_sandbox/destroy_sandbox)
- Add port binding support in ContainerSpec and SandboxState
- Rename TaskSetup to SandboxTaskSetup for clarity
- Remove old playwright.py environment (replaced by browser_env.py)
- Add site seeding system with Docker-based seed data generation
- Add browser-specific tools, evaluators, and injection support

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- contexts.py: Clarify that cloned containers don't have their own port
  bindings configured (we copy source's bindings, which works because
  cloning is only used for snapshottable environments)
- tools/__init__.py: Remove incorrect categorization comments (linter
  requires sorted __all__, module docstring already explains the split)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit prepares the browser dataset for a two-PR split:
- PR1 (this): Browser environment infrastructure + one example task per site
- PR2: All remaining tasks (benign + malicious) and task couples

Changes:
- Keep only one example benign task per site (answer_find_question, gitea_find_issue)
- Empty malicious tasks lists (PR2 will add them)
- Empty task couples list (PR2 will add them)
- Update tests to validate PR1 state:
  - test_browser_dataset.py: Test example tasks are present, couples are empty
  - test_dataset_properties.py: Add separate fixture for datasets with malicious tasks,
    exclude browser dataset from malicious/couples tests until PR2

The infrastructure (browser environment, sandbox manager, sites, tools, evaluators,
injection vectors) is all included and functional.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 2, 2026
@dedeswim dedeswim changed the base branch from main to generalize-build-script February 2, 2026 17:46
dedeswim and others added 4 commits February 2, 2026 18:52
Reduce seed data to the minimum required for example benign tasks:
- Answer: Single Python question with one answer
- Gitea: Single demo-project repo with one issue
- WikiJS: Single welcome page

Simplify example benign tasks to match minimal seed data:
- answer_find_question: "Find the question about Python and read the answer"
- gitea_find_issue: "Find the open issue in the demo-project repository"

PR2 will restore full seed data along with all tasks.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
PR1 now contains the absolute minimum to demonstrate the browser environment:
- Gitea site only (no Answer, no WikiJS tasks)
- One benign task: gitea_find_issue
- No malicious tasks, no couples

This makes review easier and establishes the pattern for adding more sites/tasks.
PR2 will add Answer, WikiJS, all remaining benign tasks, malicious tasks, and couples.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
PR1 now contains only Gitea site infrastructure:
- Remove Answer and WikiJS site directories
- Remove Answer and cross-site malicious tasks
- Update SiteName to only include "gitea"
- Update config, injection, and base modules
- Update tests to only test Gitea site

PR2 will add Answer, WikiJS, and other sites.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Move RenderFn type alias to browser_env.py and import in base.py
- Add failure cleanup in DockerSandboxManager.create_sandbox
- Update integration tests for PR1 infrastructure-only scope
- Add site seeder mapping in config

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant