Add browser environment infrastructure with example tasks#112
Open
dedeswim wants to merge 14 commits intofacebookresearch:generalize-build-scriptfrom
Open
Add browser environment infrastructure with example tasks#112dedeswim wants to merge 14 commits intofacebookresearch:generalize-build-scriptfrom
dedeswim wants to merge 14 commits intofacebookresearch:generalize-build-scriptfrom
Conversation
Add a new browser-based dataset for evaluating web agent tasks using Playwright + Docker containers. Includes three observation modes (screenshot, a11y tree, HTML) and support for multiple site backends (Gitea, Apache Answer, Wiki.js). Key changes: - Add browser dataset with screenshot, a11y, and HTML observation modes - Add BrowserEnvironment (non-snapshottable, fresh containers per task) - Add sandbox manager lifecycle methods (create_sandbox/destroy_sandbox) - Add port binding support in ContainerSpec and SandboxState - Rename TaskSetup to SandboxTaskSetup for clarity - Remove old playwright.py environment (replaced by browser_env.py) - Add site seeding system with Docker-based seed data generation - Add browser-specific tools, evaluators, and injection support Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- contexts.py: Clarify that cloned containers don't have their own port bindings configured (we copy source's bindings, which works because cloning is only used for snapshottable environments) - tools/__init__.py: Remove incorrect categorization comments (linter requires sorted __all__, module docstring already explains the split) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit prepares the browser dataset for a two-PR split:
- PR1 (this): Browser environment infrastructure + one example task per site
- PR2: All remaining tasks (benign + malicious) and task couples
Changes:
- Keep only one example benign task per site (answer_find_question, gitea_find_issue)
- Empty malicious tasks lists (PR2 will add them)
- Empty task couples list (PR2 will add them)
- Update tests to validate PR1 state:
- test_browser_dataset.py: Test example tasks are present, couples are empty
- test_dataset_properties.py: Add separate fixture for datasets with malicious tasks,
exclude browser dataset from malicious/couples tests until PR2
The infrastructure (browser environment, sandbox manager, sites, tools, evaluators,
injection vectors) is all included and functional.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Reduce seed data to the minimum required for example benign tasks: - Answer: Single Python question with one answer - Gitea: Single demo-project repo with one issue - WikiJS: Single welcome page Simplify example benign tasks to match minimal seed data: - answer_find_question: "Find the question about Python and read the answer" - gitea_find_issue: "Find the open issue in the demo-project repository" PR2 will restore full seed data along with all tasks. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
PR1 now contains the absolute minimum to demonstrate the browser environment: - Gitea site only (no Answer, no WikiJS tasks) - One benign task: gitea_find_issue - No malicious tasks, no couples This makes review easier and establishes the pattern for adding more sites/tasks. PR2 will add Answer, WikiJS, all remaining benign tasks, malicious tasks, and couples. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
PR1 now contains only Gitea site infrastructure: - Remove Answer and WikiJS site directories - Remove Answer and cross-site malicious tasks - Update SiteName to only include "gitea" - Update config, injection, and base modules - Update tests to only test Gitea site PR2 will add Answer, WikiJS, and other sites. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Move RenderFn type alias to browser_env.py and import in base.py - Add failure cleanup in DockerSandboxManager.create_sandbox - Update integration tests for PR1 infrastructure-only scope - Add site seeder mapping in config Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds the browser environment infrastructure for web agent tasks, split from the original
add-browser-datasetbranch for easier review.This is PR1 of 2:
What's Included
Core Environment:
BrowserEnvironmentclass with Playwright-based browser automationSite Infrastructure:
Tools:
Evaluators:
Example Tasks (1 per site to demonstrate the pattern works):
answer_find_question: Find and read a Q&A questiongitea_find_issue: Find and summarize a Git issueWhat's Deferred to PR2
Test plan
uv run ruff check --fix && uv run ruff format- Linting passesuv run ty check- Type checking passesuv run pytest -vx -m "not docker_integration"- All unit tests pass (837)uv run python -c "from prompt_siren.datasets.browser_dataset import BrowserDataset; print('OK')"🤖 Generated with Claude Code