Skip to content

ServiceNow/webarena-verified

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

10 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

WebArena-Verified

PyPI version Docker Hub Python 3.11+ Tests: Pytest Docs: MkDocs

WebArena-Verified is the verified release of the WebArena benchmark. It distributes a curated, version-controlled dataset of web tasks together with deterministic evaluators that operate on agent responses and captured network traces. The project is designed for reproducible benchmarking of web agents and provides tooling for both single-task debugging and batch evaluation.

πŸ“– Documentation

πŸ“’ Announcements

  • February 2, 2026: Optimized Docker images for all WebArena environments are now available on Docker Hub! Images are up to 92% smaller than originals, include auto-login headers, plus a single container for Map (beta) (previously 5 separate containers). See the Environments documentation.
  • February 2, 2026: WebArena-Verified is now available via Docker and uvx! Run uvx webarena-verified --help or docker run am1n3e/webarena-verified:latest --help to get started.
  • January 7, 2026: WebArena-Verified is now available on PyPI! Install it easily with pip install webarena-verified.
  • December 2, 2025: We are presenting WebArena-Verified at the Scaling Environments for Agents (SEA) Workshop at NeurIPS 2025 on December 7th in San Diego. Come see us!
  • November 12, 2024: Started initial release with collaborators to gather early feedback, catch any issues, and clarify the documentation. Public release scheduled for December 4th, 2025.

🎯 Highlights

  • Fully audited benchmark: Every task, reference answer, and evaluator has been manually reviewed and corrected
  • Offline evaluation: Evaluate agent runs without requiring live web environments using network trace replay
  • Deterministic scoring: Removed LLM-as-a-judge evaluation and substring matching in favor of type-aware normalization and structural comparison
  • WebArena-Verified Hard subset: A difficulty-prioritized 258-task subset for cost-effective evaluation

πŸš€ Quick Start

Using uvx (Recommended)

The fastest way to try WebArena-Verified without installing anything:

uvx webarena-verified --help

Run evaluation directly:

uvx webarena-verified eval-tasks \
  --task-ids 108 \
  --output-dir examples/agent_logs/demo

Using Docker

Run evaluation using the Docker image by mounting your output directory:

docker run --rm \
  -v /path/to/output:/data \
  am1n3e/webarena-verified:latest \
  eval-tasks --output-dir /data

Your output directory should contain task subdirectories with agent_response.json and network.har files:

output/
β”œβ”€β”€ 1/
β”‚   β”œβ”€β”€ agent_response.json
β”‚   └── network.har
β”œβ”€β”€ 2/
β”‚   └── ...

Using pip

Install from PyPI:

pip install webarena-verified

Verify the CLI is working:

webarena-verified --help

For development, clone and install from source:

git clone https://github.com/ServiceNow/webarena-verified.git
cd webarena-verified
uv sync

🌐 Run WebArena Environments

Using the CLI (Recommended)

Start and manage WebArena environments using the built-in CLI:

# Start a site (waits for services to be ready)
webarena-verified env start --site shopping
webarena-verified env start --site shopping_admin
webarena-verified env start --site reddit
webarena-verified env start --site gitlab

# Check status
webarena-verified env status --site shopping

# Stop a site
webarena-verified env stop --site shopping

# Stop all running sites
webarena-verified env stop-all

For sites requiring data setup (Wikipedia, Map):

# Wikipedia - download data first (~100GB)
webarena-verified env setup init --site wikipedia --data-dir ./downloads
webarena-verified env start --site wikipedia --data-dir ./downloads

# Map - download data first (~60GB)
webarena-verified env setup init --site map --data-dir ./downloads
webarena-verified env start --site map

Using Docker Directly

You can also run environments directly with Docker:

# Shopping (Magento)
docker run -d --name webarena-verified-shopping -p 7770:80 -p 7771:8877 am1n3e/webarena-verified-shopping

# Shopping Admin
docker run -d --name webarena-verified-shopping_admin -p 7780:80 -p 7781:8877 am1n3e/webarena-verified-shopping_admin

# Reddit (Postmill)
docker run -d --name webarena-verified-reddit -p 9999:80 -p 9998:8877 am1n3e/webarena-verified-reddit

# GitLab
docker run -d --name webarena-verified-gitlab -p 8023:8023 -p 8024:8877 am1n3e/webarena-verified-gitlab

See the Environments documentation for detailed setup instructions, credentials, and configuration options.

πŸ§ͺ Evaluate A Task

Evaluate a task using the CLI or programmatically:

CLI:

webarena-verified eval-tasks \
  --task-ids 108 \
  --output-dir examples/agent_logs/demo \
  --config examples/configs/config.example.json

Library:

Start by creating a WebArenaVerified instance with your environment configuration:

from pathlib import Path
from webarena_verified.api import WebArenaVerified
from webarena_verified.types.config import WebArenaVerifiedConfig

# Initialize with configuration
config = WebArenaVerifiedConfig(
    environments={
        "__GITLAB__": {
            "urls": ["http://localhost:8012"],
            "credentials": {"username": "root", "password": "demopass"}
        }
    }
)
wa = WebArenaVerified(config=config)

# Get a single task
task = wa.get_task(44)
print(f"Task intent: {task.intent}")

Once you have your agent's output, evaluate it against the task definition:

With Files:

# Evaluate a task with file paths
result = wa.evaluate_task(
    task_id=44,
    agent_response=Path("output/44/agent_response_44.json"),
    network_trace=Path("output/44/network_44.har")
)

print(f"Score: {result.score}, Status: {result.status}")

With Inline Response:

# Evaluate a task with inline response
result = wa.evaluate_task(
    task_id=44,
    agent_response={
        "task_type": "NAVIGATE",
        "status": "SUCCESS",
        "retrieved_data": None
    },
    network_trace=Path("output/44/network_44.har")
)

print(f"Score: {result.score}, Status: {result.status}")

See the Quick Start Guide for a complete walkthrough using example task logs.

πŸ“Š Dataset

  • WebArena Verified dataset is in assets/dataset/webarena-verified.json
  • The original WebArena dataset is in assets/dataset/test.raw.json (kept for reference)
  • The WebArena Verified Hard subset task IDs are in assets/dataset/subsets/webarena-verified-hard.json

To export the hard subset's task data:

webarena-verified subset-export --name webarena-verified-hard --output webarena-verified-hard.json

See the documentation for more info.

🀝 Contributing

We welcome improvements to both the dataset and the evaluation tooling. See the Contributing Guide for guidelines, local development tips, and dataset update workflows.

πŸ“„ Citation

If you use WebArena-Verified in your research, please cite our paper:

@inproceedings{
hattami2025webarena,
title={WebArena Verified: Reliable Evaluation for Web Agents},
author={Amine El hattami and Megh Thakkar and Nicolas Chapados and Christopher Pal},
booktitle={Workshop on Scaling Environments for Agents},
year={2025},
url={https://openreview.net/forum?id=94tlGxmqkN}
}

πŸ™ Acknowledgements

We thank Prof. Shuyan Zhou and Prof. Graham Neubig for their valuable guidance and feedback.