Skip to content

Commit f0e4275

Browse files
WebArena Verified (#377)
* init commit for webarena verified * upd Makefile * adding the basic files * update dependencies * start adding integration with wa_verified * upd readme * use custom backend for webarena_verified * pass the wa instance to the evaluator * pass the wa instance to the evaluator * cleanup evaluator * remove custom webarena verified instance * update requirements to latest wav code * use simpler and cleaner wav eval * enable tracing * fix wav * update to new webarena verified version * update task name template to webarena_verified.templateID.taskID * fix config * fix csv file * add webarena_verified backend * fix wav tasks * do not check reachable if url is todo * fix tmp trace creation, update goal to prompt model to satisfy wav return format/ * create webarena_verified action space with special submit function to match the benchmark expected agent response format * look for extra header file path in environment variable * undo special action set for webarena_verified * remove wav actions * load extra context headers for webarena(+lite) * update README * update requirements * update makefile and readme * update readme * update requirements * update readme * update test * black formater * upd makefile * update to new webarena_verified dataset version * small debug * add massage of shopping_admin tasks * assume all endpoints are running * update to latest version before the public release * update instructions to fetch latest version before the public release * exponential backoff * update README * compare json with the one in the library * update install instructions * update makefile * update pypi deployment with webarena-verified * fix assets directory * fix task id template * remove task json file, use the one from the webarena-verified library. Update task template to include revision number * remove metadata and create it dynamically * do not hardcode revision number * fix * run black formater * fix format? * always create the metadata file * version-bump-dev * Remove git dependency and add ins to install from source * version-bump-dev 0.14.3.dev3 * add webarena-verified package as a dependency * version-bump-dev 0.14.3.dev4 * add webarena-verified in the dev requirements.txt * update gitignore --------- Co-authored-by: Nicolas Gontier <nicolas.gontier@servicenow.com> Co-authored-by: Aman Jaiswal <amanjaiswal73892@gmail.com>
1 parent 2fe88fd commit f0e4275

File tree

32 files changed

+1512
-29
lines changed

32 files changed

+1512
-29
lines changed

.github/workflows/pypi.yml

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,8 +28,13 @@ jobs:
2828

2929
- name: Build a binary wheel and a source tarball (browsergym-webarena)
3030
run: python3 -m build browsergym/webarena/ --outdir dist/
31+
3132
- name: Build a binary wheel and a source tarball (browsergym-webarenalite)
32-
run: python3 -m build browsergym/webarenalite/ --outdir dist/
33+
run: python3 -m build browsergym/webarenalite/ --outdir dist/
34+
35+
- name: Build a binary wheel and a source tarball (browsergym-webarena-verified)
36+
run: python3 -m build browsergym/webarena_verified/ --outdir dist
37+
3338
- name: Build a binary wheel and a source tarball (browsergym-visualwebarena)
3439
run: python3 -m build browsergym/visualwebarena/ --outdir dist/
3540

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -155,3 +155,6 @@ bg_wl_data/
155155
miniwob-plusplus/
156156

157157
uv.lock
158+
159+
# webarena verified metadata (constructed automatically)
160+
browsergym/experiments/src/browsergym/experiments/benchmark/metadata/webarena_verified.csv

Makefile

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -38,12 +38,12 @@ clean-miniwob:
3838

3939
help:
4040
@echo "Available targets:"
41-
@echo " install - Install project dependencies"
42-
@echo " setup-miniwob - Setup MiniWoB++ dependencies"
43-
@echo " install-demo - Install demo dependencies"
44-
@echo " demo - Run demo agent"
45-
@echo " test-core - Run core tests"
46-
@echo " clean-miniwob - Remove MiniWoB++ directory"
47-
@echo " help - Show this help message"
41+
@echo " install - Install project dependencies"
42+
@echo " setup-miniwob - Setup MiniWoB++ dependencies"
43+
@echo " install-demo - Install demo dependencies"
44+
@echo " demo - Run demo agent"
45+
@echo " test-core - Run core tests"
46+
@echo " clean-miniwob - Remove MiniWoB++ directory"
47+
@echo " help - Show this help message"
4848

4949
.PHONY: install setup-miniwob install-demo demo test-core clean-miniwob help

README.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,7 @@ _Example of a GPT4-V agent executing openended tasks (top row, chat interactive)
3939
BrowserGym includes the following benchmarks by default:
4040
- [MiniWoB](https://miniwob.farama.org/)
4141
- [WebArena](https://webarena.dev/)
42+
- [WebArenaVerified](https://github.com/ServiceNow/platform-labs-webarena-verified)
4243
- [VisualWebArena](https://jykoh.com/vwa)
4344
- [WorkArena](https://github.com/ServiceNow/WorkArena)
4445
- [AssistantBench](https://github.com/oriyor/assistantbench)
@@ -55,6 +56,7 @@ pip install browsergym-experiments # experiment utilities (agent, loop, benchma
5556
pip install browsergym-core # core functionalities only (no benchmark, just the openended task)
5657
pip install browsergym-miniwob # core + miniwob
5758
pip install browsergym-webarena # core + webarena
59+
pip install browsergym-webarena-verified # core + webarena_verified
5860
pip install browsergym-visualwebarena # core + visualwebarena
5961
pip install browsergym-workarena # core + workarena
6062
pip install browsergym-assistantbench # core + assistantbench
@@ -69,6 +71,7 @@ playwright install chromium
6971
Finally, each benchmark comes with its own specific setup that requires to follow additional steps.
7072
- for MiniWoB++, see [miniwob/README.md](browsergym/miniwob/README.md)
7173
- for WebArena, see [webarena/README.md](browsergym/webarena/README.md)
74+
- for WebArenaVerified, see [webarena_verified/README.md](browsergym/webarena_verified/README.md)
7275
- for VisualWebArena, see [visualwebarena/README.md](browsergym/visualwebarena/README.md)
7376
- for WorkArena, see [WorkArena](https://github.com/ServiceNow/WorkArena)
7477
- for AssistantBench, see [assistantbench/README.md](browsergym/assistantbench/README.md)
Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
browsergym-core==0.14.3.dev1
1+
browsergym-core==0.14.3.dev4
22
datasets
33
scipy
44
numpy

browsergym/assistantbench/src/browsergym/assistantbench/evaluation/evaluate_utils/evaluate_strings.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -69,7 +69,7 @@ def _normalize_number(text: str) -> str:
6969

7070

7171
def _answer_to_bags(
72-
answer: Union[str, List[str], Tuple[str, ...]]
72+
answer: Union[str, List[str], Tuple[str, ...]],
7373
) -> Tuple[List[str], List[Set[str]]]:
7474
if isinstance(answer, (list, tuple)):
7575
raw_spans = answer

browsergym/core/src/browsergym/core/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
__version__ = "0.14.3.dev1"
1+
__version__ = "0.14.3.dev4"
22

33
import playwright.sync_api
44

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,3 @@
1-
browsergym-core==0.14.3.dev1
1+
browsergym-core==0.14.3.dev4
22
tiktoken>=0.4
33
dataclasses-json

browsergym/experiments/src/browsergym/experiments/benchmark/base.py

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -53,7 +53,13 @@ def make_action_set(self):
5353

5454

5555
BenchmarkBackend = Literal[
56-
"miniwob", "webarena", "visualwebarena", "workarena", "assistantbench", "weblinx"
56+
"miniwob",
57+
"webarena",
58+
"webarena_verified",
59+
"visualwebarena",
60+
"workarena",
61+
"assistantbench",
62+
"weblinx",
5763
]
5864

5965

browsergym/experiments/src/browsergym/experiments/benchmark/configs.py

Lines changed: 18 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
import numpy as np
2+
23
from browsergym.experiments.benchmark.metadata.utils import (
34
task_list_from_metadata,
45
task_metadata,
@@ -132,6 +133,21 @@
132133
),
133134
task_metadata=task_metadata("webarena"),
134135
),
136+
"webarena_verified": lambda n_repeats=1: Benchmark(
137+
name="webarena_verified",
138+
high_level_action_set_args=DEFAULT_HIGHLEVEL_ACTION_SET_ARGS["webarena"],
139+
is_multi_tab=True,
140+
supports_parallel_seeds=False,
141+
backends=["webarena_verified"],
142+
env_args_list=make_env_args_list_from_repeat_tasks(
143+
task_list=task_list_from_metadata(metadata=task_metadata("webarena_verified")),
144+
max_steps=30,
145+
n_repeats=n_repeats,
146+
seeds_rng=np.random.RandomState(42),
147+
),
148+
task_metadata=task_metadata("webarena_verified"),
149+
), # TODO: Add webarena-verified hard subsets by filtering tasks in
150+
# https://github.com/ServiceNow/webarena-verified/blob/main/assets/dataset/subsets/webarena-verified-hard.json
135151
"webarena_lite": lambda n_repeats=1: Benchmark(
136152
name="webarena_lite",
137153
high_level_action_set_args=DEFAULT_HIGHLEVEL_ACTION_SET_ARGS["webarena"],
@@ -252,7 +268,8 @@
252268
backends=["assistantbench"],
253269
env_args_list=make_env_args_list_from_repeat_tasks(
254270
task_list=task_list_from_metadata(
255-
metadata=task_metadata("assistantbench"), filter={"browsergym_split": "valid|test"}
271+
metadata=task_metadata("assistantbench"),
272+
filter={"browsergym_split": "valid|test"},
256273
),
257274
max_steps=30,
258275
n_repeats=n_repeats,

0 commit comments

Comments
 (0)