Skip to content

Commit d2c0536

Browse files
gassejardinetsouffletonaldro61
authored
Release 0.3.0 (ICML version, only L1) (#17)
* Internal repo sync * version bump 0.3.0 * Update README.md --------- Co-authored-by: Leo Boisvert <[email protected]> Co-authored-by: Alexandre Drouin <[email protected]>
1 parent e672831 commit d2c0536

File tree

96 files changed

+27161
-2079
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

96 files changed

+27161
-2079
lines changed

README.md

Lines changed: 17 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -10,12 +10,9 @@ WorkArena is included in [BrowserGym](https://github.com/ServiceNow/BrowserGym),
1010

1111
https://github.com/ServiceNow/WorkArena/assets/2374980/68640f09-7d6f-4eb1-b556-c294a6afef70
1212

13-
## ⚠️ Pre-Release warning ⚠️
14-
Please note that the WorkArena benchmark is still undergoing minor bug fixes and updates, which may cause discrepancies with results reported in our latest arXiv preprint. We plan to release soon a stable version of WorkArena with enhanced stability, and a final version v1.0.0 with a new suite of tasks.
15-
1613
## Benchmark Contents
1714

18-
At the moment, WorkArena includes `19,951` task instances drawn from `33` tasks that cover the main components of the ServiceNow user interface. The following videos show an agent built on `GPT-4-vision` interacting with every such component. As emphasized by our results, this benchmark is not solved and thus, the performance of the agent is not always on point.
15+
At the moment, WorkArena includes `19,912` unique instances drawn from `33` tasks that cover the main components of the ServiceNow user interface. The following videos show an agent built on `GPT-4-vision` interacting with every such component. As emphasized by our results, this benchmark is not solved and thus, the performance of the agent is not always on point.
1916

2017
### Knowledge Bases
2118

@@ -53,8 +50,11 @@ https://github.com/ServiceNow/WorkArena/assets/1726818/ca26dfaf-2358-4418-855f-8
5350

5451
### Dashboards
5552

56-
**Goal:** The agent must extract information from a dashboard.
53+
**Goal:** The agent must answer a question that requires reading charts and (optionally) performing simple reasoning over them.
54+
55+
*Note: For demonstration purposes, a human is controlling the cursor since this is a pure retrieval task*
5756

57+
https://github.com/ServiceNow/WorkArena/assets/1726818/0023232c-081f-4be4-99bd-f60c766e6c3f
5858

5959

6060
## Getting Started
@@ -98,6 +98,8 @@ Your installation is now complete! 🎉
9898

9999
Run this code to see WorkArena in action.
100100

101+
Note: the following example executes WorkArena's oracle (cheat) function to solve each task. To evaluate an agent, calls to `env.step()` must be used instead.
102+
101103
```python
102104
import random
103105

@@ -112,28 +114,27 @@ for task in ALL_WORKARENA_TASKS:
112114

113115
# Instantiate a new environment
114116
env = BrowserEnv(task_entrypoint=task,
115-
headless=False,
116-
slow_mo=1000)
117+
headless=False)
117118
env.reset()
118119

119120
# Cheat functions use Playwright to automatically solve the task
120121
env.chat.add_message(role="assistant", msg="On it. Please wait...")
121-
env.task.cheat(env.page, env.chat.messages)
122+
cheat_messages = []
123+
env.task.cheat(env.page, cheat_messages)
124+
125+
# Send cheat messages to chat
126+
for cheat_msg in cheat_messages:
127+
env.chat.add_message(role=cheat_msg["role"], msg=cheat_msg["message"])
122128

123129
# Post solution to chat
124-
if "KnowledgeBaseSearchTask" in str(task):
125-
answer = env.chat.messages[-1]["message"]
126-
env.chat.add_message(role="assistant", msg=f"The answer is:")
127-
env.chat.add_message(role="assistant", msg=answer)
128-
else:
129-
env.chat.add_message(role="assistant", msg="I'm done!")
130+
env.chat.add_message(role="assistant", msg="I'm done!")
130131

131132
# Validate the solution
132-
reward, stop, info, message = env.task.validate(env.page, env.chat.messages)
133+
reward, stop, message, info = env.task.validate(env.page, cheat_messages)
133134
if reward == 1:
134135
env.chat.add_message(role="user", msg="Yes, that works. Thanks!")
135136
else:
136-
env.chat.add_message(role="user", msg=f"No, that doesn't work. {message.get('message', '')}")
137+
env.chat.add_message(role="user", msg=f"No, that doesn't work. {info.get('message', '')}")
137138

138139
sleep(3)
139140
env.close()

pyproject.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,7 @@ homepage = "https://github.com/ServiceNow/WorkArena"
3131

3232
[project.scripts]
3333
workarena-install = "browsergym.workarena.install:main"
34+
workarena-human-eval = "browsergym.workarena.human_eval.tool:main"
3435

3536
[tool.hatch.version]
3637
path = "src/browsergym/workarena/__init__.py"

requirements.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
browsergym-core>=0.2
22
english-words>=2.0.1
3-
faker>=24.11.0
3+
Faker>=24.8.0
44
numpy>=1.14
55
requests>=2.31
66
tenacity>=8.2.3 # only used in cheat() -> move to tests?
Lines changed: 131 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,131 @@
1+
"""
2+
A demonstration of how observation/action traces can be extracted
3+
for WorkArena tasks without modifying the task code.
4+
5+
Author: Alexandre Drouin ([email protected])
6+
7+
Notes:
8+
- This approach relies on monkey patching the playwright actions to log the actions and observations.
9+
It has not been tested for parallel execution. It might work with multiprocessing, but it will for
10+
sure not work with multithreading.
11+
12+
"""
13+
14+
import importlib
15+
import logging
16+
import os
17+
import pickle
18+
import playwright.sync_api as playwright_sync
19+
20+
from browsergym.core.env import BrowserEnv
21+
from browsergym.workarena import ALL_WORKARENA_TASKS
22+
from collections import defaultdict
23+
from tenacity import retry, stop_after_attempt, wait_fixed
24+
from time import time
25+
26+
27+
N_PER_TASK = 10
28+
29+
30+
def monkey_patch_playwright(observation_callback, trace_storage):
31+
"""
32+
A function that overrides the default playwright actions to log the actions and observations.
33+
34+
Parameters:
35+
------------
36+
observation_callback: callable
37+
A function that returns the observation of the environment.
38+
trace_storage: list
39+
A list to store the trace of the actions and observations.
40+
These will be appended in-place.
41+
42+
"""
43+
44+
def wrapper(func, interface):
45+
def wrapped(*args, **kwargs):
46+
# Get the observation
47+
obs = observation_callback()
48+
49+
# Get the BID of the element on which we are acting.
50+
if interface.__name__ == "Locator":
51+
# Get the locator
52+
locator = args[0]
53+
# Get the BID
54+
bid = locator.element_handle().evaluate('(el) => el.getAttribute("bid")')
55+
elif interface.__name__ == "Keyboard":
56+
# Get the BID of the element
57+
bid = "keyboard"
58+
else:
59+
# Get the BID of the element
60+
bid = args[0].evaluate('(el) => el.getAttribute("bid")')
61+
62+
logging.info(f"Action: {func.__name__} BID: {bid} -- Args: {args[1:]} {kwargs}")
63+
trace_storage.append(
64+
{
65+
"obs": obs,
66+
"action": func.__name__,
67+
"args": args[1:],
68+
"kwargs": kwargs,
69+
"bid": bid,
70+
"time": time(),
71+
}
72+
)
73+
74+
# Resume action
75+
return func(*args, **kwargs)
76+
77+
return wrapped
78+
79+
# Interfaces and actions we want to monkey patch
80+
importlib.reload(playwright_sync)
81+
from playwright.sync_api import Page, Frame, Locator, Keyboard, ElementHandle
82+
83+
# TODO: Make sure the list of interfaces and actions is exhaustive
84+
# It covers all that is used in WorkArena cheats as of April 11, 2024
85+
interfaces = [Page, Frame, Locator, Keyboard, ElementHandle]
86+
actions = ["click", "select_option", "set_checked", "fill", "press", "type", "down", "up"]
87+
88+
for interface in interfaces:
89+
for action in actions:
90+
if hasattr(interface, action):
91+
setattr(interface, action, wrapper(getattr(interface, action), interface))
92+
print(f"Monkey patched {interface.__name__}.{action}")
93+
94+
95+
@retry(stop=stop_after_attempt(3), wait=wait_fixed(2))
96+
def extract_trace(task_cls, headless=True):
97+
"""
98+
Extracts the trace of actions and observations for a given task.
99+
100+
Parameters:
101+
------------
102+
task_cls: class
103+
The class of the task to extract the trace from.
104+
105+
"""
106+
# Instantiate a new environment
107+
env = BrowserEnv(task_entrypoint=task_cls, headless=headless, slow_mo=1000)
108+
109+
# Setup customized tracing
110+
trace = []
111+
monkey_patch_playwright(observation_callback=env._get_obs, trace_storage=trace)
112+
113+
env.reset()
114+
env.task.cheat(env.page, env.chat.messages)
115+
env.close()
116+
117+
return trace
118+
119+
120+
if __name__ == "__main__":
121+
os.makedirs("trace_profiling", exist_ok=True)
122+
123+
task_traces = defaultdict(list)
124+
for task in ALL_WORKARENA_TASKS:
125+
print("Task:", task)
126+
for i in range(N_PER_TASK):
127+
print(f"Extracting trace {i+1}/{N_PER_TASK}")
128+
trace = extract_trace(task, headless=True)
129+
task_traces[task].append(trace)
130+
131+
pickle.dump(task_traces, open("trace_profiling/task_traces.pkl", "wb"))

0 commit comments

Comments
 (0)