Skip to content

Commit 2921be2

Browse files
gasserecursix
andauthored
Refine ux (rebased) (#147)
* black * little bug * more flexible requirement * imrove readme * Enhance agent configuration and logging in study setup - Updated `get_vision_agent` to append "_vision" to agent names. - Improved `make_study` function to accept single agent args and benchmark types. - Added detailed docstrings for better clarity on parameters and functionality. - Introduced `avg_step_timeout` and `demo_mode` attributes in the Study class. * get_text was added by mistake * Update README and Jupyter notebook with improved documentation and result analysis instructions * Update README.md * Update requirements to include Jupyter support for black --------- Co-authored-by: recursix <[email protected]>
1 parent 0834520 commit 2921be2

File tree

6 files changed

+319
-153
lines changed

6 files changed

+319
-153
lines changed

README.md

Lines changed: 176 additions & 104 deletions
Original file line numberDiff line numberDiff line change
@@ -1,177 +1,249 @@
11

22

3-
<a href="https://github.com/user-attachments/assets/fa71f769-6d7b-427a-978b-82aa13a6265f">
4-
<img src="https://github.com/user-attachments/assets/fa71f769-6d7b-427a-978b-82aa13a6265f" width="1000" />
5-
</a>
3+
<a href="https://github.com/user-attachments/assets/c2bc0b80-89da-4afb-9120-2feb018df19d"> <img
4+
src="https://github.com/user-attachments/assets/c2bc0b80-89da-4afb-9120-2feb018df19d" width="800"
5+
/> </a>
66

7+
&nbsp;&nbsp;|&nbsp;&nbsp;
8+
[🎯 Benchmarks](#🎯-supported-benchmarks) &nbsp;&nbsp;|&nbsp;&nbsp;
9+
[🛠️ Setup](#🛠️-setup-agentlab) &nbsp;&nbsp;|&nbsp;&nbsp;
10+
[🤖 Assistant](#ui-assistant) &nbsp;&nbsp;|&nbsp;&nbsp;
11+
[🚀 Launch Experiments](#🚀-launch-experiments) &nbsp;&nbsp;|&nbsp;&nbsp;
12+
[🔍 Analyse Results](#🔍-analyse-results) &nbsp;&nbsp;|&nbsp;&nbsp;
13+
[🤖 Make Your Own Agent](#implement-a-new-agent) &nbsp;&nbsp;|&nbsp;&nbsp;
14+
[↻ Reproducibility](#↻-reproducibility) &nbsp;&nbsp;|&nbsp;&nbsp;
715

16+
[![PyPI - License](https://img.shields.io/pypi/l/agentlab?style=flat-square)]([https://opensource.org/licenses/MIT](http://www.apache.org/licenses/LICENSE-2.0))
17+
[![PyPI - Downloads](https://img.shields.io/pypi/dm/agentlab?style=flat-square)](https://pypistats.org/packages/agentlab)
18+
[![GitHub star chart](https://img.shields.io/github/stars/ServiceNow/AgentLab?style=flat-square)](https://star-history.com/#ServiceNow/AgentLab)
819

9-
AgentLab is a framework for developing and evaluating agents on a variety of
10-
benchmarks supported by [BrowserGym](https://github.com/ServiceNow/BrowserGym).
11-
This includes:
12-
* WebArena
13-
* WorkArena.L1, L2, L3
14-
* VisualWebArena (coming soon...)
15-
* MiniWoB
16-
17-
The framework enables the desing of rich hyperparameter spaces and the launch of
18-
parallel experiments using ablation studies or random searches. It also provides
19-
agent_xray, a visualization tool to inspect the results of the experiments using
20-
a custom gradio interface
2120

22-
<a href="https://github.com/user-attachments/assets/20a91e7b-94ef-423d-9091-743eebb4733d">
23-
<img src="https://github.com/user-attachments/assets/20a91e7b-94ef-423d-9091-743eebb4733d" width="250" />
24-
</a>
21+
<video controls style="max-width: 700px;">
22+
<source src="https://github.com/ServiceNow/BrowserGym/assets/26232819/e0bfc788-cc8e-44f1-b8c3-0d1114108b85" type="video/mp4">
23+
Your browser does not support the video tag.
24+
</video>
2525

26-
## Install agentlab
2726

28-
This repo is intended for testing and developing new agents, hence we clone and install using the `-e` flag.
27+
AgentLab is a framework for developing and evaluating agents on a variety of
28+
[benchmarks](#🎯-supported-benchmarks) supported by
29+
[BrowserGym](https://github.com/ServiceNow/BrowserGym).
30+
31+
AgentLab Features:
32+
* Easy large scale parallel [agent experiments](#🚀-launch-experiments) using [ray](https://www.ray.io/)
33+
* Building blocks for making agents over BrowserGym
34+
* Unified LLM API for OpenRouter, OpenAI, Azure, or self hosted using TGI.
35+
* Prefered way for running benchmarks like WebArena
36+
* Various [reproducibility features](#reproducibility-features)
37+
* Unified LeaderBoard (soon)
38+
39+
## 🎯 Supported Benchmarks
40+
| Benchmark | Setup <br> Link | # Task <br> Template| Seed <br> Diversity | Max <br> Step | Multi-tab | Hosted Method | BrowserGym <br> Leaderboard |
41+
|-----------|------------|---------|----------------|-----------|-----------|---------------|----------------------|
42+
| [WebArena](https://webarena.dev/) | [setup](https://github.com/ServiceNow/BrowserGym/blob/main/browsergym/webarena/README.md) | 812 | None | 30 | yes | self hosted (docker) | soon |
43+
| [WorkArena](https://github.com/ServiceNow/WorkArena) L1 | [setup](https://github.com/ServiceNow/WorkArena?tab=readme-ov-file#getting-started) | 33 | High | 30 | no | demo instance | soon |
44+
| [WorkArena](https://github.com/ServiceNow/WorkArena) L2 | [setup](https://github.com/ServiceNow/WorkArena?tab=readme-ov-file#getting-started) | 341 | High | 50 | no | demo instance | soon |
45+
| [WorkArena](https://github.com/ServiceNow/WorkArena) L3 | [setup](https://github.com/ServiceNow/WorkArena?tab=readme-ov-file#getting-started) | 341 | High | 50 | no | demo instance | soon |
46+
| [WebLinx](https://mcgill-nlp.github.io/weblinx/) | - | 31586 | None | 1 | no | self hosted (dataset) | soon |
47+
| [VisualWebArena](https://github.com/web-arena-x/visualwebarena) | [setup](https://github.com/ServiceNow/BrowserGym/blob/main/browsergym/visualwebarena/README.md) | 910 | None | 30 | yes | self hosted (docker) | soon |
48+
| [Assistant Bench](https://assistantbench.github.io/) | [setup](https://github.com/ServiceNow/BrowserGym/blob/main/browsergym/assistantbench/README.md) | 214 | None | 30 | yes | live web | soon |
49+
| [GAIA](https://huggingface.co/spaces/gaia-benchmark/leaderboard) (soon) | - | - | None | - | - | live web | soon |
50+
| [Mind2Web-live](https://huggingface.co/datasets/iMeanAI/Mind2Web-Live) (soon) | - | - | None | - | - | live web | soon |
51+
| [MiniWoB](https://miniwob.farama.org/index.html) | [setup](https://github.com/ServiceNow/BrowserGym/blob/main/browsergym/miniwob/README.md) | 125 | Medium | 10 | no | self hosted (static files) | soon |
52+
## 🛠️ Setup agentlab
2953

3054
```bash
31-
git clone [email protected]:ServiceNow/AgentLab.git
32-
pip install -e .
55+
pip install agentlab
3356
```
3457

35-
## Set Environment Variables
58+
Make sure to prepare the required benchmark according to instructions provided in the [setup
59+
column](#🎯-supported-benchmarks).
3660

3761
```bash
3862
export AGENTLAB_EXP_ROOT=<root directory of experiment results> # defaults to $HOME/agentlab_results
3963
export OPENAI_API_KEY=<your openai api key> # if openai models are used
40-
export HUGGINGFACEHUB_API_TOKEN=<your huggingfacehub api token> # if huggingface models are used
4164
```
4265

43-
## Use an assistant to work for you (at your own cost and risk)
66+
<details>
67+
<summary>Setup OpenRouter API</summary>
68+
4469
```bash
45-
agentlab-assistant --start_url https://www.google.com
70+
export OPENROUTER_API_KEY=<your openrouter api key> # if openrouter models are used
4671
```
47-
48-
## Prepare Benchmarks
49-
Depending on which benchmark you use, there are some prerequisites
72+
</details>
5073

5174
<details>
52-
<summary>MiniWoB</summary>
75+
<summary>Setup Azure API</summary>
5376

5477
```bash
55-
export MINIWOB_URL="file://$HOME/dev/miniwob-plusplus/miniwob/html/miniwob/"
78+
export AZURE_OPENAI_API_KEY=<your azure api key> # if using azure models
79+
export AZURE_OPENAI_ENDPOINT=<your endpoint> # if using azure models
5680
```
5781
</details>
5882

59-
<details>
60-
61-
<summary>WorkArena</summary>
62-
63-
See [detailed instructions on workarena github](https://github.com/ServiceNow/WorkArena?tab=readme-ov-file#getting-started)
64-
65-
At a glance:
66-
1) [Sign in](https://developer.servicenow.com/) and reqeuest a `washington` instance.
67-
2) Once the instance is ready, you should see `<your instance URL>` and `<your-instance-password>`
68-
3) Add these to your `.bashrc` (or `.zshrc`) and `source` it (note: make sure that
69-
all variables are in single quotes unless you happen to have a password with a
70-
single quote in it)
71-
```bash
72-
export SNOW_INSTANCE_URL='https://<your-instance-number>.service-now.com/'
73-
export SNOW_INSTANCE_UNAME='admin'
74-
export SNOW_INSTANCE_PWD='<your-instance-password>'
75-
```
76-
4) finally run these commands:
77-
78-
```bash
79-
pip install browsergym-workarena
80-
playwright install
81-
workarena-install
82-
```
83+
## UI-Assistant
84+
Use an assistant to work for you (at your own cost and risk).
8385

86+
```bash
87+
agentlab-assistant --start_url https://www.google.com
88+
```
8489

85-
</details>
90+
Try your own agent:
8691

87-
<details>
88-
<summary>WebArena on AWS</summary>
89-
TODO
90-
</details>
92+
```bash
93+
agentlab-assistant --agent_config="module.path.to.your.AgentArgs"
94+
```
9195

92-
<details>
93-
<summary>WebArena on Azure</summary>
94-
TODO
95-
</details>
96+
## 🚀 Launch experiments
9697

98+
```python
99+
# Import your agent configuration extending bgym.AgentArgs class
100+
# Make sure this object is imported from a module accessible in PYTHONPATH to properly unpickle
101+
from agentlab.agents.generic_agent import AGENT_4o_MINI
97102

103+
from agentlab.experiments.study import make_study
98104

105+
study = make_study(
106+
benchmark="miniwob", # or "webarena", "workarnea_l1" ...
107+
agent_args=[AGENT_4o_MINI],
108+
comment="My first study",
109+
)
99110

111+
study.run(n_jobs=5)
112+
```
100113

101-
## Launch experiments
114+
Relaunching incomplete or errored tasks
102115

103-
Create your agent or import an existing one:
104116
```python
105-
from agentlab.agents.generic_agent.agent_configs import AGENT_4o
117+
from agentlab.experiments.study import Study
118+
study = Study.load("/path/to/your/study/dir")
119+
study.find_incomplete(include_errors=True)
120+
study.run()
106121
```
107122

108-
Run the agent on a benchmark:
109-
```python
110-
study_name, exp_args_list = run_agents_on_benchmark(AGENT_4o, benchmark)
111-
study_dir = make_study_dir(RESULTS_DIR, study_name)
112-
run_experiments(n_jobs, exp_args_list, study_dir)
113-
```
123+
See [main.py](main.py) to launch experiments with a variety of options. This is like a lazy CLI that
124+
is actually more convenient. Just comment and uncomment the lines you need or modify at will (but
125+
don't push to the repo).
114126

115-
use [main.py](main.py) to launch experiments with a variety
116-
of options. This is like a lazy CLI that is actually more convenient than a CLI.
117-
Just comment and uncomment the lines you need or modify at will (but don't push
118-
to the repo).
119127

120-
<details>
128+
### Job Timeouts
121129

122-
<summary>Debugging</summary>
130+
The complexity of the wild web, Playwright, and asyncio can sometimes cause jobs to hang. This
131+
disables workers until the study is terminated and relaunched. If you are running jobs sequentially
132+
or with a small number of workers, this could halt your entire study until you manually kill and
133+
relaunch it. In the Ray parallel backend, we've implemented a system to automatically terminate jobs
134+
exceeding a specified timeout. This feature is particularly useful when task hanging limits your
135+
experiments.
123136

124-
For debugging, run experiments using `n_jobs=1` and use VSCode debug mode. This
125-
will allow you to stop on breakpoints. To prevent the debugger from stopping
126-
on errors when running multiple experiments directly in VSCode, set
127-
`enable_debug = False` in `ExpArgs`
128-
</details>
137+
### Debugging
129138

139+
For debugging, run experiments with `n_jobs=1` and use VSCode's debug mode. This allows you to pause
140+
execution at breakpoints.
130141

142+
### About Parallel Jobs
131143

144+
Running one agent on one task corresponds to a single job. Conducting ablation studies or random
145+
searches across hundreds of tasks with multiple seeds can generate more than 10,000 jobs. Efficient
146+
parallel execution is therefore critical. Agents typically wait for responses from the LLM server or
147+
updates from the web server. As a result, you can run 10–50 jobs in parallel on a single computer,
148+
depending on available RAM.
132149

150+
⚠️ **Note for (Visual)WebArena**: These benchmarks have task dependencies designed to minimize
151+
"corrupting" the instance between tasks. For example, an agent on task 323 could alter the instance
152+
state, making task 201 impossible. To address this, the Ray backend accounts for task dependencies,
153+
enabling some degree of parallelism. On WebArena, you can disable dependencies to increase
154+
parallelism, but this might reduce performance by 1–2%.
133155

134-
<details>
156+
⚠️ **Instance Reset for (Visual)WebArena**: Before evaluating an agent, the instance is
157+
automatically reset, a process that takes about 5 minutes. When evaluating multiple agents, the
158+
`make_study` function returns a `SequentialStudies` object to ensure proper sequential evaluation of
159+
each agent. AgentLab currently does not support evaluations across multiple instances, but you could
160+
either create a quick script to handle this or submit a PR to AgentLab. For a smoother parallel
161+
experience, consider using benchmarks like WorkArena instead.
135162

136-
<summary>Parallel jobs</summary>
163+
## 🔍 Analyse Results
137164

138-
Running one agent on one task correspond to one job. When conducting ablation
139-
studies or random searches on hundreds of tasks with multiple seeds, this can
140-
lead to more than 10000 jobs. It is thus crucial to execute them in parallel.
141-
The agent usually wait on the LLM server to return the results or the web server
142-
to update the page. Hence, you can run 10-50 jobs in parallel on a single
143-
computer depending on how much RAM is available.
165+
### Loading Results
166+
167+
The class [`ExpResult`](https://github.com/ServiceNow/BrowserGym/blob/da26a5849d99d9a3169d7b1fde79f909c55c9ba7/browsergym/experiments/src/browsergym/experiments/loop.py#L595) provides a lazy loader for all the information of a specific experiment. You can use [`yield_all_exp_results`](https://github.com/ServiceNow/BrowserGym/blob/da26a5849d99d9a3169d7b1fde79f909c55c9ba7/browsergym/experiments/src/browsergym/experiments/loop.py#L872) to recursivley find all results in a directory. Finally [`load_result_df`](https://github.com/ServiceNow/AgentLab/blob/be1998c5fad5bda47ba50497ec3899aae03e85ec/src/agentlab/analyze/inspect_results.py#L119C5-L119C19) gathers all the summary information in a single dataframe. See [`inspect_results.ipynb`](src/agentlab/analyze/inspect_results.ipynb) for example usage.
168+
169+
```python
170+
from agentlab.analyze import inspect_results
171+
result_df = inspect_results.load_result_df("path/to/your/study")
172+
```
144173

145-
</details>
146174

147-
## AgentXray
148-
While your experiments are running, you can inspect the results using:
175+
### AgentXray
176+
Inspect the behaviour of your agent using xray. You can load previous or ongoing experiments. The refresh mechanism is currently a bit clunky, but you can refresh the page, refresh the experiment directory list and select again your experiment to see an updated version of your currently running experiments.
177+
149178

150179
```bash
151180
agentlab-xray
152181
```
153182

154-
<a href="https://github.com/user-attachments/assets/20a91e7b-94ef-423d-9091-743eebb4733d">
155-
<img src="https://github.com/user-attachments/assets/20a91e7b-94ef-423d-9091-743eebb4733d" width="250" />
156-
</a>
183+
**⚠️ Note**: Gradio is still in developement and unexpected behavior have been frequently noticed. Version 5.5 seems to work properly so far. If you're not sure that the proper information is displaying, refresh the page and select your experiment again.
184+
185+
186+
<video controls style="max-width: 800px;">
187+
<source src="https://github.com/user-attachments/assets/06c4dac0-b78f-45b7-9405-003da4af6b37" type="video/mp4">
188+
Your browser does not support the video tag.
189+
</video>
190+
157191

158-
You will be able to select the recent experiments in the directory
159-
`AGENTLAB_EXP_ROOT` and visualize the results in a gradio interface.
192+
You will be able to select the recent experiments in the directory `AGENTLAB_EXP_ROOT` and visualize
193+
the results in a gradio interface.
160194

161195
In the following order, select:
162196
* The experiment you want to visualize
163197
* The agent if there is more than one
164198
* The task
165199
* And the seed
166200

167-
Once this is selected, you can see the trace of your agent on the given task.
168-
Click on the profiling image to select a step and observe the action taken by the agent.
201+
Once this is selected, you can see the trace of your agent on the given task. Click on the profiling
202+
image to select a step and observe the action taken by the agent.
169203

170204
## Implement a new Agent
171205

172-
Get inspiration from the `MostBasicAgent` in [agentlab/agents/most_basic_agent/most_basic_agent.py](src/agentlab/agents/most_basic_agent/most_basic_agent.py)
206+
Get inspiration from the `MostBasicAgent` in
207+
[agentlab/agents/most_basic_agent/most_basic_agent.py](src/agentlab/agents/most_basic_agent/most_basic_agent.py).
208+
For a better integration with the tools, make sure to implement most functions in the
209+
[AgentArgs](src/agentlab/agents/agent_args.py#L5) API and the extended `bgym.AbstractAgentArgs`.
210+
211+
If you think your agent should be included directly in AgenLab, let use know and it can be added in
212+
agentlab/agents/ with the name of your agent.
213+
214+
## ↻ Reproducibility
215+
Several factors can influence reproducibility of results in the context of evaluating agents on
216+
dynamic benchmarks.
217+
218+
### Factors affecting reproducibility
219+
* **Software version**: Different version of Playwright or any package in the software stack could
220+
influence the behavior of the benchmark or the agent.
221+
* **API based LLMs silently changing**: Even for a fixed version, an LLM may be updated e.g. to
222+
incorporate latest web knowledge.
223+
* **Live websites**:
224+
* WorkArena: The demo instance is mostly fixed in time to a specific version but ServiceNow
225+
sometime push minor modifications.
226+
* AssistantBench and GAIA: These rely on the agent navigating the open web. The experience may
227+
change depending on which country or region, some websites might be in different languages by
228+
default.
229+
* **Stochastic Agents**: Setting temperature of the LLM to 0 can reduce most stochasticity.
230+
* **Non deterministic tasks**: For a fixed seed, the changes should be minimal
231+
232+
### Reproducibility Features
233+
* `Study` contains a dict of information about reproducibility, including benchmark version, package
234+
version and commit hash
235+
* The `Study` class allows automatic upload of your results to
236+
[`reproducibility_journal.csv`](reproducibility_journal.csv). This makes it easier to populate a
237+
large amount of reference points.
238+
* **Reproduced results in the leaderboard**. For agents that are repdocudibile, we encourage users
239+
to try to reproduce the results and upload them to the leaderboard. There is a special column
240+
containing information about all reproduced results of an agent on a benchmark.
241+
* **ReproducibilityAgent**: You can run this agent on an existing study and it will try to re-run
242+
the same actions on the same task seeds. A vsiual diff of the two prompts will be displayed in the
243+
AgentInfo HTML tab of AgentXray. You will be able to inspect on some tasks what kind of changes
244+
between to two executions. **Note**: this is a beta feature and will need some adaptation for your
245+
own agent.
173246

174-
Create a new directory in agentlab/agents/ with the name of your agent.
175247

176248
## Misc
177249

requirements.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
black[jupyter]==24.2.0
1+
black[jupyter]>=24.2.0
22
blacken-docs
33
pre-commit
44
pytest==7.3.2

src/agentlab/agents/generic_agent/tmlr_config.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -56,10 +56,12 @@ def get_base_agent(llm_config: str):
5656
def get_vision_agent(llm_config: str):
5757
flags = deepcopy(BASE_FLAGS)
5858
flags.obs.use_screenshot = True
59-
return GenericAgentArgs(
59+
agent_args = GenericAgentArgs(
6060
chat_model_args=CHAT_MODEL_ARGS_DICT[llm_config],
6161
flags=flags,
6262
)
63+
agent_args.agent_name = f"{agent_args.agent_name}_vision"
64+
return agent_args
6365

6466

6567
def get_som_agent(llm_config: str):

0 commit comments

Comments
 (0)