Skip to content

Commit a0cea3d

Browse files
recursixgasse
authored andcommitted
Enhance agent configuration and logging in study setup
- Updated `get_vision_agent` to append "_vision" to agent names. - Improved `make_study` function to accept single agent args and benchmark types. - Added detailed docstrings for better clarity on parameters and functionality. - Introduced `avg_step_timeout` and `demo_mode` attributes in the Study class.
1 parent fa0bfb3 commit a0cea3d

File tree

4 files changed

+267
-142
lines changed

4 files changed

+267
-142
lines changed

README.md

Lines changed: 155 additions & 113 deletions
Original file line numberDiff line numberDiff line change
@@ -1,189 +1,231 @@
11

22

3-
<a href="https://github.com/user-attachments/assets/fa71f769-6d7b-427a-978b-82aa13a6265f">
4-
<img src="https://github.com/user-attachments/assets/fa71f769-6d7b-427a-978b-82aa13a6265f" width="1000" />
5-
</a>
3+
<a href="https://github.com/user-attachments/assets/c2bc0b80-89da-4afb-9120-2feb018df19d"> <img
4+
src="https://github.com/user-attachments/assets/c2bc0b80-89da-4afb-9120-2feb018df19d" width="800"
5+
/> </a>
66

7+
&nbsp;&nbsp;|&nbsp;&nbsp;
8+
[🎯 Benchmarks](#🎯-supported-benchmarks) &nbsp;&nbsp;|&nbsp;&nbsp;
9+
[🛠️ Setup](#🛠️-setup-agentlab) &nbsp;&nbsp;|&nbsp;&nbsp;
10+
[🤖 Assistant](#ui-assistant) &nbsp;&nbsp;|&nbsp;&nbsp;
11+
[🚀 Launch Experiments](#🚀-launch-experiments) &nbsp;&nbsp;|&nbsp;&nbsp;
12+
[🔍 AgentXray](#🔍-agentxray) &nbsp;&nbsp;|&nbsp;&nbsp;
13+
[🤖 Make Your Own Agent](#implement-a-new-agent) &nbsp;&nbsp;|&nbsp;&nbsp;
14+
[↻ Reproducibility](#↻-reproducibility) &nbsp;&nbsp;|&nbsp;&nbsp;
15+
16+
<video controls style="max-width: 800px;">
17+
<source src="https://github.com/ServiceNow/BrowserGym/assets/26232819/e0bfc788-cc8e-44f1-b8c3-0d1114108b85" type="video/mp4">
18+
Your browser does not support the video tag.
19+
</video>
720

821

922
AgentLab is a framework for developing and evaluating agents on a variety of
10-
benchmarks supported by [BrowserGym](https://github.com/ServiceNow/BrowserGym).
11-
This includes:
12-
* [WebArena](https://webarena.dev/)
13-
* [WorkArena](https://github.com/ServiceNow/WorkArena) L1, L2, L3
14-
* [WebLinx](https://mcgill-nlp.github.io/weblinx/)
15-
* [VisualWebArena](https://github.com/web-arena-x/visualwebarena)
16-
* Assistant Bench
17-
* GAIA
18-
* Mind2Web-live (coming soon ...)
19-
* [MiniWoB](https://miniwob.farama.org/index.html)
23+
[benchmarks](#🎯-supported-benchmarks) supported by
24+
[BrowserGym](https://github.com/ServiceNow/BrowserGym).
2025

2126
AgentLab Features:
2227
* Easy large scale parallel agent experiments using [ray](https://www.ray.io/)
2328
* Building blocks for making agents
24-
* Unified LLM api for OpenRouter, OpenAI, Azure, Self hosted using TGI.
29+
* Unified LLM api for OpenRouter, OpenAI, Azure, or self hosted using TGI.
2530
* Prefered way for running benchmarks like WebArena
2631
* Various Reproducibility features
27-
* Unified LeaderBoard
28-
29-
The framework enables the desing of rich hyperparameter spaces and the launch of
30-
parallel experiments using ablation studies or random searches. It also provides
31-
agent_xray, a visualization tool to inspect the results of the experiments using
32-
a custom gradio interface
33-
34-
<a href="https://github.com/user-attachments/assets/20a91e7b-94ef-423d-9091-743eebb4733d">
35-
<img src="https://github.com/user-attachments/assets/20a91e7b-94ef-423d-9091-743eebb4733d" width="250" />
36-
</a>
37-
38-
## Install agentlab
39-
40-
This repo is intended for testing and developing new agents, hence we clone and install using the `-e` flag.
32+
* Unified LeaderBoard (soon)
33+
34+
## 🎯 Supported Benchmarks
35+
| Benchmark | Setup <br> Link | # Task <br> Template| Seed <br> Diversity | Max <br> Step | Multi-tab | Hosted Method | BrowserGym <br> Leaderboard |
36+
|-----------|------------|---------|----------------|-----------|-----------|---------------|----------------------|
37+
| [WebArena](https://webarena.dev/) | [setup](https://github.com/ServiceNow/BrowserGym/blob/main/browsergym/webarena/README.md) | 812 | None | 30 | yes | self hosted (docker) | soon |
38+
| [WorkArena](https://github.com/ServiceNow/WorkArena) L1 | [setup](https://github.com/ServiceNow/WorkArena?tab=readme-ov-file#getting-started) | 33 | High | 30 | no | demo instance | soon |
39+
| [WorkArena](https://github.com/ServiceNow/WorkArena) L2 | [setup](https://github.com/ServiceNow/WorkArena?tab=readme-ov-file#getting-started) | 341 | High | 50 | no | demo instance | soon |
40+
| [WorkArena](https://github.com/ServiceNow/WorkArena) L3 | [setup](https://github.com/ServiceNow/WorkArena?tab=readme-ov-file#getting-started) | 341 | High | 50 | no | demo instance | soon |
41+
| [WebLinx](https://mcgill-nlp.github.io/weblinx/) | - | 31586 | None | 1 | no | self hosted (dataset) | soon |
42+
| [VisualWebArena](https://github.com/web-arena-x/visualwebarena) | [setup](https://github.com/ServiceNow/BrowserGym/blob/main/browsergym/visualwebarena/README.md) | 910 | None | 30 | yes | self hosted (docker) | soon |
43+
| [Assistant Bench](https://assistantbench.github.io/) | [setup](https://github.com/ServiceNow/BrowserGym/blob/main/browsergym/assistantbench/README.md) | 214 | None | 30 | yes | live web | soon |
44+
| [GAIA](https://huggingface.co/spaces/gaia-benchmark/leaderboard) (soon) | - | - | None | - | - | live web | soon |
45+
| [Mind2Web-live](https://huggingface.co/datasets/iMeanAI/Mind2Web-Live) (soon) | - | - | None | - | - | live web | soon |
46+
| [MiniWoB](https://miniwob.farama.org/index.html) | [setup](https://github.com/ServiceNow/BrowserGym/blob/main/browsergym/miniwob/README.md) | 125 | Medium | 10 | no | self hosted (static files) | soon |
47+
## 🛠️ Setup agentlab
4148

4249
```bash
43-
git clone [email protected]:ServiceNow/AgentLab.git
44-
pip install -e .
50+
pip install agentlab
4551
```
4652

47-
## Set Environment Variables
53+
Make sure to prepare the required benchmark according to instructions provided in the [setup
54+
column](#🎯-supported-benchmarks).
4855

4956
```bash
5057
export AGENTLAB_EXP_ROOT=<root directory of experiment results> # defaults to $HOME/agentlab_results
5158
export OPENAI_API_KEY=<your openai api key> # if openai models are used
52-
export HUGGINGFACEHUB_API_TOKEN=<your huggingfacehub api token> # if huggingface models are used
53-
```
54-
55-
## Use an assistant to work for you (at your own cost and risk)
56-
```bash
57-
agentlab-assistant --start_url https://www.google.com
5859
```
5960

60-
## Prepare Benchmarks
61-
Depending on which benchmark you use, there are some prerequisites
62-
6361
<details>
64-
<summary>MiniWoB</summary>
62+
<summary>Setup OpenRouter API</summary>
6563

6664
```bash
67-
export MINIWOB_URL="file://$HOME/dev/miniwob-plusplus/miniwob/html/miniwob/"
65+
export OPENROUTER_API_KEY=<your openrouter api key> # if openrouter models are used
6866
```
6967
</details>
7068

7169
<details>
70+
<summary>Setup Azure API</summary>
7271

73-
<summary>WorkArena</summary>
74-
75-
See [detailed instructions on workarena github](https://github.com/ServiceNow/WorkArena?tab=readme-ov-file#getting-started)
76-
77-
At a glance:
78-
1) [Sign in](https://developer.servicenow.com/) and reqeuest a `washington` instance.
79-
2) Once the instance is ready, you should see `<your instance URL>` and `<your-instance-password>`
80-
3) Add these to your `.bashrc` (or `.zshrc`) and `source` it (note: make sure that
81-
all variables are in single quotes unless you happen to have a password with a
82-
single quote in it)
83-
```bash
84-
export SNOW_INSTANCE_URL='https://<your-instance-number>.service-now.com/'
85-
export SNOW_INSTANCE_UNAME='admin'
86-
export SNOW_INSTANCE_PWD='<your-instance-password>'
87-
```
88-
4) finally run these commands:
89-
90-
```bash
91-
pip install browsergym-workarena
92-
playwright install
93-
workarena-install
94-
```
95-
96-
72+
```bash
73+
export AZURE_OPENAI_API_KEY=<your azure api key> # if using azure models
74+
export AZURE_OPENAI_ENDPOINT=<your endpoint> # if using azure models
75+
```
9776
</details>
9877

99-
<details>
100-
<summary>WebArena on AWS</summary>
101-
TODO
102-
</details>
78+
## UI-Assistant
79+
Use an assistant to work for you (at your own cost and risk).
10380

104-
<details>
105-
<summary>WebArena on Azure</summary>
106-
TODO
107-
</details>
81+
```bash
82+
agentlab-assistant --start_url https://www.google.com
83+
```
10884

85+
Try your own agent:
10986

87+
```bash
88+
agentlab-assistant --agent_config="module.path.to.your.AgentArgs"
89+
```
90+
91+
## 🚀 Launch experiments
11092

93+
```python
94+
# Import your agent configuration extending bgym.AgentArgs class
95+
# Make sure this object is imported from a module accessible in PYTHONPATH to properly unpickle
96+
from agentlab.agents.generic_agent import AGENT_4o_MINI
11197

98+
from agentlab.experiments.study import make_study
11299

113-
## Launch experiments
100+
study = make_study(
101+
benchmark="miniwob", # or "webarena", "workarnea_l1" ...
102+
agent_args=[AGENT_4o_MINI],
103+
comment="My first study",
104+
)
114105

115-
Create your agent or import an existing one:
116-
```python
117-
from agentlab.agents.generic_agent.agent_configs import AGENT_4o
106+
study.run(n_jobs=5)
118107
```
119108

120-
Run the agent on a benchmark:
109+
Relaunching incomplete or errored tasks
110+
121111
```python
122-
study_name, exp_args_list = run_agents_on_benchmark(AGENT_4o, benchmark)
123-
study_dir = make_study_dir(RESULTS_DIR, study_name)
124-
run_experiments(n_jobs, exp_args_list, study_dir)
112+
from agentlab.experiments.study import Study
113+
study = Study.load("/path/to/your/study/dir")
114+
study.find_incomplete(include_errors=True)
115+
study.run()
125116
```
126117

127-
use [main.py](main.py) to launch experiments with a variety
128-
of options. This is like a lazy CLI that is actually more convenient than a CLI.
129-
Just comment and uncomment the lines you need or modify at will (but don't push
130-
to the repo).
131-
132-
<details>
118+
See [main.py](main.py) to launch experiments with a variety of options. This is like a lazy CLI that
119+
is actually more convenient. Just comment and uncomment the lines you need or modify at will (but
120+
don't push to the repo).
133121

134-
<summary>Debugging</summary>
135122

136-
For debugging, run experiments using `n_jobs=1` and use VSCode debug mode. This
137-
will allow you to stop on breakpoints. To prevent the debugger from stopping
138-
on errors when running multiple experiments directly in VSCode, set
139-
`enable_debug = False` in `ExpArgs`
140-
</details>
123+
### Job Timeouts
141124

125+
The complexity of the wild web, Playwright, and asyncio can sometimes cause jobs to hang. This
126+
disables workers until the study is terminated and relaunched. If you are running jobs sequentially
127+
or with a small number of workers, this could halt your entire study until you manually kill and
128+
relaunch it. In the Ray parallel backend, we've implemented a system to automatically terminate jobs
129+
exceeding a specified timeout. This feature is particularly useful when task hanging limits your
130+
experiments.
142131

132+
### Debugging
143133

134+
For debugging, run experiments with `n_jobs=1` and use VSCode's debug mode. This allows you to pause
135+
execution at breakpoints. To prevent the debugger from stopping on errors while running multiple
136+
experiments in VSCode, set `enable_debug = False` in `ExpArgs`.
144137

138+
### About Parallel Jobs
145139

146-
<details>
140+
Running one agent on one task corresponds to a single job. Conducting ablation studies or random
141+
searches across hundreds of tasks with multiple seeds can generate more than 10,000 jobs. Efficient
142+
parallel execution is therefore critical. Agents typically wait for responses from the LLM server or
143+
updates from the web server. As a result, you can run 10–50 jobs in parallel on a single computer,
144+
depending on available RAM.
147145

148-
<summary>Parallel jobs</summary>
146+
⚠️ **Note for (Visual)WebArena**: These benchmarks have task dependencies designed to minimize
147+
"corrupting" the instance between tasks. For example, an agent on task 323 could alter the instance
148+
state, making task 201 impossible. To address this, the Ray backend accounts for task dependencies,
149+
enabling some degree of parallelism. On WebArena, you can disable dependencies to increase
150+
parallelism, but this might reduce performance by 1–2%.
149151

150-
Running one agent on one task correspond to one job. When conducting ablation
151-
studies or random searches on hundreds of tasks with multiple seeds, this can
152-
lead to more than 10000 jobs. It is thus crucial to execute them in parallel.
153-
The agent usually wait on the LLM server to return the results or the web server
154-
to update the page. Hence, you can run 10-50 jobs in parallel on a single
155-
computer depending on how much RAM is available.
152+
⚠️ **Instance Reset for (Visual)WebArena**: Before evaluating an agent, the instance is
153+
automatically reset, a process that takes about 5 minutes. When evaluating multiple agents, the
154+
`make_study` function returns a `SequentialStudies` object to ensure proper sequential evaluation of
155+
each agent. AgentLab currently does not support evaluations across multiple instances, but you could
156+
either create a quick script to handle this or submit a PR to AgentLab. For a smoother parallel
157+
experience, consider using benchmarks like WorkArena instead.
156158

157-
</details>
158159

159-
## AgentXray
160+
## 🔍 AgentXray
160161
While your experiments are running, you can inspect the results using:
161162

162163
```bash
163164
agentlab-xray
164165
```
165166

166-
<a href="https://github.com/user-attachments/assets/20a91e7b-94ef-423d-9091-743eebb4733d">
167-
<img src="https://github.com/user-attachments/assets/20a91e7b-94ef-423d-9091-743eebb4733d" width="250" />
168-
</a>
169167

170-
You will be able to select the recent experiments in the directory
171-
`AGENTLAB_EXP_ROOT` and visualize the results in a gradio interface.
168+
<video controls style="max-width: 800px;">
169+
<source src="https://github.com/user-attachments/assets/06c4dac0-b78f-45b7-9405-003da4af6b37" type="video/mp4">
170+
Your browser does not support the video tag.
171+
</video>
172+
173+
174+
You will be able to select the recent experiments in the directory `AGENTLAB_EXP_ROOT` and visualize
175+
the results in a gradio interface.
172176

173177
In the following order, select:
174178
* The experiment you want to visualize
175179
* The agent if there is more than one
176180
* The task
177181
* And the seed
178182

179-
Once this is selected, you can see the trace of your agent on the given task.
180-
Click on the profiling image to select a step and observe the action taken by the agent.
183+
Once this is selected, you can see the trace of your agent on the given task. Click on the profiling
184+
image to select a step and observe the action taken by the agent.
181185

182186
## Implement a new Agent
183187

184-
Get inspiration from the `MostBasicAgent` in [agentlab/agents/most_basic_agent/most_basic_agent.py](src/agentlab/agents/most_basic_agent/most_basic_agent.py)
188+
Get inspiration from the `MostBasicAgent` in
189+
[agentlab/agents/most_basic_agent/most_basic_agent.py](src/agentlab/agents/most_basic_agent/most_basic_agent.py).
190+
For a better integration with the tools, make sure to implement most functions in the
191+
[AgentArgs](src/agentlab/agents/agent_args.py#L5) API and the extended `bgym.AbstractAgentArgs`.
192+
193+
If you think your agent should be included directly in AgenLab, let use know and it can be added in
194+
agentlab/agents/ with the name of your agent.
195+
196+
## ↻ Reproducibility
197+
Several factors can influence reproducibility of results in the context of evaluating agents on
198+
dynamic benchmarks.
199+
200+
### Factors affecting roproducibility
201+
* **Software version**: Different version of Playwright or any package in the software stack could
202+
influence the behavior of the benchmark or the agent.
203+
* **API based LLMs silently changing**: Even for a fixed version, a LLM may be updated e.g. to
204+
incorporate latest web knowledge.
205+
* **Live websites**:
206+
* WorkArena: The demo instance is mostly fixed in time to a specific version but ServiceNow
207+
sometime push minor modifications.
208+
* AssistantBench and GAIA: These rely on the agent navigating the open web. The experience may
209+
change depending on which country or region, some websites might be in different languages by
210+
default.
211+
* **Stochastic Agents**: Setting temperature of the LLM to 0 can reduce most stochasticity.
212+
* **Non deterministic tasks**: For a fixed seed, the changes should be minimal
213+
214+
### Reproducibility Features
215+
* `Study` contains a dict of information about reproducibility, including benchmark version, package
216+
version and commit hash
217+
* The `Study` class allows automatic upload of your results to
218+
[`reproducibility_journal.csv`](reproducibility_journal.csv). This makes it easier to populate a
219+
large amount of reference points.
220+
* **Reproduced results in the leaderboard**. For agents that are repdocudibile, we encourage users
221+
to try to reproduce the results and upload them to the leaderboard. There is a special column
222+
containing information about all reproduced results of an agent on a benchmark.
223+
* **ReproducibilityAgent**: You can run this agent on an existing study and it will try to re-run
224+
the same actions on the same task seeds. A vsiual diff of the two prompts will be displayed in the
225+
AgentInfo HTML tab of AgentXray. You will be able to inspect on some tasks what kind of changes
226+
between to two executions. **Note**: this is a beta feature and will need some adaptation for your
227+
own agent.
185228

186-
Create a new directory in agentlab/agents/ with the name of your agent.
187229

188230
## Misc
189231

src/agentlab/agents/generic_agent/tmlr_config.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -56,10 +56,12 @@ def get_base_agent(llm_config: str):
5656
def get_vision_agent(llm_config: str):
5757
flags = deepcopy(BASE_FLAGS)
5858
flags.obs.use_screenshot = True
59-
return GenericAgentArgs(
59+
agent_args = GenericAgentArgs(
6060
chat_model_args=CHAT_MODEL_ARGS_DICT[llm_config],
6161
flags=flags,
6262
)
63+
agent_args.agent_name = f"{agent_args.agent_name}_vision"
64+
return agent_args
6365

6466

6567
def get_som_agent(llm_config: str):

0 commit comments

Comments
 (0)