|
1 | 1 |
|
2 | 2 |
|
3 | | -<a href="https://github.com/user-attachments/assets/fa71f769-6d7b-427a-978b-82aa13a6265f"> |
4 | | - <img src="https://github.com/user-attachments/assets/fa71f769-6d7b-427a-978b-82aa13a6265f" width="1000" /> |
5 | | -</a> |
| 3 | +<a href="https://github.com/user-attachments/assets/c2bc0b80-89da-4afb-9120-2feb018df19d"> <img |
| 4 | + src="https://github.com/user-attachments/assets/c2bc0b80-89da-4afb-9120-2feb018df19d" width="800" |
| 5 | +/> </a> |
6 | 6 |
|
| 7 | + | |
| 8 | +[🎯 Benchmarks](#🎯-supported-benchmarks) | |
| 9 | +[🛠️ Setup](#🛠️-setup-agentlab) | |
| 10 | +[🤖 Assistant](#ui-assistant) | |
| 11 | +[🚀 Launch Experiments](#🚀-launch-experiments) | |
| 12 | +[🔍 AgentXray](#🔍-agentxray) | |
| 13 | +[🤖 Make Your Own Agent](#implement-a-new-agent) | |
| 14 | +[↻ Reproducibility](#↻-reproducibility) | |
| 15 | + |
| 16 | +<video controls style="max-width: 800px;"> |
| 17 | + <source src="https://github.com/ServiceNow/BrowserGym/assets/26232819/e0bfc788-cc8e-44f1-b8c3-0d1114108b85" type="video/mp4"> |
| 18 | + Your browser does not support the video tag. |
| 19 | +</video> |
7 | 20 |
|
8 | 21 |
|
9 | 22 | AgentLab is a framework for developing and evaluating agents on a variety of |
10 | | -benchmarks supported by [BrowserGym](https://github.com/ServiceNow/BrowserGym). |
11 | | -This includes: |
12 | | -* [WebArena](https://webarena.dev/) |
13 | | -* [WorkArena](https://github.com/ServiceNow/WorkArena) L1, L2, L3 |
14 | | -* [WebLinx](https://mcgill-nlp.github.io/weblinx/) |
15 | | -* [VisualWebArena](https://github.com/web-arena-x/visualwebarena) |
16 | | -* Assistant Bench |
17 | | -* GAIA |
18 | | -* Mind2Web-live (coming soon ...) |
19 | | -* [MiniWoB](https://miniwob.farama.org/index.html) |
| 23 | +[benchmarks](#🎯-supported-benchmarks) supported by |
| 24 | +[BrowserGym](https://github.com/ServiceNow/BrowserGym). |
20 | 25 |
|
21 | 26 | AgentLab Features: |
22 | 27 | * Easy large scale parallel agent experiments using [ray](https://www.ray.io/) |
23 | 28 | * Building blocks for making agents |
24 | | -* Unified LLM api for OpenRouter, OpenAI, Azure, Self hosted using TGI. |
| 29 | +* Unified LLM api for OpenRouter, OpenAI, Azure, or self hosted using TGI. |
25 | 30 | * Prefered way for running benchmarks like WebArena |
26 | 31 | * Various Reproducibility features |
27 | | -* Unified LeaderBoard |
28 | | - |
29 | | -The framework enables the desing of rich hyperparameter spaces and the launch of |
30 | | -parallel experiments using ablation studies or random searches. It also provides |
31 | | -agent_xray, a visualization tool to inspect the results of the experiments using |
32 | | -a custom gradio interface |
33 | | - |
34 | | -<a href="https://github.com/user-attachments/assets/20a91e7b-94ef-423d-9091-743eebb4733d"> |
35 | | - <img src="https://github.com/user-attachments/assets/20a91e7b-94ef-423d-9091-743eebb4733d" width="250" /> |
36 | | -</a> |
37 | | - |
38 | | -## Install agentlab |
39 | | - |
40 | | -This repo is intended for testing and developing new agents, hence we clone and install using the `-e` flag. |
| 32 | +* Unified LeaderBoard (soon) |
| 33 | + |
| 34 | +## 🎯 Supported Benchmarks |
| 35 | +| Benchmark | Setup <br> Link | # Task <br> Template| Seed <br> Diversity | Max <br> Step | Multi-tab | Hosted Method | BrowserGym <br> Leaderboard | |
| 36 | +|-----------|------------|---------|----------------|-----------|-----------|---------------|----------------------| |
| 37 | +| [WebArena](https://webarena.dev/) | [setup](https://github.com/ServiceNow/BrowserGym/blob/main/browsergym/webarena/README.md) | 812 | None | 30 | yes | self hosted (docker) | soon | |
| 38 | +| [WorkArena](https://github.com/ServiceNow/WorkArena) L1 | [setup](https://github.com/ServiceNow/WorkArena?tab=readme-ov-file#getting-started) | 33 | High | 30 | no | demo instance | soon | |
| 39 | +| [WorkArena](https://github.com/ServiceNow/WorkArena) L2 | [setup](https://github.com/ServiceNow/WorkArena?tab=readme-ov-file#getting-started) | 341 | High | 50 | no | demo instance | soon | |
| 40 | +| [WorkArena](https://github.com/ServiceNow/WorkArena) L3 | [setup](https://github.com/ServiceNow/WorkArena?tab=readme-ov-file#getting-started) | 341 | High | 50 | no | demo instance | soon | |
| 41 | +| [WebLinx](https://mcgill-nlp.github.io/weblinx/) | - | 31586 | None | 1 | no | self hosted (dataset) | soon | |
| 42 | +| [VisualWebArena](https://github.com/web-arena-x/visualwebarena) | [setup](https://github.com/ServiceNow/BrowserGym/blob/main/browsergym/visualwebarena/README.md) | 910 | None | 30 | yes | self hosted (docker) | soon | |
| 43 | +| [Assistant Bench](https://assistantbench.github.io/) | [setup](https://github.com/ServiceNow/BrowserGym/blob/main/browsergym/assistantbench/README.md) | 214 | None | 30 | yes | live web | soon | |
| 44 | +| [GAIA](https://huggingface.co/spaces/gaia-benchmark/leaderboard) (soon) | - | - | None | - | - | live web | soon | |
| 45 | +| [Mind2Web-live](https://huggingface.co/datasets/iMeanAI/Mind2Web-Live) (soon) | - | - | None | - | - | live web | soon | |
| 46 | +| [MiniWoB](https://miniwob.farama.org/index.html) | [setup](https://github.com/ServiceNow/BrowserGym/blob/main/browsergym/miniwob/README.md) | 125 | Medium | 10 | no | self hosted (static files) | soon | |
| 47 | +## 🛠️ Setup agentlab |
41 | 48 |
|
42 | 49 | ```bash |
43 | | -git clone [email protected]:ServiceNow/AgentLab.git |
44 | | -pip install -e . |
| 50 | +pip install agentlab |
45 | 51 | ``` |
46 | 52 |
|
47 | | -## Set Environment Variables |
| 53 | +Make sure to prepare the required benchmark according to instructions provided in the [setup |
| 54 | +column](#🎯-supported-benchmarks). |
48 | 55 |
|
49 | 56 | ```bash |
50 | 57 | export AGENTLAB_EXP_ROOT=<root directory of experiment results> # defaults to $HOME/agentlab_results |
51 | 58 | export OPENAI_API_KEY=<your openai api key> # if openai models are used |
52 | | -export HUGGINGFACEHUB_API_TOKEN=<your huggingfacehub api token> # if huggingface models are used |
53 | | -``` |
54 | | - |
55 | | -## Use an assistant to work for you (at your own cost and risk) |
56 | | -```bash |
57 | | -agentlab-assistant --start_url https://www.google.com |
58 | 59 | ``` |
59 | 60 |
|
60 | | -## Prepare Benchmarks |
61 | | -Depending on which benchmark you use, there are some prerequisites |
62 | | - |
63 | 61 | <details> |
64 | | -<summary>MiniWoB</summary> |
| 62 | +<summary>Setup OpenRouter API</summary> |
65 | 63 |
|
66 | 64 | ```bash |
67 | | -export MINIWOB_URL="file://$HOME/dev/miniwob-plusplus/miniwob/html/miniwob/" |
| 65 | +export OPENROUTER_API_KEY=<your openrouter api key> # if openrouter models are used |
68 | 66 | ``` |
69 | 67 | </details> |
70 | 68 |
|
71 | 69 | <details> |
| 70 | +<summary>Setup Azure API</summary> |
72 | 71 |
|
73 | | -<summary>WorkArena</summary> |
74 | | - |
75 | | -See [detailed instructions on workarena github](https://github.com/ServiceNow/WorkArena?tab=readme-ov-file#getting-started) |
76 | | - |
77 | | -At a glance: |
78 | | -1) [Sign in](https://developer.servicenow.com/) and reqeuest a `washington` instance. |
79 | | -2) Once the instance is ready, you should see `<your instance URL>` and `<your-instance-password>` |
80 | | -3) Add these to your `.bashrc` (or `.zshrc`) and `source` it (note: make sure that |
81 | | - all variables are in single quotes unless you happen to have a password with a |
82 | | - single quote in it) |
83 | | - ```bash |
84 | | - export SNOW_INSTANCE_URL='https://<your-instance-number>.service-now.com/' |
85 | | - export SNOW_INSTANCE_UNAME='admin' |
86 | | - export SNOW_INSTANCE_PWD='<your-instance-password>' |
87 | | - ``` |
88 | | -4) finally run these commands: |
89 | | - |
90 | | - ```bash |
91 | | - pip install browsergym-workarena |
92 | | - playwright install |
93 | | - workarena-install |
94 | | - ``` |
95 | | - |
96 | | - |
| 72 | +```bash |
| 73 | +export AZURE_OPENAI_API_KEY=<your azure api key> # if using azure models |
| 74 | +export AZURE_OPENAI_ENDPOINT=<your endpoint> # if using azure models |
| 75 | +``` |
97 | 76 | </details> |
98 | 77 |
|
99 | | -<details> |
100 | | -<summary>WebArena on AWS</summary> |
101 | | -TODO |
102 | | -</details> |
| 78 | +## UI-Assistant |
| 79 | +Use an assistant to work for you (at your own cost and risk). |
103 | 80 |
|
104 | | -<details> |
105 | | -<summary>WebArena on Azure</summary> |
106 | | -TODO |
107 | | -</details> |
| 81 | +```bash |
| 82 | +agentlab-assistant --start_url https://www.google.com |
| 83 | +``` |
108 | 84 |
|
| 85 | +Try your own agent: |
109 | 86 |
|
| 87 | +```bash |
| 88 | +agentlab-assistant --agent_config="module.path.to.your.AgentArgs" |
| 89 | +``` |
| 90 | + |
| 91 | +## 🚀 Launch experiments |
110 | 92 |
|
| 93 | +```python |
| 94 | +# Import your agent configuration extending bgym.AgentArgs class |
| 95 | +# Make sure this object is imported from a module accessible in PYTHONPATH to properly unpickle |
| 96 | +from agentlab.agents.generic_agent import AGENT_4o_MINI |
111 | 97 |
|
| 98 | +from agentlab.experiments.study import make_study |
112 | 99 |
|
113 | | -## Launch experiments |
| 100 | +study = make_study( |
| 101 | + benchmark="miniwob", # or "webarena", "workarnea_l1" ... |
| 102 | + agent_args=[AGENT_4o_MINI], |
| 103 | + comment="My first study", |
| 104 | +) |
114 | 105 |
|
115 | | -Create your agent or import an existing one: |
116 | | -```python |
117 | | -from agentlab.agents.generic_agent.agent_configs import AGENT_4o |
| 106 | +study.run(n_jobs=5) |
118 | 107 | ``` |
119 | 108 |
|
120 | | -Run the agent on a benchmark: |
| 109 | +Relaunching incomplete or errored tasks |
| 110 | + |
121 | 111 | ```python |
122 | | -study_name, exp_args_list = run_agents_on_benchmark(AGENT_4o, benchmark) |
123 | | -study_dir = make_study_dir(RESULTS_DIR, study_name) |
124 | | -run_experiments(n_jobs, exp_args_list, study_dir) |
| 112 | +from agentlab.experiments.study import Study |
| 113 | +study = Study.load("/path/to/your/study/dir") |
| 114 | +study.find_incomplete(include_errors=True) |
| 115 | +study.run() |
125 | 116 | ``` |
126 | 117 |
|
127 | | -use [main.py](main.py) to launch experiments with a variety |
128 | | -of options. This is like a lazy CLI that is actually more convenient than a CLI. |
129 | | -Just comment and uncomment the lines you need or modify at will (but don't push |
130 | | -to the repo). |
131 | | -
|
132 | | -<details> |
| 118 | +See [main.py](main.py) to launch experiments with a variety of options. This is like a lazy CLI that |
| 119 | +is actually more convenient. Just comment and uncomment the lines you need or modify at will (but |
| 120 | +don't push to the repo). |
133 | 121 |
|
134 | | -<summary>Debugging</summary> |
135 | 122 |
|
136 | | -For debugging, run experiments using `n_jobs=1` and use VSCode debug mode. This |
137 | | -will allow you to stop on breakpoints. To prevent the debugger from stopping |
138 | | -on errors when running multiple experiments directly in VSCode, set |
139 | | -`enable_debug = False` in `ExpArgs` |
140 | | -</details> |
| 123 | +### Job Timeouts |
141 | 124 |
|
| 125 | +The complexity of the wild web, Playwright, and asyncio can sometimes cause jobs to hang. This |
| 126 | +disables workers until the study is terminated and relaunched. If you are running jobs sequentially |
| 127 | +or with a small number of workers, this could halt your entire study until you manually kill and |
| 128 | +relaunch it. In the Ray parallel backend, we've implemented a system to automatically terminate jobs |
| 129 | +exceeding a specified timeout. This feature is particularly useful when task hanging limits your |
| 130 | +experiments. |
142 | 131 |
|
| 132 | +### Debugging |
143 | 133 |
|
| 134 | +For debugging, run experiments with `n_jobs=1` and use VSCode's debug mode. This allows you to pause |
| 135 | +execution at breakpoints. To prevent the debugger from stopping on errors while running multiple |
| 136 | +experiments in VSCode, set `enable_debug = False` in `ExpArgs`. |
144 | 137 |
|
| 138 | +### About Parallel Jobs |
145 | 139 |
|
146 | | -<details> |
| 140 | +Running one agent on one task corresponds to a single job. Conducting ablation studies or random |
| 141 | +searches across hundreds of tasks with multiple seeds can generate more than 10,000 jobs. Efficient |
| 142 | +parallel execution is therefore critical. Agents typically wait for responses from the LLM server or |
| 143 | +updates from the web server. As a result, you can run 10–50 jobs in parallel on a single computer, |
| 144 | +depending on available RAM. |
147 | 145 |
|
148 | | -<summary>Parallel jobs</summary> |
| 146 | +⚠️ **Note for (Visual)WebArena**: These benchmarks have task dependencies designed to minimize |
| 147 | +"corrupting" the instance between tasks. For example, an agent on task 323 could alter the instance |
| 148 | +state, making task 201 impossible. To address this, the Ray backend accounts for task dependencies, |
| 149 | +enabling some degree of parallelism. On WebArena, you can disable dependencies to increase |
| 150 | +parallelism, but this might reduce performance by 1–2%. |
149 | 151 |
|
150 | | -Running one agent on one task correspond to one job. When conducting ablation |
151 | | -studies or random searches on hundreds of tasks with multiple seeds, this can |
152 | | -lead to more than 10000 jobs. It is thus crucial to execute them in parallel. |
153 | | -The agent usually wait on the LLM server to return the results or the web server |
154 | | -to update the page. Hence, you can run 10-50 jobs in parallel on a single |
155 | | -computer depending on how much RAM is available. |
| 152 | +⚠️ **Instance Reset for (Visual)WebArena**: Before evaluating an agent, the instance is |
| 153 | +automatically reset, a process that takes about 5 minutes. When evaluating multiple agents, the |
| 154 | +`make_study` function returns a `SequentialStudies` object to ensure proper sequential evaluation of |
| 155 | +each agent. AgentLab currently does not support evaluations across multiple instances, but you could |
| 156 | +either create a quick script to handle this or submit a PR to AgentLab. For a smoother parallel |
| 157 | +experience, consider using benchmarks like WorkArena instead. |
156 | 158 |
|
157 | | -</details> |
158 | 159 |
|
159 | | -## AgentXray |
| 160 | +## 🔍 AgentXray |
160 | 161 | While your experiments are running, you can inspect the results using: |
161 | 162 |
|
162 | 163 | ```bash |
163 | 164 | agentlab-xray |
164 | 165 | ``` |
165 | 166 |
|
166 | | -<a href="https://github.com/user-attachments/assets/20a91e7b-94ef-423d-9091-743eebb4733d"> |
167 | | - <img src="https://github.com/user-attachments/assets/20a91e7b-94ef-423d-9091-743eebb4733d" width="250" /> |
168 | | -</a> |
169 | 167 |
|
170 | | -You will be able to select the recent experiments in the directory |
171 | | -`AGENTLAB_EXP_ROOT` and visualize the results in a gradio interface. |
| 168 | +<video controls style="max-width: 800px;"> |
| 169 | + <source src="https://github.com/user-attachments/assets/06c4dac0-b78f-45b7-9405-003da4af6b37" type="video/mp4"> |
| 170 | + Your browser does not support the video tag. |
| 171 | +</video> |
| 172 | + |
| 173 | + |
| 174 | +You will be able to select the recent experiments in the directory `AGENTLAB_EXP_ROOT` and visualize |
| 175 | +the results in a gradio interface. |
172 | 176 |
|
173 | 177 | In the following order, select: |
174 | 178 | * The experiment you want to visualize |
175 | 179 | * The agent if there is more than one |
176 | 180 | * The task |
177 | 181 | * And the seed |
178 | 182 |
|
179 | | -Once this is selected, you can see the trace of your agent on the given task. |
180 | | -Click on the profiling image to select a step and observe the action taken by the agent. |
| 183 | +Once this is selected, you can see the trace of your agent on the given task. Click on the profiling |
| 184 | +image to select a step and observe the action taken by the agent. |
181 | 185 |
|
182 | 186 | ## Implement a new Agent |
183 | 187 |
|
184 | | -Get inspiration from the `MostBasicAgent` in [agentlab/agents/most_basic_agent/most_basic_agent.py](src/agentlab/agents/most_basic_agent/most_basic_agent.py) |
| 188 | +Get inspiration from the `MostBasicAgent` in |
| 189 | +[agentlab/agents/most_basic_agent/most_basic_agent.py](src/agentlab/agents/most_basic_agent/most_basic_agent.py). |
| 190 | +For a better integration with the tools, make sure to implement most functions in the |
| 191 | +[AgentArgs](src/agentlab/agents/agent_args.py#L5) API and the extended `bgym.AbstractAgentArgs`. |
| 192 | + |
| 193 | +If you think your agent should be included directly in AgenLab, let use know and it can be added in |
| 194 | +agentlab/agents/ with the name of your agent. |
| 195 | + |
| 196 | +## ↻ Reproducibility |
| 197 | +Several factors can influence reproducibility of results in the context of evaluating agents on |
| 198 | +dynamic benchmarks. |
| 199 | + |
| 200 | +### Factors affecting roproducibility |
| 201 | +* **Software version**: Different version of Playwright or any package in the software stack could |
| 202 | + influence the behavior of the benchmark or the agent. |
| 203 | +* **API based LLMs silently changing**: Even for a fixed version, a LLM may be updated e.g. to |
| 204 | + incorporate latest web knowledge. |
| 205 | +* **Live websites**: |
| 206 | + * WorkArena: The demo instance is mostly fixed in time to a specific version but ServiceNow |
| 207 | + sometime push minor modifications. |
| 208 | + * AssistantBench and GAIA: These rely on the agent navigating the open web. The experience may |
| 209 | + change depending on which country or region, some websites might be in different languages by |
| 210 | + default. |
| 211 | +* **Stochastic Agents**: Setting temperature of the LLM to 0 can reduce most stochasticity. |
| 212 | +* **Non deterministic tasks**: For a fixed seed, the changes should be minimal |
| 213 | + |
| 214 | +### Reproducibility Features |
| 215 | +* `Study` contains a dict of information about reproducibility, including benchmark version, package |
| 216 | + version and commit hash |
| 217 | +* The `Study` class allows automatic upload of your results to |
| 218 | + [`reproducibility_journal.csv`](reproducibility_journal.csv). This makes it easier to populate a |
| 219 | + large amount of reference points. |
| 220 | +* **Reproduced results in the leaderboard**. For agents that are repdocudibile, we encourage users |
| 221 | + to try to reproduce the results and upload them to the leaderboard. There is a special column |
| 222 | + containing information about all reproduced results of an agent on a benchmark. |
| 223 | +* **ReproducibilityAgent**: You can run this agent on an existing study and it will try to re-run |
| 224 | + the same actions on the same task seeds. A vsiual diff of the two prompts will be displayed in the |
| 225 | + AgentInfo HTML tab of AgentXray. You will be able to inspect on some tasks what kind of changes |
| 226 | + between to two executions. **Note**: this is a beta feature and will need some adaptation for your |
| 227 | + own agent. |
185 | 228 |
|
186 | | -Create a new directory in agentlab/agents/ with the name of your agent. |
187 | 229 |
|
188 | 230 | ## Misc |
189 | 231 |
|
|
0 commit comments