|
1 | 1 |
|
2 | 2 |
|
3 | | -<a href="https://github.com/user-attachments/assets/fa71f769-6d7b-427a-978b-82aa13a6265f"> |
4 | | - <img src="https://github.com/user-attachments/assets/fa71f769-6d7b-427a-978b-82aa13a6265f" width="1000" /> |
5 | | -</a> |
6 | | - |
| 3 | +<a href="https://github.com/user-attachments/assets/c2bc0b80-89da-4afb-9120-2feb018df19d"> <img |
| 4 | + src="https://github.com/user-attachments/assets/c2bc0b80-89da-4afb-9120-2feb018df19d" width="800" |
| 5 | +/> </a> |
| 6 | + |
| 7 | + | |
| 8 | +[🎯 Benchmarks](#🎯-supported-benchmarks) | |
| 9 | +[🛠️ Setup](#🛠️-setup-agentlab) | |
| 10 | +[🤖 Assistant](#ui-assistant) | |
| 11 | +[🚀 Launch Experiments](#🚀-launch-experiments) | |
| 12 | +[🔍 AgentXray](#🔍-agentxray) | |
| 13 | +[🤖 Make Your Own Agent](#implement-a-new-agent) | |
| 14 | +[↻ Reproducibility](#↻-reproducibility) | |
| 15 | + |
| 16 | +<video controls style="max-width: 800px;"> |
| 17 | + <source src="https://github.com/ServiceNow/BrowserGym/assets/26232819/e0bfc788-cc8e-44f1-b8c3-0d1114108b85" type="video/mp4"> |
| 18 | + Your browser does not support the video tag. |
| 19 | +</video> |
7 | 20 |
|
8 | 21 | AgentLab is a framework for developing and evaluating agents on a variety of |
9 | | -benchmarks supported by [BrowserGym](https://github.com/ServiceNow/BrowserGym). |
10 | | -This includes: |
11 | | -* [WebArena](https://webarena.dev/) |
12 | | -* [WorkArena](https://github.com/ServiceNow/WorkArena) L1, L2, L3 |
13 | | -* [WebLinx](https://mcgill-nlp.github.io/weblinx/) |
14 | | -* [VisualWebArena](https://github.com/web-arena-x/visualwebarena) |
15 | | -* Assistant Bench |
16 | | -* GAIA |
17 | | -* Mind2Web-live (coming soon ...) |
18 | | -* [MiniWoB](https://miniwob.farama.org/index.html) |
| 22 | +[benchmarks](#🎯-supported-benchmarks) supported by |
| 23 | +[BrowserGym](https://github.com/ServiceNow/BrowserGym). |
19 | 24 |
|
20 | 25 | AgentLab Features: |
21 | 26 | * Easy large scale parallel agent experiments using [ray](https://www.ray.io/) |
22 | 27 | * Building blocks for making agents |
23 | | -* Unified LLM api for OpenRouter, OpenAI, Azure, Self hosted using TGI. |
| 28 | +* Unified LLM api for OpenRouter, OpenAI, Azure, or self hosted using TGI. |
24 | 29 | * Prefered way for running benchmarks like WebArena |
25 | 30 | * Various Reproducibility features |
26 | | -* Unified LeaderBoard |
27 | | - |
28 | | -The framework enables the desing of rich hyperparameter spaces and the launch of |
29 | | -parallel experiments using ablation studies or random searches. It also provides |
30 | | -agent_xray, a visualization tool to inspect the results of the experiments using |
31 | | -a custom gradio interface |
32 | | - |
33 | | -<a href="https://github.com/user-attachments/assets/20a91e7b-94ef-423d-9091-743eebb4733d"> |
34 | | - <img src="https://github.com/user-attachments/assets/20a91e7b-94ef-423d-9091-743eebb4733d" width="250" /> |
35 | | -</a> |
36 | | - |
37 | | -## Install agentlab |
38 | | - |
39 | | -This repo is intended for testing and developing new agents, hence we clone and install using the `-e` flag. |
| 31 | +* Unified LeaderBoard (soon) |
| 32 | + |
| 33 | +## 🎯 Supported Benchmarks |
| 34 | +| Benchmark | Setup <br> Link | # Task <br> Template| Seed <br> Diversity | Max <br> Step | Multi-tab | Hosted Method | BrowserGym <br> Leaderboard | |
| 35 | +|-----------|------------|---------|----------------|-----------|-----------|---------------|----------------------| |
| 36 | +| [WebArena](https://webarena.dev/) | [setup](https://github.com/ServiceNow/BrowserGym/blob/main/browsergym/webarena/README.md) | 812 | None | 30 | yes | self hosted (docker) | soon | |
| 37 | +| [WorkArena](https://github.com/ServiceNow/WorkArena) L1 | [setup](https://github.com/ServiceNow/WorkArena?tab=readme-ov-file#getting-started) | 33 | High | 30 | no | demo instance | soon | |
| 38 | +| [WorkArena](https://github.com/ServiceNow/WorkArena) L2 | [setup](https://github.com/ServiceNow/WorkArena?tab=readme-ov-file#getting-started) | 341 | High | 50 | no | demo instance | soon | |
| 39 | +| [WorkArena](https://github.com/ServiceNow/WorkArena) L3 | [setup](https://github.com/ServiceNow/WorkArena?tab=readme-ov-file#getting-started) | 341 | High | 50 | no | demo instance | soon | |
| 40 | +| [WebLinx](https://mcgill-nlp.github.io/weblinx/) | - | 31586 | None | 1 | no | self hosted (dataset) | soon | |
| 41 | +| [VisualWebArena](https://github.com/web-arena-x/visualwebarena) | [setup](https://github.com/ServiceNow/BrowserGym/blob/main/browsergym/visualwebarena/README.md) | 910 | None | 30 | yes | self hosted (docker) | soon | |
| 42 | +| [Assistant Bench](https://assistantbench.github.io/) | [setup](https://github.com/ServiceNow/BrowserGym/blob/main/browsergym/assistantbench/README.md) | 214 | None | 30 | yes | live web | soon | |
| 43 | +| [GAIA](https://huggingface.co/spaces/gaia-benchmark/leaderboard) (soon) | - | - | None | - | - | live web | soon | |
| 44 | +| [Mind2Web-live](https://huggingface.co/datasets/iMeanAI/Mind2Web-Live) (soon) | - | - | None | - | - | live web | soon | |
| 45 | +| [MiniWoB](https://miniwob.farama.org/index.html) | [setup](https://github.com/ServiceNow/BrowserGym/blob/main/browsergym/miniwob/README.md) | 125 | Medium | 10 | no | self hosted (static files) | soon | |
| 46 | +## 🛠️ Setup agentlab |
40 | 47 |
|
41 | 48 | ```bash |
42 | | -git clone [email protected]:ServiceNow/AgentLab.git |
43 | | -pip install -e . |
| 49 | +pip install agentlab |
44 | 50 | ``` |
45 | 51 |
|
46 | | -## Set Environment Variables |
| 52 | +Make sure to prepare the required benchmark according to instructions provided in the [setup |
| 53 | +column](#🎯-supported-benchmarks). |
47 | 54 |
|
48 | 55 | ```bash |
49 | 56 | export AGENTLAB_EXP_ROOT=<root directory of experiment results> # defaults to $HOME/agentlab_results |
50 | 57 | export OPENAI_API_KEY=<your openai api key> # if openai models are used |
51 | | -export HUGGINGFACEHUB_API_TOKEN=<your huggingfacehub api token> # if huggingface models are used |
52 | | -``` |
53 | | - |
54 | | -## Use an assistant to work for you (at your own cost and risk) |
55 | | -```bash |
56 | | -agentlab-assistant --start_url https://www.google.com |
57 | 58 | ``` |
58 | 59 |
|
59 | | -## Prepare Benchmarks |
60 | | -Depending on which benchmark you use, there are some prerequisites |
61 | | - |
62 | 60 | <details> |
63 | | -<summary>MiniWoB</summary> |
| 61 | +<summary>Setup OpenRouter API</summary> |
64 | 62 |
|
65 | 63 | ```bash |
66 | | -export MINIWOB_URL="file://$HOME/dev/miniwob-plusplus/miniwob/html/miniwob/" |
| 64 | +export OPENROUTER_API_KEY=<your openrouter api key> # if openrouter models are used |
67 | 65 | ``` |
68 | 66 | </details> |
69 | 67 |
|
70 | 68 | <details> |
| 69 | +<summary>Setup Azure API</summary> |
71 | 70 |
|
72 | | -<summary>WorkArena</summary> |
73 | | - |
74 | | -See [detailed instructions on workarena github](https://github.com/ServiceNow/WorkArena?tab=readme-ov-file#getting-started) |
75 | | - |
76 | | -At a glance: |
77 | | -1) [Sign in](https://developer.servicenow.com/) and reqeuest a `washington` instance. |
78 | | -2) Once the instance is ready, you should see `<your instance URL>` and `<your-instance-password>` |
79 | | -3) Add these to your `.bashrc` (or `.zshrc`) and `source` it (note: make sure that |
80 | | - all variables are in single quotes unless you happen to have a password with a |
81 | | - single quote in it) |
82 | | - ```bash |
83 | | - export SNOW_INSTANCE_URL='https://<your-instance-number>.service-now.com/' |
84 | | - export SNOW_INSTANCE_UNAME='admin' |
85 | | - export SNOW_INSTANCE_PWD='<your-instance-password>' |
86 | | - ``` |
87 | | -4) finally run these commands: |
88 | | - |
89 | | - ```bash |
90 | | - pip install browsergym-workarena |
91 | | - playwright install |
92 | | - workarena-install |
93 | | - ``` |
94 | | - |
95 | | - |
| 71 | +```bash |
| 72 | +export AZURE_OPENAI_API_KEY=<your azure api key> # if using azure models |
| 73 | +export AZURE_OPENAI_ENDPOINT=<your endpoint> # if using azure models |
| 74 | +``` |
96 | 75 | </details> |
97 | 76 |
|
98 | | -<details> |
99 | | -<summary>WebArena on AWS</summary> |
100 | | -TODO |
101 | | -</details> |
| 77 | +## UI-Assistant |
| 78 | +Use an assistant to work for you (at your own cost and risk). |
102 | 79 |
|
103 | | -<details> |
104 | | -<summary>WebArena on Azure</summary> |
105 | | -TODO |
106 | | -</details> |
| 80 | +```bash |
| 81 | +agentlab-assistant --start_url https://www.google.com |
| 82 | +``` |
107 | 83 |
|
| 84 | +Try your own agent: |
108 | 85 |
|
| 86 | +```bash |
| 87 | +agentlab-assistant --agent_config="module.path.to.your.AgentArgs" |
| 88 | +``` |
| 89 | + |
| 90 | +## 🚀 Launch experiments |
109 | 91 |
|
| 92 | +```python |
| 93 | +# Import your agent configuration extending bgym.AgentArgs class |
| 94 | +# Make sure this object is imported from a module accessible in PYTHONPATH to properly unpickle |
| 95 | +from agentlab.agents.generic_agent import AGENT_4o_MINI |
110 | 96 |
|
| 97 | +from agentlab.experiments.study import make_study |
111 | 98 |
|
112 | | -## Launch experiments |
| 99 | +study = make_study( |
| 100 | + benchmark="miniwob", # or "webarena", "workarnea_l1" ... |
| 101 | + agent_args=[AGENT_4o_MINI], |
| 102 | + comment="My first study", |
| 103 | +) |
113 | 104 |
|
114 | | -Create your agent or import an existing one: |
115 | | -```python |
116 | | -from agentlab.agents.generic_agent.agent_configs import AGENT_4o |
| 105 | +study.run(n_jobs=5) |
117 | 106 | ``` |
118 | 107 |
|
119 | | -Run the agent on a benchmark: |
| 108 | +Relaunching incomplete or errored tasks |
| 109 | + |
120 | 110 | ```python |
121 | | -study_name, exp_args_list = run_agents_on_benchmark(AGENT_4o, benchmark) |
122 | | -study_dir = make_study_dir(RESULTS_DIR, study_name) |
123 | | -run_experiments(n_jobs, exp_args_list, study_dir) |
| 111 | +from agentlab.experiments.study import Study |
| 112 | +study = Study.load("/path/to/your/study/dir") |
| 113 | +study.find_incomplete(include_errors=True) |
| 114 | +study.run() |
124 | 115 | ``` |
125 | 116 |
|
126 | | -use [main.py](main.py) to launch experiments with a variety |
127 | | -of options. This is like a lazy CLI that is actually more convenient than a CLI. |
128 | | -Just comment and uncomment the lines you need or modify at will (but don't push |
129 | | -to the repo). |
130 | | -
|
131 | | -<details> |
| 117 | +See [main.py](main.py) to launch experiments with a variety of options. This is like a lazy CLI that |
| 118 | +is actually more convenient. Just comment and uncomment the lines you need or modify at will (but |
| 119 | +don't push to the repo). |
132 | 120 |
|
133 | | -<summary>Debugging</summary> |
134 | 121 |
|
135 | | -For debugging, run experiments using `n_jobs=1` and use VSCode debug mode. This |
136 | | -will allow you to stop on breakpoints. To prevent the debugger from stopping |
137 | | -on errors when running multiple experiments directly in VSCode, set |
138 | | -`enable_debug = False` in `ExpArgs` |
139 | | -</details> |
| 122 | +### Job Timeouts |
140 | 123 |
|
| 124 | +The complexity of the wild web, Playwright, and asyncio can sometimes cause jobs to hang. This |
| 125 | +disables workers until the study is terminated and relaunched. If you are running jobs sequentially |
| 126 | +or with a small number of workers, this could halt your entire study until you manually kill and |
| 127 | +relaunch it. In the Ray parallel backend, we've implemented a system to automatically terminate jobs |
| 128 | +exceeding a specified timeout. This feature is particularly useful when task hanging limits your |
| 129 | +experiments. |
141 | 130 |
|
| 131 | +### Debugging |
142 | 132 |
|
| 133 | +For debugging, run experiments with `n_jobs=1` and use VSCode's debug mode. This allows you to pause |
| 134 | +execution at breakpoints. To prevent the debugger from stopping on errors while running multiple |
| 135 | +experiments in VSCode, set `enable_debug = False` in `ExpArgs`. |
143 | 136 |
|
| 137 | +### About Parallel Jobs |
144 | 138 |
|
145 | | -<details> |
| 139 | +Running one agent on one task corresponds to a single job. Conducting ablation studies or random |
| 140 | +searches across hundreds of tasks with multiple seeds can generate more than 10,000 jobs. Efficient |
| 141 | +parallel execution is therefore critical. Agents typically wait for responses from the LLM server or |
| 142 | +updates from the web server. As a result, you can run 10–50 jobs in parallel on a single computer, |
| 143 | +depending on available RAM. |
146 | 144 |
|
147 | | -<summary>Parallel jobs</summary> |
| 145 | +⚠️ **Note for (Visual)WebArena**: These benchmarks have task dependencies designed to minimize |
| 146 | +"corrupting" the instance between tasks. For example, an agent on task 323 could alter the instance |
| 147 | +state, making task 201 impossible. To address this, the Ray backend accounts for task dependencies, |
| 148 | +enabling some degree of parallelism. On WebArena, you can disable dependencies to increase |
| 149 | +parallelism, but this might reduce performance by 1–2%. |
148 | 150 |
|
149 | | -Running one agent on one task correspond to one job. When conducting ablation |
150 | | -studies or random searches on hundreds of tasks with multiple seeds, this can |
151 | | -lead to more than 10000 jobs. It is thus crucial to execute them in parallel. |
152 | | -The agent usually wait on the LLM server to return the results or the web server |
153 | | -to update the page. Hence, you can run 10-50 jobs in parallel on a single |
154 | | -computer depending on how much RAM is available. |
| 151 | +⚠️ **Instance Reset for (Visual)WebArena**: Before evaluating an agent, the instance is |
| 152 | +automatically reset, a process that takes about 5 minutes. When evaluating multiple agents, the |
| 153 | +`make_study` function returns a `SequentialStudies` object to ensure proper sequential evaluation of |
| 154 | +each agent. AgentLab currently does not support evaluations across multiple instances, but you could |
| 155 | +either create a quick script to handle this or submit a PR to AgentLab. For a smoother parallel |
| 156 | +experience, consider using benchmarks like WorkArena instead. |
155 | 157 |
|
156 | | -</details> |
157 | 158 |
|
158 | | -## AgentXray |
| 159 | +## 🔍 AgentXray |
159 | 160 | While your experiments are running, you can inspect the results using: |
160 | 161 |
|
161 | 162 | ```bash |
162 | 163 | agentlab-xray |
163 | 164 | ``` |
164 | 165 |
|
165 | | -<a href="https://github.com/user-attachments/assets/20a91e7b-94ef-423d-9091-743eebb4733d"> |
166 | | - <img src="https://github.com/user-attachments/assets/20a91e7b-94ef-423d-9091-743eebb4733d" width="250" /> |
167 | | -</a> |
168 | 166 |
|
169 | | -You will be able to select the recent experiments in the directory |
170 | | -`AGENTLAB_EXP_ROOT` and visualize the results in a gradio interface. |
| 167 | +<video controls style="max-width: 800px;"> |
| 168 | + <source src="https://github.com/user-attachments/assets/06c4dac0-b78f-45b7-9405-003da4af6b37" type="video/mp4"> |
| 169 | + Your browser does not support the video tag. |
| 170 | +</video> |
| 171 | + |
| 172 | + |
| 173 | +You will be able to select the recent experiments in the directory `AGENTLAB_EXP_ROOT` and visualize |
| 174 | +the results in a gradio interface. |
171 | 175 |
|
172 | 176 | In the following order, select: |
173 | 177 | * The experiment you want to visualize |
174 | 178 | * The agent if there is more than one |
175 | 179 | * The task |
176 | 180 | * And the seed |
177 | 181 |
|
178 | | -Once this is selected, you can see the trace of your agent on the given task. |
179 | | -Click on the profiling image to select a step and observe the action taken by the agent. |
| 182 | +Once this is selected, you can see the trace of your agent on the given task. Click on the profiling |
| 183 | +image to select a step and observe the action taken by the agent. |
180 | 184 |
|
181 | 185 | ## Implement a new Agent |
182 | 186 |
|
183 | | -Get inspiration from the `MostBasicAgent` in [agentlab/agents/most_basic_agent/most_basic_agent.py](src/agentlab/agents/most_basic_agent/most_basic_agent.py) |
| 187 | +Get inspiration from the `MostBasicAgent` in |
| 188 | +[agentlab/agents/most_basic_agent/most_basic_agent.py](src/agentlab/agents/most_basic_agent/most_basic_agent.py). |
| 189 | +For a better integration with the tools, make sure to implement most functions in the |
| 190 | +[AgentArgs](src/agentlab/agents/agent_args.py#L5) API and the extended `bgym.AbstractAgentArgs`. |
| 191 | + |
| 192 | +If you think your agent should be included directly in AgenLab, let use know and it can be added in |
| 193 | +agentlab/agents/ with the name of your agent. |
| 194 | + |
| 195 | +## ↻ Reproducibility |
| 196 | +Several factors can influence reproducibility of results in the context of evaluating agents on |
| 197 | +dynamic benchmarks. |
| 198 | + |
| 199 | +### Factors affecting roproducibility |
| 200 | +* **Software version**: Different version of Playwright or any package in the software stack could |
| 201 | + influence the behavior of the benchmark or the agent. |
| 202 | +* **API based LLMs silently changing**: Even for a fixed version, a LLM may be updated e.g. to |
| 203 | + incorporate latest web knowledge. |
| 204 | +* **Live websites**: |
| 205 | + * WorkArena: The demo instance is mostly fixed in time to a specific version but ServiceNow |
| 206 | + sometime push minor modifications. |
| 207 | + * AssistantBench and GAIA: These rely on the agent navigating the open web. The experience may |
| 208 | + change depending on which country or region, some websites might be in different languages by |
| 209 | + default. |
| 210 | +* **Stochastic Agents**: Setting temperature of the LLM to 0 can reduce most stochasticity. |
| 211 | +* **Non deterministic tasks**: For a fixed seed, the changes should be minimal |
| 212 | + |
| 213 | +### Reproducibility Features |
| 214 | +* `Study` contains a dict of information about reproducibility, including benchmark version, package |
| 215 | + version and commit hash |
| 216 | +* The `Study` class allows automatic upload of your results to |
| 217 | + [`reproducibility_journal.csv`](reproducibility_journal.csv). This makes it easier to populate a |
| 218 | + large amount of reference points. |
| 219 | +* **Reproduced results in the leaderboard**. For agents that are repdocudibile, we encourage users |
| 220 | + to try to reproduce the results and upload them to the leaderboard. There is a special column |
| 221 | + containing information about all reproduced results of an agent on a benchmark. |
| 222 | +* **ReproducibilityAgent**: You can run this agent on an existing study and it will try to re-run |
| 223 | + the same actions on the same task seeds. A vsiual diff of the two prompts will be displayed in the |
| 224 | + AgentInfo HTML tab of AgentXray. You will be able to inspect on some tasks what kind of changes |
| 225 | + between to two executions. **Note**: this is a beta feature and will need some adaptation for your |
| 226 | + own agent. |
184 | 227 |
|
185 | | -Create a new directory in agentlab/agents/ with the name of your agent. |
186 | 228 |
|
187 | 229 | ## Misc |
188 | 230 |
|
|
0 commit comments