|
1 | 1 |
|
2 | 2 |
|
3 | | -<a href="https://github.com/user-attachments/assets/fa71f769-6d7b-427a-978b-82aa13a6265f"> |
4 | | - <img src="https://github.com/user-attachments/assets/fa71f769-6d7b-427a-978b-82aa13a6265f" width="1000" /> |
5 | | -</a> |
| 3 | +<a href="https://github.com/user-attachments/assets/c2bc0b80-89da-4afb-9120-2feb018df19d"> <img |
| 4 | + src="https://github.com/user-attachments/assets/c2bc0b80-89da-4afb-9120-2feb018df19d" width="800" |
| 5 | +/> </a> |
6 | 6 |
|
| 7 | + | |
| 8 | +[🎯 Benchmarks](#🎯-supported-benchmarks) | |
| 9 | +[🛠️ Setup](#🛠️-setup-agentlab) | |
| 10 | +[🤖 Assistant](#ui-assistant) | |
| 11 | +[🚀 Launch Experiments](#🚀-launch-experiments) | |
| 12 | +[🔍 Analyse Results](#🔍-analyse-results) | |
| 13 | +[🤖 Make Your Own Agent](#implement-a-new-agent) | |
| 14 | +[↻ Reproducibility](#↻-reproducibility) | |
7 | 15 |
|
| 16 | +[]([https://opensource.org/licenses/MIT](http://www.apache.org/licenses/LICENSE-2.0)) |
| 17 | +[](https://pypistats.org/packages/agentlab) |
| 18 | +[](https://star-history.com/#ServiceNow/AgentLab) |
8 | 19 |
|
9 | | -AgentLab is a framework for developing and evaluating agents on a variety of |
10 | | -benchmarks supported by [BrowserGym](https://github.com/ServiceNow/BrowserGym). |
11 | | -This includes: |
12 | | -* WebArena |
13 | | -* WorkArena.L1, L2, L3 |
14 | | -* VisualWebArena (coming soon...) |
15 | | -* MiniWoB |
16 | | - |
17 | | -The framework enables the desing of rich hyperparameter spaces and the launch of |
18 | | -parallel experiments using ablation studies or random searches. It also provides |
19 | | -agent_xray, a visualization tool to inspect the results of the experiments using |
20 | | -a custom gradio interface |
21 | 20 |
|
22 | | -<a href="https://github.com/user-attachments/assets/20a91e7b-94ef-423d-9091-743eebb4733d"> |
23 | | - <img src="https://github.com/user-attachments/assets/20a91e7b-94ef-423d-9091-743eebb4733d" width="250" /> |
24 | | -</a> |
| 21 | +<video controls style="max-width: 700px;"> |
| 22 | + <source src="https://github.com/ServiceNow/BrowserGym/assets/26232819/e0bfc788-cc8e-44f1-b8c3-0d1114108b85" type="video/mp4"> |
| 23 | + Your browser does not support the video tag. |
| 24 | +</video> |
25 | 25 |
|
26 | | -## Install agentlab |
27 | 26 |
|
28 | | -This repo is intended for testing and developing new agents, hence we clone and install using the `-e` flag. |
| 27 | +AgentLab is a framework for developing and evaluating agents on a variety of |
| 28 | +[benchmarks](#🎯-supported-benchmarks) supported by |
| 29 | +[BrowserGym](https://github.com/ServiceNow/BrowserGym). |
| 30 | + |
| 31 | +AgentLab Features: |
| 32 | +* Easy large scale parallel [agent experiments](#🚀-launch-experiments) using [ray](https://www.ray.io/) |
| 33 | +* Building blocks for making agents over BrowserGym |
| 34 | +* Unified LLM API for OpenRouter, OpenAI, Azure, or self hosted using TGI. |
| 35 | +* Prefered way for running benchmarks like WebArena |
| 36 | +* Various [reproducibility features](#reproducibility-features) |
| 37 | +* Unified LeaderBoard (soon) |
| 38 | + |
| 39 | +## 🎯 Supported Benchmarks |
| 40 | +| Benchmark | Setup <br> Link | # Task <br> Template| Seed <br> Diversity | Max <br> Step | Multi-tab | Hosted Method | BrowserGym <br> Leaderboard | |
| 41 | +|-----------|------------|---------|----------------|-----------|-----------|---------------|----------------------| |
| 42 | +| [WebArena](https://webarena.dev/) | [setup](https://github.com/ServiceNow/BrowserGym/blob/main/browsergym/webarena/README.md) | 812 | None | 30 | yes | self hosted (docker) | soon | |
| 43 | +| [WorkArena](https://github.com/ServiceNow/WorkArena) L1 | [setup](https://github.com/ServiceNow/WorkArena?tab=readme-ov-file#getting-started) | 33 | High | 30 | no | demo instance | soon | |
| 44 | +| [WorkArena](https://github.com/ServiceNow/WorkArena) L2 | [setup](https://github.com/ServiceNow/WorkArena?tab=readme-ov-file#getting-started) | 341 | High | 50 | no | demo instance | soon | |
| 45 | +| [WorkArena](https://github.com/ServiceNow/WorkArena) L3 | [setup](https://github.com/ServiceNow/WorkArena?tab=readme-ov-file#getting-started) | 341 | High | 50 | no | demo instance | soon | |
| 46 | +| [WebLinx](https://mcgill-nlp.github.io/weblinx/) | - | 31586 | None | 1 | no | self hosted (dataset) | soon | |
| 47 | +| [VisualWebArena](https://github.com/web-arena-x/visualwebarena) | [setup](https://github.com/ServiceNow/BrowserGym/blob/main/browsergym/visualwebarena/README.md) | 910 | None | 30 | yes | self hosted (docker) | soon | |
| 48 | +| [Assistant Bench](https://assistantbench.github.io/) | [setup](https://github.com/ServiceNow/BrowserGym/blob/main/browsergym/assistantbench/README.md) | 214 | None | 30 | yes | live web | soon | |
| 49 | +| [GAIA](https://huggingface.co/spaces/gaia-benchmark/leaderboard) (soon) | - | - | None | - | - | live web | soon | |
| 50 | +| [Mind2Web-live](https://huggingface.co/datasets/iMeanAI/Mind2Web-Live) (soon) | - | - | None | - | - | live web | soon | |
| 51 | +| [MiniWoB](https://miniwob.farama.org/index.html) | [setup](https://github.com/ServiceNow/BrowserGym/blob/main/browsergym/miniwob/README.md) | 125 | Medium | 10 | no | self hosted (static files) | soon | |
| 52 | +## 🛠️ Setup agentlab |
29 | 53 |
|
30 | 54 | ```bash |
31 | | -git clone [email protected]:ServiceNow/AgentLab.git |
32 | | -pip install -e . |
| 55 | +pip install agentlab |
33 | 56 | ``` |
34 | 57 |
|
35 | | -## Set Environment Variables |
| 58 | +Make sure to prepare the required benchmark according to instructions provided in the [setup |
| 59 | +column](#🎯-supported-benchmarks). |
36 | 60 |
|
37 | 61 | ```bash |
38 | 62 | export AGENTLAB_EXP_ROOT=<root directory of experiment results> # defaults to $HOME/agentlab_results |
39 | 63 | export OPENAI_API_KEY=<your openai api key> # if openai models are used |
40 | | -export HUGGINGFACEHUB_API_TOKEN=<your huggingfacehub api token> # if huggingface models are used |
41 | 64 | ``` |
42 | 65 |
|
43 | | -## Use an assistant to work for you (at your own cost and risk) |
| 66 | +<details> |
| 67 | +<summary>Setup OpenRouter API</summary> |
| 68 | + |
44 | 69 | ```bash |
45 | | -agentlab-assistant --start_url https://www.google.com |
| 70 | +export OPENROUTER_API_KEY=<your openrouter api key> # if openrouter models are used |
46 | 71 | ``` |
47 | | - |
48 | | -## Prepare Benchmarks |
49 | | -Depending on which benchmark you use, there are some prerequisites |
| 72 | +</details> |
50 | 73 |
|
51 | 74 | <details> |
52 | | -<summary>MiniWoB</summary> |
| 75 | +<summary>Setup Azure API</summary> |
53 | 76 |
|
54 | 77 | ```bash |
55 | | -export MINIWOB_URL="file://$HOME/dev/miniwob-plusplus/miniwob/html/miniwob/" |
| 78 | +export AZURE_OPENAI_API_KEY=<your azure api key> # if using azure models |
| 79 | +export AZURE_OPENAI_ENDPOINT=<your endpoint> # if using azure models |
56 | 80 | ``` |
57 | 81 | </details> |
58 | 82 |
|
59 | | -<details> |
60 | | - |
61 | | -<summary>WorkArena</summary> |
62 | | - |
63 | | -See [detailed instructions on workarena github](https://github.com/ServiceNow/WorkArena?tab=readme-ov-file#getting-started) |
64 | | - |
65 | | -At a glance: |
66 | | -1) [Sign in](https://developer.servicenow.com/) and reqeuest a `washington` instance. |
67 | | -2) Once the instance is ready, you should see `<your instance URL>` and `<your-instance-password>` |
68 | | -3) Add these to your `.bashrc` (or `.zshrc`) and `source` it (note: make sure that |
69 | | - all variables are in single quotes unless you happen to have a password with a |
70 | | - single quote in it) |
71 | | - ```bash |
72 | | - export SNOW_INSTANCE_URL='https://<your-instance-number>.service-now.com/' |
73 | | - export SNOW_INSTANCE_UNAME='admin' |
74 | | - export SNOW_INSTANCE_PWD='<your-instance-password>' |
75 | | - ``` |
76 | | -4) finally run these commands: |
77 | | - |
78 | | - ```bash |
79 | | - pip install browsergym-workarena |
80 | | - playwright install |
81 | | - workarena-install |
82 | | - ``` |
| 83 | +## UI-Assistant |
| 84 | +Use an assistant to work for you (at your own cost and risk). |
83 | 85 |
|
| 86 | +```bash |
| 87 | +agentlab-assistant --start_url https://www.google.com |
| 88 | +``` |
84 | 89 |
|
85 | | -</details> |
| 90 | +Try your own agent: |
86 | 91 |
|
87 | | -<details> |
88 | | -<summary>WebArena on AWS</summary> |
89 | | -TODO |
90 | | -</details> |
| 92 | +```bash |
| 93 | +agentlab-assistant --agent_config="module.path.to.your.AgentArgs" |
| 94 | +``` |
91 | 95 |
|
92 | | -<details> |
93 | | -<summary>WebArena on Azure</summary> |
94 | | -TODO |
95 | | -</details> |
| 96 | +## 🚀 Launch experiments |
96 | 97 |
|
| 98 | +```python |
| 99 | +# Import your agent configuration extending bgym.AgentArgs class |
| 100 | +# Make sure this object is imported from a module accessible in PYTHONPATH to properly unpickle |
| 101 | +from agentlab.agents.generic_agent import AGENT_4o_MINI |
97 | 102 |
|
| 103 | +from agentlab.experiments.study import make_study |
98 | 104 |
|
| 105 | +study = make_study( |
| 106 | + benchmark="miniwob", # or "webarena", "workarnea_l1" ... |
| 107 | + agent_args=[AGENT_4o_MINI], |
| 108 | + comment="My first study", |
| 109 | +) |
99 | 110 |
|
| 111 | +study.run(n_jobs=5) |
| 112 | +``` |
100 | 113 |
|
101 | | -## Launch experiments |
| 114 | +Relaunching incomplete or errored tasks |
102 | 115 |
|
103 | | -Create your agent or import an existing one: |
104 | 116 | ```python |
105 | | -from agentlab.agents.generic_agent.agent_configs import AGENT_4o |
| 117 | +from agentlab.experiments.study import Study |
| 118 | +study = Study.load("/path/to/your/study/dir") |
| 119 | +study.find_incomplete(include_errors=True) |
| 120 | +study.run() |
106 | 121 | ``` |
107 | 122 |
|
108 | | -Run the agent on a benchmark: |
109 | | -```python |
110 | | -study_name, exp_args_list = run_agents_on_benchmark(AGENT_4o, benchmark) |
111 | | -study_dir = make_study_dir(RESULTS_DIR, study_name) |
112 | | -run_experiments(n_jobs, exp_args_list, study_dir) |
113 | | -``` |
| 123 | +See [main.py](main.py) to launch experiments with a variety of options. This is like a lazy CLI that |
| 124 | +is actually more convenient. Just comment and uncomment the lines you need or modify at will (but |
| 125 | +don't push to the repo). |
114 | 126 |
|
115 | | -use [main.py](main.py) to launch experiments with a variety |
116 | | -of options. This is like a lazy CLI that is actually more convenient than a CLI. |
117 | | -Just comment and uncomment the lines you need or modify at will (but don't push |
118 | | -to the repo). |
119 | 127 |
|
120 | | -<details> |
| 128 | +### Job Timeouts |
121 | 129 |
|
122 | | -<summary>Debugging</summary> |
| 130 | +The complexity of the wild web, Playwright, and asyncio can sometimes cause jobs to hang. This |
| 131 | +disables workers until the study is terminated and relaunched. If you are running jobs sequentially |
| 132 | +or with a small number of workers, this could halt your entire study until you manually kill and |
| 133 | +relaunch it. In the Ray parallel backend, we've implemented a system to automatically terminate jobs |
| 134 | +exceeding a specified timeout. This feature is particularly useful when task hanging limits your |
| 135 | +experiments. |
123 | 136 |
|
124 | | -For debugging, run experiments using `n_jobs=1` and use VSCode debug mode. This |
125 | | -will allow you to stop on breakpoints. To prevent the debugger from stopping |
126 | | -on errors when running multiple experiments directly in VSCode, set |
127 | | -`enable_debug = False` in `ExpArgs` |
128 | | -</details> |
| 137 | +### Debugging |
129 | 138 |
|
| 139 | +For debugging, run experiments with `n_jobs=1` and use VSCode's debug mode. This allows you to pause |
| 140 | +execution at breakpoints. |
130 | 141 |
|
| 142 | +### About Parallel Jobs |
131 | 143 |
|
| 144 | +Running one agent on one task corresponds to a single job. Conducting ablation studies or random |
| 145 | +searches across hundreds of tasks with multiple seeds can generate more than 10,000 jobs. Efficient |
| 146 | +parallel execution is therefore critical. Agents typically wait for responses from the LLM server or |
| 147 | +updates from the web server. As a result, you can run 10–50 jobs in parallel on a single computer, |
| 148 | +depending on available RAM. |
132 | 149 |
|
| 150 | +⚠️ **Note for (Visual)WebArena**: These benchmarks have task dependencies designed to minimize |
| 151 | +"corrupting" the instance between tasks. For example, an agent on task 323 could alter the instance |
| 152 | +state, making task 201 impossible. To address this, the Ray backend accounts for task dependencies, |
| 153 | +enabling some degree of parallelism. On WebArena, you can disable dependencies to increase |
| 154 | +parallelism, but this might reduce performance by 1–2%. |
133 | 155 |
|
134 | | -<details> |
| 156 | +⚠️ **Instance Reset for (Visual)WebArena**: Before evaluating an agent, the instance is |
| 157 | +automatically reset, a process that takes about 5 minutes. When evaluating multiple agents, the |
| 158 | +`make_study` function returns a `SequentialStudies` object to ensure proper sequential evaluation of |
| 159 | +each agent. AgentLab currently does not support evaluations across multiple instances, but you could |
| 160 | +either create a quick script to handle this or submit a PR to AgentLab. For a smoother parallel |
| 161 | +experience, consider using benchmarks like WorkArena instead. |
135 | 162 |
|
136 | | -<summary>Parallel jobs</summary> |
| 163 | +## 🔍 Analyse Results |
137 | 164 |
|
138 | | -Running one agent on one task correspond to one job. When conducting ablation |
139 | | -studies or random searches on hundreds of tasks with multiple seeds, this can |
140 | | -lead to more than 10000 jobs. It is thus crucial to execute them in parallel. |
141 | | -The agent usually wait on the LLM server to return the results or the web server |
142 | | -to update the page. Hence, you can run 10-50 jobs in parallel on a single |
143 | | -computer depending on how much RAM is available. |
| 165 | +### Loading Results |
| 166 | + |
| 167 | +The class [`ExpResult`](https://github.com/ServiceNow/BrowserGym/blob/da26a5849d99d9a3169d7b1fde79f909c55c9ba7/browsergym/experiments/src/browsergym/experiments/loop.py#L595) provides a lazy loader for all the information of a specific experiment. You can use [`yield_all_exp_results`](https://github.com/ServiceNow/BrowserGym/blob/da26a5849d99d9a3169d7b1fde79f909c55c9ba7/browsergym/experiments/src/browsergym/experiments/loop.py#L872) to recursivley find all results in a directory. Finally [`load_result_df`](https://github.com/ServiceNow/AgentLab/blob/be1998c5fad5bda47ba50497ec3899aae03e85ec/src/agentlab/analyze/inspect_results.py#L119C5-L119C19) gathers all the summary information in a single dataframe. See [`inspect_results.ipynb`](src/agentlab/analyze/inspect_results.ipynb) for example usage. |
| 168 | + |
| 169 | +```python |
| 170 | +from agentlab.analyze import inspect_results |
| 171 | +result_df = inspect_results.load_result_df("path/to/your/study") |
| 172 | +``` |
144 | 173 |
|
145 | | -</details> |
146 | 174 |
|
147 | | -## AgentXray |
148 | | -While your experiments are running, you can inspect the results using: |
| 175 | +### AgentXray |
| 176 | +Inspect the behaviour of your agent using xray. You can load previous or ongoing experiments. The refresh mechanism is currently a bit clunky, but you can refresh the page, refresh the experiment directory list and select again your experiment to see an updated version of your currently running experiments. |
| 177 | + |
149 | 178 |
|
150 | 179 | ```bash |
151 | 180 | agentlab-xray |
152 | 181 | ``` |
153 | 182 |
|
154 | | -<a href="https://github.com/user-attachments/assets/20a91e7b-94ef-423d-9091-743eebb4733d"> |
155 | | - <img src="https://github.com/user-attachments/assets/20a91e7b-94ef-423d-9091-743eebb4733d" width="250" /> |
156 | | -</a> |
| 183 | +**⚠️ Note**: Gradio is still in developement and unexpected behavior have been frequently noticed. Version 5.5 seems to work properly so far. If you're not sure that the proper information is displaying, refresh the page and select your experiment again. |
| 184 | + |
| 185 | + |
| 186 | +<video controls style="max-width: 800px;"> |
| 187 | + <source src="https://github.com/user-attachments/assets/06c4dac0-b78f-45b7-9405-003da4af6b37" type="video/mp4"> |
| 188 | + Your browser does not support the video tag. |
| 189 | +</video> |
| 190 | + |
157 | 191 |
|
158 | | -You will be able to select the recent experiments in the directory |
159 | | -`AGENTLAB_EXP_ROOT` and visualize the results in a gradio interface. |
| 192 | +You will be able to select the recent experiments in the directory `AGENTLAB_EXP_ROOT` and visualize |
| 193 | +the results in a gradio interface. |
160 | 194 |
|
161 | 195 | In the following order, select: |
162 | 196 | * The experiment you want to visualize |
163 | 197 | * The agent if there is more than one |
164 | 198 | * The task |
165 | 199 | * And the seed |
166 | 200 |
|
167 | | -Once this is selected, you can see the trace of your agent on the given task. |
168 | | -Click on the profiling image to select a step and observe the action taken by the agent. |
| 201 | +Once this is selected, you can see the trace of your agent on the given task. Click on the profiling |
| 202 | +image to select a step and observe the action taken by the agent. |
169 | 203 |
|
170 | 204 | ## Implement a new Agent |
171 | 205 |
|
172 | | -Get inspiration from the `MostBasicAgent` in [agentlab/agents/most_basic_agent/most_basic_agent.py](src/agentlab/agents/most_basic_agent/most_basic_agent.py) |
| 206 | +Get inspiration from the `MostBasicAgent` in |
| 207 | +[agentlab/agents/most_basic_agent/most_basic_agent.py](src/agentlab/agents/most_basic_agent/most_basic_agent.py). |
| 208 | +For a better integration with the tools, make sure to implement most functions in the |
| 209 | +[AgentArgs](src/agentlab/agents/agent_args.py#L5) API and the extended `bgym.AbstractAgentArgs`. |
| 210 | + |
| 211 | +If you think your agent should be included directly in AgenLab, let use know and it can be added in |
| 212 | +agentlab/agents/ with the name of your agent. |
| 213 | + |
| 214 | +## ↻ Reproducibility |
| 215 | +Several factors can influence reproducibility of results in the context of evaluating agents on |
| 216 | +dynamic benchmarks. |
| 217 | + |
| 218 | +### Factors affecting reproducibility |
| 219 | +* **Software version**: Different version of Playwright or any package in the software stack could |
| 220 | + influence the behavior of the benchmark or the agent. |
| 221 | +* **API based LLMs silently changing**: Even for a fixed version, an LLM may be updated e.g. to |
| 222 | + incorporate latest web knowledge. |
| 223 | +* **Live websites**: |
| 224 | + * WorkArena: The demo instance is mostly fixed in time to a specific version but ServiceNow |
| 225 | + sometime push minor modifications. |
| 226 | + * AssistantBench and GAIA: These rely on the agent navigating the open web. The experience may |
| 227 | + change depending on which country or region, some websites might be in different languages by |
| 228 | + default. |
| 229 | +* **Stochastic Agents**: Setting temperature of the LLM to 0 can reduce most stochasticity. |
| 230 | +* **Non deterministic tasks**: For a fixed seed, the changes should be minimal |
| 231 | + |
| 232 | +### Reproducibility Features |
| 233 | +* `Study` contains a dict of information about reproducibility, including benchmark version, package |
| 234 | + version and commit hash |
| 235 | +* The `Study` class allows automatic upload of your results to |
| 236 | + [`reproducibility_journal.csv`](reproducibility_journal.csv). This makes it easier to populate a |
| 237 | + large amount of reference points. |
| 238 | +* **Reproduced results in the leaderboard**. For agents that are repdocudibile, we encourage users |
| 239 | + to try to reproduce the results and upload them to the leaderboard. There is a special column |
| 240 | + containing information about all reproduced results of an agent on a benchmark. |
| 241 | +* **ReproducibilityAgent**: You can run this agent on an existing study and it will try to re-run |
| 242 | + the same actions on the same task seeds. A vsiual diff of the two prompts will be displayed in the |
| 243 | + AgentInfo HTML tab of AgentXray. You will be able to inspect on some tasks what kind of changes |
| 244 | + between to two executions. **Note**: this is a beta feature and will need some adaptation for your |
| 245 | + own agent. |
173 | 246 |
|
174 | | -Create a new directory in agentlab/agents/ with the name of your agent. |
175 | 247 |
|
176 | 248 | ## Misc |
177 | 249 |
|
|
0 commit comments