You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Make sure to prepare the required benchmark according to instructions provided in the [setup
77
+
Make sure to prepare the required benchmark according to the instructions provided in the [setup
68
78
column](#-supported-benchmarks).
69
79
70
80
```bash
@@ -174,11 +184,18 @@ experience, consider using benchmarks like WorkArena instead.
174
184
175
185
### Loading Results
176
186
177
-
The class [`ExpResult`](https://github.com/ServiceNow/BrowserGym/blob/da26a5849d99d9a3169d7b1fde79f909c55c9ba7/browsergym/experiments/src/browsergym/experiments/loop.py#L595) provides a lazy loader for all the information of a specific experiment. You can use [`yield_all_exp_results`](https://github.com/ServiceNow/BrowserGym/blob/da26a5849d99d9a3169d7b1fde79f909c55c9ba7/browsergym/experiments/src/browsergym/experiments/loop.py#L872) to recursivley find all results in a directory. Finally [`load_result_df`](https://github.com/ServiceNow/AgentLab/blob/be1998c5fad5bda47ba50497ec3899aae03e85ec/src/agentlab/analyze/inspect_results.py#L119C5-L119C19) gathers all the summary information in a single dataframe. See [`inspect_results.ipynb`](src/agentlab/analyze/inspect_results.ipynb) for example usage.
187
+
The class [`ExpResult`](https://github.com/ServiceNow/BrowserGym/blob/da26a5849d99d9a3169d7b1fde79f909c55c9ba7/browsergym/experiments/src/browsergym/experiments/loop.py#L595) provides a lazy loader for all the information of a specific experiment. You can use [`yield_all_exp_results`](https://github.com/ServiceNow/BrowserGym/blob/da26a5849d99d9a3169d7b1fde79f909c55c9ba7/browsergym/experiments/src/browsergym/experiments/loop.py#L872) to recursively find all results in a directory. Finally [`load_result_df`](https://github.com/ServiceNow/AgentLab/blob/be1998c5fad5bda47ba50497ec3899aae03e85ec/src/agentlab/analyze/inspect_results.py#L119C5-L119C19) gathers all the summary information in a single dataframe. See [`inspect_results.ipynb`](src/agentlab/analyze/inspect_results.ipynb) for example usage.
178
188
179
189
```python
180
190
from agentlab.analyze import inspect_results
191
+
192
+
# load the summary of all experiments of the study in a dataframe
@@ -204,8 +221,14 @@ Once this is selected, you can see the trace of your agent on the given task. Cl
204
221
image to select a step and observe the action taken by the agent.
205
222
206
223
207
-
**⚠️ Note**: Gradio is still in developement and unexpected behavior have been frequently noticed. Version 5.5 seems to work properly so far. If you're not sure that the proper information is displaying, refresh the page and select your experiment again.
224
+
**⚠️ Note**: Gradio is still developing, and unexpected behavior has been frequently noticed. Version 5.5 seems to work properly so far. If you're not sure that the proper information is displaying, refresh the page and select your experiment again.
225
+
226
+
227
+
## 🏆 Leaderboard
228
+
229
+
Official unified [leaderboard](https://huggingface.co/spaces/ServiceNow/browsergym-leaderboard) across all benchmarks.
208
230
231
+
Experiments are on their way for more reference points using GenericAgent. We are also working on code to automatically push a study to the leaderboard.
209
232
210
233
## 🤖 Implement a new Agent
211
234
@@ -222,32 +245,32 @@ Several factors can influence reproducibility of results in the context of evalu
222
245
dynamic benchmarks.
223
246
224
247
### Factors affecting reproducibility
225
-
***Software version**: Different version of Playwright or any package in the software stack could
248
+
***Software version**: Different versions of Playwright or any package in the software stack could
226
249
influence the behavior of the benchmark or the agent.
227
-
***APIbased LLMs silently changing**: Even for a fixed version, an LLM may be updated e.g. to
228
-
incorporate latest web knowledge.
250
+
***API-based LLMs silently changing**: Even for a fixed version, an LLM may be updated e.g. to
251
+
incorporate the latest web knowledge.
229
252
***Live websites**:
230
253
* WorkArena: The demo instance is mostly fixed in time to a specific version but ServiceNow
231
-
sometime push minor modifications.
254
+
sometimes pushes minor modifications.
232
255
* AssistantBench and GAIA: These rely on the agent navigating the open web. The experience may
233
256
change depending on which country or region, some websites might be in different languages by
234
257
default.
235
-
***Stochastic Agents**: Setting temperature of the LLM to 0 can reduce most stochasticity.
236
-
***Nondeterministic tasks**: For a fixed seed, the changes should be minimal
258
+
***Stochastic Agents**: Setting the temperature of the LLM to 0 can reduce most stochasticity.
259
+
***Non-deterministic tasks**: For a fixed seed, the changes should be minimal
237
260
238
261
### Reproducibility Features
239
262
*`Study` contains a dict of information about reproducibility, including benchmark version, package
240
263
version and commit hash
241
264
* The `Study` class allows automatic upload of your results to
242
265
[`reproducibility_journal.csv`](reproducibility_journal.csv). This makes it easier to populate a
243
-
large amount of reference points.
244
-
***Reproduced results in the leaderboard**. For agents that are repdocudibile, we encourage users
266
+
large amount of reference points. For this feature, you need to `git clone` the repository and install via `pip install -e .`.
267
+
***Reproduced results in the leaderboard**. For agents that are reprocudibile, we encourage users
245
268
to try to reproduce the results and upload them to the leaderboard. There is a special column
246
269
containing information about all reproduced results of an agent on a benchmark.
247
270
***ReproducibilityAgent**: [You can run this agent](src/agentlab/agents/generic_agent/reproducibility_agent.py) on an existing study and it will try to re-run
248
-
the same actions on the same task seeds. A vsiual diff of the two prompts will be displayed in the
271
+
the same actions on the same task seeds. A visual diff of the two prompts will be displayed in the
249
272
AgentInfo HTML tab of AgentXray. You will be able to inspect on some tasks what kind of changes
250
-
between to two executions. **Note**: this is a beta feature and will need some adaptation for your
273
+
between the two executions. **Note**: this is a beta feature and will need some adaptation for your
ThibaultLSDC,GenericAgent-meta-llama_llama-3.1-70b-instruct,weblinx_test,0.0.1.dev13,2024-11-07_21-42-30,b9451759-4f0e-492c-a3c8-fa5109d2d9b1,0.089,0.005,0,2650/2650,None,Linux (#66-Ubuntu SMP Fri Aug 30 13:56:20 UTC 2024),3.12.7,1.39.0,0.2.3,7a5b91e62056fa8fb26efdd2f64f5b25a92b817c,,0.12.0,8633c30c31e6a5a1d5122835c035aa56d18f3f0a,
47
47
ThibaultLSDC,GenericAgent-openai_o1-mini-2024-09-12,weblinx_test,0.0.1.dev13,2024-11-07_21-42-30,b9451759-4f0e-492c-a3c8-fa5109d2d9b1,0.125,0.006,0,2650/2650,None,Linux (#66-Ubuntu SMP Fri Aug 30 13:56:20 UTC 2024),3.12.7,1.39.0,0.2.3,7a5b91e62056fa8fb26efdd2f64f5b25a92b817c,,0.12.0,8633c30c31e6a5a1d5122835c035aa56d18f3f0a,
48
48
ThibaultLSDC,GenericAgent-meta-llama_llama-3.1-405b-instruct,weblinx_test,0.0.1.dev13,2024-11-07_21-42-30,b9451759-4f0e-492c-a3c8-fa5109d2d9b1,0.079,0.005,0,2650/2650,None,Linux (#66-Ubuntu SMP Fri Aug 30 13:56:20 UTC 2024),3.12.7,1.39.0,0.2.3,7a5b91e62056fa8fb26efdd2f64f5b25a92b817c,,0.12.0,8633c30c31e6a5a1d5122835c035aa56d18f3f0a,
49
+
ThibaultLSDC,GenericAgent-meta-llama_llama-3.1-405b-instruct,workarena_l2_agent_curriculum_eval,0.4.1,2024-11-29_14-28-47,528da1f2-1949-41dc-b988-85f19f435af2,0.072,0.017,2,235/235,None,Linux (#68-Ubuntu SMP Mon Oct 7 14:34:20 UTC 2024),3.12.7,1.39.0,0.3.1,b115b2716d8a6328824684a692ed642297f0b1dc,,0.13.3,70dac253628c476aff1af6a975f27f8563453ad2,
50
+
ThibaultLSDC,GenericAgent-meta-llama_llama-3.1-405b-instruct,miniwob,0.13.3,2024-11-29_16-14-00,4d748972-6d35-4489-a197-138b656a7db3,0.646,0.019,0,625/625,None,Linux (#68-Ubuntu SMP Mon Oct 7 14:34:20 UTC 2024),3.12.7,1.39.0,0.3.1,becb4856fb1612f44010fe74ef8155d367ca17fc,,0.13.3,70dac253628c476aff1af6a975f27f8563453ad2,
51
+
ThibaultLSDC,GenericAgent-gpt-4o,assistantbench,0.13.1,2024-11-28_19-34-58,d93a2398-2b70-41ce-b989-364fed988d73,0.005,0.003,2,213/214,None,Linux (#68-Ubuntu SMP Mon Oct 7 14:34:20 UTC 2024),3.12.7,1.39.0,0.3.0,32865050045c8c71df35c34ff30a6b420a4e258c, M: src/agentlab/experiments/study.py,0.13.1,None,
52
+
ThibaultLSDC,GenericAgent-gpt-4o-mini,assistantbench,0.13.1,2024-11-28_19-34-58,d93a2398-2b70-41ce-b989-364fed988d73,0.002,0.002,1,214/214,None,Linux (#68-Ubuntu SMP Mon Oct 7 14:34:20 UTC 2024),3.12.7,1.39.0,0.3.0,32865050045c8c71df35c34ff30a6b420a4e258c, M: src/agentlab/experiments/study.py,0.13.1,None,
53
+
ThibaultLSDC,GenericAgent-meta-llama_llama-3.1-405b-instruct,assistantbench,0.13.1,2024-11-28_19-34-58,d93a2398-2b70-41ce-b989-364fed988d73,0.008,0.003,1,212/214,None,Linux (#68-Ubuntu SMP Mon Oct 7 14:34:20 UTC 2024),3.12.7,1.39.0,0.3.0,32865050045c8c71df35c34ff30a6b420a4e258c, M: src/agentlab/experiments/study.py,0.13.1,None,
54
+
ThibaultLSDC,GenericAgent-meta-llama_llama-3.1-70b-instruct,assistantbench,0.13.1,2024-11-28_19-34-58,d93a2398-2b70-41ce-b989-364fed988d73,0.007,0.005,8,206/214,None,Linux (#68-Ubuntu SMP Mon Oct 7 14:34:20 UTC 2024),3.12.7,1.39.0,0.3.0,32865050045c8c71df35c34ff30a6b420a4e258c, M: src/agentlab/experiments/study.py,0.13.1,None,
55
+
ThibaultLSDC,GenericAgent-meta-llama_llama-3.1-8b-instruct,assistantbench,0.13.1,2024-11-28_19-34-58,d93a2398-2b70-41ce-b989-364fed988d73,0.001,0.001,15,214/214,None,Linux (#68-Ubuntu SMP Mon Oct 7 14:34:20 UTC 2024),3.12.7,1.39.0,0.3.0,32865050045c8c71df35c34ff30a6b420a4e258c, M: src/agentlab/experiments/study.py,0.13.1,None,
56
+
ThibaultLSDC,GenericAgent-anthropic_claude-3.5-sonnet:beta,assistantbench,0.13.1,2024-11-28_19-34-58,d93a2398-2b70-41ce-b989-364fed988d73,0.007,0.003,1,212/214,None,Linux (#68-Ubuntu SMP Mon Oct 7 14:34:20 UTC 2024),3.12.7,1.39.0,0.3.0,32865050045c8c71df35c34ff30a6b420a4e258c, M: src/agentlab/experiments/study.py,0.13.1,None,
57
+
ThibaultLSDC,GenericAgent-openai_o1-mini-2024-09-12,assistantbench,0.13.1,2024-11-28_19-34-58,d93a2398-2b70-41ce-b989-364fed988d73,0.009,0.005,1,214/214,None,Linux (#68-Ubuntu SMP Mon Oct 7 14:34:20 UTC 2024),3.12.7,1.39.0,0.3.0,32865050045c8c71df35c34ff30a6b420a4e258c, M: src/agentlab/experiments/study.py,0.13.1,None,
58
+
ThibaultLSDC,GenericAgent-gpt-4o-mini,webarena,0.13.3,2024-11-29_19-25-49,c6bdeb87-9879-4c06-aa70-00d895001156,0.174,0.013,1,812/812,None,Linux (#68-Ubuntu SMP Mon Oct 7 14:34:20 UTC 2024),3.12.7,1.39.0,0.3.1,b115b2716d8a6328824684a692ed642297f0b1dc,,0.13.3,None,
59
+
ThibaultLSDC,GenericAgent-gpt-4o,webarena,0.13.3,2024-11-29_22-28-32,d2eed215-91bb-4603-b69c-8ef8f9d57f34,0.314,0.016,3,812/812,None,Linux (#68-Ubuntu SMP Mon Oct 7 14:34:20 UTC 2024),3.12.7,1.39.0,0.3.1,430fe9456ba766398380454a6335f094004607af,,0.13.3,None,
60
+
ThibaultLSDC,GenericAgent-anthropic_claude-3.5-sonnet:beta,webarena,0.13.3,2024-11-29_22-37-46,b5fc5be7-54cc-4fc1-a9ee-73447b9c3eae,0.362,0.017,0,812/812,None,Linux (#68-Ubuntu SMP Mon Oct 7 14:34:20 UTC 2024),3.12.7,1.39.0,0.3.1,7b224971fb7a90fb76924ca9386a1e8bf609dd2a,,0.13.3,None,
61
+
ThibaultLSDC,GenericAgent-openai_o1-mini-2024-09-12,webarena,0.13.3,2024-11-30_00-22-44,1827983d-5e84-4b63-ad49-bf45ec2a6348,0.286,0.016,0,812/812,None,Linux (#68-Ubuntu SMP Mon Oct 7 14:34:20 UTC 2024),3.12.7,1.39.0,0.3.1,3f54ef13b778e69a1706c732f776147e9523ad3d,,0.13.3,None,
62
+
ThibaultLSDC,GenericAgent-meta-llama_llama-3.1-405b-instruct,webarena,0.13.3,2024-12-01_00-04-43,aaeca13d-0cf5-444f-8445-590350b54746,0.24,0.015,9,812/812,None,Linux (#68-Ubuntu SMP Mon Oct 7 14:34:20 UTC 2024),3.12.7,1.39.0,0.3.1,5a5b94d544424517cdd11602b27100b82e35eac0,,0.13.3,None,
63
+
ThibaultLSDC,GenericAgent-gpt-4o-mini_vision,visualwebarena,0.13.3,2024-12-02_02-54-33,8d8642d3-757a-4346-ba45-01398f85b1f4,0.169,0.012,37,909/910,None,Linux (#68-Ubuntu SMP Mon Oct 7 14:34:20 UTC 2024),3.12.7,1.39.0,0.3.1,df7bc706f3793f47a456d1bda0485b306b8cf612,,0.13.3,None,
64
+
ThibaultLSDC,GenericAgent-gpt-4o_vision,visualwebarena,0.13.3,2024-12-02_07-17-28,7fb7eac8-4bbd-4ebe-be32-15901a7678f2,0.267,0.015,65,910/910,None,Linux (#68-Ubuntu SMP Mon Oct 7 14:34:20 UTC 2024),3.12.7,1.39.0,0.3.1,df7bc706f3793f47a456d1bda0485b306b8cf612,,0.13.3,None,
65
+
ThibaultLSDC,GenericAgent-anthropic_claude-3.5-sonnet:beta_vision,visualwebarena,0.13.3,2024-12-02_09-11-35,22f0611d-aeea-4ee9-a533-b45442b5e080,0.21,0.013,178,910/910,None,Linux (#68-Ubuntu SMP Mon Oct 7 14:34:20 UTC 2024),3.12.7,1.39.0,0.3.1,df7bc706f3793f47a456d1bda0485b306b8cf612,,0.13.3,None,
66
+
ThibaultLSDC,GenericAgent-meta-llama_llama-3.1-70b-instruct,webarena,0.13.3,2024-12-02_23-18-38,fc5747bc-d998-4942-a0eb-e55a3ccc1cb3,0.184,0.014,213,811/812,None,Linux (#68-Ubuntu SMP Mon Oct 7 14:34:20 UTC 2024),3.12.7,1.39.0,0.3.1,df7bc706f3793f47a456d1bda0485b306b8cf612,,0.13.3,None,
0 commit comments