Skip to content

Commit b3f8f80

Browse files
authored
Update README.md (#158)
1 parent 67b205c commit b3f8f80

File tree

1 file changed

+31
-20
lines changed

1 file changed

+31
-20
lines changed

README.md

Lines changed: 31 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,6 @@
11

22
<div align="center">
33

4-
![AgentLab Banner](https://github.com/user-attachments/assets/a23b3cd8-b5c4-4918-817b-654ae6468cb4)
5-
64

75

86
[![pypi](https://badge.fury.io/py/agentlab.svg)](https://pypi.org/project/agentlab/)
@@ -17,10 +15,17 @@
1715
[🛠️ Setup](#%EF%B8%8F-setup-agentlab) &nbsp;|&nbsp;
1816
[🤖 Assistant](#-ui-assistant) &nbsp;|&nbsp;
1917
[🚀 Launch Experiments](#-launch-experiments) &nbsp;|&nbsp;
20-
[🔍 Analyse Results](#-analyse-results) &nbsp;|&nbsp;
18+
[🔍 Analyse Results](#-analyse-results) &nbsp;|&nbsp;
19+
<br>
20+
[🏆 Leaderboard](#-leaderboard) &nbsp;|&nbsp;
2121
[🤖 Build Your Agent](#-implement-a-new-agent) &nbsp;|&nbsp;
2222
[↻ Reproducibility](#-reproducibility)
2323

24+
25+
<img src="https://github.com/user-attachments/assets/47a7c425-9763-46e5-be54-adac363be850" alt="agentlab-diagram" width="700"/>
26+
27+
28+
Demo solving tasks:
2429
https://github.com/ServiceNow/BrowserGym/assets/26232819/e0bfc788-cc8e-44f1-b8c3-0d1114108b85
2530

2631
</div>
@@ -32,10 +37,10 @@ AgentLab is a framework for developing and evaluating agents on a variety of
3237
AgentLab Features:
3338
* Easy large scale parallel [agent experiments](#-launch-experiments) using [ray](https://www.ray.io/)
3439
* Building blocks for making agents over BrowserGym
35-
* Unified LLM API for OpenRouter, OpenAI, Azure, or self hosted using TGI.
36-
* Prefered way for running benchmarks like WebArena
40+
* Unified LLM API for OpenRouter, OpenAI, Azure, or self-hosted using TGI.
41+
* Preferred way for running benchmarks like WebArena
3742
* Various [reproducibility features](#reproducibility-features)
38-
* Unified LeaderBoard (soon)
43+
* Unified [LeaderBoard](https://huggingface.co/spaces/ServiceNow/browsergym-leaderboard)
3944

4045
## 🎯 Supported Benchmarks
4146

@@ -59,12 +64,12 @@ AgentLab Features:
5964
pip install agentlab
6065
```
6166

62-
If not done already, install playwright:
67+
If not done already, install Playwright:
6368
```bash
6469
playwright install
6570
```
6671

67-
Make sure to prepare the required benchmark according to instructions provided in the [setup
72+
Make sure to prepare the required benchmark according to the instructions provided in the [setup
6873
column](#-supported-benchmarks).
6974

7075
```bash
@@ -174,7 +179,7 @@ experience, consider using benchmarks like WorkArena instead.
174179

175180
### Loading Results
176181

177-
The class [`ExpResult`](https://github.com/ServiceNow/BrowserGym/blob/da26a5849d99d9a3169d7b1fde79f909c55c9ba7/browsergym/experiments/src/browsergym/experiments/loop.py#L595) provides a lazy loader for all the information of a specific experiment. You can use [`yield_all_exp_results`](https://github.com/ServiceNow/BrowserGym/blob/da26a5849d99d9a3169d7b1fde79f909c55c9ba7/browsergym/experiments/src/browsergym/experiments/loop.py#L872) to recursivley find all results in a directory. Finally [`load_result_df`](https://github.com/ServiceNow/AgentLab/blob/be1998c5fad5bda47ba50497ec3899aae03e85ec/src/agentlab/analyze/inspect_results.py#L119C5-L119C19) gathers all the summary information in a single dataframe. See [`inspect_results.ipynb`](src/agentlab/analyze/inspect_results.ipynb) for example usage.
182+
The class [`ExpResult`](https://github.com/ServiceNow/BrowserGym/blob/da26a5849d99d9a3169d7b1fde79f909c55c9ba7/browsergym/experiments/src/browsergym/experiments/loop.py#L595) provides a lazy loader for all the information of a specific experiment. You can use [`yield_all_exp_results`](https://github.com/ServiceNow/BrowserGym/blob/da26a5849d99d9a3169d7b1fde79f909c55c9ba7/browsergym/experiments/src/browsergym/experiments/loop.py#L872) to recursively find all results in a directory. Finally [`load_result_df`](https://github.com/ServiceNow/AgentLab/blob/be1998c5fad5bda47ba50497ec3899aae03e85ec/src/agentlab/analyze/inspect_results.py#L119C5-L119C19) gathers all the summary information in a single dataframe. See [`inspect_results.ipynb`](src/agentlab/analyze/inspect_results.ipynb) for example usage.
178183

179184
```python
180185
from agentlab.analyze import inspect_results
@@ -204,8 +209,14 @@ Once this is selected, you can see the trace of your agent on the given task. Cl
204209
image to select a step and observe the action taken by the agent.
205210

206211

207-
**⚠️ Note**: Gradio is still in developement and unexpected behavior have been frequently noticed. Version 5.5 seems to work properly so far. If you're not sure that the proper information is displaying, refresh the page and select your experiment again.
212+
**⚠️ Note**: Gradio is still developing, and unexpected behavior has been frequently noticed. Version 5.5 seems to work properly so far. If you're not sure that the proper information is displaying, refresh the page and select your experiment again.
213+
214+
215+
## 🏆 Leaderboard
216+
217+
Official unified [leaderboard](https://huggingface.co/spaces/ServiceNow/browsergym-leaderboard) across all benchmarks.
208218

219+
Experiments are on their way for more reference points using GenericAgent. We are also working on code to automatically push a study to the leaderboard.
209220

210221
## 🤖 Implement a new Agent
211222

@@ -222,32 +233,32 @@ Several factors can influence reproducibility of results in the context of evalu
222233
dynamic benchmarks.
223234

224235
### Factors affecting reproducibility
225-
* **Software version**: Different version of Playwright or any package in the software stack could
236+
* **Software version**: Different versions of Playwright or any package in the software stack could
226237
influence the behavior of the benchmark or the agent.
227-
* **API based LLMs silently changing**: Even for a fixed version, an LLM may be updated e.g. to
228-
incorporate latest web knowledge.
238+
* **API-based LLMs silently changing**: Even for a fixed version, an LLM may be updated e.g. to
239+
incorporate the latest web knowledge.
229240
* **Live websites**:
230241
* WorkArena: The demo instance is mostly fixed in time to a specific version but ServiceNow
231-
sometime push minor modifications.
242+
sometimes pushes minor modifications.
232243
* AssistantBench and GAIA: These rely on the agent navigating the open web. The experience may
233244
change depending on which country or region, some websites might be in different languages by
234245
default.
235-
* **Stochastic Agents**: Setting temperature of the LLM to 0 can reduce most stochasticity.
236-
* **Non deterministic tasks**: For a fixed seed, the changes should be minimal
246+
* **Stochastic Agents**: Setting the temperature of the LLM to 0 can reduce most stochasticity.
247+
* **Non-deterministic tasks**: For a fixed seed, the changes should be minimal
237248

238249
### Reproducibility Features
239250
* `Study` contains a dict of information about reproducibility, including benchmark version, package
240251
version and commit hash
241252
* The `Study` class allows automatic upload of your results to
242253
[`reproducibility_journal.csv`](reproducibility_journal.csv). This makes it easier to populate a
243-
large amount of reference points.
244-
* **Reproduced results in the leaderboard**. For agents that are repdocudibile, we encourage users
254+
large amount of reference points. For this feature, you need to `git clone` the repository and install via `pip install -e .`.
255+
* **Reproduced results in the leaderboard**. For agents that are reprocudibile, we encourage users
245256
to try to reproduce the results and upload them to the leaderboard. There is a special column
246257
containing information about all reproduced results of an agent on a benchmark.
247258
* **ReproducibilityAgent**: [You can run this agent](src/agentlab/agents/generic_agent/reproducibility_agent.py) on an existing study and it will try to re-run
248-
the same actions on the same task seeds. A vsiual diff of the two prompts will be displayed in the
259+
the same actions on the same task seeds. A visual diff of the two prompts will be displayed in the
249260
AgentInfo HTML tab of AgentXray. You will be able to inspect on some tasks what kind of changes
250-
between to two executions. **Note**: this is a beta feature and will need some adaptation for your
261+
between the two executions. **Note**: this is a beta feature and will need some adaptation for your
251262
own agent.
252263

253264

0 commit comments

Comments
 (0)