Skip to content

Commit 8677f48

Browse files
committed
Update README and Jupyter notebook with improved documentation and result analysis instructions
1 parent f4f9e25 commit 8677f48

File tree

2 files changed

+53
-24
lines changed

2 files changed

+53
-24
lines changed

README.md

Lines changed: 28 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -9,11 +9,16 @@
99
[🛠️ Setup](#🛠️-setup-agentlab)   |  
1010
[🤖 Assistant](#ui-assistant)   |  
1111
[🚀 Launch Experiments](#🚀-launch-experiments)   |  
12-
[🔍 AgentXray](#🔍-agentxray)   |  
12+
[🔍 Analyse Results](#🔍-analyse-results)   |  
1313
[🤖 Make Your Own Agent](#implement-a-new-agent)   |  
1414
[↻ Reproducibility](#↻-reproducibility)   |  
1515

16-
<video controls style="max-width: 800px;">
16+
[![PyPI - License](https://img.shields.io/pypi/l/agentlab?style=flat-square)]([https://opensource.org/licenses/MIT](http://www.apache.org/licenses/LICENSE-2.0))
17+
[![PyPI - Downloads](https://img.shields.io/pypi/dm/agentlab?style=flat-square)](https://pypistats.org/packages/agentlab)
18+
[![GitHub star chart](https://img.shields.io/github/stars/ServiceNow/AgentLab?style=flat-square)](https://star-history.com/#ServiceNow/AgentLab)
19+
20+
21+
<video controls style="max-width: 700px;">
1722
<source src="https://github.com/ServiceNow/BrowserGym/assets/26232819/e0bfc788-cc8e-44f1-b8c3-0d1114108b85" type="video/mp4">
1823
Your browser does not support the video tag.
1924
</video>
@@ -23,11 +28,11 @@ AgentLab is a framework for developing and evaluating agents on a variety of
2328
[BrowserGym](https://github.com/ServiceNow/BrowserGym).
2429

2530
AgentLab Features:
26-
* Easy large scale parallel agent experiments using [ray](https://www.ray.io/)
27-
* Building blocks for making agents
28-
* Unified LLM api for OpenRouter, OpenAI, Azure, or self hosted using TGI.
31+
* Easy large scale parallel [agent experiments](#🚀-launch-experiments) using [ray](https://www.ray.io/)
32+
* Building blocks for making agents over BrowserGym
33+
* Unified LLM API for OpenRouter, OpenAI, Azure, or self hosted using TGI.
2934
* Prefered way for running benchmarks like WebArena
30-
* Various Reproducibility features
35+
* Various [reproducibility features](#reproducibility-features)
3136
* Unified LeaderBoard (soon)
3237

3338
## 🎯 Supported Benchmarks
@@ -131,8 +136,7 @@ experiments.
131136
### Debugging
132137

133138
For debugging, run experiments with `n_jobs=1` and use VSCode's debug mode. This allows you to pause
134-
execution at breakpoints. To prevent the debugger from stopping on errors while running multiple
135-
experiments in VSCode, set `enable_debug = False` in `ExpArgs`.
139+
execution at breakpoints.
136140

137141
### About Parallel Jobs
138142

@@ -155,14 +159,28 @@ each agent. AgentLab currently does not support evaluations across multiple inst
155159
either create a quick script to handle this or submit a PR to AgentLab. For a smoother parallel
156160
experience, consider using benchmarks like WorkArena instead.
157161

162+
## 🔍 Analyse Results
163+
164+
### Loading Results
165+
166+
The class [`ExpResult`](https://github.com/ServiceNow/BrowserGym/blob/da26a5849d99d9a3169d7b1fde79f909c55c9ba7/browsergym/experiments/src/browsergym/experiments/loop.py#L595) provides a lazy loader for all the information of a specific experiment. You can use [`yield_all_exp_results`](https://github.com/ServiceNow/BrowserGym/blob/da26a5849d99d9a3169d7b1fde79f909c55c9ba7/browsergym/experiments/src/browsergym/experiments/loop.py#L872) to recursivley find all results in a directory. Finally [`load_result_df`](https://github.com/ServiceNow/AgentLab/blob/be1998c5fad5bda47ba50497ec3899aae03e85ec/src/agentlab/analyze/inspect_results.py#L119C5-L119C19) gathers all the summary information in a single dataframe. See [`inspect_results.ipynb`](src/agentlab/analyze/inspect_results.ipynb) for example usage.
167+
168+
```python
169+
from agentlab.analyze import inspect_results
170+
result_df = inspect_results.load_result_df("path/to/your/study")
171+
```
172+
173+
174+
### AgentXray
175+
Inspect the behaviour of your agent using xray. You can load previous or ongoing experiments. The refresh mechanism is currently a bit clunky, but you can refresh the page, refresh the experiment directory list and select again your experiment to see an updated version of your currently running experiments.
158176

159-
## 🔍 AgentXray
160-
While your experiments are running, you can inspect the results using:
161177

162178
```bash
163179
agentlab-xray
164180
```
165181

182+
**⚠️ Note**: Gradio is still in developement and unexpected behavior have been frequently noticed. Version 5.5 seems to work properly so far. If you're not sure that the proper information is displaying, refresh the page and select your experiment again.
183+
166184

167185
<video controls style="max-width: 800px;">
168186
<source src="https://github.com/user-attachments/assets/06c4dac0-b78f-45b7-9405-003da4af6b37" type="video/mp4">

src/agentlab/analyze/inspect_results.ipynb

Lines changed: 25 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,8 @@
2121
"cell_type": "markdown",
2222
"metadata": {},
2323
"source": [
24-
"### load all summaries"
24+
"### load all summaries\n",
25+
"this will iterate over your RESULTS_DIR directory and create a summary of all the results."
2526
]
2627
},
2728
{
@@ -30,15 +31,18 @@
3031
"metadata": {},
3132
"outputs": [],
3233
"source": [
33-
"all_summaries = inspect_results.get_all_summaries(RESULTS_DIR.resolve().parent / \"ICML-Neurips-final-run\", ignore_cache=False, ignore_stale=True)\n",
34+
"all_summaries = inspect_results.get_all_summaries(\n",
35+
" RESULTS_DIR.resolve().parent / \"ICML-Neurips-final-run\", ignore_cache=False, ignore_stale=True\n",
36+
")\n",
3437
"all_summaries"
3538
]
3639
},
3740
{
3841
"cell_type": "markdown",
3942
"metadata": {},
4043
"source": [
41-
"### Load results"
44+
"### Load results\n",
45+
"find the most recent study and load all summary information in a result dataframe"
4246
]
4347
},
4448
{
@@ -47,13 +51,7 @@
4751
"metadata": {},
4852
"outputs": [],
4953
"source": [
50-
"# # minwob GPT-4o single agent reproduced\n",
51-
"# result_dir = RESULTS_DIR / \"2024-05-28_01-16-12_generic_agent_eval_llm\" #\n",
52-
"\n",
53-
"# # workarena GPT-4o single agent mostly reproduced\n",
54-
"# result_dir = RESULTS_DIR / \"2024-05-28_01-13-04_generic_agent_eval_llm\" \n",
55-
"# result_dir = RESULTS_DIR / \"2024-05-28_01-44-29_generic_agent_eval_llm\"\n",
56-
"\n",
54+
"# replace this by your desired directory if needed.\n",
5755
"result_dir = get_most_recent_study(RESULTS_DIR, contains=None)\n",
5856
"\n",
5957
"print(result_dir)\n",
@@ -108,14 +106,27 @@
108106
"cell_type": "markdown",
109107
"metadata": {},
110108
"source": [
111-
"## Ablation study"
109+
"## Ablation study\n",
110+
"(TODO this might need some dedusting)"
112111
]
113112
},
114113
{
115114
"cell_type": "code",
116-
"execution_count": null,
117-
"metadata": {},
118-
"outputs": [],
115+
"execution_count": 4,
116+
"metadata": {},
117+
"outputs": [
118+
{
119+
"ename": "NameError",
120+
"evalue": "name 'result_df' is not defined",
121+
"output_type": "error",
122+
"traceback": [
123+
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
124+
"\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)",
125+
"Cell \u001b[0;32mIn[4], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m ablation_report \u001b[38;5;241m=\u001b[39m inspect_results\u001b[38;5;241m.\u001b[39mablation_report(\u001b[43mresult_df\u001b[49m)\n\u001b[1;32m 2\u001b[0m inspect_results\u001b[38;5;241m.\u001b[39mdisplay_report(ablation_report)\n",
126+
"\u001b[0;31mNameError\u001b[0m: name 'result_df' is not defined"
127+
]
128+
}
129+
],
119130
"source": [
120131
"ablation_report = inspect_results.ablation_report(result_df)\n",
121132
"inspect_results.display_report(ablation_report)"

0 commit comments

Comments
 (0)