Skip to content

Commit 8ddc050

Browse files
authored
Merge branch 'main' into tlsdc/docs
2 parents b27ac9e + c52b7cd commit 8ddc050

File tree

14 files changed

+224
-76
lines changed

14 files changed

+224
-76
lines changed

.github/workflows/pypi.yml

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -66,6 +66,16 @@ jobs:
6666
name: python-package-distributions
6767
path: dist/
6868

69+
- name: Set up Python for Sigstore
70+
uses: actions/setup-python@v5
71+
with:
72+
python-version: "3.x"
73+
74+
- name: Install Sigstore and cryptography dependencies
75+
run: |
76+
python3 -m pip install --upgrade pip
77+
python3 -m pip install cryptography==43.0.3
78+
6979
- name: Sign the dists with Sigstore
7080
uses: sigstore/[email protected]
7181
with:

README.md

Lines changed: 45 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,6 @@
11

22
<div align="center">
33

4-
![AgentLab Banner](https://github.com/user-attachments/assets/a23b3cd8-b5c4-4918-817b-654ae6468cb4)
5-
64

75

86
[![pypi](https://badge.fury.io/py/agentlab.svg)](https://pypi.org/project/agentlab/)
@@ -17,25 +15,37 @@
1715
[🛠️ Setup](#%EF%B8%8F-setup-agentlab) &nbsp;|&nbsp;
1816
[🤖 Assistant](#-ui-assistant) &nbsp;|&nbsp;
1917
[🚀 Launch Experiments](#-launch-experiments) &nbsp;|&nbsp;
20-
[🔍 Analyse Results](#-analyse-results) &nbsp;|&nbsp;
18+
[🔍 Analyse Results](#-analyse-results) &nbsp;|&nbsp;
19+
<br>
20+
[🏆 Leaderboard](#-leaderboard) &nbsp;|&nbsp;
2121
[🤖 Build Your Agent](#-implement-a-new-agent) &nbsp;|&nbsp;
22-
[↻ Reproducibility](#-reproducibility)
22+
[↻ Reproducibility](#-reproducibility) &nbsp;|&nbsp;
23+
[💪 BrowserGym](https://github.com/ServiceNow/BrowserGym)
24+
25+
26+
<img src="https://github.com/user-attachments/assets/47a7c425-9763-46e5-be54-adac363be850" alt="agentlab-diagram" width="700"/>
27+
28+
29+
[Demo solving tasks:](https://github.com/ServiceNow/BrowserGym/assets/26232819/e0bfc788-cc8e-44f1-b8c3-0d1114108b85)
2330

24-
https://github.com/ServiceNow/BrowserGym/assets/26232819/e0bfc788-cc8e-44f1-b8c3-0d1114108b85
2531

2632
</div>
2733

34+
> [!WARNING]
35+
> AgentLab is meant to provide an open, easy-to-use and extensible framework to accelerate the field of web agent research.
36+
> It is not meant to be a consumer product. Use with caution!
37+
2838
AgentLab is a framework for developing and evaluating agents on a variety of
2939
[benchmarks](#-supported-benchmarks) supported by
3040
[BrowserGym](https://github.com/ServiceNow/BrowserGym).
3141

3242
AgentLab Features:
3343
* Easy large scale parallel [agent experiments](#-launch-experiments) using [ray](https://www.ray.io/)
3444
* Building blocks for making agents over BrowserGym
35-
* Unified LLM API for OpenRouter, OpenAI, Azure, or self hosted using TGI.
36-
* Prefered way for running benchmarks like WebArena
45+
* Unified LLM API for OpenRouter, OpenAI, Azure, or self-hosted using TGI.
46+
* Preferred way for running benchmarks like WebArena
3747
* Various [reproducibility features](#reproducibility-features)
38-
* Unified LeaderBoard (soon)
48+
* Unified [LeaderBoard](https://huggingface.co/spaces/ServiceNow/browsergym-leaderboard)
3949

4050
## 🎯 Supported Benchmarks
4151

@@ -59,12 +69,12 @@ AgentLab Features:
5969
pip install agentlab
6070
```
6171

62-
If not done already, install playwright:
72+
If not done already, install Playwright:
6373
```bash
6474
playwright install
6575
```
6676

67-
Make sure to prepare the required benchmark according to instructions provided in the [setup
77+
Make sure to prepare the required benchmark according to the instructions provided in the [setup
6878
column](#-supported-benchmarks).
6979

7080
```bash
@@ -174,11 +184,18 @@ experience, consider using benchmarks like WorkArena instead.
174184

175185
### Loading Results
176186

177-
The class [`ExpResult`](https://github.com/ServiceNow/BrowserGym/blob/da26a5849d99d9a3169d7b1fde79f909c55c9ba7/browsergym/experiments/src/browsergym/experiments/loop.py#L595) provides a lazy loader for all the information of a specific experiment. You can use [`yield_all_exp_results`](https://github.com/ServiceNow/BrowserGym/blob/da26a5849d99d9a3169d7b1fde79f909c55c9ba7/browsergym/experiments/src/browsergym/experiments/loop.py#L872) to recursivley find all results in a directory. Finally [`load_result_df`](https://github.com/ServiceNow/AgentLab/blob/be1998c5fad5bda47ba50497ec3899aae03e85ec/src/agentlab/analyze/inspect_results.py#L119C5-L119C19) gathers all the summary information in a single dataframe. See [`inspect_results.ipynb`](src/agentlab/analyze/inspect_results.ipynb) for example usage.
187+
The class [`ExpResult`](https://github.com/ServiceNow/BrowserGym/blob/da26a5849d99d9a3169d7b1fde79f909c55c9ba7/browsergym/experiments/src/browsergym/experiments/loop.py#L595) provides a lazy loader for all the information of a specific experiment. You can use [`yield_all_exp_results`](https://github.com/ServiceNow/BrowserGym/blob/da26a5849d99d9a3169d7b1fde79f909c55c9ba7/browsergym/experiments/src/browsergym/experiments/loop.py#L872) to recursively find all results in a directory. Finally [`load_result_df`](https://github.com/ServiceNow/AgentLab/blob/be1998c5fad5bda47ba50497ec3899aae03e85ec/src/agentlab/analyze/inspect_results.py#L119C5-L119C19) gathers all the summary information in a single dataframe. See [`inspect_results.ipynb`](src/agentlab/analyze/inspect_results.ipynb) for example usage.
178188

179189
```python
180190
from agentlab.analyze import inspect_results
191+
192+
# load the summary of all experiments of the study in a dataframe
181193
result_df = inspect_results.load_result_df("path/to/your/study")
194+
195+
# load the detailed results of the 1st experiment
196+
exp_result = bgym.ExpResult(result_df["exp_dir"][0])
197+
step_0_screenshot = exp_result.screenshots[0]
198+
step_0_action = exp_result.steps_info[0].action
182199
```
183200

184201

@@ -204,8 +221,14 @@ Once this is selected, you can see the trace of your agent on the given task. Cl
204221
image to select a step and observe the action taken by the agent.
205222

206223

207-
**⚠️ Note**: Gradio is still in developement and unexpected behavior have been frequently noticed. Version 5.5 seems to work properly so far. If you're not sure that the proper information is displaying, refresh the page and select your experiment again.
224+
**⚠️ Note**: Gradio is still developing, and unexpected behavior has been frequently noticed. Version 5.5 seems to work properly so far. If you're not sure that the proper information is displaying, refresh the page and select your experiment again.
225+
226+
227+
## 🏆 Leaderboard
228+
229+
Official unified [leaderboard](https://huggingface.co/spaces/ServiceNow/browsergym-leaderboard) across all benchmarks.
208230

231+
Experiments are on their way for more reference points using GenericAgent. We are also working on code to automatically push a study to the leaderboard.
209232

210233
## 🤖 Implement a new Agent
211234

@@ -222,32 +245,32 @@ Several factors can influence reproducibility of results in the context of evalu
222245
dynamic benchmarks.
223246

224247
### Factors affecting reproducibility
225-
* **Software version**: Different version of Playwright or any package in the software stack could
248+
* **Software version**: Different versions of Playwright or any package in the software stack could
226249
influence the behavior of the benchmark or the agent.
227-
* **API based LLMs silently changing**: Even for a fixed version, an LLM may be updated e.g. to
228-
incorporate latest web knowledge.
250+
* **API-based LLMs silently changing**: Even for a fixed version, an LLM may be updated e.g. to
251+
incorporate the latest web knowledge.
229252
* **Live websites**:
230253
* WorkArena: The demo instance is mostly fixed in time to a specific version but ServiceNow
231-
sometime push minor modifications.
254+
sometimes pushes minor modifications.
232255
* AssistantBench and GAIA: These rely on the agent navigating the open web. The experience may
233256
change depending on which country or region, some websites might be in different languages by
234257
default.
235-
* **Stochastic Agents**: Setting temperature of the LLM to 0 can reduce most stochasticity.
236-
* **Non deterministic tasks**: For a fixed seed, the changes should be minimal
258+
* **Stochastic Agents**: Setting the temperature of the LLM to 0 can reduce most stochasticity.
259+
* **Non-deterministic tasks**: For a fixed seed, the changes should be minimal
237260

238261
### Reproducibility Features
239262
* `Study` contains a dict of information about reproducibility, including benchmark version, package
240263
version and commit hash
241264
* The `Study` class allows automatic upload of your results to
242265
[`reproducibility_journal.csv`](reproducibility_journal.csv). This makes it easier to populate a
243-
large amount of reference points.
244-
* **Reproduced results in the leaderboard**. For agents that are repdocudibile, we encourage users
266+
large amount of reference points. For this feature, you need to `git clone` the repository and install via `pip install -e .`.
267+
* **Reproduced results in the leaderboard**. For agents that are reprocudibile, we encourage users
245268
to try to reproduce the results and upload them to the leaderboard. There is a special column
246269
containing information about all reproduced results of an agent on a benchmark.
247270
* **ReproducibilityAgent**: [You can run this agent](src/agentlab/agents/generic_agent/reproducibility_agent.py) on an existing study and it will try to re-run
248-
the same actions on the same task seeds. A vsiual diff of the two prompts will be displayed in the
271+
the same actions on the same task seeds. A visual diff of the two prompts will be displayed in the
249272
AgentInfo HTML tab of AgentXray. You will be able to inspect on some tasks what kind of changes
250-
between to two executions. **Note**: this is a beta feature and will need some adaptation for your
273+
between the two executions. **Note**: this is a beta feature and will need some adaptation for your
251274
own agent.
252275

253276

reproducibility_journal.csv

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -46,3 +46,22 @@ ThibaultLSDC,GenericAgent-anthropic_claude-3.5-sonnet:beta,weblinx_test,0.0.1.de
4646
ThibaultLSDC,GenericAgent-meta-llama_llama-3.1-70b-instruct,weblinx_test,0.0.1.dev13,2024-11-07_21-42-30,b9451759-4f0e-492c-a3c8-fa5109d2d9b1,0.089,0.005,0,2650/2650,None,Linux (#66-Ubuntu SMP Fri Aug 30 13:56:20 UTC 2024),3.12.7,1.39.0,0.2.3,7a5b91e62056fa8fb26efdd2f64f5b25a92b817c,,0.12.0,8633c30c31e6a5a1d5122835c035aa56d18f3f0a,
4747
ThibaultLSDC,GenericAgent-openai_o1-mini-2024-09-12,weblinx_test,0.0.1.dev13,2024-11-07_21-42-30,b9451759-4f0e-492c-a3c8-fa5109d2d9b1,0.125,0.006,0,2650/2650,None,Linux (#66-Ubuntu SMP Fri Aug 30 13:56:20 UTC 2024),3.12.7,1.39.0,0.2.3,7a5b91e62056fa8fb26efdd2f64f5b25a92b817c,,0.12.0,8633c30c31e6a5a1d5122835c035aa56d18f3f0a,
4848
ThibaultLSDC,GenericAgent-meta-llama_llama-3.1-405b-instruct,weblinx_test,0.0.1.dev13,2024-11-07_21-42-30,b9451759-4f0e-492c-a3c8-fa5109d2d9b1,0.079,0.005,0,2650/2650,None,Linux (#66-Ubuntu SMP Fri Aug 30 13:56:20 UTC 2024),3.12.7,1.39.0,0.2.3,7a5b91e62056fa8fb26efdd2f64f5b25a92b817c,,0.12.0,8633c30c31e6a5a1d5122835c035aa56d18f3f0a,
49+
ThibaultLSDC,GenericAgent-meta-llama_llama-3.1-405b-instruct,workarena_l2_agent_curriculum_eval,0.4.1,2024-11-29_14-28-47,528da1f2-1949-41dc-b988-85f19f435af2,0.072,0.017,2,235/235,None,Linux (#68-Ubuntu SMP Mon Oct 7 14:34:20 UTC 2024),3.12.7,1.39.0,0.3.1,b115b2716d8a6328824684a692ed642297f0b1dc,,0.13.3,70dac253628c476aff1af6a975f27f8563453ad2,
50+
ThibaultLSDC,GenericAgent-meta-llama_llama-3.1-405b-instruct,miniwob,0.13.3,2024-11-29_16-14-00,4d748972-6d35-4489-a197-138b656a7db3,0.646,0.019,0,625/625,None,Linux (#68-Ubuntu SMP Mon Oct 7 14:34:20 UTC 2024),3.12.7,1.39.0,0.3.1,becb4856fb1612f44010fe74ef8155d367ca17fc,,0.13.3,70dac253628c476aff1af6a975f27f8563453ad2,
51+
ThibaultLSDC,GenericAgent-gpt-4o,assistantbench,0.13.1,2024-11-28_19-34-58,d93a2398-2b70-41ce-b989-364fed988d73,0.005,0.003,2,213/214,None,Linux (#68-Ubuntu SMP Mon Oct 7 14:34:20 UTC 2024),3.12.7,1.39.0,0.3.0,32865050045c8c71df35c34ff30a6b420a4e258c, M: src/agentlab/experiments/study.py,0.13.1,None,
52+
ThibaultLSDC,GenericAgent-gpt-4o-mini,assistantbench,0.13.1,2024-11-28_19-34-58,d93a2398-2b70-41ce-b989-364fed988d73,0.002,0.002,1,214/214,None,Linux (#68-Ubuntu SMP Mon Oct 7 14:34:20 UTC 2024),3.12.7,1.39.0,0.3.0,32865050045c8c71df35c34ff30a6b420a4e258c, M: src/agentlab/experiments/study.py,0.13.1,None,
53+
ThibaultLSDC,GenericAgent-meta-llama_llama-3.1-405b-instruct,assistantbench,0.13.1,2024-11-28_19-34-58,d93a2398-2b70-41ce-b989-364fed988d73,0.008,0.003,1,212/214,None,Linux (#68-Ubuntu SMP Mon Oct 7 14:34:20 UTC 2024),3.12.7,1.39.0,0.3.0,32865050045c8c71df35c34ff30a6b420a4e258c, M: src/agentlab/experiments/study.py,0.13.1,None,
54+
ThibaultLSDC,GenericAgent-meta-llama_llama-3.1-70b-instruct,assistantbench,0.13.1,2024-11-28_19-34-58,d93a2398-2b70-41ce-b989-364fed988d73,0.007,0.005,8,206/214,None,Linux (#68-Ubuntu SMP Mon Oct 7 14:34:20 UTC 2024),3.12.7,1.39.0,0.3.0,32865050045c8c71df35c34ff30a6b420a4e258c, M: src/agentlab/experiments/study.py,0.13.1,None,
55+
ThibaultLSDC,GenericAgent-meta-llama_llama-3.1-8b-instruct,assistantbench,0.13.1,2024-11-28_19-34-58,d93a2398-2b70-41ce-b989-364fed988d73,0.001,0.001,15,214/214,None,Linux (#68-Ubuntu SMP Mon Oct 7 14:34:20 UTC 2024),3.12.7,1.39.0,0.3.0,32865050045c8c71df35c34ff30a6b420a4e258c, M: src/agentlab/experiments/study.py,0.13.1,None,
56+
ThibaultLSDC,GenericAgent-anthropic_claude-3.5-sonnet:beta,assistantbench,0.13.1,2024-11-28_19-34-58,d93a2398-2b70-41ce-b989-364fed988d73,0.007,0.003,1,212/214,None,Linux (#68-Ubuntu SMP Mon Oct 7 14:34:20 UTC 2024),3.12.7,1.39.0,0.3.0,32865050045c8c71df35c34ff30a6b420a4e258c, M: src/agentlab/experiments/study.py,0.13.1,None,
57+
ThibaultLSDC,GenericAgent-openai_o1-mini-2024-09-12,assistantbench,0.13.1,2024-11-28_19-34-58,d93a2398-2b70-41ce-b989-364fed988d73,0.009,0.005,1,214/214,None,Linux (#68-Ubuntu SMP Mon Oct 7 14:34:20 UTC 2024),3.12.7,1.39.0,0.3.0,32865050045c8c71df35c34ff30a6b420a4e258c, M: src/agentlab/experiments/study.py,0.13.1,None,
58+
ThibaultLSDC,GenericAgent-gpt-4o-mini,webarena,0.13.3,2024-11-29_19-25-49,c6bdeb87-9879-4c06-aa70-00d895001156,0.174,0.013,1,812/812,None,Linux (#68-Ubuntu SMP Mon Oct 7 14:34:20 UTC 2024),3.12.7,1.39.0,0.3.1,b115b2716d8a6328824684a692ed642297f0b1dc,,0.13.3,None,
59+
ThibaultLSDC,GenericAgent-gpt-4o,webarena,0.13.3,2024-11-29_22-28-32,d2eed215-91bb-4603-b69c-8ef8f9d57f34,0.314,0.016,3,812/812,None,Linux (#68-Ubuntu SMP Mon Oct 7 14:34:20 UTC 2024),3.12.7,1.39.0,0.3.1,430fe9456ba766398380454a6335f094004607af,,0.13.3,None,
60+
ThibaultLSDC,GenericAgent-anthropic_claude-3.5-sonnet:beta,webarena,0.13.3,2024-11-29_22-37-46,b5fc5be7-54cc-4fc1-a9ee-73447b9c3eae,0.362,0.017,0,812/812,None,Linux (#68-Ubuntu SMP Mon Oct 7 14:34:20 UTC 2024),3.12.7,1.39.0,0.3.1,7b224971fb7a90fb76924ca9386a1e8bf609dd2a,,0.13.3,None,
61+
ThibaultLSDC,GenericAgent-openai_o1-mini-2024-09-12,webarena,0.13.3,2024-11-30_00-22-44,1827983d-5e84-4b63-ad49-bf45ec2a6348,0.286,0.016,0,812/812,None,Linux (#68-Ubuntu SMP Mon Oct 7 14:34:20 UTC 2024),3.12.7,1.39.0,0.3.1,3f54ef13b778e69a1706c732f776147e9523ad3d,,0.13.3,None,
62+
ThibaultLSDC,GenericAgent-meta-llama_llama-3.1-405b-instruct,webarena,0.13.3,2024-12-01_00-04-43,aaeca13d-0cf5-444f-8445-590350b54746,0.24,0.015,9,812/812,None,Linux (#68-Ubuntu SMP Mon Oct 7 14:34:20 UTC 2024),3.12.7,1.39.0,0.3.1,5a5b94d544424517cdd11602b27100b82e35eac0,,0.13.3,None,
63+
ThibaultLSDC,GenericAgent-gpt-4o-mini_vision,visualwebarena,0.13.3,2024-12-02_02-54-33,8d8642d3-757a-4346-ba45-01398f85b1f4,0.169,0.012,37,909/910,None,Linux (#68-Ubuntu SMP Mon Oct 7 14:34:20 UTC 2024),3.12.7,1.39.0,0.3.1,df7bc706f3793f47a456d1bda0485b306b8cf612,,0.13.3,None,
64+
ThibaultLSDC,GenericAgent-gpt-4o_vision,visualwebarena,0.13.3,2024-12-02_07-17-28,7fb7eac8-4bbd-4ebe-be32-15901a7678f2,0.267,0.015,65,910/910,None,Linux (#68-Ubuntu SMP Mon Oct 7 14:34:20 UTC 2024),3.12.7,1.39.0,0.3.1,df7bc706f3793f47a456d1bda0485b306b8cf612,,0.13.3,None,
65+
ThibaultLSDC,GenericAgent-anthropic_claude-3.5-sonnet:beta_vision,visualwebarena,0.13.3,2024-12-02_09-11-35,22f0611d-aeea-4ee9-a533-b45442b5e080,0.21,0.013,178,910/910,None,Linux (#68-Ubuntu SMP Mon Oct 7 14:34:20 UTC 2024),3.12.7,1.39.0,0.3.1,df7bc706f3793f47a456d1bda0485b306b8cf612,,0.13.3,None,
66+
ThibaultLSDC,GenericAgent-meta-llama_llama-3.1-70b-instruct,webarena,0.13.3,2024-12-02_23-18-38,fc5747bc-d998-4942-a0eb-e55a3ccc1cb3,0.184,0.014,213,811/812,None,Linux (#68-Ubuntu SMP Mon Oct 7 14:34:20 UTC 2024),3.12.7,1.39.0,0.3.1,df7bc706f3793f47a456d1bda0485b306b8cf612,,0.13.3,None,
67+

src/agentlab/__init__.py

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1 @@
1-
"""DOCSTRING"""
2-
3-
__version__ = "0.3.1"
1+
__version__ = "0.3.2.dev1"

src/agentlab/agents/generic_agent/generic_agent.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -53,7 +53,7 @@ def set_benchmark(self, benchmark: bgym.Benchmark, demo_mode):
5353

5454
# verify if we can remove this
5555
if demo_mode:
56-
self.action_set.demo_mode = "all_blue"
56+
self.flags.action.action_set.demo_mode = "all_blue"
5757

5858
def set_reproducibility_mode(self):
5959
self.chat_model_args.temperature = 0

src/agentlab/analyze/inspect_results.py

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -14,8 +14,6 @@
1414
from IPython.display import display
1515
from tqdm import tqdm
1616

17-
from agentlab.experiments.exp_utils import RESULTS_DIR
18-
1917
# TODO find a more portable way to code set_task_category_as_index at least
2018
# handle dynamic imports. We don't want to always import workarena
2119
# from browsergym.workarena import TASK_CATEGORY_MAP
@@ -496,8 +494,8 @@ def display_report(
496494
if rename_bool_flags:
497495
report = _rename_bool_flags(report)
498496

499-
if copy_to_clipboard:
500-
to_clipboard(report)
497+
# if copy_to_clipboard:
498+
# to_clipboard(report)
501499

502500
columns = list(report.columns)
503501

src/agentlab/experiments/exp_utils.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66
from pathlib import Path
77
from time import sleep, time
88

9-
from browsergym.experiments.loop import ExpArgs, _move_old_exp, yield_all_exp_results
9+
from browsergym.experiments.loop import ExpArgs, yield_all_exp_results
1010
from tqdm import tqdm
1111

1212
logger = logging.getLogger(__name__) # Get logger based on module name

src/agentlab/experiments/reproducibility_util.py

Lines changed: 9 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,9 @@ def _get_repo(module):
1919
return Repo(Path(module.__file__).resolve().parent, search_parent_directories=True)
2020

2121

22-
def _get_benchmark_version(benchmark: bgym.Benchmark) -> str:
22+
def _get_benchmark_version(
23+
benchmark: bgym.Benchmark, allow_bypass_benchmark_version: bool = False
24+
) -> str:
2325
benchmark_name = benchmark.name
2426

2527
if hasattr(benchmark, "get_version"):
@@ -42,7 +44,10 @@ def _get_benchmark_version(benchmark: bgym.Benchmark) -> str:
4244
elif benchmark_name.startswith("assistantbench"):
4345
return metadata.distribution("browsergym.assistantbench").version
4446
else:
45-
raise ValueError(f"Unknown benchmark {benchmark_name}")
47+
if allow_bypass_benchmark_version:
48+
return "bypassed"
49+
else:
50+
raise ValueError(f"Unknown benchmark {benchmark_name}")
4651

4752

4853
def _get_git_username(repo: Repo) -> str:
@@ -183,6 +188,7 @@ def get_reproducibility_info(
183188
"*inspect_results.ipynb",
184189
),
185190
ignore_changes=False,
191+
allow_bypass_benchmark_version=False,
186192
):
187193
"""
188194
Retrieve a dict of information that could influence the reproducibility of an experiment.
@@ -205,7 +211,7 @@ def get_reproducibility_info(
205211
"benchmark": benchmark.name,
206212
"study_id": study_id,
207213
"comment": comment,
208-
"benchmark_version": _get_benchmark_version(benchmark),
214+
"benchmark_version": _get_benchmark_version(benchmark, allow_bypass_benchmark_version),
209215
"date": datetime.now().strftime("%Y-%m-%d_%H-%M-%S"),
210216
"os": f"{platform.system()} ({platform.version()})",
211217
"python_version": platform.python_version(),

src/agentlab/experiments/study.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -268,6 +268,7 @@ def set_reproducibility_info(self, strict_reproducibility=False, comment=None):
268268
self.uuid,
269269
ignore_changes=not strict_reproducibility,
270270
comment=comment,
271+
allow_bypass_benchmark_version=not strict_reproducibility,
271272
)
272273
if self.reproducibility_info is not None:
273274
repro.assert_compatible(
@@ -405,7 +406,6 @@ def load_most_recent(root_dir: Path = None, contains=None) -> "Study":
405406

406407
def _make_study_name(agent_names, benchmark_names, suffix=None):
407408
"""Make a study name from the agent and benchmark names."""
408-
409409
# extract unique agent and benchmark names
410410
agent_names = list(set(agent_names))
411411
benchmark_names = list(set(benchmark_names))

0 commit comments

Comments
 (0)