ServiceNow
diff --git a/‎.github/workflows/pypi.yml‎
Lines changed: 10 additions & 0 deletions b/‎.github/workflows/pypi.yml‎
Lines changed: 10 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 45 additions & 22 deletions b/‎README.md‎
Lines changed: 45 additions & 22 deletions
diff --git a/‎reproducibility_journal.csv‎
Lines changed: 19 additions & 0 deletions b/‎reproducibility_journal.csv‎
Lines changed: 19 additions & 0 deletions
diff --git a/‎src/agentlab/__init__.py‎
Lines changed: 1 addition & 3 deletions b/‎src/agentlab/__init__.py‎
Lines changed: 1 addition & 3 deletions
diff --git a/‎src/agentlab/agents/generic_agent/generic_agent.py‎
Lines changed: 1 addition & 1 deletion b/‎src/agentlab/agents/generic_agent/generic_agent.py‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎src/agentlab/analyze/inspect_results.py‎
Lines changed: 2 additions & 4 deletions b/‎src/agentlab/analyze/inspect_results.py‎
Lines changed: 2 additions & 4 deletions
diff --git a/‎src/agentlab/experiments/exp_utils.py‎
Lines changed: 1 addition & 1 deletion b/‎src/agentlab/experiments/exp_utils.py‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎src/agentlab/experiments/reproducibility_util.py‎
Lines changed: 9 additions & 3 deletions b/‎src/agentlab/experiments/reproducibility_util.py‎
Lines changed: 9 additions & 3 deletions
diff --git a/‎src/agentlab/experiments/study.py‎
Lines changed: 1 addition & 1 deletion b/‎src/agentlab/experiments/study.py‎
Lines changed: 1 addition & 1 deletion
@@ -66,6 +66,16 @@ jobs:
           name: python-package-distributions
           path: dist/
 
+      - name: Set up Python for Sigstore
+        uses: actions/setup-python@v5
+        with:
+          python-version: "3.x"
+
+      - name: Install Sigstore and cryptography dependencies
+        run: |
+          python3 -m pip install --upgrade pip
+          python3 -m pip install cryptography==43.0.3
+
       - name: Sign the dists with Sigstore
         uses: sigstore/[email protected]
         with:
 
@@ -1,8 +1,6 @@
 
 <div align="center">
 
-![AgentLab Banner](https://github.com/user-attachments/assets/a23b3cd8-b5c4-4918-817b-654ae6468cb4)
-
 
 
 [![pypi](https://badge.fury.io/py/agentlab.svg)](https://pypi.org/project/agentlab/)
@@ -17,25 +15,37 @@
 [🛠️ Setup](#%EF%B8%8F-setup-agentlab) &nbsp;|&nbsp; 
 [🤖 Assistant](#-ui-assistant) &nbsp;|&nbsp; 
 [🚀 Launch Experiments](#-launch-experiments) &nbsp;|&nbsp;
-[🔍 Analyse Results](#-analyse-results) &nbsp;|&nbsp; 
+[🔍 Analyse Results](#-analyse-results) &nbsp;|&nbsp;
+<br>
+[🏆 Leaderboard](#-leaderboard) &nbsp;|&nbsp; 
 [🤖 Build Your Agent](#-implement-a-new-agent) &nbsp;|&nbsp;
-[↻ Reproducibility](#-reproducibility) 
+[↻ Reproducibility](#-reproducibility) &nbsp;|&nbsp;
+[💪 BrowserGym](https://github.com/ServiceNow/BrowserGym)
+
+
+<img src="https://github.com/user-attachments/assets/47a7c425-9763-46e5-be54-adac363be850" alt="agentlab-diagram" width="700"/>
+
+
+[Demo solving tasks:](https://github.com/ServiceNow/BrowserGym/assets/26232819/e0bfc788-cc8e-44f1-b8c3-0d1114108b85)
 
-https://github.com/ServiceNow/BrowserGym/assets/26232819/e0bfc788-cc8e-44f1-b8c3-0d1114108b85
 
 </div>
 
+> [!WARNING]
+> AgentLab is meant to provide an open, easy-to-use and extensible framework to accelerate the field of web agent research.
+> It is not meant to be a consumer product. Use with caution!
+
 AgentLab is a framework for developing and evaluating agents on a variety of
 [benchmarks](#-supported-benchmarks) supported by
 [BrowserGym](https://github.com/ServiceNow/BrowserGym).
 
 AgentLab Features:
 * Easy large scale parallel [agent experiments](#-launch-experiments) using [ray](https://www.ray.io/)
 * Building blocks for making agents over BrowserGym
-* Unified LLM API for OpenRouter, OpenAI, Azure, or self hosted using TGI.
-* Prefered way for running benchmarks like WebArena
+* Unified LLM API for OpenRouter, OpenAI, Azure, or self-hosted using TGI.
+* Preferred way for running benchmarks like WebArena
 * Various [reproducibility features](#reproducibility-features)
-* Unified LeaderBoard (soon)
+* Unified [LeaderBoard](https://huggingface.co/spaces/ServiceNow/browsergym-leaderboard)
 
 ## 🎯 Supported Benchmarks
 
@@ -59,12 +69,12 @@ AgentLab Features:
 pip install agentlab
 ```
 
-If not done already, install playwright:
+If not done already, install Playwright:
 ```bash
 playwright install
 ```
 
-Make sure to prepare the required benchmark according to instructions provided in the [setup
+Make sure to prepare the required benchmark according to the instructions provided in the [setup
 column](#-supported-benchmarks).
 
 ```bash
@@ -174,11 +184,18 @@ experience, consider using benchmarks like WorkArena instead.
 
 ### Loading Results
 
-The class [`ExpResult`](https://github.com/ServiceNow/BrowserGym/blob/da26a5849d99d9a3169d7b1fde79f909c55c9ba7/browsergym/experiments/src/browsergym/experiments/loop.py#L595) provides a lazy loader for all the information of a specific experiment. You can use [`yield_all_exp_results`](https://github.com/ServiceNow/BrowserGym/blob/da26a5849d99d9a3169d7b1fde79f909c55c9ba7/browsergym/experiments/src/browsergym/experiments/loop.py#L872) to recursivley find all results in a directory. Finally [`load_result_df`](https://github.com/ServiceNow/AgentLab/blob/be1998c5fad5bda47ba50497ec3899aae03e85ec/src/agentlab/analyze/inspect_results.py#L119C5-L119C19) gathers all the summary information in a single dataframe. See [`inspect_results.ipynb`](src/agentlab/analyze/inspect_results.ipynb) for example usage.
+The class [`ExpResult`](https://github.com/ServiceNow/BrowserGym/blob/da26a5849d99d9a3169d7b1fde79f909c55c9ba7/browsergym/experiments/src/browsergym/experiments/loop.py#L595) provides a lazy loader for all the information of a specific experiment. You can use [`yield_all_exp_results`](https://github.com/ServiceNow/BrowserGym/blob/da26a5849d99d9a3169d7b1fde79f909c55c9ba7/browsergym/experiments/src/browsergym/experiments/loop.py#L872) to recursively find all results in a directory. Finally [`load_result_df`](https://github.com/ServiceNow/AgentLab/blob/be1998c5fad5bda47ba50497ec3899aae03e85ec/src/agentlab/analyze/inspect_results.py#L119C5-L119C19) gathers all the summary information in a single dataframe. See [`inspect_results.ipynb`](src/agentlab/analyze/inspect_results.ipynb) for example usage.
 
 ```python
 from agentlab.analyze import inspect_results
+
+# load the summary of all experiments of the study in a dataframe
 result_df = inspect_results.load_result_df("path/to/your/study")
+
+# load the detailed results of the 1st experiment
+exp_result = bgym.ExpResult(result_df["exp_dir"][0])
+step_0_screenshot = exp_result.screenshots[0]
+step_0_action = exp_result.steps_info[0].action
 ```
 
 
@@ -204,8 +221,14 @@ Once this is selected, you can see the trace of your agent on the given task. Cl
 image to select a step and observe the action taken by the agent.
 
 
-**⚠️ Note**: Gradio is still in developement and unexpected behavior have been frequently noticed. Version 5.5 seems to work properly so far. If you're not sure that the proper information is displaying, refresh the page and select your experiment again.
+**⚠️ Note**: Gradio is still developing, and unexpected behavior has been frequently noticed. Version 5.5 seems to work properly so far. If you're not sure that the proper information is displaying, refresh the page and select your experiment again.
+
+
+## 🏆 Leaderboard
+
+Official unified [leaderboard](https://huggingface.co/spaces/ServiceNow/browsergym-leaderboard) across all benchmarks. 
 
+Experiments are on their way for more reference points using GenericAgent. We are also working on code to automatically push a study to the leaderboard.
 
 ## 🤖 Implement a new Agent
 
@@ -222,32 +245,32 @@ Several factors can influence reproducibility of results in the context of evalu
 dynamic benchmarks.
 
 ### Factors affecting reproducibility
-* **Software version**: Different version of Playwright or any package in the software stack could
+* **Software version**: Different versions of Playwright or any package in the software stack could
   influence the behavior of the benchmark or the agent.
-* **API based LLMs silently changing**: Even for a fixed version, an LLM may be updated e.g. to
-  incorporate latest web knowledge.
+* **API-based LLMs silently changing**: Even for a fixed version, an LLM may be updated e.g. to
+  incorporate the latest web knowledge.
 * **Live websites**:
   * WorkArena: The demo instance is mostly fixed in time to a specific version but ServiceNow
-    sometime push minor modifications.
+    sometimes pushes minor modifications.
   * AssistantBench and GAIA: These rely on the agent navigating the open web. The experience may
     change depending on which country or region, some websites might be in different languages by
     default.
-* **Stochastic Agents**: Setting temperature of the LLM to 0 can reduce most stochasticity.
-* **Non deterministic tasks**: For a fixed seed, the changes should be minimal
+* **Stochastic Agents**: Setting the temperature of the LLM to 0 can reduce most stochasticity.
+* **Non-deterministic tasks**: For a fixed seed, the changes should be minimal
 
 ### Reproducibility Features
 * `Study` contains a dict of information about reproducibility, including benchmark version, package
   version and commit hash
 * The `Study` class allows automatic upload of your results to
   [`reproducibility_journal.csv`](reproducibility_journal.csv). This makes it easier to populate a
-  large amount of reference points. 
-* **Reproduced results in the leaderboard**. For agents that are repdocudibile, we encourage users
+  large amount of reference points. For this feature, you need to `git clone` the repository and install via `pip install -e .`.
+* **Reproduced results in the leaderboard**. For agents that are reprocudibile, we encourage users
   to try to reproduce the results and upload them to the leaderboard. There is a special column
   containing information about all reproduced results of an agent on a benchmark.
 * **ReproducibilityAgent**: [You can run this agent](src/agentlab/agents/generic_agent/reproducibility_agent.py) on an existing study and it will try to re-run
-  the same actions on the same task seeds. A vsiual diff of the two prompts will be displayed in the
+  the same actions on the same task seeds. A visual diff of the two prompts will be displayed in the
   AgentInfo HTML tab of AgentXray. You will be able to inspect on some tasks what kind of changes
-  between to two executions. **Note**: this is a beta feature and will need some adaptation for your
+  between the two executions. **Note**: this is a beta feature and will need some adaptation for your
   own agent.
 
 
 
@@ -46,3 +46,22 @@ ThibaultLSDC,GenericAgent-anthropic_claude-3.5-sonnet:beta,weblinx_test,0.0.1.de
 ThibaultLSDC,GenericAgent-meta-llama_llama-3.1-70b-instruct,weblinx_test,0.0.1.dev13,2024-11-07_21-42-30,b9451759-4f0e-492c-a3c8-fa5109d2d9b1,0.089,0.005,0,2650/2650,None,Linux (#66-Ubuntu SMP Fri Aug 30 13:56:20 UTC 2024),3.12.7,1.39.0,0.2.3,7a5b91e62056fa8fb26efdd2f64f5b25a92b817c,,0.12.0,8633c30c31e6a5a1d5122835c035aa56d18f3f0a,
 ThibaultLSDC,GenericAgent-openai_o1-mini-2024-09-12,weblinx_test,0.0.1.dev13,2024-11-07_21-42-30,b9451759-4f0e-492c-a3c8-fa5109d2d9b1,0.125,0.006,0,2650/2650,None,Linux (#66-Ubuntu SMP Fri Aug 30 13:56:20 UTC 2024),3.12.7,1.39.0,0.2.3,7a5b91e62056fa8fb26efdd2f64f5b25a92b817c,,0.12.0,8633c30c31e6a5a1d5122835c035aa56d18f3f0a,
 ThibaultLSDC,GenericAgent-meta-llama_llama-3.1-405b-instruct,weblinx_test,0.0.1.dev13,2024-11-07_21-42-30,b9451759-4f0e-492c-a3c8-fa5109d2d9b1,0.079,0.005,0,2650/2650,None,Linux (#66-Ubuntu SMP Fri Aug 30 13:56:20 UTC 2024),3.12.7,1.39.0,0.2.3,7a5b91e62056fa8fb26efdd2f64f5b25a92b817c,,0.12.0,8633c30c31e6a5a1d5122835c035aa56d18f3f0a,
+ThibaultLSDC,GenericAgent-meta-llama_llama-3.1-405b-instruct,workarena_l2_agent_curriculum_eval,0.4.1,2024-11-29_14-28-47,528da1f2-1949-41dc-b988-85f19f435af2,0.072,0.017,2,235/235,None,Linux (#68-Ubuntu SMP Mon Oct 7 14:34:20 UTC 2024),3.12.7,1.39.0,0.3.1,b115b2716d8a6328824684a692ed642297f0b1dc,,0.13.3,70dac253628c476aff1af6a975f27f8563453ad2,
+ThibaultLSDC,GenericAgent-meta-llama_llama-3.1-405b-instruct,miniwob,0.13.3,2024-11-29_16-14-00,4d748972-6d35-4489-a197-138b656a7db3,0.646,0.019,0,625/625,None,Linux (#68-Ubuntu SMP Mon Oct 7 14:34:20 UTC 2024),3.12.7,1.39.0,0.3.1,becb4856fb1612f44010fe74ef8155d367ca17fc,,0.13.3,70dac253628c476aff1af6a975f27f8563453ad2,
+ThibaultLSDC,GenericAgent-gpt-4o,assistantbench,0.13.1,2024-11-28_19-34-58,d93a2398-2b70-41ce-b989-364fed988d73,0.005,0.003,2,213/214,None,Linux (#68-Ubuntu SMP Mon Oct 7 14:34:20 UTC 2024),3.12.7,1.39.0,0.3.0,32865050045c8c71df35c34ff30a6b420a4e258c,  M: src/agentlab/experiments/study.py,0.13.1,None,
+ThibaultLSDC,GenericAgent-gpt-4o-mini,assistantbench,0.13.1,2024-11-28_19-34-58,d93a2398-2b70-41ce-b989-364fed988d73,0.002,0.002,1,214/214,None,Linux (#68-Ubuntu SMP Mon Oct 7 14:34:20 UTC 2024),3.12.7,1.39.0,0.3.0,32865050045c8c71df35c34ff30a6b420a4e258c,  M: src/agentlab/experiments/study.py,0.13.1,None,
+ThibaultLSDC,GenericAgent-meta-llama_llama-3.1-405b-instruct,assistantbench,0.13.1,2024-11-28_19-34-58,d93a2398-2b70-41ce-b989-364fed988d73,0.008,0.003,1,212/214,None,Linux (#68-Ubuntu SMP Mon Oct 7 14:34:20 UTC 2024),3.12.7,1.39.0,0.3.0,32865050045c8c71df35c34ff30a6b420a4e258c,  M: src/agentlab/experiments/study.py,0.13.1,None,
+ThibaultLSDC,GenericAgent-meta-llama_llama-3.1-70b-instruct,assistantbench,0.13.1,2024-11-28_19-34-58,d93a2398-2b70-41ce-b989-364fed988d73,0.007,0.005,8,206/214,None,Linux (#68-Ubuntu SMP Mon Oct 7 14:34:20 UTC 2024),3.12.7,1.39.0,0.3.0,32865050045c8c71df35c34ff30a6b420a4e258c,  M: src/agentlab/experiments/study.py,0.13.1,None,
+ThibaultLSDC,GenericAgent-meta-llama_llama-3.1-8b-instruct,assistantbench,0.13.1,2024-11-28_19-34-58,d93a2398-2b70-41ce-b989-364fed988d73,0.001,0.001,15,214/214,None,Linux (#68-Ubuntu SMP Mon Oct 7 14:34:20 UTC 2024),3.12.7,1.39.0,0.3.0,32865050045c8c71df35c34ff30a6b420a4e258c,  M: src/agentlab/experiments/study.py,0.13.1,None,
+ThibaultLSDC,GenericAgent-anthropic_claude-3.5-sonnet:beta,assistantbench,0.13.1,2024-11-28_19-34-58,d93a2398-2b70-41ce-b989-364fed988d73,0.007,0.003,1,212/214,None,Linux (#68-Ubuntu SMP Mon Oct 7 14:34:20 UTC 2024),3.12.7,1.39.0,0.3.0,32865050045c8c71df35c34ff30a6b420a4e258c,  M: src/agentlab/experiments/study.py,0.13.1,None,
+ThibaultLSDC,GenericAgent-openai_o1-mini-2024-09-12,assistantbench,0.13.1,2024-11-28_19-34-58,d93a2398-2b70-41ce-b989-364fed988d73,0.009,0.005,1,214/214,None,Linux (#68-Ubuntu SMP Mon Oct 7 14:34:20 UTC 2024),3.12.7,1.39.0,0.3.0,32865050045c8c71df35c34ff30a6b420a4e258c,  M: src/agentlab/experiments/study.py,0.13.1,None,
+ThibaultLSDC,GenericAgent-gpt-4o-mini,webarena,0.13.3,2024-11-29_19-25-49,c6bdeb87-9879-4c06-aa70-00d895001156,0.174,0.013,1,812/812,None,Linux (#68-Ubuntu SMP Mon Oct 7 14:34:20 UTC 2024),3.12.7,1.39.0,0.3.1,b115b2716d8a6328824684a692ed642297f0b1dc,,0.13.3,None,
+ThibaultLSDC,GenericAgent-gpt-4o,webarena,0.13.3,2024-11-29_22-28-32,d2eed215-91bb-4603-b69c-8ef8f9d57f34,0.314,0.016,3,812/812,None,Linux (#68-Ubuntu SMP Mon Oct 7 14:34:20 UTC 2024),3.12.7,1.39.0,0.3.1,430fe9456ba766398380454a6335f094004607af,,0.13.3,None,
+ThibaultLSDC,GenericAgent-anthropic_claude-3.5-sonnet:beta,webarena,0.13.3,2024-11-29_22-37-46,b5fc5be7-54cc-4fc1-a9ee-73447b9c3eae,0.362,0.017,0,812/812,None,Linux (#68-Ubuntu SMP Mon Oct 7 14:34:20 UTC 2024),3.12.7,1.39.0,0.3.1,7b224971fb7a90fb76924ca9386a1e8bf609dd2a,,0.13.3,None,
+ThibaultLSDC,GenericAgent-openai_o1-mini-2024-09-12,webarena,0.13.3,2024-11-30_00-22-44,1827983d-5e84-4b63-ad49-bf45ec2a6348,0.286,0.016,0,812/812,None,Linux (#68-Ubuntu SMP Mon Oct 7 14:34:20 UTC 2024),3.12.7,1.39.0,0.3.1,3f54ef13b778e69a1706c732f776147e9523ad3d,,0.13.3,None,
+ThibaultLSDC,GenericAgent-meta-llama_llama-3.1-405b-instruct,webarena,0.13.3,2024-12-01_00-04-43,aaeca13d-0cf5-444f-8445-590350b54746,0.24,0.015,9,812/812,None,Linux (#68-Ubuntu SMP Mon Oct 7 14:34:20 UTC 2024),3.12.7,1.39.0,0.3.1,5a5b94d544424517cdd11602b27100b82e35eac0,,0.13.3,None,
+ThibaultLSDC,GenericAgent-gpt-4o-mini_vision,visualwebarena,0.13.3,2024-12-02_02-54-33,8d8642d3-757a-4346-ba45-01398f85b1f4,0.169,0.012,37,909/910,None,Linux (#68-Ubuntu SMP Mon Oct 7 14:34:20 UTC 2024),3.12.7,1.39.0,0.3.1,df7bc706f3793f47a456d1bda0485b306b8cf612,,0.13.3,None,
+ThibaultLSDC,GenericAgent-gpt-4o_vision,visualwebarena,0.13.3,2024-12-02_07-17-28,7fb7eac8-4bbd-4ebe-be32-15901a7678f2,0.267,0.015,65,910/910,None,Linux (#68-Ubuntu SMP Mon Oct 7 14:34:20 UTC 2024),3.12.7,1.39.0,0.3.1,df7bc706f3793f47a456d1bda0485b306b8cf612,,0.13.3,None,
+ThibaultLSDC,GenericAgent-anthropic_claude-3.5-sonnet:beta_vision,visualwebarena,0.13.3,2024-12-02_09-11-35,22f0611d-aeea-4ee9-a533-b45442b5e080,0.21,0.013,178,910/910,None,Linux (#68-Ubuntu SMP Mon Oct 7 14:34:20 UTC 2024),3.12.7,1.39.0,0.3.1,df7bc706f3793f47a456d1bda0485b306b8cf612,,0.13.3,None,
+ThibaultLSDC,GenericAgent-meta-llama_llama-3.1-70b-instruct,webarena,0.13.3,2024-12-02_23-18-38,fc5747bc-d998-4942-a0eb-e55a3ccc1cb3,0.184,0.014,213,811/812,None,Linux (#68-Ubuntu SMP Mon Oct 7 14:34:20 UTC 2024),3.12.7,1.39.0,0.3.1,df7bc706f3793f47a456d1bda0485b306b8cf612,,0.13.3,None,
+
@@ -1,3 +1 @@
-"""DOCSTRING"""
-
-__version__ = "0.3.1"
+__version__ = "0.3.2.dev1"
@@ -53,7 +53,7 @@ def set_benchmark(self, benchmark: bgym.Benchmark, demo_mode):
 
         # verify if we can remove this
         if demo_mode:
-            self.action_set.demo_mode = "all_blue"
+            self.flags.action.action_set.demo_mode = "all_blue"
 
     def set_reproducibility_mode(self):
         self.chat_model_args.temperature = 0
 
@@ -14,8 +14,6 @@
 from IPython.display import display
 from tqdm import tqdm
 
-from agentlab.experiments.exp_utils import RESULTS_DIR
-
 # TODO find a more portable way to code set_task_category_as_index at least
 # handle dynamic imports. We don't want to always import workarena
 # from browsergym.workarena import TASK_CATEGORY_MAP
@@ -496,8 +494,8 @@ def display_report(
     if rename_bool_flags:
         report = _rename_bool_flags(report)
 
-    if copy_to_clipboard:
-        to_clipboard(report)
+    # if copy_to_clipboard:
+    #     to_clipboard(report)
 
     columns = list(report.columns)
 
 
@@ -6,7 +6,7 @@
 from pathlib import Path
 from time import sleep, time
 
-from browsergym.experiments.loop import ExpArgs, _move_old_exp, yield_all_exp_results
+from browsergym.experiments.loop import ExpArgs, yield_all_exp_results
 from tqdm import tqdm
 
 logger = logging.getLogger(__name__)  # Get logger based on module name
 
@@ -19,7 +19,9 @@ def _get_repo(module):
     return Repo(Path(module.__file__).resolve().parent, search_parent_directories=True)
 
 
-def _get_benchmark_version(benchmark: bgym.Benchmark) -> str:
+def _get_benchmark_version(
+    benchmark: bgym.Benchmark, allow_bypass_benchmark_version: bool = False
+) -> str:
     benchmark_name = benchmark.name
 
     if hasattr(benchmark, "get_version"):
@@ -42,7 +44,10 @@ def _get_benchmark_version(benchmark: bgym.Benchmark) -> str:
     elif benchmark_name.startswith("assistantbench"):
         return metadata.distribution("browsergym.assistantbench").version
     else:
-        raise ValueError(f"Unknown benchmark {benchmark_name}")
+        if allow_bypass_benchmark_version:
+            return "bypassed"
+        else:
+            raise ValueError(f"Unknown benchmark {benchmark_name}")
 
 
 def _get_git_username(repo: Repo) -> str:
@@ -183,6 +188,7 @@ def get_reproducibility_info(
         "*inspect_results.ipynb",
     ),
     ignore_changes=False,
+    allow_bypass_benchmark_version=False,
 ):
     """
     Retrieve a dict of information that could influence the reproducibility of an experiment.
@@ -205,7 +211,7 @@ def get_reproducibility_info(
         "benchmark": benchmark.name,
         "study_id": study_id,
         "comment": comment,
-        "benchmark_version": _get_benchmark_version(benchmark),
+        "benchmark_version": _get_benchmark_version(benchmark, allow_bypass_benchmark_version),
         "date": datetime.now().strftime("%Y-%m-%d_%H-%M-%S"),
         "os": f"{platform.system()} ({platform.version()})",
         "python_version": platform.python_version(),
 
@@ -268,6 +268,7 @@ def set_reproducibility_info(self, strict_reproducibility=False, comment=None):
             self.uuid,
             ignore_changes=not strict_reproducibility,
             comment=comment,
+            allow_bypass_benchmark_version=not strict_reproducibility,
         )
         if self.reproducibility_info is not None:
             repro.assert_compatible(
@@ -405,7 +406,6 @@ def load_most_recent(root_dir: Path = None, contains=None) -> "Study":
 
 def _make_study_name(agent_names, benchmark_names, suffix=None):
     """Make a study name from the agent and benchmark names."""
-
     # extract unique agent and benchmark names
     agent_names = list(set(agent_names))
     benchmark_names = list(set(benchmark_names))