Update README and experiment script for clarity and consistency

recursix · recursix · commit 3e4b72eecd7f · 2025-08-13T10:47:46.000-04:00
diff --git a/tutorials/2_eval_on_miniwob/experiment.py b/tutorials/2_eval_on_miniwob/experiment.py
@@ -11,8 +11,8 @@
 from agentlab.benchmarks.setup_benchmark import ensure_benchmark
 from agentlab.experiments.study import Study
 
-project_dir = Path(__file__).parents[2]
 # This ensures MiniWob assets are downloaded and sets the MINIWOB_URL .env in the project dir.
+project_dir = Path(__file__).parents[2]
 ensure_benchmark("miniwob", project_root=project_dir)
 load_dotenv(project_dir.joinpath(".env"), override=False)  # load .env variables
 
diff --git a/tutorials/2_eval_on_miniwob/readme.md b/tutorials/2_eval_on_miniwob/readme.md
@@ -1,6 +1,6 @@
 ### Launch tutorial experiment
 * Activate the python env `source .venv/bin/activate`
-* Run the following command from **AgentLab** directory `python tutorials/2_eval_on_miniwob/experiment.py`.
+* Run the following command from **AgentLab** directory [`python tutorials/2_eval_on_miniwob/experiment.py`](./experiment.py).
 * This should launch your agent on 4 miniwob tasks in parallel and save results to `$HOME/agentlab_results`.
 
 
@@ -10,17 +10,20 @@
 * Navigate around XRay to visualize the agent's behavior
 
 ### (optional) Load your results in a notebook
-Optionally load the results in `inspect_results.ipynb`.
+Optionally load the results in [inspect_results.ipynb](./inspect_results.ipynb).
 This will help understand the content of results
 
 ### Change the experiment
-go back to `experiment.py` and uncomment the line 
+go back to [experiment.py](./experiment.py) and uncomment the lines
 ```python
 benchmark = DEFAULT_BENCHMARKS["miniwob"]()  # 125 tasks
-benchmark = benchmark.subset_from_glob(column="task_name", glob="*enter*")
+benchmark = benchmark.subset_from_glob(column="task_name", glob="*enter*") # filter only 7 tasks
 ```
 This will change the benchmark from `"miniwob_tiny_test"` to all the `"*enter*"` tasks in the full miniwob.
 
-Run the experiment again. This will launch on 7 tasks with 5 seeds each.
+Run the experiment again. This will launch on 7 tasks with 5 seeds each (35 experiments)
 
-Finally refresh XRay to be able to load the new experiment.
+Finally refresh XRay to be able to load the new experiment.
+
+### Exercise 1
+What are the failures and why?