You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* core functions
* switch to dask
* removing joblib dependency and adding dask
* fixing imports
* handles multiple backends
* ensure asyncio loop creation
* more tests
* setting dashboard address to None
* minor
* Finally found a way to make it work
* initial reproducibility files
* Seems to be superflus
* adding a reproducibility journal
* minor update
* more robust
* adding reproducibility tools
* fix white listing
* minor
* minor
* minor
* minor
* minor fix
* more tests
* more results yay
* disabling this test
* update
* update
* black
* maybe fixing github workflow ?
* make get_git_username great again
* trigger change
* new browsergym
* GPT-4o result (and new comment column)
* Seems like there was a change to 4o flags, trying these
* minor comment
* better xray
* minor fix
* addming a comment field
* new agent
* another test with GPT-4o
* adding llama3 from openrouter
* fix naming
* unused import
* new summary tools and remove "_args" from columns in results
* add Llama
* initial code for reproducibility agent
* adjust inspect results
* infer from benchmark
* fix reproducibility agent
* prevent the repro_dir to be an index variable
* updating repro agent stats
* Reproducibility agent
* instructions to setup workarena
* fixing tests
* handles better a few edge cases
* default progress function to None
* minor formatting
* minor
* initial commit
* refactoring with Study class
* refactor to adapt for study class
* minor
* fix pricy test
* fixing tests
* tmp
* print report
* minor fix
* refine little details about reproducibility
* minor
* no need for set_temp anymore
* sanity check before running main
* minor update
* minor
* new results with 4o on workarena.l1
* sharing is caring
* add llama to main.py
* new hournal entry
* format
---------
Co-authored-by: Thibault Le Sellier de Chezelles <[email protected]>
recursix,GenericAgent-gpt-4o-mini-2024-07-18,workarena.l1,0.3.2,2024-10-05_13-21-27,0.23,0.023,0,330/330,None,Linux (#66-Ubuntu SMP Fri Aug 30 13:56:20 UTC 2024),3.12.6,1.39.0,0.2.1,aadf86b397cd36c581e1a61e491aec649ac5a140, M: main.py,0.7.0,2a0ab7e8e8795f8ca35fe4d4d67c6892d635dc12,
10
+
recursix,GenericAgent-gpt-4o-2024-05-13,workarena.l1,0.3.2,2024-10-05_15-45-42,0.382,0.027,0,330/330,None,Linux (#66-Ubuntu SMP Fri Aug 30 13:56:20 UTC 2024),3.12.6,1.39.0,0.2.1,ab447e997af589bbd022de7a5189a7685ddfa6ef,,0.7.0,2a0ab7e8e8795f8ca35fe4d4d67c6892d635dc12,
11
+
recursix,GenericAgent-meta-llama_llama-3.1-70b-instruct,miniwob_tiny_test,0.7.0,2024-10-05_17-49-15,1.0,0.0,0,4/4,None,Linux (#66-Ubuntu SMP Fri Aug 30 13:56:20 UTC 2024),3.12.6,1.39.0,0.2.1,a98fa24426a6ddde8443e8be44ed94cd9522e5ca,,0.7.0,2a0ab7e8e8795f8ca35fe4d4d67c6892d635dc12,
0 commit comments