Commit 7a5b91e
Fixing openrouter pricing rate limit (#112)
* Update unit_tests.yml (#101)
* request is done once and then reused
* Patching minor stuff (#69)
* fixing sample_std for single experience
* making gradio shared server non default
* missing requirement for xray
* Improve agent xray app (#70)
* 0.2.2 Release (#67)
* downgrading ubuntu version for github tests (#62)
* Llm api update (#59)
* getting rid of .invoke()
* adding an AbstractChatModel
* changing chat_api structure
* Reproducibility again (#61)
* core functions
* switch to dask
* removing joblib dependency and adding dask
* fixing imports
* handles multiple backends
* ensure asyncio loop creation
* more tests
* setting dashboard address to None
* minor
* Finally found a way to make it work
* initial reproducibility files
* Seems to be superflus
* adding a reproducibility journal
* minor update
* more robust
* adding reproducibility tools
* fix white listing
* minor
* minor
* minor
* minor
* minor fix
* more tests
* more results yay
* disabling this test
* update
* update
* black
* maybe fixing github workflow ?
* make get_git_username great again
* trigger change
* new browsergym
* GPT-4o result (and new comment column)
* Seems like there was a change to 4o flags, trying these
* minor comment
* better xray
* minor fix
* addming a comment field
* new agent
* another test with GPT-4o
* adding llama3 from openrouter
* fix naming
* unused import
* new summary tools and remove "_args" from columns in results
* add Llama
* initial code for reproducibility agent
* adjust inspect results
* infer from benchmark
* fix reproducibility agent
* prevent the repro_dir to be an index variable
* updating repro agent stats
* Reproducibility agent
* instructions to setup workarena
* fixing tests
* handles better a few edge cases
* default progress function to None
* minor formatting
* minor
* initial commit
* refactoring with Study class
* refactor to adapt for study class
* minor
* fix pricy test
* fixing tests
* tmp
* print report
* minor fix
* refine little details about reproducibility
* minor
* no need for set_temp anymore
* sanity check before running main
* minor update
* minor
* new results with 4o on workarena.l1
* sharing is caring
* add llama to main.py
* new hournal entry
* lamma 3 70B
* minor
* typo
* black fix (wasn't configured)
---------
Co-authored-by: Thibault Le Sellier de Chezelles <[email protected]>
* version bump
---------
Co-authored-by: Alexandre Lacoste <[email protected]>
* Make share=TRue into a environment variable, disabled by default for security
* fix floating point issue with std_reward in agent xray
* Update src/agentlab/analyze/inspect_results.py
* Update src/agentlab/analyze/agent_xray.py
---------
Co-authored-by: Thibault LSDC <[email protected]>
Co-authored-by: Alexandre Lacoste <[email protected]>
* added tmlr definitive config (#71)
* downgrading gradio version (#77)
* Study refactor (#73)
* adapting to new Benchmark class
* fixing tests
* fix tests
* typo
* not ready for gradio 5
* study id and a few fixes
* fixing pricy tests
---------
Co-authored-by: ThibaultLSDC <[email protected]>
* adding message class and updating generic agent accordingly (#68)
* adding message class and updating generic agent accordingly
* updating tests
* Reproducibility test before message class
* Adding inspect_result.ipynb to reprod white list
* Reproducibility test after message class
* L1 before message class
* L1 after message class
* added append as method to the Discussion class, to make it totally similar to a list
* changed to_markdown behavior
* updated most_basic_agent
* updated ReproAgent
* Update src/agentlab/analyze/agent_xray.py
* format
* new journal entry
* immutable as default kwarg
* removing __add__ and __radd__
* added deprecation warning
* updating tests
* version bump
* Updating generic_agent to fit use BGym's goal_object (#83)
* updating generic agent to goal_object
* fixing image markdown display
* updating tests
* fixing intruction BaseMessage
* added merge text in discussion
* added merge to discussion class
* added tests
* Minor revert (#86)
* minor revert
* revert tests too
* Add tabs (#84)
* add tabs
* make sure it's not computed if not visible
* Fix reproduce study (#87)
* add tabs
* this workaround is worst
* bug fix
* fix reproduce study
* make sure it's not computed if not visible
* upgrading gradio dependency (#88)
* bgym update (#90)
* Workarena TMLR experiments (#89)
* new entry
* adding llm configs
* new journal entries
* handling sequntial in VWA (#91)
* handling sequntial in VWA
* enable comments
* format
---------
Co-authored-by: ThibaultLSDC <[email protected]>
* Tmlr workarena (#92)
* adding llm configs
* new L1 entries
* tmp
* reformat
* adding assistantbench to reproducibility_util.py
* gitignore (#97)
* Vision fix (#105)
* changing content name
* Update src/agentlab/llm/llm_utils.py
---------
Co-authored-by: Maxime Gasse <[email protected]>
* L2 tmlr (#93)
* adding llm configs
* L2 entries
* claude L3
* claude vision support
* miniwob results
* 405b L1 entry
* Replacing Dask with Ray (#100)
* dask-dependencies
* minor
* replace with ray
* adjust tests and move a few things
* markdown report
* automatic relaunch
* add dependencies
* reformat
* fix unit-test
* catch timeout
* fixing bugs and making things work
* adress comments and black format
* new dependencies viewer
* Update benchmark to use visualwebarena instead of webarena
* Fix import and uncomment code in get_ray_url.py
* Add ignore_dependencies option to Study and _agents_on_benchmark functions
* Update load_most_recent method to include contains parameter
* Update load_most_recent method to accept contains parameter and add warning for ignored dependencies in _agents_on_benchmark
* Refactor backend preparation in Study class and improve logging for ignored dependencies
* finallly some results with claude on webarena
* Add warnings for Windows timeouts and clarify parallel backend options; update get_results method to conditionally save outputs
* black
* ensure timeout is int (For the 3rd time?)
* Refactor timeout handling in context manager; update test to reduce avg_step_timeout and rename test function
* black
* Change parallel backend from "joblib" to "ray" in run_experiments function
* Update src/agentlab/experiments/study.py
Co-authored-by: Maxime Gasse <[email protected]>
* Update src/agentlab/analyze/inspect_results.py
Co-authored-by: Maxime Gasse <[email protected]>
* Refactor logging initialization and update layout configurations in dependency graph plotting; adjust node size and font size for better visualization
---------
Co-authored-by: Maxime Gasse <[email protected]>
* switching to 2 for loops in _agents_on_benchmark (#107)
* yet another way to kill timedout jobs (#108)
* request is done once and then reused
* switched to caching original function bc it doesnt break to tests
* added a catch for some openrouter under-the-hood error
---------
Co-authored-by: Maxime Gasse <[email protected]>
Co-authored-by: Xing Han Lu <[email protected]>
Co-authored-by: Alexandre Lacoste <[email protected]>1 parent feda734 commit 7a5b91e
File tree
3 files changed
+16
-1
lines changed- .github/workflows
- src/agentlab/llm
3 files changed
+16
-1
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
38 | 38 | | |
39 | 39 | | |
40 | 40 | | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
41 | 44 | | |
42 | 45 | | |
43 | 46 | | |
| |||
58 | 61 | | |
59 | 62 | | |
60 | 63 | | |
61 | | - | |
| 64 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
208 | 208 | | |
209 | 209 | | |
210 | 210 | | |
| 211 | + | |
| 212 | + | |
| 213 | + | |
| 214 | + | |
211 | 215 | | |
212 | 216 | | |
213 | 217 | | |
| |||
274 | 278 | | |
275 | 279 | | |
276 | 280 | | |
| 281 | + | |
| 282 | + | |
| 283 | + | |
| 284 | + | |
| 285 | + | |
| 286 | + | |
277 | 287 | | |
278 | 288 | | |
279 | 289 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
1 | 2 | | |
2 | 3 | | |
3 | 4 | | |
| |||
61 | 62 | | |
62 | 63 | | |
63 | 64 | | |
| 65 | + | |
64 | 66 | | |
65 | 67 | | |
66 | 68 | | |
| |||
0 commit comments