bump package versions to support gemini 3.1 and claude 4.6 (also rerun scoring and eval)#142
bump package versions to support gemini 3.1 and claude 4.6 (also rerun scoring and eval)#142gituser768 merged 18 commits intomainfrom
Conversation
|
brb working on fixing ci. |
pyproject.toml
Outdated
| "inspect_ai==0.3.143", | ||
| "anthropic==0.85.0", | ||
| "google-genai==1.67.0", |
There was a problem hiding this comment.
This is a lot of overrides, and I'd kind of prefer to live in a world with as few overrides as possible; can you add a note about what's actually conflicting? If we can resolve it easily that seems best, but otherwise the note will at least let us re-evaluate later.
pyproject.toml
Outdated
| "inspect_ai==0.3.114", | ||
| "inspect_ai==0.3.143", | ||
| "agent-eval==0.1.44", | ||
| "openai>=1.78.0", # required by inspect |
There was a problem hiding this comment.
openai 2.28.0, if that's the requirement (per below)
pyproject.toml
Outdated
pyproject.toml
Outdated
|
|
Finally a comparison table:
|
|
For the non-sqa/react systems, are those score changes entirely from the scorer model change? E.g. futurehouse crow went |
|
@mdarcy220 I'm trying to figure out where those differences come from. I do suspect that somehow the answers I ran this eval on dont always match what is on HF. So I've pulled the HF eval files and rescoring those directly at the moment. |
|
Here is an updated table where I made sure to evaluate on exactly the same solver outputs that are already in HF:
Only the fhouse solvers start doing worse, after manually examining some examples I think this is mostly due to the old scorer being overly generous on the citation metrics for fhouse for some reason. |
…to rerun-as-requested
# Conflicts: # pyproject.toml
|
alright, reduced the number of overrides. this should be ready now |
bump package versions to support gemini 3.1 and claude 4.6 (also rerun scoring and eval)
Not really sure what I should do to check that this change didnt also break other things. I could instead just change the package versions for the sqa runs (inside the dvc call to uv) if that's what we decide.
agent-baselines side of things: allenai/agent-baselines#21