Skip to content

bump package versions to support gemini 3.1 and claude 4.6 (also rerun scoring and eval)#142

Merged
gituser768 merged 18 commits intomainfrom
rerun-as-requested
Apr 2, 2026
Merged

bump package versions to support gemini 3.1 and claude 4.6 (also rerun scoring and eval)#142
gituser768 merged 18 commits intomainfrom
rerun-as-requested

Conversation

@gituser768
Copy link
Copy Markdown
Contributor

@gituser768 gituser768 commented Mar 19, 2026

bump package versions to support gemini 3.1 and claude 4.6 (also rerun scoring and eval)
Not really sure what I should do to check that this change didnt also break other things. I could instead just change the package versions for the sqa runs (inside the dvc call to uv) if that's what we decide.
agent-baselines side of things: allenai/agent-baselines#21

@gituser768
Copy link
Copy Markdown
Contributor Author

gituser768 commented Mar 19, 2026

brb working on fixing ci.
EDIT: looks like its fixed!

pyproject.toml Outdated
Comment on lines +68 to +70
"inspect_ai==0.3.143",
"anthropic==0.85.0",
"google-genai==1.67.0",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a lot of overrides, and I'd kind of prefer to live in a world with as few overrides as possible; can you add a note about what's actually conflicting? If we can resolve it easily that seems best, but otherwise the note will at least let us re-evaluate later.

pyproject.toml Outdated
"inspect_ai==0.3.114",
"inspect_ai==0.3.143",
"agent-eval==0.1.44",
"openai>=1.78.0", # required by inspect
Copy link
Copy Markdown
Contributor

@mdarcy220 mdarcy220 Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

openai 2.28.0, if that's the requirement (per below)

pyproject.toml Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1.67.0

pyproject.toml Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0.85.0

@gituser768
Copy link
Copy Markdown
Contributor Author

gituser768 commented Mar 24, 2026

Model global_avg answer_precision citation_precision citation_recall ingredient_recall
openai/o4-mini 0.654291 0.865940 0.494598 0.463127 0.793500
openai/o3 0.661024 0.940450 0.438086 0.379169 0.886390
anthropic/claude-sonnet-4-6-thinking 0.473003 0.728975 0.287516 0.181738 0.693783
google/gemini-3.1-pro-preview 0.652146 0.952273 0.464964 0.452873 0.738474
anthropic/claude-sonnet-4-6 0.348680 0.542101 0.201180 0.121430 0.530011
sqa_claude-4.6 0.901987 0.922788 0.926290 0.830583 0.928285
sqa_gemini-3.1-pro-preview 0.878336 0.914451 0.914740 0.831195 0.852959
sqa_o3_high 0.911982 0.927918 0.945655 0.812281 0.962072
elicit 0.862784 0.936412 0.913435 0.778799 0.822489
storm 0.825915 0.969670 0.795299 0.703196 0.835497
scispace 0.866961 0.911583 0.901646 0.745571 0.909046
fhouse_crow 0.672511 0.922433 0.490574 0.463001 0.814035
fhouse_falcon 0.699629 0.955473 0.483884 0.441739 0.917420
openai_deep_research 0.807240 0.969544 0.806372 0.529920 0.923125
you 0.617172 0.853728 0.613199 0.346591 0.655170
perplexity_dr 0.614562 0.932745 0.465892 0.235995 0.823616
openscholar 0.628729 0.841769 0.600112 0.516345 0.556690

@gituser768
Copy link
Copy Markdown
Contributor Author

Finally a comparison table:

Current model Historical comparator global_avg answer_precision citation_precision citation_recall ingredient_recall
openai/o3 main DVC openai/o3 0.6667 -> 0.6610 0.9456 -> 0.9405 0.4496 -> 0.4381 0.4137 -> 0.3792 0.8578 -> 0.8864
anthropic/claude-sonnet-4-6 main DVC claude-sonnet-4-20250514 0.6244 -> 0.3487 0.9494 -> 0.5421 0.4038 -> 0.2012 0.2747 -> 0.1214 0.8698 -> 0.5300
anthropic/claude-sonnet-4-6-thinking main DVC claude-sonnet-4-20250514-thinking 0.5958 -> 0.4730 0.9326 -> 0.7290 0.3603 -> 0.2875 0.3055 -> 0.1817 0.7849 -> 0.6938
google/gemini-3.1-pro-preview main DVC gemini-2.5-pro-preview-03-25 0.5760 -> 0.6521 0.9369 -> 0.9523 0.4089 -> 0.4650 0.3109 -> 0.4529 0.6472 -> 0.7385
sqa_claude-4.6 main DVC sqa_claude-4.0 0.8791 -> 0.9020 0.9072 -> 0.9228 0.9225 -> 0.9263 0.8111 -> 0.8306 0.8758 -> 0.9283
sqa_gemini-3.1-pro-preview main DVC sqa_gemini-2.5-pro 0.8786 -> 0.8783 0.9212 -> 0.9145 0.9493 -> 0.9147 0.7443 -> 0.8312 0.8996 -> 0.8530
sqa_o3_high HF Asta Scholar QA openai/o3-2025-04-16 0.8874 -> 0.9120 0.9152 -> 0.9279 0.9183 -> 0.9457 0.7831 -> 0.8123 0.9330 -> 0.9621
elicit HF public Elicit submission 0.8553 -> 0.8628 0.9338 -> 0.9364 0.8984 -> 0.9134 0.8017 -> 0.7788 0.7874 -> 0.8225
storm HF public STORM submission 0.7830 -> 0.8259 0.9578 -> 0.9697 0.7502 -> 0.7953 0.6477 -> 0.7032 0.7764 -> 0.8355
scispace HF public Scispace submission 0.8464 -> 0.8670 0.8836 -> 0.9116 0.8726 -> 0.9016 0.7187 -> 0.7456 0.9105 -> 0.9090
fhouse_crow HF public FutureHouse CROW submission 0.8106 -> 0.6725 0.9917 -> 0.9224 0.8216 -> 0.4906 0.6338 -> 0.4630 0.7952 -> 0.8140
fhouse_falcon HF public FutureHouse FALCON submission 0.7759 -> 0.6996 0.9807 -> 0.9555 0.7272 -> 0.4839 0.4728 -> 0.4417 0.9230 -> 0.9174
openai_deep_research latest HF public OpenAI Deep Research submission 0.7940 -> 0.8072 0.9751 -> 0.9695 0.7548 -> 0.8064 0.5089 -> 0.5299 0.9370 -> 0.9231
you HF public You.com Research API submission 0.5500 -> 0.6172 0.8849 -> 0.8537 0.4510 -> 0.6132 0.2721 -> 0.3466 0.5921 -> 0.6552
perplexity_dr HF public Perplexity Deep Research submission 0.6728 -> 0.6146 0.9268 -> 0.9327 0.4726 -> 0.4659 0.3756 -> 0.2360 0.9160 -> 0.8236
openscholar HF public OpenSciLM/OpenScholar submission 0.5796 -> 0.6287 0.7629 -> 0.8418 0.5016 -> 0.6001 0.4407 -> 0.5163 0.6131 -> 0.5567

@mdarcy220
Copy link
Copy Markdown
Contributor

For the non-sqa/react systems, are those score changes entirely from the scorer model change? E.g. futurehouse crow went 0.8106 -> 0.6725 even though we didn't regenerate the answers? It's weird since a lot of systems have pretty close scores (±2%) as before but a couple have those big diffs. It could make sense if those are fresh answers from their apis, though.

@gituser768
Copy link
Copy Markdown
Contributor Author

@mdarcy220 I'm trying to figure out where those differences come from. I do suspect that somehow the answers I ran this eval on dont always match what is on HF. So I've pulled the HF eval files and rescoring those directly at the moment.

@gituser768
Copy link
Copy Markdown
Contributor Author

Here is an updated table where I made sure to evaluate on exactly the same solver outputs that are already in HF:

Current model Historical comparator global_avg answer_precision citation_precision citation_recall ingredient_recall
sqa_claude-4.6 HF Asta Scholar QA claude_4 0.8791 -> 0.9020 0.9072 -> 0.9228 0.9225 -> 0.9263 0.8111 -> 0.8306 0.8758 -> 0.9283
sqa_gemini-3.1-pro-preview main DVC sqa_gemini-2.5-pro 0.8786 -> 0.8783 0.9212 -> 0.9145 0.9493 -> 0.9147 0.7443 -> 0.8312 0.8996 -> 0.8530
sqa_o3_high HF Asta Scholar QA openai/o3-2025-04-16 0.8874 -> 0.9133 0.9152 -> 0.9347 0.9183 -> 0.9457 0.7831 -> 0.8097 0.9330 -> 0.9630
elicit HF Elicit submission 0.8553 -> 0.8617 0.9338 -> 0.9350 0.8984 -> 0.9167 0.8017 -> 0.7815 0.7874 -> 0.8134
storm HF STORM submission 0.7830 -> 0.8168 0.9578 -> 0.9607 0.7502 -> 0.8004 0.6477 -> 0.7053 0.7764 -> 0.8009
scispace HF Scispace submission 0.8464 -> 0.8731 0.8836 -> 0.9151 0.8726 -> 0.9001 0.7187 -> 0.7455 0.9105 -> 0.9316
fhouse_crow HF FutureHouse CROW submission 0.8106 -> 0.6513 0.9917 -> 0.9638 0.8216 -> 0.4511 0.6338 -> 0.3738 0.7952 -> 0.8164
fhouse_falcon HF FutureHouse FALCON submission 0.7759 -> 0.6474 0.9807 -> 0.9532 0.7272 -> 0.4089 0.4728 -> 0.2860 0.9230 -> 0.9417
openai_deep_research latest HF public OpenAI Deep Research submission 0.7940 -> 0.7957 0.9751 -> 0.9152 0.7548 -> 0.7892 0.5089 -> 0.5309 0.9370 -> 0.9476
you HF You.com Research API submission 0.5500 -> 0.6108 0.8849 -> 0.8623 0.4510 -> 0.6217 0.2721 -> 0.3551 0.5921 -> 0.6042
perplexity_dr HF Perplexity Deep Research submission 0.6728 -> 0.6968 0.9268 -> 0.9118 0.4726 -> 0.4926 0.3756 -> 0.4362 0.9160 -> 0.9465
openscholar HF OpenSciLM/OpenScholar submission 0.5796 -> 0.5873 0.7629 -> 0.7649 0.5016 -> 0.5069 0.4407 -> 0.4481 0.6131 -> 0.6292

Only the fhouse solvers start doing worse, after manually examining some examples I think this is mostly due to the old scorer being overly generous on the citation metrics for fhouse for some reason.

@gituser768
Copy link
Copy Markdown
Contributor Author

alright, reduced the number of overrides. this should be ready now

Copy link
Copy Markdown
Contributor

@mdarcy220 mdarcy220 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok great, lgtm

@gituser768 gituser768 merged commit 57dcfe1 into main Apr 2, 2026
4 checks passed
@gituser768 gituser768 deleted the rerun-as-requested branch April 2, 2026 17:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants