bump package versions to support gemini 3.1 and claude 4.6 (also rerun scoring and eval) by gituser768 · Pull Request #142 · allenai/asta-bench

gituser768 · 2026-03-19T17:19:57Z

bump package versions to support gemini 3.1 and claude 4.6 (also rerun scoring and eval)
Not really sure what I should do to check that this change didnt also break other things. I could instead just change the package versions for the sqa runs (inside the dvc call to uv) if that's what we decide.
agent-baselines side of things: allenai/agent-baselines#21

gituser768 · 2026-03-19T17:47:16Z

brb working on fixing ci.
EDIT: looks like its fixed!

mdarcy220 · 2026-03-20T00:14:33Z

pyproject.toml

+    "inspect_ai==0.3.143",
+    "anthropic==0.85.0",
+    "google-genai==1.67.0",


This is a lot of overrides, and I'd kind of prefer to live in a world with as few overrides as possible; can you add a note about what's actually conflicting? If we can resolve it easily that seems best, but otherwise the note will at least let us re-evaluate later.

mdarcy220 · 2026-03-20T00:16:05Z

pyproject.toml

-    "inspect_ai==0.3.114",
+    "inspect_ai==0.3.143",
    "agent-eval==0.1.44",
    "openai>=1.78.0", # required by inspect


openai 2.28.0, if that's the requirement (per below)

mdarcy220 · 2026-03-20T00:19:05Z

pyproject.toml

mdarcy220 · 2026-03-20T00:19:19Z

pyproject.toml

…to rerun-as-requested

gituser768 · 2026-03-24T00:59:15Z

Model	global_avg	answer_precision	citation_precision	citation_recall	ingredient_recall
openai/o4-mini	0.654291	0.865940	0.494598	0.463127	0.793500
openai/o3	0.661024	0.940450	0.438086	0.379169	0.886390
anthropic/claude-sonnet-4-6-thinking	0.473003	0.728975	0.287516	0.181738	0.693783
google/gemini-3.1-pro-preview	0.652146	0.952273	0.464964	0.452873	0.738474
anthropic/claude-sonnet-4-6	0.348680	0.542101	0.201180	0.121430	0.530011
sqa_claude-4.6	0.901987	0.922788	0.926290	0.830583	0.928285
sqa_gemini-3.1-pro-preview	0.878336	0.914451	0.914740	0.831195	0.852959
sqa_o3_high	0.911982	0.927918	0.945655	0.812281	0.962072
elicit	0.862784	0.936412	0.913435	0.778799	0.822489
storm	0.825915	0.969670	0.795299	0.703196	0.835497
scispace	0.866961	0.911583	0.901646	0.745571	0.909046
fhouse_crow	0.672511	0.922433	0.490574	0.463001	0.814035
fhouse_falcon	0.699629	0.955473	0.483884	0.441739	0.917420
openai_deep_research	0.807240	0.969544	0.806372	0.529920	0.923125
you	0.617172	0.853728	0.613199	0.346591	0.655170
perplexity_dr	0.614562	0.932745	0.465892	0.235995	0.823616
openscholar	0.628729	0.841769	0.600112	0.516345	0.556690

gituser768 · 2026-03-26T21:13:06Z

Finally a comparison table:

Current model	Historical comparator	global_avg	answer_precision	citation_precision	citation_recall	ingredient_recall
openai/o3	main DVC openai/o3	0.6667 -> 0.6610	0.9456 -> 0.9405	0.4496 -> 0.4381	0.4137 -> 0.3792	0.8578 -> 0.8864
anthropic/claude-sonnet-4-6	main DVC claude-sonnet-4-20250514	0.6244 -> 0.3487	0.9494 -> 0.5421	0.4038 -> 0.2012	0.2747 -> 0.1214	0.8698 -> 0.5300
anthropic/claude-sonnet-4-6-thinking	main DVC claude-sonnet-4-20250514-thinking	0.5958 -> 0.4730	0.9326 -> 0.7290	0.3603 -> 0.2875	0.3055 -> 0.1817	0.7849 -> 0.6938
google/gemini-3.1-pro-preview	main DVC gemini-2.5-pro-preview-03-25	0.5760 -> 0.6521	0.9369 -> 0.9523	0.4089 -> 0.4650	0.3109 -> 0.4529	0.6472 -> 0.7385
sqa_claude-4.6	main DVC sqa_claude-4.0	0.8791 -> 0.9020	0.9072 -> 0.9228	0.9225 -> 0.9263	0.8111 -> 0.8306	0.8758 -> 0.9283
sqa_gemini-3.1-pro-preview	main DVC sqa_gemini-2.5-pro	0.8786 -> 0.8783	0.9212 -> 0.9145	0.9493 -> 0.9147	0.7443 -> 0.8312	0.8996 -> 0.8530
sqa_o3_high	HF Asta Scholar QA openai/o3-2025-04-16	0.8874 -> 0.9120	0.9152 -> 0.9279	0.9183 -> 0.9457	0.7831 -> 0.8123	0.9330 -> 0.9621
elicit	HF public Elicit submission	0.8553 -> 0.8628	0.9338 -> 0.9364	0.8984 -> 0.9134	0.8017 -> 0.7788	0.7874 -> 0.8225
storm	HF public STORM submission	0.7830 -> 0.8259	0.9578 -> 0.9697	0.7502 -> 0.7953	0.6477 -> 0.7032	0.7764 -> 0.8355
scispace	HF public Scispace submission	0.8464 -> 0.8670	0.8836 -> 0.9116	0.8726 -> 0.9016	0.7187 -> 0.7456	0.9105 -> 0.9090
fhouse_crow	HF public FutureHouse CROW submission	0.8106 -> 0.6725	0.9917 -> 0.9224	0.8216 -> 0.4906	0.6338 -> 0.4630	0.7952 -> 0.8140
fhouse_falcon	HF public FutureHouse FALCON submission	0.7759 -> 0.6996	0.9807 -> 0.9555	0.7272 -> 0.4839	0.4728 -> 0.4417	0.9230 -> 0.9174
openai_deep_research	latest HF public OpenAI Deep Research submission	0.7940 -> 0.8072	0.9751 -> 0.9695	0.7548 -> 0.8064	0.5089 -> 0.5299	0.9370 -> 0.9231
you	HF public You.com Research API submission	0.5500 -> 0.6172	0.8849 -> 0.8537	0.4510 -> 0.6132	0.2721 -> 0.3466	0.5921 -> 0.6552
perplexity_dr	HF public Perplexity Deep Research submission	0.6728 -> 0.6146	0.9268 -> 0.9327	0.4726 -> 0.4659	0.3756 -> 0.2360	0.9160 -> 0.8236
openscholar	HF public OpenSciLM/OpenScholar submission	0.5796 -> 0.6287	0.7629 -> 0.8418	0.5016 -> 0.6001	0.4407 -> 0.5163	0.6131 -> 0.5567

mdarcy220 · 2026-03-26T22:10:07Z

For the non-sqa/react systems, are those score changes entirely from the scorer model change? E.g. futurehouse crow went 0.8106 -> 0.6725 even though we didn't regenerate the answers? It's weird since a lot of systems have pretty close scores (±2%) as before but a couple have those big diffs. It could make sense if those are fresh answers from their apis, though.

gituser768 · 2026-03-27T20:29:49Z

@mdarcy220 I'm trying to figure out where those differences come from. I do suspect that somehow the answers I ran this eval on dont always match what is on HF. So I've pulled the HF eval files and rescoring those directly at the moment.

gituser768 · 2026-03-31T21:03:13Z

Here is an updated table where I made sure to evaluate on exactly the same solver outputs that are already in HF:

Current model	Historical comparator	global_avg	answer_precision	citation_precision	citation_recall	ingredient_recall
sqa_claude-4.6	HF Asta Scholar QA claude_4	0.8791 -> 0.9020	0.9072 -> 0.9228	0.9225 -> 0.9263	0.8111 -> 0.8306	0.8758 -> 0.9283
sqa_gemini-3.1-pro-preview	main DVC sqa_gemini-2.5-pro	0.8786 -> 0.8783	0.9212 -> 0.9145	0.9493 -> 0.9147	0.7443 -> 0.8312	0.8996 -> 0.8530
sqa_o3_high	HF Asta Scholar QA openai/o3-2025-04-16	0.8874 -> 0.9133	0.9152 -> 0.9347	0.9183 -> 0.9457	0.7831 -> 0.8097	0.9330 -> 0.9630
elicit	HF Elicit submission	0.8553 -> 0.8617	0.9338 -> 0.9350	0.8984 -> 0.9167	0.8017 -> 0.7815	0.7874 -> 0.8134
storm	HF STORM submission	0.7830 -> 0.8168	0.9578 -> 0.9607	0.7502 -> 0.8004	0.6477 -> 0.7053	0.7764 -> 0.8009
scispace	HF Scispace submission	0.8464 -> 0.8731	0.8836 -> 0.9151	0.8726 -> 0.9001	0.7187 -> 0.7455	0.9105 -> 0.9316
fhouse_crow	HF FutureHouse CROW submission	0.8106 -> 0.6513	0.9917 -> 0.9638	0.8216 -> 0.4511	0.6338 -> 0.3738	0.7952 -> 0.8164
fhouse_falcon	HF FutureHouse FALCON submission	0.7759 -> 0.6474	0.9807 -> 0.9532	0.7272 -> 0.4089	0.4728 -> 0.2860	0.9230 -> 0.9417
openai_deep_research	latest HF public OpenAI Deep Research submission	0.7940 -> 0.7957	0.9751 -> 0.9152	0.7548 -> 0.7892	0.5089 -> 0.5309	0.9370 -> 0.9476
you	HF You.com Research API submission	0.5500 -> 0.6108	0.8849 -> 0.8623	0.4510 -> 0.6217	0.2721 -> 0.3551	0.5921 -> 0.6042
perplexity_dr	HF Perplexity Deep Research submission	0.6728 -> 0.6968	0.9268 -> 0.9118	0.4726 -> 0.4926	0.3756 -> 0.4362	0.9160 -> 0.9465
openscholar	HF OpenSciLM/OpenScholar submission	0.5796 -> 0.5873	0.7629 -> 0.7649	0.5016 -> 0.5069	0.4407 -> 0.4481	0.6131 -> 0.6292

Only the fhouse solvers start doing worse, after manually examining some examples I think this is mostly due to the old scorer being overly generous on the citation metrics for fhouse for some reason.

…to rerun-as-requested

# Conflicts: # pyproject.toml

gituser768 · 2026-04-01T23:24:31Z

alright, reduced the number of overrides. this should be ready now

mdarcy220

ok great, lgtm

gituser768 added 4 commits March 12, 2026 21:05

commit stuff that's ok to commit

507c2cf

Fix DVC solve runtime and refresh lockfile

36864d7

Finish test scoring pipeline updates

e23d08a

Restore shared uv env for DVC pipeline

9b7b4c9

gituser768 requested review from mdarcy220 and stefanc-ai2 March 19, 2026 17:19

gituser768 mentioned this pull request Mar 19, 2026

agent-baselines portion of astabench work allenai/agent-baselines#21

Merged

Avoid local uv source in shared astabench config

6974253

stefanc-ai2 approved these changes Mar 19, 2026

View reviewed changes

mdarcy220 reviewed Mar 20, 2026

View reviewed changes

dirkraft and others added 4 commits March 23, 2026 16:36

Merge branch 'main' into rerun-as-requested

60a15df

dvc: invalidate solve_sqa on solver changes

d543512

run

f2fcb7c

Merge branch 'rerun-as-requested' of github.com:allenai/asta-bench in…

7367d13

…to rerun-as-requested

Merge branch 'main' into rerun-as-requested

2391bd0

gituser768 added 2 commits March 27, 2026 16:16

pull eval files from hf

d706376

force commit

7ab988d

gituser768 added 6 commits March 31, 2026 21:32

fix retry logic and rerun

95eaeef

Merge branch 'rerun-as-requested' of github.com:allenai/asta-bench in…

f50729d

…to rerun-as-requested

Refresh scorer lock and override notes

fd5e7c4

Merge remote-tracking branch 'origin/main' into rerun-as-requested

204de5f

# Conflicts: # pyproject.toml

oops

5f4b198

Update provider deps for new model support

6cad397

mdarcy220 approved these changes Apr 2, 2026

View reviewed changes

gituser768 merged commit 57dcfe1 into main Apr 2, 2026
4 checks passed

gituser768 deleted the rerun-as-requested branch April 2, 2026 17:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bump package versions to support gemini 3.1 and claude 4.6 (also rerun scoring and eval)#142

bump package versions to support gemini 3.1 and claude 4.6 (also rerun scoring and eval)#142
gituser768 merged 18 commits intomainfrom
rerun-as-requested

gituser768 commented Mar 19, 2026 •

edited by dirkraft

Loading

Uh oh!

gituser768 commented Mar 19, 2026 •

edited

Loading

Uh oh!

mdarcy220 Mar 20, 2026

Uh oh!

mdarcy220 Mar 20, 2026 •

edited

Loading

Uh oh!

mdarcy220 Mar 20, 2026

Uh oh!

mdarcy220 Mar 20, 2026

Uh oh!

gituser768 commented Mar 24, 2026 •

edited

Loading

Uh oh!

gituser768 commented Mar 26, 2026

Uh oh!

mdarcy220 commented Mar 26, 2026

Uh oh!

gituser768 commented Mar 27, 2026

Uh oh!

gituser768 commented Mar 31, 2026

Uh oh!

gituser768 commented Apr 1, 2026

Uh oh!

mdarcy220 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

gituser768 commented Mar 19, 2026 • edited by dirkraft Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gituser768 commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mdarcy220 Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

mdarcy220 Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mdarcy220 Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

mdarcy220 Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

gituser768 commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gituser768 commented Mar 26, 2026

Uh oh!

mdarcy220 commented Mar 26, 2026

Uh oh!

gituser768 commented Mar 27, 2026

Uh oh!

gituser768 commented Mar 31, 2026

Uh oh!

gituser768 commented Apr 1, 2026

Uh oh!

mdarcy220 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

gituser768 commented Mar 19, 2026 •

edited by dirkraft

Loading

gituser768 commented Mar 19, 2026 •

edited

Loading

mdarcy220 Mar 20, 2026 •

edited

Loading

gituser768 commented Mar 24, 2026 •

edited

Loading