adapt scoring for user-submitted models by regan-huff · Pull Request #76 · allenai/agent-eval

regan-huff · 2026-03-25T23:03:38Z

I am attempting to score some recently arrived external submissions for AstaBench with model usage that won't allow cost calculation in our current code.

moonshot
https://huggingface.co/datasets/allenai/asta-bench-submissions/tree/main/1.0.0/test/EvoScientist_EvoScientist_Coder_2026-03-19_16-22-34

The solver args show this provider openrouter/moonshotai/kimi-k2.5, which comes through in the inspect model usage objects like this:

{
model: "moonshotai/kimi-k2.5-0127",
usage: {
input_tokens: 7467,
output_tokens: 829,
total_tokens: 8296,
input_tokens_cache_write: null,
input_tokens_cache_read: 6144,
reasoning_tokens: 284
}

Moonshot is supported as an inference provider in litellm and some cost objects have been added to https://github.com/BerriAI/litellm/blob/main/model_prices_and_context_window.json
but are not yet in a released version. This PR adds pricing in local_cost to handle this provider/model.

claude-opus-4-6
1.0.0/test/Distyl_AI_Button_2026-03-23_18-54-16
This cost information can be added by bumping the litellm version to 1.82.3 and updating the desired_model_costs_url to match the sha for this release.

According to litellm, the compromised PyPI packages were litellm==1.82.7 and litellm==1.82.8.

Verified that scoring these two submissions is possible with these changes.

dirkraft · 2026-03-25T23:24:15Z

src/agenteval/cli.py

    # https://github.com/BerriAI/litellm/blob/b9621c760d3355e06dd17ec89b9eb6776755392e/litellm/litellm_core_utils/get_model_cost_map.py#L16
    # See the Development.md before changing.
-    desired_model_costs_url = "https://raw.githubusercontent.com/BerriAI/litellm/eb66daeef740947c0326826817cf68fb56a8b931/litellm/model_prices_and_context_window_backup.json"
+    desired_model_costs_url = "https://raw.githubusercontent.com/BerriAI/litellm/9a5c778f1824641fe9f6c8dcc1d096fd9d8ef9f0/litellm/model_prices_and_context_window_backup.json"


how'd you choose this one? i ended up in the same place for running some other cost calcs. think we should take whatever the latest is

This is the latest release marked "stable"

dirkraft · 2026-03-25T23:28:12Z

I have a further request if it makes sense. Put the costs used in the data. https://github.com/allenai/agent-eval/compare/update-cost-map-gpt54

regan-huff · 2026-03-25T23:31:31Z

I have a further request if it makes sense. Put the costs used in the data. https://github.com/allenai/agent-eval/compare/update-cost-map-gpt54

That makes sense to me...do you want me to bring those changes into this PR?

dirkraft · 2026-03-26T00:27:27Z

I have a further request if it makes sense. Put the costs used in the data. https://github.com/allenai/agent-eval/compare/update-cost-map-gpt54

That makes sense to me...do you want me to bring those changes into this PR?

yes please :)

I'm trying to figure out the specific model names for all the newer runs we're trying to get and see if they're actually in that version of the costs file. If that file is really only a few days old, then it's probably(?) fine

regan-huff added 3 commits March 25, 2026 15:25

bump litellm version to get cost for newer claude models

d335d05

add custom pricing for moonshot

0bd7551

update with basemodel

c22179d

regan-huff requested a review from dirkraft March 25, 2026 23:03

dirkraft reviewed Mar 25, 2026

View reviewed changes

regan-huff added 2 commits March 26, 2026 09:57

capture commit of model cost json used to score

af034d2

foooormat

d413ef1

dirkraft approved these changes Mar 26, 2026

View reviewed changes

regan-huff merged commit eba8758 into main Mar 26, 2026
4 checks passed

regan-huff deleted the reganh/bump-litellm branch March 26, 2026 20:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adapt scoring for user-submitted models#76

adapt scoring for user-submitted models#76
regan-huff merged 5 commits intomainfrom
reganh/bump-litellm

regan-huff commented Mar 25, 2026

Uh oh!

dirkraft Mar 25, 2026

Uh oh!

regan-huff Mar 25, 2026

Uh oh!

dirkraft commented Mar 25, 2026

Uh oh!

regan-huff commented Mar 25, 2026

Uh oh!

dirkraft commented Mar 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

regan-huff commented Mar 25, 2026

Uh oh!

dirkraft Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

regan-huff Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

dirkraft commented Mar 25, 2026

Uh oh!

regan-huff commented Mar 25, 2026

Uh oh!

dirkraft commented Mar 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants