Skip to content

adapt scoring for user-submitted models#76

Merged
regan-huff merged 5 commits intomainfrom
reganh/bump-litellm
Mar 26, 2026
Merged

adapt scoring for user-submitted models#76
regan-huff merged 5 commits intomainfrom
reganh/bump-litellm

Conversation

@regan-huff
Copy link
Copy Markdown
Contributor

I am attempting to score some recently arrived external submissions for AstaBench with model usage that won't allow cost calculation in our current code.

  1. moonshot
    https://huggingface.co/datasets/allenai/asta-bench-submissions/tree/main/1.0.0/test/EvoScientist_EvoScientist_Coder_2026-03-19_16-22-34

The solver args show this provider openrouter/moonshotai/kimi-k2.5, which comes through in the inspect model usage objects like this:

{
model: "moonshotai/kimi-k2.5-0127",
usage: {
input_tokens: 7467,
output_tokens: 829,
total_tokens: 8296,
input_tokens_cache_write: null,
input_tokens_cache_read: 6144,
reasoning_tokens: 284
}

Moonshot is supported as an inference provider in litellm and some cost objects have been added to https://github.com/BerriAI/litellm/blob/main/model_prices_and_context_window.json
but are not yet in a released version. This PR adds pricing in local_cost to handle this provider/model.

  1. claude-opus-4-6
    1.0.0/test/Distyl_AI_Button_2026-03-23_18-54-16
    This cost information can be added by bumping the litellm version to 1.82.3 and updating the desired_model_costs_url to match the sha for this release.

According to litellm, the compromised PyPI packages were litellm==1.82.7 and litellm==1.82.8.

Verified that scoring these two submissions is possible with these changes.

@regan-huff regan-huff requested a review from dirkraft March 25, 2026 23:03
# https://github.com/BerriAI/litellm/blob/b9621c760d3355e06dd17ec89b9eb6776755392e/litellm/litellm_core_utils/get_model_cost_map.py#L16
# See the Development.md before changing.
desired_model_costs_url = "https://raw.githubusercontent.com/BerriAI/litellm/eb66daeef740947c0326826817cf68fb56a8b931/litellm/model_prices_and_context_window_backup.json"
desired_model_costs_url = "https://raw.githubusercontent.com/BerriAI/litellm/9a5c778f1824641fe9f6c8dcc1d096fd9d8ef9f0/litellm/model_prices_and_context_window_backup.json"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how'd you choose this one? i ended up in the same place for running some other cost calcs. think we should take whatever the latest is

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the latest release marked "stable"
Screenshot 2026-03-25 at 4 28 44 PM

@dirkraft
Copy link
Copy Markdown

I have a further request if it makes sense. Put the costs used in the data. https://github.com/allenai/agent-eval/compare/update-cost-map-gpt54

@regan-huff
Copy link
Copy Markdown
Contributor Author

I have a further request if it makes sense. Put the costs used in the data. https://github.com/allenai/agent-eval/compare/update-cost-map-gpt54

That makes sense to me...do you want me to bring those changes into this PR?

@dirkraft
Copy link
Copy Markdown

I have a further request if it makes sense. Put the costs used in the data. https://github.com/allenai/agent-eval/compare/update-cost-map-gpt54

That makes sense to me...do you want me to bring those changes into this PR?

yes please :)

I'm trying to figure out the specific model names for all the newer runs we're trying to get and see if they're actually in that version of the costs file. If that file is really only a few days old, then it's probably(?) fine

@regan-huff regan-huff merged commit eba8758 into main Mar 26, 2026
4 checks passed
@regan-huff regan-huff deleted the reganh/bump-litellm branch March 26, 2026 20:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants