LLM names take into account reasoning effort in model args by ca16 · Pull Request #69 · allenai/agent-eval

ca16 · 2025-08-22T14:48:55Z

Related to https://github.com/allenai/astabench-issues/issues/199, and more generally supports reflecting non-default reasoning effort settings for models as long as those are reflected in EvalSpec's model_args.

The idea is that when this code is processing model usages, including to construct more formatted names for the models, it also looks at any relevant model args. For a given result, if its model args include reasoning effort, it'll attempt to figure out which model names from model usages that corresponds to, and add in the effort to the formatted model name.

Note: at first I had a more complicated version of this, which would only include the effort if it seemed like the same model didn't end up getting mapped to a name without the effort once all the results in the submission had been processed... But then it occurred to me that since model_args is on the result level, if in theory an effort setting were to exist in the model args for one result but not in the model args for another result in the same submission, it would make sense to end up with both versions of the model name. But let me know if you think that version would be better...

Testing done:

This branch

For the results of interest pclark425_Asta_DataVoyager_2025-08-14T21-31-12, with reasoning effort in the model args:

['gpt-5', 'openai/gpt-4o'] -> ['gpt-5 (reasoning_effort=minimal)', 'GPT-4o (unpinned)']

Same result, without reasoning effort in the model args:

['gpt-5', 'openai/gpt-4o'] -> ['gpt-5', 'GPT-4o (unpinned)']

For a couple of other results:
aps6992_FutureHouse_FALCON_2025-07-26T17-31-12

['openai/gpt-4.1-mini', 'models/gemini-2.5-flash-preview-05-20', 'openai/o3-mini'] -> ['GPT-4.1 Mini (unpinned)', 'Gemini 2.5 Flash (2024-05)', 'o3 Mini (unpinned)']

aps6992_Asta_Scholar_QA__No_Tables__2025-07-23T22-38-03

['anthropic/claude-sonnet-4-20250514'] -> ['Claude Sonnet 4 (2025-05)']

vs main

For the results of interest pclark425_Asta_DataVoyager_2025-08-14T21-31-12, with reasoning effort in the model args:

['gpt-5', 'openai/gpt-4o'] -> ['gpt-5', 'GPT-4o (unpinned)']

Same result, without reasoning effort in the model args:

['gpt-5', 'openai/gpt-4o'] -> ['gpt-5', 'GPT-4o (unpinned)']

For aps6992_FutureHouse_FALCON_2025-07-26T17-31-12:

['openai/gpt-4.1-mini', 'models/gemini-2.5-flash-preview-05-20', 'openai/o3-mini'] -> ['GPT-4.1 Mini (unpinned)', 'Gemini 2.5 Flash (2024-05)', 'o3 Mini (unpinned)']

For aps6992_Asta_Scholar_QA__No_Tables__2025-07-23T22-38-03:

['anthropic/claude-sonnet-4-20250514'] -> ['Claude Sonnet 4 (2025-05)']

jbragg · 2025-08-22T15:14:20Z

It's fine to handle this one specific model arg given our current results, but in general there's a lack of standardization of arg names (including for options beyond reasoning effort), and it may make sense simply to pass along the dictionary and let the frontend deal with it, eg see full args dict on hover in the webapp case

ca16 · 2025-08-22T15:31:07Z

src/agenteval/leaderboard/view.py

+    return f"{model_name} (reasoning_effort={effort})"
+
+
+def get_model_name_aliases(raw_name: str) -> set[str]:


A submission can have multiple results. Each result can have one eval spec with model args (and a model name), as well as a list of model usages (which also have model names). The names we show through lb view and in the leaderboard are based on the names in the model usages.

For a given result within a submission, assuming the model args apply to the model indicated by the model name in the eval spec, we need to figure out which names from model usages correspond to the same model (we can't assume they're the same, e.g. for the case that brought this on, the model name in the eval spec is openai/gpt-5, and in model usages we have gpt-5 and openai/gpt-4o).

My approach here is to map the model name from the eval spec to a list of aliases, and each model name from the model usages to a list of aliases, and if there's any overlap, treat that as the two referencing the same model.

Here's how aliases are determined:

the raw name provided (lower case)

if the name is a key in our current LB_MODEL_NAME_MAPPING mapping, the corresponding value (lower case)

if the name is a key in our current LB_MODEL_NAME_MAPPING mapping and the corresponding value indicates it's unpinned, the corresponding value without the date part (lower case)

Here's what that looks like for our example:

aliases for the model name in the eval spec: openai/gpt-5 -> {'openai/gpt-5', 'gpt-5', 'gpt-5 (unpinned)'}

aliases for the model names in the results:

openai/gpt-4o -> {'openai/gpt-4o', 'gpt-4o', 'gpt-4o (unpinned)'}

gpt-5 -> {'gpt-5'}

Does this logic make sense?

(Sidenote: might be helpful to refactor the model name stuff a little at some point, to put some assumptions in code... E.g. the map values all have dates or 'unpinned' between parens at the end, but nothing in the code makes sure that's always true AFAIK.)

Is there a need to map to aliases before joining between the evalspec and model usages? I worry that this may cause some incorrect joins.
Since both are logged through Inspect, I would hope that they would use the same identifier.
There's the additional issue that some people manually logged usages outside of Inspect they may not match.

I think we do need it at least for this particular result because they don't match as is (openai/gpt-5 vs gpt-5). Maybe because of the issue you mention ("There's the additional issue that some people manually logged usages outside of Inspect they may not match.")

jbragg

['gpt-5', 'openai/gpt-4o'] -> ['gpt-5 (reasoning_effort=minimal)', 'GPT-4o (unpinned)']

@ca16 in this example, gpt-5 looks unpinned so I would expect the name to retain unpinned in addition to the reasoning_effort detail

src/agenteval/leaderboard/view.py

ca16 · 2025-08-22T17:12:05Z

@ca16 in this example, gpt-5 looks unpinned so I would expect the name to retain unpinned in addition to the reasoning_effort detail

~~I think it looks unpinned~~ I think it doesn't have 'unpinned' because we don't have an entry for just gpt-5 in the current model name mapping, so we're falling back to the raw name, and then adding the effort to the end of that. I can add an entry for it though.

ca16 · 2025-08-22T17:30:05Z

After adding the second new entry to the mapping:
For pclark425_Asta_DataVoyager_2025-08-14T21-31-12:

['gpt-5', 'openai/gpt-4o'] -> ['GPT-5 (unpinned) (reasoning_effort=minimal)', 'GPT-4o (unpinned)']

mdarcy220

My overall take is that this seems good as an immediate solution for getting reasoning_effort to be represented in lb rows for our runs (which otherwise look confusing with two gpt-5s getting different results), but in the longer term there are going to be many gotchas and we probably need to restructure the way we extract this info.

mdarcy220 · 2025-08-22T19:22:38Z

src/agenteval/leaderboard/view.py

+            if looks_like_same_model:
+                reasoning_effort = eval_spec.model_args["reasoning_effort"]
+                other_name_option = adjust_model_name_for_reasoning_effort(
+                    model_name=safe_name_option,
+                    effort=reasoning_effort,
+                )


AFAIK this probably works for the specific experiments/agents we used, but in general for agents with multiple models, it's likely to run into mistakes (in particular, with an agent that tries cheap models first and escalates to higher reasoning if it detects a hard problem; this is a known strategy that we want to test at some point).

There are currently limitations for what we can do if usages are coming from outside Inspect, but for usages within Inspect the "model" events each have a property like:

"config": { "max_retries": 8, "max_connections": 8, "reasoning_effort": "medium" },

which specifies the config for that particular model usage (as opposed to the global default config). That example above is from the ReAct-o3 run on the leaderboard. (There's also a "call" field with the exact request; could inspect directly that but maybe loses the benefit of any normalization Inspect tries to do).

I would like it if we can first check the direct model-usage config before falling back to the global default (and logging a corresponding warning). Idk how hard that is to set up here but maybe it requires a lot of restructuring, in which case I think it's okay to merge this but we need to make a ticket to revisit when we start getting external submissions or testing mixed-reasoning agents.

We should also consider reasoning_tokens (Anthropic's finer-grained equivalent of OpenAI's reasoning_effort) at some point, but maybe that's off-topic for this ticket (though seems like it might be trivial to slot into the logic alongside reasoning_effort?).

thinking more...

So if we have gpt-5 and gpt-5 (unpinned), we will end up attaching reasoning effort to both of them? But if you have those two different names, it basically means we really do have a mixture of different model configs (which could also have different reasoning settings). I think if there's more than one variant of the same model used with reasoning_effort, this should be an error until we add per-usage effort extraction. Presumably that wouldn't break anything with our current runs because we shouldn't have mixed variants of the reasoning models (not within the same run, anyway)?

oh one more thing: in theory this will be fine for my runs since I changed args per-model, but I believe it is technically possible to specify --reasoning-effort for models that don't actually do reasoning, e.g. gpt-4o, but would be silently ignore in the actual requests. So we'd get a bogus name in that case.

mdarcy220 · 2025-08-22T19:23:42Z

src/agenteval/leaderboard/view.py

                    task_result.model_usages = None
                    task_result.model_costs = None

+                models_in_this_task = set([])


Suggested change

models_in_this_task = set([])

models_in_this_task = set()

mdarcy220 · 2025-08-22T19:44:54Z

oh and just to note; while I think this change will work for our current lb runs, we should make sure to double-check carefully after applying it, since as I mentioned there could be some gotchas

ca16 · 2025-08-22T21:57:12Z

I would like it if we can first check the direct model-usage config before falling back to the global default (and logging a corresponding warning). Idk how hard that is to set up here but maybe it requires a lot of restructuring, in which case I think it's okay to merge this but we need to make a ticket to revisit when we start getting external submissions or testing mixed-reasoning agents.

This seems much more reliable! Unfortunately I think that to make that work we would either need to also look at submission logs in the view code, or add this info to what's represented in an lb submission (what goes in the results repo). Both of which seems like non-trivial restructuring. Will open a ticket though (being able to just look at the config seems much better than trying to line up the names, at least when we have that info).

We should also consider reasoning_tokens (Anthropic's finer-grained equivalent of OpenAI's reasoning_effort) at some point, but maybe that's off-topic for this ticket (though seems like it might be trivial to slot into the logic alongside reasoning_effort?).

Like surface the total of reasoning_tokens (from ModelUsages) used across the usages for a model as part of the name?

So if we have gpt-5 and gpt-5 (unpinned), we will end up attaching reasoning effort to both of them? But if you have those two different names, it basically means we really do have a mixture of different model configs (which could also have different reasoning settings). I think if there's more than one variant of the same model used with reasoning_effort, this should be an error until we add per-usage effort extraction. Presumably that wouldn't break anything with our current runs because we shouldn't have mixed variants of the reasoning models (not within the same run, anyway)?

I think it's true it wouldn't break anything for our current runs because literally no results have model_args set, and for the specific result where I'm artificially setting it, it's true that there is only one distinct gpt-5 looking name across the model usages. To make sure I follow the specific suggestion though, it's that:

per result within a specific submission: if you could potentially map the eval spec's model name to multiple different names from the model usages in that one result, error. To illustrate what I'm asking about...
- Scenario A: We have one submission with two results, A and B. The evalspec in result A has model args with reasoning effort that seems like it's connected to gpt-5, and model usages (perhaps multiple of these) connected to gpt-5 (but all of these model usages have the same name as each other, e.g. openai/gpt-5). The evalspec in result B has no model args, and model usages connected to gpt-5. In this case we'd want to show both gpt-5 with the provided reasoning effort, and gpt-5 with no effort info. We wouldn't error in this case.
- Scenario B: We have one submission with one result, C. The evalspec in result C has model args with reasoning effort that seem like they're connected to gpt-5, and model usages with at least two different names that both seem connected to gpt-5 but aren't exactly the same name. In this case we would error.

Btw: would suggest maybe we warn and drop the result instead of erroring, to avoid the leaderboard totally crashing if any of these come up.

oh one more thing: in theory this will be fine for my runs since I changed args per-model, but I believe it is technically possible to specify --reasoning-effort for models that don't actually do reasoning, e.g. gpt-4o, but would be silently ignore in the actual requests. So we'd get a bogus name in that case.

does this mean I should expect to see model args for existing results? If so there's something wrong because I don't...

And around the problem you're pointing out, I see. Would that be fixed if we used the config though like you suggest above? If so, I'd suggest we fix it when we do that.

oh and just to note; while I think this change will work for our current lb runs, we should make sure to double-check carefully after applying it, since as I mentioned there could be some gotchas

As in double check the only model name displayed that has changed is the one we wanted to?

ca16 · 2025-08-22T22:56:57Z

Ticket for the better version: https://github.com/allenai/astabench-issues/issues/455

ca16 · 2025-08-22T23:04:32Z

After changes around PR feedback:
pclark425_Asta_DataVoyager_2025-08-14T21-31-12 that has model_args added:

['gpt-5', 'openai/gpt-4o'] -> ['GPT-5 (unpinned) (reasoning_effort=minimal)', 'GPT-4o (unpinned)']

pclark425_Asta_DataVoyager_2025-08-14T21-31-12 that doesn't have model_args added:

['gpt-5', 'openai/gpt-4o'] -> ['GPT-5 (unpinned)', 'GPT-4o (unpinned)']

aps6992_FutureHouse_FALCON_2025-07-26T17-31-1:

['openai/gpt-4.1-mini', 'models/gemini-2.5-flash-preview-05-20', 'openai/o3-mini']  -> ['GPT-4.1 Mini (unpinned)', 'Gemini 2.5 Flash (2024-05)', 'o3 Mini (unpinned)']

aps6992_Asta_Scholar_QA__No_Tables__2025-07-23T22-38-03:

['anthropic/claude-sonnet-4-20250514'] -> ['Claude Sonnet 4 (2025-05)']

FYI here's what the relevant result looks like after adding the model args:
https://huggingface.co/datasets/allenai/asta-bench-internal-results/raw/main/1.0.0/test/pclark425_Asta_DataVoyager_2025-08-14T21-31-12.json
(waiting to propagate it to the new dataset until this merges)

mdarcy220 · 2025-08-22T23:22:55Z

Like surface the total of reasoning_tokens (from ModelUsages) used across the usages

no I meant the --reasoning-tokens param for Inspect, which controls the maximum reasoning tokens per call (in hindsight I should have recognized the conflict with the number of reasoning tokens actually used; kind of an unfortunate param name)

literally no results have model_args

uhh hang on, I definitely set --reasoning-effort on all the o3 and gpt-5 runs, and --reasoning-tokens on the anthropic runs. Did it not show up?

Update: I have learned that the reasoning_effort etc args are stored into model_generate_config, not model_args

And around the problem you're pointing out, I see. Would that be fixed if we used the config though

yeah it should, since each individual call would store the reasoning level independently

ca16 · 2025-08-22T23:40:55Z

In summary: this is far from an ideal solution. But, it gives us a pathway to report effort, so we'll use it for now, but intend to revisit it soon with https://github.com/allenai/astabench-issues/issues/455.

ca16 · 2025-08-22T23:49:25Z

Published new library version.

ca16 · 2025-08-23T00:19:22Z

Updated in the leaderboard:

ca16 added 8 commits August 21, 2025 12:56

save

43fa334

Merge branch 'main' into chloea-special-llm-names

1d96c93

save

fdcfb04

maybe first pass

226dfb5

Merge branch 'main' into chloea-special-llm-names

1a7dcc8

fix

d51744c

make simpler

f80f652

fix

6c64ef5

ca16 added 2 commits August 22, 2025 08:28

tighten stuff up

f5de3ab

Format

f317bfe

ca16 commented Aug 22, 2025

View reviewed changes

ca16 added 7 commits August 22, 2025 08:36

comment

9b47e2f

check for none eval specs

74f7447

bring back order bit

1e0f6fa

save

7f24dd3

mypy

16aa1a9

more mypy

0701343

fix

67457cc

ca16 requested a review from jbragg August 22, 2025 16:49

jbragg reviewed Aug 22, 2025

View reviewed changes

src/agenteval/leaderboard/view.py Outdated Show resolved Hide resolved

ca16 added 2 commits August 22, 2025 10:13

add entry for just gpt5 too

3c811c3

from the right

ee44cb8

ca16 requested a review from mdarcy220 August 22, 2025 18:16

mdarcy220 approved these changes Aug 22, 2025

View reviewed changes

pr feedback

b1d41c2

pr feedback

6fabf64

ca16 added 3 commits August 22, 2025 16:05

fix

67375fb

bump version

6f5fe11

use exception

2b2310a

ca16 merged commit f12a714 into main Aug 22, 2025
4 checks passed

ca16 deleted the chloea-special-llm-names branch August 22, 2025 23:46

This was referenced Aug 22, 2025

bump agent-eval version to pick up reasoning effort model name display thing allenai/asta-bench-leaderboard#88

Merged

bump agent eval version to pull in llm names with reasoning allenai/asta-bench#110

Merged

		return f"{model_name} (reasoning_effort={effort})"


		def get_model_name_aliases(raw_name: str) -> set[str]:

Conversation

ca16 commented Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jbragg commented Aug 22, 2025

Uh oh!

ca16 Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jbragg Aug 22, 2025

Choose a reason for hiding this comment

Uh oh!

ca16 Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jbragg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ca16 commented Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ca16 commented Aug 22, 2025

Uh oh!

mdarcy220 left a comment

Choose a reason for hiding this comment

Uh oh!

mdarcy220 Aug 22, 2025

Choose a reason for hiding this comment

Uh oh!

mdarcy220 Aug 22, 2025

Choose a reason for hiding this comment

Uh oh!

mdarcy220 Aug 22, 2025

Choose a reason for hiding this comment

Uh oh!

mdarcy220 Aug 22, 2025

Choose a reason for hiding this comment

Uh oh!

mdarcy220 commented Aug 22, 2025

Uh oh!

ca16 commented Aug 22, 2025

Uh oh!

ca16 commented Aug 22, 2025

Uh oh!

ca16 commented Aug 22, 2025

Uh oh!

mdarcy220 commented Aug 22, 2025

Uh oh!

ca16 commented Aug 22, 2025

Uh oh!

Uh oh!

ca16 commented Aug 22, 2025

Uh oh!

ca16 commented Aug 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ca16 commented Aug 22, 2025 •

edited

Loading

ca16 Aug 22, 2025 •

edited

Loading

ca16 Aug 22, 2025 •

edited

Loading

ca16 commented Aug 22, 2025 •

edited

Loading