Skip to content

LLM names take into account reasoning effort in model args#69

Merged
ca16 merged 24 commits intomainfrom
chloea-special-llm-names
Aug 22, 2025
Merged

LLM names take into account reasoning effort in model args#69
ca16 merged 24 commits intomainfrom
chloea-special-llm-names

Conversation

@ca16
Copy link
Copy Markdown
Collaborator

@ca16 ca16 commented Aug 22, 2025

Related to https://github.com/allenai/astabench-issues/issues/199, and more generally supports reflecting non-default reasoning effort settings for models as long as those are reflected in EvalSpec's model_args.

The idea is that when this code is processing model usages, including to construct more formatted names for the models, it also looks at any relevant model args. For a given result, if its model args include reasoning effort, it'll attempt to figure out which model names from model usages that corresponds to, and add in the effort to the formatted model name.

Note: at first I had a more complicated version of this, which would only include the effort if it seemed like the same model didn't end up getting mapped to a name without the effort once all the results in the submission had been processed... But then it occurred to me that since model_args is on the result level, if in theory an effort setting were to exist in the model args for one result but not in the model args for another result in the same submission, it would make sense to end up with both versions of the model name. But let me know if you think that version would be better...

Testing done:

This branch

For the results of interest pclark425_Asta_DataVoyager_2025-08-14T21-31-12, with reasoning effort in the model args:

['gpt-5', 'openai/gpt-4o'] -> ['gpt-5 (reasoning_effort=minimal)', 'GPT-4o (unpinned)']

Same result, without reasoning effort in the model args:

['gpt-5', 'openai/gpt-4o'] -> ['gpt-5', 'GPT-4o (unpinned)']

For a couple of other results:
aps6992_FutureHouse_FALCON_2025-07-26T17-31-12

['openai/gpt-4.1-mini', 'models/gemini-2.5-flash-preview-05-20', 'openai/o3-mini'] -> ['GPT-4.1 Mini (unpinned)', 'Gemini 2.5 Flash (2024-05)', 'o3 Mini (unpinned)']

aps6992_Asta_Scholar_QA__No_Tables__2025-07-23T22-38-03

['anthropic/claude-sonnet-4-20250514'] -> ['Claude Sonnet 4 (2025-05)']

vs main

For the results of interest pclark425_Asta_DataVoyager_2025-08-14T21-31-12, with reasoning effort in the model args:

['gpt-5', 'openai/gpt-4o'] -> ['gpt-5', 'GPT-4o (unpinned)']

Same result, without reasoning effort in the model args:

['gpt-5', 'openai/gpt-4o'] -> ['gpt-5', 'GPT-4o (unpinned)']

For aps6992_FutureHouse_FALCON_2025-07-26T17-31-12:

['openai/gpt-4.1-mini', 'models/gemini-2.5-flash-preview-05-20', 'openai/o3-mini'] -> ['GPT-4.1 Mini (unpinned)', 'Gemini 2.5 Flash (2024-05)', 'o3 Mini (unpinned)']

For aps6992_Asta_Scholar_QA__No_Tables__2025-07-23T22-38-03:

['anthropic/claude-sonnet-4-20250514'] -> ['Claude Sonnet 4 (2025-05)']

@jbragg
Copy link
Copy Markdown
Collaborator

jbragg commented Aug 22, 2025

It's fine to handle this one specific model arg given our current results, but in general there's a lack of standardization of arg names (including for options beyond reasoning effort), and it may make sense simply to pass along the dictionary and let the frontend deal with it, eg see full args dict on hover in the webapp case

return f"{model_name} (reasoning_effort={effort})"


def get_model_name_aliases(raw_name: str) -> set[str]:
Copy link
Copy Markdown
Collaborator Author

@ca16 ca16 Aug 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A submission can have multiple results. Each result can have one eval spec with model args (and a model name), as well as a list of model usages (which also have model names). The names we show through lb view and in the leaderboard are based on the names in the model usages.

For a given result within a submission, assuming the model args apply to the model indicated by the model name in the eval spec, we need to figure out which names from model usages correspond to the same model (we can't assume they're the same, e.g. for the case that brought this on, the model name in the eval spec is openai/gpt-5, and in model usages we have gpt-5 and openai/gpt-4o).

My approach here is to map the model name from the eval spec to a list of aliases, and each model name from the model usages to a list of aliases, and if there's any overlap, treat that as the two referencing the same model.

Here's how aliases are determined:

  • the raw name provided (lower case)
  • if the name is a key in our current LB_MODEL_NAME_MAPPING mapping, the corresponding value (lower case)
  • if the name is a key in our current LB_MODEL_NAME_MAPPING mapping and the corresponding value indicates it's unpinned, the corresponding value without the date part (lower case)

Here's what that looks like for our example:

  • aliases for the model name in the eval spec: openai/gpt-5 -> {'openai/gpt-5', 'gpt-5', 'gpt-5 (unpinned)'}
  • aliases for the model names in the results:
    • openai/gpt-4o -> {'openai/gpt-4o', 'gpt-4o', 'gpt-4o (unpinned)'}
    • gpt-5 -> {'gpt-5'}

Does this logic make sense?

(Sidenote: might be helpful to refactor the model name stuff a little at some point, to put some assumptions in code... E.g. the map values all have dates or 'unpinned' between parens at the end, but nothing in the code makes sure that's always true AFAIK.)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a need to map to aliases before joining between the evalspec and model usages? I worry that this may cause some incorrect joins.
Since both are logged through Inspect, I would hope that they would use the same identifier.
There's the additional issue that some people manually logged usages outside of Inspect they may not match.

Copy link
Copy Markdown
Collaborator Author

@ca16 ca16 Aug 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we do need it at least for this particular result because they don't match as is (openai/gpt-5 vs gpt-5). Maybe because of the issue you mention ("There's the additional issue that some people manually logged usages outside of Inspect they may not match.")

@ca16 ca16 requested a review from jbragg August 22, 2025 16:49
Copy link
Copy Markdown
Collaborator

@jbragg jbragg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

['gpt-5', 'openai/gpt-4o'] -> ['gpt-5 (reasoning_effort=minimal)', 'GPT-4o (unpinned)']

@ca16 in this example, gpt-5 looks unpinned so I would expect the name to retain unpinned in addition to the reasoning_effort detail

@ca16
Copy link
Copy Markdown
Collaborator Author

ca16 commented Aug 22, 2025

@ca16 in this example, gpt-5 looks unpinned so I would expect the name to retain unpinned in addition to the reasoning_effort detail

I think it looks unpinned I think it doesn't have 'unpinned' because we don't have an entry for just gpt-5 in the current model name mapping, so we're falling back to the raw name, and then adding the effort to the end of that. I can add an entry for it though.

@ca16
Copy link
Copy Markdown
Collaborator Author

ca16 commented Aug 22, 2025

After adding the second new entry to the mapping:
For pclark425_Asta_DataVoyager_2025-08-14T21-31-12:

['gpt-5', 'openai/gpt-4o'] -> ['GPT-5 (unpinned) (reasoning_effort=minimal)', 'GPT-4o (unpinned)']

@ca16 ca16 requested a review from mdarcy220 August 22, 2025 18:16
Copy link
Copy Markdown
Contributor

@mdarcy220 mdarcy220 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My overall take is that this seems good as an immediate solution for getting reasoning_effort to be represented in lb rows for our runs (which otherwise look confusing with two gpt-5s getting different results), but in the longer term there are going to be many gotchas and we probably need to restructure the way we extract this info.

Comment on lines +421 to +426
if looks_like_same_model:
reasoning_effort = eval_spec.model_args["reasoning_effort"]
other_name_option = adjust_model_name_for_reasoning_effort(
model_name=safe_name_option,
effort=reasoning_effort,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK this probably works for the specific experiments/agents we used, but in general for agents with multiple models, it's likely to run into mistakes (in particular, with an agent that tries cheap models first and escalates to higher reasoning if it detects a hard problem; this is a known strategy that we want to test at some point).

There are currently limitations for what we can do if usages are coming from outside Inspect, but for usages within Inspect the "model" events each have a property like:

"config": {
    "max_retries": 8,
    "max_connections": 8,
    "reasoning_effort": "medium"
},

which specifies the config for that particular model usage (as opposed to the global default config). That example above is from the ReAct-o3 run on the leaderboard. (There's also a "call" field with the exact request; could inspect directly that but maybe loses the benefit of any normalization Inspect tries to do).

I would like it if we can first check the direct model-usage config before falling back to the global default (and logging a corresponding warning). Idk how hard that is to set up here but maybe it requires a lot of restructuring, in which case I think it's okay to merge this but we need to make a ticket to revisit when we start getting external submissions or testing mixed-reasoning agents.

We should also consider reasoning_tokens (Anthropic's finer-grained equivalent of OpenAI's reasoning_effort) at some point, but maybe that's off-topic for this ticket (though seems like it might be trivial to slot into the logic alongside reasoning_effort?).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thinking more...

So if we have gpt-5 and gpt-5 (unpinned), we will end up attaching reasoning effort to both of them? But if you have those two different names, it basically means we really do have a mixture of different model configs (which could also have different reasoning settings). I think if there's more than one variant of the same model used with reasoning_effort, this should be an error until we add per-usage effort extraction. Presumably that wouldn't break anything with our current runs because we shouldn't have mixed variants of the reasoning models (not within the same run, anyway)?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh one more thing: in theory this will be fine for my runs since I changed args per-model, but I believe it is technically possible to specify --reasoning-effort for models that don't actually do reasoning, e.g. gpt-4o, but would be silently ignore in the actual requests. So we'd get a bogus name in that case.

task_result.model_usages = None
task_result.model_costs = None

models_in_this_task = set([])
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
models_in_this_task = set([])
models_in_this_task = set()

@mdarcy220
Copy link
Copy Markdown
Contributor

oh and just to note; while I think this change will work for our current lb runs, we should make sure to double-check carefully after applying it, since as I mentioned there could be some gotchas

@ca16
Copy link
Copy Markdown
Collaborator Author

ca16 commented Aug 22, 2025

I would like it if we can first check the direct model-usage config before falling back to the global default (and logging a corresponding warning). Idk how hard that is to set up here but maybe it requires a lot of restructuring, in which case I think it's okay to merge this but we need to make a ticket to revisit when we start getting external submissions or testing mixed-reasoning agents.

This seems much more reliable! Unfortunately I think that to make that work we would either need to also look at submission logs in the view code, or add this info to what's represented in an lb submission (what goes in the results repo). Both of which seems like non-trivial restructuring. Will open a ticket though (being able to just look at the config seems much better than trying to line up the names, at least when we have that info).

We should also consider reasoning_tokens (Anthropic's finer-grained equivalent of OpenAI's reasoning_effort) at some point, but maybe that's off-topic for this ticket (though seems like it might be trivial to slot into the logic alongside reasoning_effort?).

Like surface the total of reasoning_tokens (from ModelUsages) used across the usages for a model as part of the name?

So if we have gpt-5 and gpt-5 (unpinned), we will end up attaching reasoning effort to both of them? But if you have those two different names, it basically means we really do have a mixture of different model configs (which could also have different reasoning settings). I think if there's more than one variant of the same model used with reasoning_effort, this should be an error until we add per-usage effort extraction. Presumably that wouldn't break anything with our current runs because we shouldn't have mixed variants of the reasoning models (not within the same run, anyway)?

I think it's true it wouldn't break anything for our current runs because literally no results have model_args set, and for the specific result where I'm artificially setting it, it's true that there is only one distinct gpt-5 looking name across the model usages. To make sure I follow the specific suggestion though, it's that:

  • per result within a specific submission: if you could potentially map the eval spec's model name to multiple different names from the model usages in that one result, error. To illustrate what I'm asking about...
    • Scenario A: We have one submission with two results, A and B. The evalspec in result A has model args with reasoning effort that seems like it's connected to gpt-5, and model usages (perhaps multiple of these) connected to gpt-5 (but all of these model usages have the same name as each other, e.g. openai/gpt-5). The evalspec in result B has no model args, and model usages connected to gpt-5. In this case we'd want to show both gpt-5 with the provided reasoning effort, and gpt-5 with no effort info. We wouldn't error in this case.
    • Scenario B: We have one submission with one result, C. The evalspec in result C has model args with reasoning effort that seem like they're connected to gpt-5, and model usages with at least two different names that both seem connected to gpt-5 but aren't exactly the same name. In this case we would error.

Btw: would suggest maybe we warn and drop the result instead of erroring, to avoid the leaderboard totally crashing if any of these come up.

oh one more thing: in theory this will be fine for my runs since I changed args per-model, but I believe it is technically possible to specify --reasoning-effort for models that don't actually do reasoning, e.g. gpt-4o, but would be silently ignore in the actual requests. So we'd get a bogus name in that case.

does this mean I should expect to see model args for existing results? If so there's something wrong because I don't...

And around the problem you're pointing out, I see. Would that be fixed if we used the config though like you suggest above? If so, I'd suggest we fix it when we do that.

oh and just to note; while I think this change will work for our current lb runs, we should make sure to double-check carefully after applying it, since as I mentioned there could be some gotchas

As in double check the only model name displayed that has changed is the one we wanted to?

@ca16
Copy link
Copy Markdown
Collaborator Author

ca16 commented Aug 22, 2025

Ticket for the better version: https://github.com/allenai/astabench-issues/issues/455

@ca16
Copy link
Copy Markdown
Collaborator Author

ca16 commented Aug 22, 2025

After changes around PR feedback:
pclark425_Asta_DataVoyager_2025-08-14T21-31-12 that has model_args added:

['gpt-5', 'openai/gpt-4o'] -> ['GPT-5 (unpinned) (reasoning_effort=minimal)', 'GPT-4o (unpinned)']

pclark425_Asta_DataVoyager_2025-08-14T21-31-12 that doesn't have model_args added:

['gpt-5', 'openai/gpt-4o'] -> ['GPT-5 (unpinned)', 'GPT-4o (unpinned)']

aps6992_FutureHouse_FALCON_2025-07-26T17-31-1:

['openai/gpt-4.1-mini', 'models/gemini-2.5-flash-preview-05-20', 'openai/o3-mini']  -> ['GPT-4.1 Mini (unpinned)', 'Gemini 2.5 Flash (2024-05)', 'o3 Mini (unpinned)']

aps6992_Asta_Scholar_QA__No_Tables__2025-07-23T22-38-03:

['anthropic/claude-sonnet-4-20250514'] -> ['Claude Sonnet 4 (2025-05)']

FYI here's what the relevant result looks like after adding the model args:
https://huggingface.co/datasets/allenai/asta-bench-internal-results/raw/main/1.0.0/test/pclark425_Asta_DataVoyager_2025-08-14T21-31-12.json
(waiting to propagate it to the new dataset until this merges)

@mdarcy220
Copy link
Copy Markdown
Contributor

Like surface the total of reasoning_tokens (from ModelUsages) used across the usages

no I meant the --reasoning-tokens param for Inspect, which controls the maximum reasoning tokens per call (in hindsight I should have recognized the conflict with the number of reasoning tokens actually used; kind of an unfortunate param name)

literally no results have model_args

uhh hang on, I definitely set --reasoning-effort on all the o3 and gpt-5 runs, and --reasoning-tokens on the anthropic runs. Did it not show up?

Update: I have learned that the reasoning_effort etc args are stored into model_generate_config, not model_args

And around the problem you're pointing out, I see. Would that be fixed if we used the config though

yeah it should, since each individual call would store the reasoning level independently

@ca16
Copy link
Copy Markdown
Collaborator Author

ca16 commented Aug 22, 2025

In summary: this is far from an ideal solution. But, it gives us a pathway to report effort, so we'll use it for now, but intend to revisit it soon with https://github.com/allenai/astabench-issues/issues/455.

@ca16 ca16 merged commit f12a714 into main Aug 22, 2025
4 checks passed
@ca16 ca16 deleted the chloea-special-llm-names branch August 22, 2025 23:46
@ca16
Copy link
Copy Markdown
Collaborator Author

ca16 commented Aug 22, 2025

Published new library version.

@ca16
Copy link
Copy Markdown
Collaborator Author

ca16 commented Aug 23, 2025

Updated in the leaderboard:

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants