LLM names take into account reasoning effort in model args#69
Conversation
|
It's fine to handle this one specific model arg given our current results, but in general there's a lack of standardization of arg names (including for options beyond reasoning effort), and it may make sense simply to pass along the dictionary and let the frontend deal with it, eg see full args dict on hover in the webapp case |
| return f"{model_name} (reasoning_effort={effort})" | ||
|
|
||
|
|
||
| def get_model_name_aliases(raw_name: str) -> set[str]: |
There was a problem hiding this comment.
A submission can have multiple results. Each result can have one eval spec with model args (and a model name), as well as a list of model usages (which also have model names). The names we show through lb view and in the leaderboard are based on the names in the model usages.
For a given result within a submission, assuming the model args apply to the model indicated by the model name in the eval spec, we need to figure out which names from model usages correspond to the same model (we can't assume they're the same, e.g. for the case that brought this on, the model name in the eval spec is openai/gpt-5, and in model usages we have gpt-5 and openai/gpt-4o).
My approach here is to map the model name from the eval spec to a list of aliases, and each model name from the model usages to a list of aliases, and if there's any overlap, treat that as the two referencing the same model.
Here's how aliases are determined:
- the raw name provided (lower case)
- if the name is a key in our current LB_MODEL_NAME_MAPPING mapping, the corresponding value (lower case)
- if the name is a key in our current LB_MODEL_NAME_MAPPING mapping and the corresponding value indicates it's unpinned, the corresponding value without the date part (lower case)
Here's what that looks like for our example:
- aliases for the model name in the eval spec: openai/gpt-5 -> {'openai/gpt-5', 'gpt-5', 'gpt-5 (unpinned)'}
- aliases for the model names in the results:
- openai/gpt-4o -> {'openai/gpt-4o', 'gpt-4o', 'gpt-4o (unpinned)'}
- gpt-5 -> {'gpt-5'}
Does this logic make sense?
(Sidenote: might be helpful to refactor the model name stuff a little at some point, to put some assumptions in code... E.g. the map values all have dates or 'unpinned' between parens at the end, but nothing in the code makes sure that's always true AFAIK.)
There was a problem hiding this comment.
Is there a need to map to aliases before joining between the evalspec and model usages? I worry that this may cause some incorrect joins.
Since both are logged through Inspect, I would hope that they would use the same identifier.
There's the additional issue that some people manually logged usages outside of Inspect they may not match.
There was a problem hiding this comment.
I think we do need it at least for this particular result because they don't match as is (openai/gpt-5 vs gpt-5). Maybe because of the issue you mention ("There's the additional issue that some people manually logged usages outside of Inspect they may not match.")
|
|
After adding the second new entry to the mapping: |
mdarcy220
left a comment
There was a problem hiding this comment.
My overall take is that this seems good as an immediate solution for getting reasoning_effort to be represented in lb rows for our runs (which otherwise look confusing with two gpt-5s getting different results), but in the longer term there are going to be many gotchas and we probably need to restructure the way we extract this info.
| if looks_like_same_model: | ||
| reasoning_effort = eval_spec.model_args["reasoning_effort"] | ||
| other_name_option = adjust_model_name_for_reasoning_effort( | ||
| model_name=safe_name_option, | ||
| effort=reasoning_effort, | ||
| ) |
There was a problem hiding this comment.
AFAIK this probably works for the specific experiments/agents we used, but in general for agents with multiple models, it's likely to run into mistakes (in particular, with an agent that tries cheap models first and escalates to higher reasoning if it detects a hard problem; this is a known strategy that we want to test at some point).
There are currently limitations for what we can do if usages are coming from outside Inspect, but for usages within Inspect the "model" events each have a property like:
"config": {
"max_retries": 8,
"max_connections": 8,
"reasoning_effort": "medium"
},
which specifies the config for that particular model usage (as opposed to the global default config). That example above is from the ReAct-o3 run on the leaderboard. (There's also a "call" field with the exact request; could inspect directly that but maybe loses the benefit of any normalization Inspect tries to do).
I would like it if we can first check the direct model-usage config before falling back to the global default (and logging a corresponding warning). Idk how hard that is to set up here but maybe it requires a lot of restructuring, in which case I think it's okay to merge this but we need to make a ticket to revisit when we start getting external submissions or testing mixed-reasoning agents.
We should also consider reasoning_tokens (Anthropic's finer-grained equivalent of OpenAI's reasoning_effort) at some point, but maybe that's off-topic for this ticket (though seems like it might be trivial to slot into the logic alongside reasoning_effort?).
There was a problem hiding this comment.
thinking more...
So if we have gpt-5 and gpt-5 (unpinned), we will end up attaching reasoning effort to both of them? But if you have those two different names, it basically means we really do have a mixture of different model configs (which could also have different reasoning settings). I think if there's more than one variant of the same model used with reasoning_effort, this should be an error until we add per-usage effort extraction. Presumably that wouldn't break anything with our current runs because we shouldn't have mixed variants of the reasoning models (not within the same run, anyway)?
There was a problem hiding this comment.
oh one more thing: in theory this will be fine for my runs since I changed args per-model, but I believe it is technically possible to specify --reasoning-effort for models that don't actually do reasoning, e.g. gpt-4o, but would be silently ignore in the actual requests. So we'd get a bogus name in that case.
src/agenteval/leaderboard/view.py
Outdated
| task_result.model_usages = None | ||
| task_result.model_costs = None | ||
|
|
||
| models_in_this_task = set([]) |
There was a problem hiding this comment.
| models_in_this_task = set([]) | |
| models_in_this_task = set() |
|
oh and just to note; while I think this change will work for our current lb runs, we should make sure to double-check carefully after applying it, since as I mentioned there could be some gotchas |
This seems much more reliable! Unfortunately I think that to make that work we would either need to also look at submission logs in the view code, or add this info to what's represented in an lb submission (what goes in the results repo). Both of which seems like non-trivial restructuring. Will open a ticket though (being able to just look at the config seems much better than trying to line up the names, at least when we have that info).
Like surface the total of reasoning_tokens (from ModelUsages) used across the usages for a model as part of the name?
I think it's true it wouldn't break anything for our current runs because literally no results have model_args set, and for the specific result where I'm artificially setting it, it's true that there is only one distinct gpt-5 looking name across the model usages. To make sure I follow the specific suggestion though, it's that:
Btw: would suggest maybe we warn and drop the result instead of erroring, to avoid the leaderboard totally crashing if any of these come up.
does this mean I should expect to see model args for existing results? If so there's something wrong because I don't... And around the problem you're pointing out, I see. Would that be fixed if we used the config though like you suggest above? If so, I'd suggest we fix it when we do that.
As in double check the only model name displayed that has changed is the one we wanted to? |
|
Ticket for the better version: https://github.com/allenai/astabench-issues/issues/455 |
|
After changes around PR feedback:
FYI here's what the relevant result looks like after adding the model args: |
no I meant the
uhh hang on, I definitely set Update: I have learned that the
yeah it should, since each individual call would store the reasoning level independently |
|
In summary: this is far from an ideal solution. But, it gives us a pathway to report effort, so we'll use it for now, but intend to revisit it soon with https://github.com/allenai/astabench-issues/issues/455. |
|
Published new library version. |

Related to https://github.com/allenai/astabench-issues/issues/199, and more generally supports reflecting non-default reasoning effort settings for models as long as those are reflected in EvalSpec's model_args.
The idea is that when this code is processing model usages, including to construct more formatted names for the models, it also looks at any relevant model args. For a given result, if its model args include reasoning effort, it'll attempt to figure out which model names from model usages that corresponds to, and add in the effort to the formatted model name.
Note: at first I had a more complicated version of this, which would only include the effort if it seemed like the same model didn't end up getting mapped to a name without the effort once all the results in the submission had been processed... But then it occurred to me that since model_args is on the result level, if in theory an effort setting were to exist in the model args for one result but not in the model args for another result in the same submission, it would make sense to end up with both versions of the model name. But let me know if you think that version would be better...
Testing done:
This branch
For the results of interest
pclark425_Asta_DataVoyager_2025-08-14T21-31-12, with reasoning effort in the model args:Same result, without reasoning effort in the model args:
For a couple of other results:
aps6992_FutureHouse_FALCON_2025-07-26T17-31-12aps6992_Asta_Scholar_QA__No_Tables__2025-07-23T22-38-03vs main
For the results of interest
pclark425_Asta_DataVoyager_2025-08-14T21-31-12, with reasoning effort in the model args:Same result, without reasoning effort in the model args:
For
aps6992_FutureHouse_FALCON_2025-07-26T17-31-12:For
aps6992_Asta_Scholar_QA__No_Tables__2025-07-23T22-38-03: