-
Notifications
You must be signed in to change notification settings - Fork 3
LLM names take into account reasoning effort in model args #69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 19 commits
43fa334
1d96c93
fdcfb04
226dfb5
1a7dcc8
d51744c
f80f652
6c64ef5
f5de3ab
f317bfe
9b47e2f
74f7447
1e0f6fa
7f24dd3
16aa1a9
0701343
67457cc
3c811c3
ee44cb8
b1d41c2
6fabf64
67375fb
6f5fe11
2b2310a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -16,6 +16,7 @@ | |||||
|
|
||||||
| from .. import compute_summary_statistics | ||||||
| from ..config import SuiteConfig | ||||||
| from ..score import EvalSpec | ||||||
| from .model_name_mapping import LB_MODEL_NAME_MAPPING | ||||||
| from .models import LeaderboardSubmission | ||||||
|
|
||||||
|
|
@@ -366,6 +367,79 @@ def construct_reproducibility_url(task_revisions: list[EvalRevision]) -> str | N | |||||
| return source_url | ||||||
|
|
||||||
|
|
||||||
| def adjust_model_name_for_reasoning_effort(model_name: str, effort: str) -> str: | ||||||
| return f"{model_name} (reasoning_effort={effort})" | ||||||
|
|
||||||
|
|
||||||
| def get_model_name_aliases(raw_name: str) -> set[str]: | ||||||
| aliases = {raw_name} | ||||||
| if raw_name in LB_MODEL_NAME_MAPPING: | ||||||
| # pretty just means a value in our LB_MODEL_NAME_MAPPING map | ||||||
| pretty_name = LB_MODEL_NAME_MAPPING[raw_name] | ||||||
| aliases.add(pretty_name) | ||||||
|
|
||||||
| # if the pretty name suggests it's unpinned | ||||||
| # include the pretty version without the date part | ||||||
| open_paren_index = pretty_name.rindex("(") | ||||||
| name_date = pretty_name[open_paren_index:].strip() | ||||||
| if name_date == "(unpinned)": | ||||||
| dateless_pretty_name = pretty_name[:open_paren_index].strip() | ||||||
| aliases.add(dateless_pretty_name) | ||||||
| return {a.lower() for a in aliases} | ||||||
|
|
||||||
|
|
||||||
| def format_model_names_for_one_result( | ||||||
| raw_names: set[str], eval_spec: EvalSpec | None | ||||||
| ) -> dict[str, str]: | ||||||
| to_return: dict[str, str] = {} | ||||||
|
|
||||||
| if ( | ||||||
| (eval_spec is not None) | ||||||
| and (eval_spec.model_args is not None) | ||||||
| and (isinstance(eval_spec.model_args, dict)) | ||||||
| and ("reasoning_effort" in eval_spec.model_args) | ||||||
| ): | ||||||
| consider_eval_spec = True | ||||||
| spec_model_name_aliases = get_model_name_aliases(eval_spec.model) | ||||||
| else: | ||||||
| consider_eval_spec = False | ||||||
| spec_model_name_aliases = None | ||||||
|
|
||||||
| for raw_name in raw_names: | ||||||
| safe_name_option = LB_MODEL_NAME_MAPPING.get(raw_name, raw_name) | ||||||
| other_name_option = None | ||||||
|
|
||||||
| if consider_eval_spec: | ||||||
| # make mypy happy | ||||||
| assert eval_spec is not None | ||||||
| assert spec_model_name_aliases is not None | ||||||
| assert isinstance(eval_spec.model_args, dict) | ||||||
| raw_name_aliases = get_model_name_aliases(raw_name) | ||||||
| looks_like_same_model = ( | ||||||
| len(raw_name_aliases.intersection(spec_model_name_aliases)) > 0 | ||||||
| ) | ||||||
| if looks_like_same_model: | ||||||
| reasoning_effort = eval_spec.model_args["reasoning_effort"] | ||||||
| other_name_option = adjust_model_name_for_reasoning_effort( | ||||||
| model_name=safe_name_option, | ||||||
| effort=reasoning_effort, | ||||||
| ) | ||||||
|
Comment on lines
+421
to
+432
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. AFAIK this probably works for the specific experiments/agents we used, but in general for agents with multiple models, it's likely to run into mistakes (in particular, with an agent that tries cheap models first and escalates to higher reasoning if it detects a hard problem; this is a known strategy that we want to test at some point). There are currently limitations for what we can do if usages are coming from outside Inspect, but for usages within Inspect the which specifies the config for that particular model usage (as opposed to the global default config). That example above is from the ReAct-o3 run on the leaderboard. (There's also a I would like it if we can first check the direct model-usage config before falling back to the global default (and logging a corresponding warning). Idk how hard that is to set up here but maybe it requires a lot of restructuring, in which case I think it's okay to merge this but we need to make a ticket to revisit when we start getting external submissions or testing mixed-reasoning agents. We should also consider
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. thinking more... So if we have
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. oh one more thing: in theory this will be fine for my runs since I changed args per-model, but I believe it is technically possible to specify |
||||||
|
|
||||||
| to_use = safe_name_option if other_name_option is None else other_name_option | ||||||
| to_return[raw_name] = to_use | ||||||
|
|
||||||
| return to_return | ||||||
|
|
||||||
|
|
||||||
| def merge_in_formatted_names_from_one_result( | ||||||
| so_far: dict[str, set[str]], from_one_result: dict[str, str] | ||||||
| ): | ||||||
| for k, v in from_one_result.items(): | ||||||
| if k not in so_far: | ||||||
| so_far[k] = set() | ||||||
| so_far[k].add(v) | ||||||
|
|
||||||
|
|
||||||
| def _get_dataframe( | ||||||
| eval_results: datasets.DatasetDict, | ||||||
| split: str, | ||||||
|
|
@@ -397,6 +471,9 @@ def _get_dataframe( | |||||
| ) | ||||||
|
|
||||||
| model_token_counts: dict[str, int] = {} | ||||||
| # formatted model names | ||||||
| raw_names_to_formatted_names: dict[str, set[str]] = {} | ||||||
|
|
||||||
| if ev.results: | ||||||
| for task_result in ev.results: | ||||||
|
|
||||||
|
|
@@ -407,6 +484,7 @@ def _get_dataframe( | |||||
| task_result.model_usages = None | ||||||
| task_result.model_costs = None | ||||||
|
|
||||||
| models_in_this_task = set([]) | ||||||
|
||||||
| models_in_this_task = set([]) | |
| models_in_this_task = set() |
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A submission can have multiple results. Each result can have one eval spec with model args (and a model name), as well as a list of model usages (which also have model names). The names we show through lb view and in the leaderboard are based on the names in the model usages.
For a given result within a submission, assuming the model args apply to the model indicated by the model name in the eval spec, we need to figure out which names from model usages correspond to the same model (we can't assume they're the same, e.g. for the case that brought this on, the model name in the eval spec is
openai/gpt-5, and in model usages we havegpt-5andopenai/gpt-4o).My approach here is to map the model name from the eval spec to a list of aliases, and each model name from the model usages to a list of aliases, and if there's any overlap, treat that as the two referencing the same model.
Here's how aliases are determined:
Here's what that looks like for our example:
Does this logic make sense?
(Sidenote: might be helpful to refactor the model name stuff a little at some point, to put some assumptions in code... E.g. the map values all have dates or 'unpinned' between parens at the end, but nothing in the code makes sure that's always true AFAIK.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a need to map to aliases before joining between the evalspec and model usages? I worry that this may cause some incorrect joins.
Since both are logged through Inspect, I would hope that they would use the same identifier.
There's the additional issue that some people manually logged usages outside of Inspect they may not match.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we do need it at least for this particular result because they don't match as is (
openai/gpt-5vsgpt-5). Maybe because of the issue you mention ("There's the additional issue that some people manually logged usages outside of Inspect they may not match.")