-
Notifications
You must be signed in to change notification settings - Fork 158
support structured outputs in hle judge for optional AA compatibility #1186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 5 commits
0379e3c
c7a5c6a
493a793
ff96906
f86bc95
54c8bc0
534a6c0
adcff37
8442962
8509918
ba33ee1
cf9725b
52d59b4
6539eae
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -46,6 +46,7 @@ | |
| server_params, | ||
| ) | ||
| from nemo_skills.inference.model.base import EndpointType | ||
| from nemo_skills.inference.structured_outputs import STRUCTURED_OUTPUTS | ||
| from nemo_skills.prompt.utils import get_prompt, get_token_count | ||
| from nemo_skills.utils import ( | ||
| chunk_data, | ||
|
|
@@ -218,6 +219,8 @@ class GenerationTaskConfig: | |
| eval_type: str | None = None # "lean4-proof", "math", etc. | ||
| eval_config: dict = field(default_factory=dict) # Config for the evaluator | ||
|
|
||
| structured_output: str | None = None | ||
|
|
||
| def __post_init__(self): | ||
| self._post_init_validate_data() | ||
| self._post_init_validate_server() | ||
|
|
@@ -630,6 +633,14 @@ async def postprocess_single_output(self, output, original_data_point): | |
| # all of the original data to the output file alongside the new generations | ||
| output[self.cfg.generation_key] = output.pop("generation") | ||
|
|
||
| if self.cfg.structured_output == "HLE_JUDGE_AA": | ||
| try: | ||
| output[self.cfg.generation_key] = "Judgement: {}".format( | ||
| json.loads(output[self.cfg.generation_key])["correct"] | ||
| ) | ||
| except (json.JSONDecodeError, KeyError): | ||
| output[self.cfg.generation_key] = "Judgement: FAILED_TO_PARSE" | ||
|
||
|
|
||
| if not self.cfg.add_generation_stats: | ||
| output.pop("generation_start_time", None) | ||
| output.pop("generation_end_time", None) | ||
|
|
@@ -681,6 +692,9 @@ async def process_single_datapoint(self, data_point, all_data): | |
| "stop_phrases": [self.cfg.stop_phrase] if self.cfg.stop_phrase else None, | ||
| } | ||
|
|
||
| if self.cfg.structured_output in STRUCTURED_OUTPUTS: | ||
| generation_params["response_format"] = STRUCTURED_OUTPUTS[self.cfg.structured_output] | ||
|
||
|
|
||
|
Comment on lines
+694
to
+696
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Unhandled invalid key When |
||
| if self.cfg.code_execution: | ||
| if self.cfg.override_max_code_executions and self.cfg.total_code_executions_in_prompt is not None: | ||
| generation_params["max_code_executions"] = data_point["total_code_executions"] | ||
|
|
||
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,29 @@ | ||||||
| # Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. | ||||||
| # | ||||||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||||||
| # you may not use this file except in compliance with the License. | ||||||
| # You may obtain a copy of the License at | ||||||
| # | ||||||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||||||
| # | ||||||
| # Unless required by applicable law or agreed to in writing, software | ||||||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||||||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||||||
| # See the License for the specific language governing permissions and | ||||||
| # limitations under the License. | ||||||
|
|
||||||
| from typing import Literal | ||||||
|
|
||||||
| from pydantic import BaseModel | ||||||
anowaczynski-nvidia marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
|
||||||
|
|
||||||
| class HLEJudgeAAResponseFormat(BaseModel): | ||||||
| extracted_final_answer: str | ||||||
| reasoning: str | ||||||
| correct: Literal["yes", "no"] | ||||||
| confidence: int | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time! |
||||||
|
|
||||||
|
|
||||||
| STRUCTURED_OUTPUTS = { | ||||||
| "HLE_JUDGE_AA": HLEJudgeAAResponseFormat, | ||||||
| } | ||||||
anowaczynski-nvidia marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not be a good idea to hard code HLE_JUDGE_AA in
generate.py.Can we build a function to handle that like
Skills/nemo_skills/inference/generate.py
Line 658 in 54c8bc0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anowaczynski-nvidia can we move this logic into metrics? Why does it need to be in the generation?
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reasons I added if with postprocessing here:
++structured_output=HLE_JUDGE_AAneeds to be added only in one place (judge generations pipeline command)summarize_resultscommand and pipeline logic for aggregating hle judge outputs into metrics doesn't require any modifications (the same command + code handles both: default and AA-compatible judges)I am aware this code is fundamental to the entire package, all generations pass through it.
Regarding moving this to metrics: I see the possibility to create hleaa_metrics.py in evaluation/metrics, inherit from MathMetrics, and override only
_get_score_dict, such that postprocessing of judgement (parsing into json etc) is applied beforeis_correct_judgement. Do you approve this plan?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, either that or we can just have this as option for main math metrics, so that any dataset, not just HLE can be evaluated in this setup. The one problem is I am not fully sure if metrics are currently customizable, but I guess if not, then we should enable customization in a similar way to how it's done for eval / generation parameters. Let me know if you need help with the design on that, happy to discuss in more details
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Kipok I tried the hard way first, but nothing I created was correct and convincing, so I pushed one commit with
class HLEAAMetrics(MathMetrics)solution as it was conceptually much simpler. The main downside is that I had to addmetric_typeto eval command. It doesn't look right here. It doesn't compose with eval on multiple benchmarks idea. Can you take a look? If we're doing Metrics Config idea, I need a sync how to approach it.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the right approach. When doing eval on multiple benchmarks you can't really customize anything except maybe inference parameters. E.g. doing prompt change or eval arguments will also break things, so I think adding metric_type is a good change. An alternative would be to add this as an argument to MathMetrics and then you can reuse existing metric_kwargs parameter to customize it. But adding metric_type is a good change anyway given that we support metric_kwargs already.
If the current implementation fully works for you, I think it LGTM as well and we can merge it. But do let me know if you have any concerns or think we should do things differently
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's probably a good idea to add a new test for this in test_generation.py, but only if models on build.nvidia.com support this response_format argument
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
test_judge_generations_with_structured_outputbut it takes 10 minutes to complete even with max_samples=2, obviously this can't be merged, but where do we go from here?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @anowaczynski-nvidia - pushed a change to limit max tokens (since we aren't checking generation correctness anyway), seems to finish very fast now!