-
Notifications
You must be signed in to change notification settings - Fork 139
support structured outputs in hle judge for optional AA compatibility #1186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
support structured outputs in hle judge for optional AA compatibility #1186
Conversation
Signed-off-by: Arkadiusz Nowaczynski <[email protected]>
Signed-off-by: Arkadiusz Nowaczynski <[email protected]>
Greptile OverviewGreptile SummaryThis PR adds structured output support to enable optional AA-compatibility in HLE benchmark implementation. The changes introduce a new Key Changes:
Architecture: Provider Support:
Confidence Score: 4/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant User
participant GenerationTask
participant STRUCTURED_OUTPUTS
participant Model
participant LiteLLM
participant LLM API
User->>GenerationTask: configure structured_output="HLE_JUDGE_AA"
GenerationTask->>GenerationTask: process_single_datapoint()
GenerationTask->>STRUCTURED_OUTPUTS: lookup HLE_JUDGE_AA
STRUCTURED_OUTPUTS-->>GenerationTask: return HLEJudgeAAResponseFormat (Pydantic class)
GenerationTask->>Model: generate_async(response_format=HLEJudgeAAResponseFormat)
Model->>Model: _build_chat_request_params(response_format=...)
Model->>LiteLLM: acompletion(response_format=HLEJudgeAAResponseFormat)
LiteLLM->>LLM API: POST with structured output schema
LLM API-->>LiteLLM: JSON response matching schema
LiteLLM-->>Model: response
Model-->>GenerationTask: generation result
GenerationTask->>GenerationTask: postprocess_single_output()
GenerationTask->>GenerationTask: parse JSON and extract "correct" field
alt JSON parsing succeeds
GenerationTask->>GenerationTask: format as "Judgement: {correct}"
else JSON parsing fails
GenerationTask->>GenerationTask: fallback to "Judgement: FAILED_TO_PARSE"
end
GenerationTask-->>User: final output
|
📝 WalkthroughWalkthroughAdds structured output support (HLE_JUDGE_AA) throughout the inference pipeline by introducing a Changes
Sequence DiagramsequenceDiagram
participant Config as GenerationTaskConfig
participant Generate as generate.py
participant BaseModel as BaseModel
participant ModelImpl as Model Implementation
participant LiteLLM as LiteLLM/API
Config->>Generate: structured_output="HLE_JUDGE_AA"
Generate->>Generate: process_single_datapoint()
Generate->>BaseModel: generate_async(..., response_format=HLEJudgeAAResponseFormat)
BaseModel->>ModelImpl: _build_chat_request_params(..., response_format)
alt Supports structured (OpenAI/SGLang/VLLM chat)
ModelImpl->>ModelImpl: Include response_format in request dict
else Rejects structured (Gemini/Megatron)
ModelImpl->>ModelImpl: assert response_format is None
end
ModelImpl->>LiteLLM: Send request with response_format
LiteLLM-->>ModelImpl: JSON response {correct: "yes", ...}
ModelImpl-->>BaseModel: Response
BaseModel-->>Generate: Raw generation
Generate->>Generate: postprocess_single_output()
Generate->>Generate: Parse JSON, extract "correct" field
Generate-->>Config: Judgement: yes/no
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 file reviewed, 1 comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
nemo_skills/inference/model/megatron.py (2)
39-53: Replaceassertstatements with explicit exceptions for parameter validation.Both
_build_chat_request_paramsand_build_completion_request_paramsuseassertto validatetoolsandresponse_formatparameters. Sinceassertstatements can be disabled with Python's-Oflag, unsupported parameters would silently pass through, violating the guideline to "fail loudly" on invalid inputs. Other parameters in the same validation block correctly use explicitif+raise NotImplementedError, so match that pattern consistently.Proposed fix
- assert kwargs.get("tools") is None, "Megatron server does not support tools parameter." - assert response_format is None, "Megatron server does not support response_format parameter." + if kwargs.get("tools") is not None: + raise NotImplementedError("Megatron server does not support tools parameter.") + if response_format is not None: + raise NotImplementedError("Megatron server does not support response_format parameter.")Apply this fix to both methods.
86-100: Replace assert statements with explicit exceptions for parameter validation.Both
toolsandresponse_formatparameters use assert statements, which can be disabled with Python's-Oflag, causing silent failures. This violates the coding guideline to "Let the code fail with clear errors instead of silently misbehaving" and "Avoid silently ignoring unused user-passed parameters."This same pattern appears in two methods (around lines 51-52 and 98-99). Replace all four assert statements with explicit
if+raise NotImplementedError()to match the approach used for other unsupported parameters (stream,min_p,repetition_penalty,top_k).Example fix for lines 98-99
- assert kwargs.get("tools") is None, "Megatron server does not support tools parameter." - assert response_format is None, "Megatron server does not support response_format parameter." + if kwargs.get("tools") is not None: + raise NotImplementedError("Megatron server does not support tools parameter.") + if response_format is not None: + raise NotImplementedError("Megatron server does not support response_format parameter.")
🤖 Fix all issues with AI agents
In `@nemo_skills/inference/generate.py`:
- Around line 695-696: The code silently ignores unknown structured_output
values; add a validation in GenerationTaskConfig.__post_init__ (or call a helper
_post_init_validate_params from __post_init__) that checks if
self.structured_output is not None and not in STRUCTURED_OUTPUTS and raise a
ValueError listing the invalid value and valid keys (referencing
STRUCTURED_OUTPUTS and the attribute structured_output); this ensures
process_single_datapoint/generation_params population logic (where
generation_params["response_format"] is set) never silently drops an unsupported
structured_output.
In `@nemo_skills/inference/structured_outputs.py`:
- Around line 1-2: Add the standard NVIDIA copyright header at the very top of
the module (above the imports) in nemo_skills/inference/structured_outputs.py so
the file begins with the required multi-line copyright notice; do not alter the
existing imports (from typing import Literal, from pydantic import
BaseModel)—just prepend the header block exactly as the project's canonical
NVIDIA header.
- Around line 5-10: The HLEJudgeAAResponseFormat model wrongly includes a
non-response field strict: Literal[True]; remove the strict attribute from the
class so the model only defines extracted_final_answer, reasoning, correct, and
confidence, and then remove any now-unused imports (e.g., Literal[True] or
Literal if no longer needed); ensure any strict:true configuration is applied at
the OpenAI request/schema configuration level rather than as a field on
HLEJudgeAAResponseFormat.
🧹 Nitpick comments (2)
nemo_skills/inference/model/base.py (1)
239-239: Consider adding type annotation for consistency.The
response_formatparameter lacks a type annotation while other parameters in this method have them. Consider adding a type hint for consistency.Proposed fix
- response_format = None, + response_format: dict | None = None,nemo_skills/inference/generate.py (1)
636-642: Remove unused exception variable and consider logging the failure.The exception variable
eis assigned but never used (also flagged by static analysis). Additionally, silently settingFAILED_TO_PARSEwithout logging could make debugging difficult when generation fails to parse.Proposed fix
if self.cfg.structured_output == "HLE_JUDGE_AA": try: output[self.cfg.generation_key] = "Judgement: {}".format( json.loads(output[self.cfg.generation_key])["correct"] ) - except json.JSONDecodeError as e: + except json.JSONDecodeError: + LOG.warning( + "Failed to parse structured output as JSON: %s", + output[self.cfg.generation_key][:200] if output[self.cfg.generation_key] else "<empty>" + ) output[self.cfg.generation_key] = "Judgement: FAILED_TO_PARSE"
| if self.cfg.structured_output in STRUCTURED_OUTPUTS: | ||
| generation_params["response_format"] = STRUCTURED_OUTPUTS[self.cfg.structured_output] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider validating structured_output against registry early.
If a user specifies a structured_output value that's not in STRUCTURED_OUTPUTS, the code silently ignores it without injecting response_format. This could lead to unexpected behavior. Per coding guidelines, the code should fail if a user specifies an unsupported argument.
Proposed fix in `__post_init__` or `process_single_datapoint`
Add validation in GenerationTaskConfig.__post_init__:
def _post_init_validate_params(self):
# ... existing validations ...
if self.structured_output is not None and self.structured_output not in STRUCTURED_OUTPUTS:
raise ValueError(
f"Unknown structured_output '{self.structured_output}'. "
f"Valid options: {list(STRUCTURED_OUTPUTS.keys())}"
)🤖 Prompt for AI Agents
In `@nemo_skills/inference/generate.py` around lines 695 - 696, The code silently
ignores unknown structured_output values; add a validation in
GenerationTaskConfig.__post_init__ (or call a helper _post_init_validate_params
from __post_init__) that checks if self.structured_output is not None and not in
STRUCTURED_OUTPUTS and raise a ValueError listing the invalid value and valid
keys (referencing STRUCTURED_OUTPUTS and the attribute structured_output); this
ensures process_single_datapoint/generation_params population logic (where
generation_params["response_format"] is set) never silently drops an unsupported
structured_output.
Signed-off-by: Arkadiusz Nowaczynski <[email protected]>
Signed-off-by: Arkadiusz Nowaczynski <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 file reviewed, 1 comment
Signed-off-by: Arkadiusz Nowaczynski <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No files reviewed, no comments
Signed-off-by: Arkadiusz Nowaczynski <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 file reviewed, 1 comment
nemo_skills/inference/generate.py
Outdated
| if self.cfg.structured_output == "HLE_JUDGE_AA": | ||
| try: | ||
| output[self.cfg.generation_key] = "Judgement: {}".format( | ||
| json.loads(output[self.cfg.generation_key])["correct"] | ||
| ) | ||
| except (json.JSONDecodeError, KeyError): | ||
| output[self.cfg.generation_key] = "Judgement: FAILED_TO_PARSE" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hardcoded check for "HLE_JUDGE_AA" creates inconsistency with line 695 which uses in STRUCTURED_OUTPUTS. If new structured output formats are added to STRUCTURED_OUTPUTS, they'll set response_format but won't have corresponding postprocessing logic. Consider using self.cfg.structured_output in STRUCTURED_OUTPUTS here or creating a registry of postprocessing handlers.
jiacheng-xu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would request @gwarmstrong to review and leave some comments here since it's changing a core function / logic in generation flow.
I could be wrong, but some thoughts I have after reviewing the changes:
- The naming of "response_format" is vague. It could be image or text, or it could be text or JSON. Need maybe renaming and more docs about it.
- The use of response_format might change the behavior of EndpointType, and vice versa. There need to be more test cases.
- Test cases needed for at least one example. MathReasoning from https://platform.openai.com/docs/guides/structured-outputs?example=structured-data is good.
- It is a broad feature and not only for HLE_JUDGE_AA.
nemo_skills/inference/generate.py
Outdated
| # all of the original data to the output file alongside the new generations | ||
| output[self.cfg.generation_key] = output.pop("generation") | ||
|
|
||
| if self.cfg.structured_output == "HLE_JUDGE_AA": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not be a good idea to hard code HLE_JUDGE_AA in generate.py.
Can we build a function to handle that like
Skills/nemo_skills/inference/generate.py
Line 658 in 54c8bc0
| if self.cfg.parse_reasoning: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anowaczynski-nvidia can we move this logic into metrics? Why does it need to be in the generation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reasons I added if with postprocessing here:
- to enable AA-compatible HLE judge,
++structured_output=HLE_JUDGE_AAneeds to be added only in one place (judge generations pipeline command) - with the current version
summarize_resultscommand and pipeline logic for aggregating hle judge outputs into metrics doesn't require any modifications (the same command + code handles both: default and AA-compatible judges)
I am aware this code is fundamental to the entire package, all generations pass through it.
Regarding moving this to metrics: I see the possibility to create hleaa_metrics.py in evaluation/metrics, inherit from MathMetrics, and override only _get_score_dict, such that postprocessing of judgement (parsing into json etc) is applied before is_correct_judgement. Do you approve this plan?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, either that or we can just have this as option for main math metrics, so that any dataset, not just HLE can be evaluated in this setup. The one problem is I am not fully sure if metrics are currently customizable, but I guess if not, then we should enable customization in a similar way to how it's done for eval / generation parameters. Let me know if you need help with the design on that, happy to discuss in more details
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Kipok I tried the hard way first, but nothing I created was correct and convincing, so I pushed one commit with class HLEAAMetrics(MathMetrics) solution as it was conceptually much simpler. The main downside is that I had to add metric_type to eval command. It doesn't look right here. It doesn't compose with eval on multiple benchmarks idea. Can you take a look? If we're doing Metrics Config idea, I need a sync how to approach it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the right approach. When doing eval on multiple benchmarks you can't really customize anything except maybe inference parameters. E.g. doing prompt change or eval arguments will also break things, so I think adding metric_type is a good change. An alternative would be to add this as an argument to MathMetrics and then you can reuse existing metric_kwargs parameter to customize it. But adding metric_type is a good change anyway given that we support metric_kwargs already.
If the current implementation fully works for you, I think it LGTM as well and we can merge it. But do let me know if you have any concerns or think we should do things differently
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's probably a good idea to add a new test for this in test_generation.py, but only if models on build.nvidia.com support this response_format argument
Signed-off-by: Arkadiusz Nowaczynski <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No files reviewed, no comments
Signed-off-by: Arkadiusz Nowaczynski <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No files reviewed, no comments
Signed-off-by: Arkadiusz Nowaczynski <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No files reviewed, no comments
Why?
Enable optional AA-compatibility in HLE benchmark implementation.
What?
Support structured output in generations. See https://platform.openai.com/docs/guides/structured-outputs
How?
nemo_skills/inference/structured_outputs.pywith predefined response formats for structured outputsstructured_outputstr, default Noneprocess_single_datapointbefore generation: add response_format to generation_params based onself.cfg.structured_outputif not Nonepostprocess_single_outputprase generation and extractcorrectfield in the expected judgement formatSummary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.