Skip to content

Conversation

@anowaczynski-nvidia
Copy link
Collaborator

@anowaczynski-nvidia anowaczynski-nvidia commented Jan 26, 2026

Why?
Enable optional AA-compatibility in HLE benchmark implementation.

What?
Support structured output in generations. See https://platform.openai.com/docs/guides/structured-outputs

How?

  • new file nemo_skills/inference/structured_outputs.py with predefined response formats for structured outputs
  • new parameter in GenerationTaskConfig: structured_output str, default None
  • in process_single_datapoint before generation: add response_format to generation_params based on self.cfg.structured_output if not None
  • in postprocess_single_output prase generation and extract correct field in the expected judgement format

Summary by CodeRabbit

  • New Features
    • Added structured output support to inference generation, enabling formatted JSON responses from compatible models
    • Automatic response parsing and field extraction for improved result handling
    • Response format configuration with consistent API across supported model providers

✏️ Tip: You can customize this high-level summary in your review settings.

@anowaczynski-nvidia anowaczynski-nvidia self-assigned this Jan 26, 2026
Signed-off-by: Arkadiusz Nowaczynski <[email protected]>
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 26, 2026

Greptile Overview

Greptile Summary

This PR adds structured output support to enable optional AA-compatibility in HLE benchmark implementation. The changes introduce a new structured_output configuration parameter that allows specifying predefined response formats (currently supporting HLE_JUDGE_AA).

Key Changes:

  • New file structured_outputs.py defines Pydantic models for structured response formats
  • GenerationTaskConfig adds structured_output parameter (str, default None)
  • Before generation: response_format is added to generation params based on the configured structured output
  • After generation: JSON response is parsed to extract the correct field and format as "Judgement: {correct}"
  • All model implementations updated to support or reject response_format parameter appropriately

Architecture:
The implementation follows a clean separation of concerns: structured output schemas are defined centrally in structured_outputs.py, the generation task coordinates the flow, and individual model adapters handle provider-specific support. Error handling includes both JSONDecodeError and KeyError catching with a fallback to "Judgement: FAILED_TO_PARSE".

Provider Support:

  • OpenAI, VLLM, SGLang: Full support added
  • Gemini: Raises NotImplementedError
  • Megatron: Assertions reject the parameter
  • All providers correctly exclude response_format from text completion endpoints

Confidence Score: 4/5

  • This PR is relatively safe to merge with minor API compatibility concerns already flagged
  • The implementation is well-structured with comprehensive error handling and proper provider-specific adaptations. Error handling for JSON parsing is thorough (catches both JSONDecodeError and KeyError). However, there's an unverified assumption about how LiteLLM handles raw Pydantic classes in response_format - the OpenAI API typically expects a specific dict format with json_schema, not raw classes. LiteLLM may handle this conversion automatically, but this needs verification through testing. The changes are isolated to the structured outputs feature and won't affect existing functionality.
  • nemo_skills/inference/structured_outputs.py - verify that LiteLLM correctly converts the raw Pydantic class to the expected API format

Important Files Changed

Filename Overview
nemo_skills/inference/structured_outputs.py Added Pydantic model for HLE judge format, but raw class may not work with all providers' APIs
nemo_skills/inference/generate.py Added structured_output config and parsing logic with proper error handling (KeyError and JSONDecodeError)
nemo_skills/inference/model/openai.py Added response_format support for chat requests, correctly excluded from completion requests
nemo_skills/inference/model/vllm.py Added response_format parameter to chat requests, correctly excluded from completion requests

Sequence Diagram

sequenceDiagram
    participant User
    participant GenerationTask
    participant STRUCTURED_OUTPUTS
    participant Model
    participant LiteLLM
    participant LLM API

    User->>GenerationTask: configure structured_output="HLE_JUDGE_AA"
    GenerationTask->>GenerationTask: process_single_datapoint()
    GenerationTask->>STRUCTURED_OUTPUTS: lookup HLE_JUDGE_AA
    STRUCTURED_OUTPUTS-->>GenerationTask: return HLEJudgeAAResponseFormat (Pydantic class)
    GenerationTask->>Model: generate_async(response_format=HLEJudgeAAResponseFormat)
    Model->>Model: _build_chat_request_params(response_format=...)
    Model->>LiteLLM: acompletion(response_format=HLEJudgeAAResponseFormat)
    LiteLLM->>LLM API: POST with structured output schema
    LLM API-->>LiteLLM: JSON response matching schema
    LiteLLM-->>Model: response
    Model-->>GenerationTask: generation result
    GenerationTask->>GenerationTask: postprocess_single_output()
    GenerationTask->>GenerationTask: parse JSON and extract "correct" field
    alt JSON parsing succeeds
        GenerationTask->>GenerationTask: format as "Judgement: {correct}"
    else JSON parsing fails
        GenerationTask->>GenerationTask: fallback to "Judgement: FAILED_TO_PARSE"
    end
    GenerationTask-->>User: final output
Loading

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 26, 2026

📝 Walkthrough

Walkthrough

Adds structured output support (HLE_JUDGE_AA) throughout the inference pipeline by introducing a response_format parameter, enabling structured JSON response parsing and post-processing of judge correctness assessments across multiple model implementations.

Changes

Cohort / File(s) Summary
Structured Output Definition
nemo_skills/inference/structured_outputs.py
New module defining HLEJudgeAAResponseFormat Pydantic model with fields for answer, reasoning, correctness (correct as "yes"/"no"), and confidence; introduces STRUCTURED_OUTPUTS` registry mapping "HLE_JUDGE_AA" key.
Generation Orchestration
nemo_skills/inference/generate.py
Adds structured_output: str | None config field to GenerationTaskConfig; injects response_format into generation params when structured output is registered; post-processing parses JSON response, extracts "correct" field, wraps as "Judgement: " or "Judgement: FAILED_TO_PARSE" on error.
Base Model Abstraction
nemo_skills/inference/model/base.py
Adds optional response_format parameter to generate_async method; propagates parameter into per-call kwargs forwarded to underlying request builders and litellm calls.
Model-Specific Implementations (Restrictive)
nemo_skills/inference/model/gemini.py, nemo_skills/inference/model/megatron.py
Both add response_format parameter with assertions enforcing it must remain None, explicitly disallowing structured outputs due to API limitations.
Model-Specific Implementations (Chat-Only)
nemo_skills/inference/model/openai.py, nemo_skills/inference/model/vllm.py
Add response_format parameter; chat paths pass through to request payload; completion paths reject with assertions. OpenAI preserves reasoning model defaults when response_format used.
Model-Specific Implementations (Pass-Through)
nemo_skills/inference/model/sglang.py
Adds response_format parameter and threads it to parent class call; parameter included in final request dictionary when provided.

Sequence Diagram

sequenceDiagram
    participant Config as GenerationTaskConfig
    participant Generate as generate.py
    participant BaseModel as BaseModel
    participant ModelImpl as Model Implementation
    participant LiteLLM as LiteLLM/API

    Config->>Generate: structured_output="HLE_JUDGE_AA"
    Generate->>Generate: process_single_datapoint()
    Generate->>BaseModel: generate_async(..., response_format=HLEJudgeAAResponseFormat)
    BaseModel->>ModelImpl: _build_chat_request_params(..., response_format)
    alt Supports structured (OpenAI/SGLang/VLLM chat)
        ModelImpl->>ModelImpl: Include response_format in request dict
    else Rejects structured (Gemini/Megatron)
        ModelImpl->>ModelImpl: assert response_format is None
    end
    ModelImpl->>LiteLLM: Send request with response_format
    LiteLLM-->>ModelImpl: JSON response {correct: "yes", ...}
    ModelImpl-->>BaseModel: Response
    BaseModel-->>Generate: Raw generation
    Generate->>Generate: postprocess_single_output()
    Generate->>Generate: Parse JSON, extract "correct" field
    Generate-->>Config: Judgement: yes/no
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 8.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'support structured outputs in hle judge for optional AA compatibility' clearly and specifically summarizes the main change: adding structured output support to the HLE judge for AA compatibility.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
nemo_skills/inference/model/megatron.py (2)

39-53: Replace assert statements with explicit exceptions for parameter validation.

Both _build_chat_request_params and _build_completion_request_params use assert to validate tools and response_format parameters. Since assert statements can be disabled with Python's -O flag, unsupported parameters would silently pass through, violating the guideline to "fail loudly" on invalid inputs. Other parameters in the same validation block correctly use explicit if + raise NotImplementedError, so match that pattern consistently.

Proposed fix
-        assert kwargs.get("tools") is None, "Megatron server does not support tools parameter."
-        assert response_format is None, "Megatron server does not support response_format parameter."
+        if kwargs.get("tools") is not None:
+            raise NotImplementedError("Megatron server does not support tools parameter.")
+        if response_format is not None:
+            raise NotImplementedError("Megatron server does not support response_format parameter.")

Apply this fix to both methods.


86-100: Replace assert statements with explicit exceptions for parameter validation.

Both tools and response_format parameters use assert statements, which can be disabled with Python's -O flag, causing silent failures. This violates the coding guideline to "Let the code fail with clear errors instead of silently misbehaving" and "Avoid silently ignoring unused user-passed parameters."

This same pattern appears in two methods (around lines 51-52 and 98-99). Replace all four assert statements with explicit if + raise NotImplementedError() to match the approach used for other unsupported parameters (stream, min_p, repetition_penalty, top_k).

Example fix for lines 98-99
-        assert kwargs.get("tools") is None, "Megatron server does not support tools parameter."
-        assert response_format is None, "Megatron server does not support response_format parameter."
+        if kwargs.get("tools") is not None:
+            raise NotImplementedError("Megatron server does not support tools parameter.")
+        if response_format is not None:
+            raise NotImplementedError("Megatron server does not support response_format parameter.")
🤖 Fix all issues with AI agents
In `@nemo_skills/inference/generate.py`:
- Around line 695-696: The code silently ignores unknown structured_output
values; add a validation in GenerationTaskConfig.__post_init__ (or call a helper
_post_init_validate_params from __post_init__) that checks if
self.structured_output is not None and not in STRUCTURED_OUTPUTS and raise a
ValueError listing the invalid value and valid keys (referencing
STRUCTURED_OUTPUTS and the attribute structured_output); this ensures
process_single_datapoint/generation_params population logic (where
generation_params["response_format"] is set) never silently drops an unsupported
structured_output.

In `@nemo_skills/inference/structured_outputs.py`:
- Around line 1-2: Add the standard NVIDIA copyright header at the very top of
the module (above the imports) in nemo_skills/inference/structured_outputs.py so
the file begins with the required multi-line copyright notice; do not alter the
existing imports (from typing import Literal, from pydantic import
BaseModel)—just prepend the header block exactly as the project's canonical
NVIDIA header.
- Around line 5-10: The HLEJudgeAAResponseFormat model wrongly includes a
non-response field strict: Literal[True]; remove the strict attribute from the
class so the model only defines extracted_final_answer, reasoning, correct, and
confidence, and then remove any now-unused imports (e.g., Literal[True] or
Literal if no longer needed); ensure any strict:true configuration is applied at
the OpenAI request/schema configuration level rather than as a field on
HLEJudgeAAResponseFormat.
🧹 Nitpick comments (2)
nemo_skills/inference/model/base.py (1)

239-239: Consider adding type annotation for consistency.

The response_format parameter lacks a type annotation while other parameters in this method have them. Consider adding a type hint for consistency.

Proposed fix
-        response_format = None,
+        response_format: dict | None = None,
nemo_skills/inference/generate.py (1)

636-642: Remove unused exception variable and consider logging the failure.

The exception variable e is assigned but never used (also flagged by static analysis). Additionally, silently setting FAILED_TO_PARSE without logging could make debugging difficult when generation fails to parse.

Proposed fix
         if self.cfg.structured_output == "HLE_JUDGE_AA":
             try:
                 output[self.cfg.generation_key] = "Judgement: {}".format(
                     json.loads(output[self.cfg.generation_key])["correct"]
                 )
-            except json.JSONDecodeError as e:
+            except json.JSONDecodeError:
+                LOG.warning(
+                    "Failed to parse structured output as JSON: %s",
+                    output[self.cfg.generation_key][:200] if output[self.cfg.generation_key] else "<empty>"
+                )
                 output[self.cfg.generation_key] = "Judgement: FAILED_TO_PARSE"

Comment on lines +695 to +696
if self.cfg.structured_output in STRUCTURED_OUTPUTS:
generation_params["response_format"] = STRUCTURED_OUTPUTS[self.cfg.structured_output]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Consider validating structured_output against registry early.

If a user specifies a structured_output value that's not in STRUCTURED_OUTPUTS, the code silently ignores it without injecting response_format. This could lead to unexpected behavior. Per coding guidelines, the code should fail if a user specifies an unsupported argument.

Proposed fix in `__post_init__` or `process_single_datapoint`

Add validation in GenerationTaskConfig.__post_init__:

def _post_init_validate_params(self):
    # ... existing validations ...
    if self.structured_output is not None and self.structured_output not in STRUCTURED_OUTPUTS:
        raise ValueError(
            f"Unknown structured_output '{self.structured_output}'. "
            f"Valid options: {list(STRUCTURED_OUTPUTS.keys())}"
        )
🤖 Prompt for AI Agents
In `@nemo_skills/inference/generate.py` around lines 695 - 696, The code silently
ignores unknown structured_output values; add a validation in
GenerationTaskConfig.__post_init__ (or call a helper _post_init_validate_params
from __post_init__) that checks if self.structured_output is not None and not in
STRUCTURED_OUTPUTS and raise a ValueError listing the invalid value and valid
keys (referencing STRUCTURED_OUTPUTS and the attribute structured_output); this
ensures process_single_datapoint/generation_params population logic (where
generation_params["response_format"] is set) never silently drops an unsupported
structured_output.

Signed-off-by: Arkadiusz Nowaczynski <[email protected]>
Signed-off-by: Arkadiusz Nowaczynski <[email protected]>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

@anowaczynski-nvidia anowaczynski-nvidia removed their assignment Jan 27, 2026
Signed-off-by: Arkadiusz Nowaczynski <[email protected]>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Signed-off-by: Arkadiusz Nowaczynski <[email protected]>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment on lines 636 to 642
if self.cfg.structured_output == "HLE_JUDGE_AA":
try:
output[self.cfg.generation_key] = "Judgement: {}".format(
json.loads(output[self.cfg.generation_key])["correct"]
)
except (json.JSONDecodeError, KeyError):
output[self.cfg.generation_key] = "Judgement: FAILED_TO_PARSE"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hardcoded check for "HLE_JUDGE_AA" creates inconsistency with line 695 which uses in STRUCTURED_OUTPUTS. If new structured output formats are added to STRUCTURED_OUTPUTS, they'll set response_format but won't have corresponding postprocessing logic. Consider using self.cfg.structured_output in STRUCTURED_OUTPUTS here or creating a registry of postprocessing handlers.

@ekmb ekmb requested a review from jiacheng-xu January 27, 2026 01:57
Copy link
Collaborator

@jiacheng-xu jiacheng-xu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would request @gwarmstrong to review and leave some comments here since it's changing a core function / logic in generation flow.
I could be wrong, but some thoughts I have after reviewing the changes:

  1. The naming of "response_format" is vague. It could be image or text, or it could be text or JSON. Need maybe renaming and more docs about it.
  2. The use of response_format might change the behavior of EndpointType, and vice versa. There need to be more test cases.
  3. Test cases needed for at least one example. MathReasoning from https://platform.openai.com/docs/guides/structured-outputs?example=structured-data is good.
  4. It is a broad feature and not only for HLE_JUDGE_AA.

# all of the original data to the output file alongside the new generations
output[self.cfg.generation_key] = output.pop("generation")

if self.cfg.structured_output == "HLE_JUDGE_AA":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not be a good idea to hard code HLE_JUDGE_AA in generate.py.
Can we build a function to handle that like

if self.cfg.parse_reasoning:
?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anowaczynski-nvidia can we move this logic into metrics? Why does it need to be in the generation?

Copy link
Collaborator Author

@anowaczynski-nvidia anowaczynski-nvidia Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reasons I added if with postprocessing here:

  • to enable AA-compatible HLE judge, ++structured_output=HLE_JUDGE_AA needs to be added only in one place (judge generations pipeline command)
  • with the current version summarize_results command and pipeline logic for aggregating hle judge outputs into metrics doesn't require any modifications (the same command + code handles both: default and AA-compatible judges)

I am aware this code is fundamental to the entire package, all generations pass through it.

Regarding moving this to metrics: I see the possibility to create hleaa_metrics.py in evaluation/metrics, inherit from MathMetrics, and override only _get_score_dict, such that postprocessing of judgement (parsing into json etc) is applied before is_correct_judgement. Do you approve this plan?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, either that or we can just have this as option for main math metrics, so that any dataset, not just HLE can be evaluated in this setup. The one problem is I am not fully sure if metrics are currently customizable, but I guess if not, then we should enable customization in a similar way to how it's done for eval / generation parameters. Let me know if you need help with the design on that, happy to discuss in more details

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Kipok I tried the hard way first, but nothing I created was correct and convincing, so I pushed one commit with class HLEAAMetrics(MathMetrics) solution as it was conceptually much simpler. The main downside is that I had to add metric_type to eval command. It doesn't look right here. It doesn't compose with eval on multiple benchmarks idea. Can you take a look? If we're doing Metrics Config idea, I need a sync how to approach it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the right approach. When doing eval on multiple benchmarks you can't really customize anything except maybe inference parameters. E.g. doing prompt change or eval arguments will also break things, so I think adding metric_type is a good change. An alternative would be to add this as an argument to MathMetrics and then you can reuse existing metric_kwargs parameter to customize it. But adding metric_type is a good change anyway given that we support metric_kwargs already.

If the current implementation fully works for you, I think it LGTM as well and we can merge it. But do let me know if you have any concerns or think we should do things differently

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably a good idea to add a new test for this in test_generation.py, but only if models on build.nvidia.com support this response_format argument

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Signed-off-by: Arkadiusz Nowaczynski <[email protected]>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants