Skip to content

Reasoning handler in the SUT class#1492

Draft
bkorycki wants to merge 9 commits intomainfrom
reasoning-checker-2
Draft

Reasoning handler in the SUT class#1492
bkorycki wants to merge 9 commits intomainfrom
reasoning-checker-2

Conversation

@bkorycki
Copy link
Copy Markdown
Contributor

@bkorycki bkorycki commented Mar 3, 2026

Just look at the changes from the last commit!

In the meeting, it was suggested that reasoning SUTs set some sort of "flag" (reasoning_handler here) that indicates that the SUT should use a different set of methods that handle reasoning. This was suggested as an alternative to dynamically mixing the reasoning class into the base class, as done in my original PR. Composition, rather than inheritance.

Here is an initial implementation. Please confirm if this is what folks had in mind. I find it a little clunky tbh and think the original code was easier to follow. But if people prefer this, I'm happy to roll with it.
To be specific, I don't love the sut.evaluate()->sut.reasoning_handler.evaluate() -> sut._evaluate() call chain. I think it's hard to follow, so please let me know if you have better suggestions.

@bkorycki bkorycki requested a review from a team as a code owner March 3, 2026 01:16
@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 3, 2026

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@bkorycki bkorycki marked this pull request as draft March 3, 2026 01:16
@bkorycki bkorycki had a problem deploying to Scheduled Testing March 3, 2026 01:16 — with GitHub Actions Failure

def translate_text_prompt(self, prompt: TextPrompt, options: ModelOptions) -> ReasoningRequest:
@classmethod
def response_contains_reasoning(cls, response: SUTResponse) -> bool:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels like SUTResponse's responsibility to report whether it contains reasoning.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your thinking there. But answering that question depends on the type of reasoning pattern that the model is following. And I think matching against all know reasoning patterns is a lot to ask of a SUTResponse object that is mainly intended to function as a bag of data.

Also, I think it's cleaner if all reasoning functionality (matching, parsing, request formatting) is handled by the same object.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking the same thing as Roger. But I think the matching can happen in the SUT. How about we have a subclass of SUTResponse that is a ReasoningSUTResponse which has a couple of new properties?

Off the top of my head, I'd say SUTResponse's text should be the raw response, as that's what the SUT actually output. Then we add a couple of properties that could be reasoning specific, like reasoning_text and response_text. Or you could add a property to SUTResponse that is something like clean_text or filtered_text or user_facing_text, which in the superclass is the same thing as text and in the ReasoningSUTReponse returns the stripped output.

That preserves the generality of PromptResponseSUT for a variety of use cases, but allows benchmarks to ask for the cleaned text they need for evaluation.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can do that, but it feels overly complicated. I think people (Kurt, workstream folks) would struggle keeping track of when to look at text and when to look at response_text. I think it would also make maintenance more difficult (every time you look at the SUTResponse obejct, you have to recall that text only matters in one case and reasoning_text matters in another). I think this has a lot of downstream repercussions for understandability.

I don't see the advantages here.


def __init__(self, uid: str):
super().__init__(uid)
self.reasoning_handler: Optional[Type[ReasoningHandler]] = ReasoningHandler.sut_matches(self)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have two qualms with this:

  1. It will be null most of the time and only used in specialized versions of a SUT class, so it should probably be a subclass like a ReasoningPromptResponseSUT.
  2. It calls a third-party service on object creation, which has a side effect the caller may not be aware of, and which is only necessary if we don't know whether a SUT is reasoning. I think we know this for sure in a lot of cases, without having to test the SUT. But that could be wrong.

So I'd suggest doing the sut_matches check in the factory that instantiates the SUT, and either passing in the handler to this constructor or a flag for the constructor to create the handler, without the constructor doing the probing.

Alternatively, make the reasoning handler something more abstract that can be common to all SUTs. In reasoning SUTs, it would do the response parsing; in other SUTs, it would be a no-op. Then every SUT would behave the same. That may overcomplicate things, and so a ReasoningPromptResponseSUT subclass with a handler may be cleaner.

Alternatively, for SUTs we know are reasoning SUTs, there could be an option to bypass the check, like

def __init__(self, uid: str, is_reasoning: bool | None = None):
    if is_reasoning is None:
        # do the check
    elif is_reasoning is False:
        self.reasoning_handler = None
    else:
        self.reasoning_handler = ReasoningHandler()

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I'd suggest doing the sut_matches check in the factory that instantiates the SUT, and either passing in the handler to this constructor or a flag for the constructor to create the handler, without the constructor doing the probing.

I believe this is more so similar to my original PR (which I actually prefer). I could implement this approach, but I know @wpietri had qualms about the factory instantiating a SUT, doing a reasoning check, and then possibly modifying the SUT.

Alternatively, make the reasoning handler something more abstract that can be common to all SUTs. In reasoning SUTs, it would do the response parsing; in other SUTs, it would be a no-op.

I think this would be difficult if there are multiple reasoning handlers.

Alternatively, for SUTs we know are reasoning SUTs, there could be an option to bypass the check, like

The scope of this first task I'm tackling only includes SUTs that we do not have prior information about. But that's a good idea for later!

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My qualms depend a lot on how the modification happens. If it's just setting a flag or something, I think that's ok. But I have a bias toward immutability, so I'd rather we instantiate a whole and complete thing. E.g., the thing that creates the SUT could create a naive version that assumes no reasoning, check to see if there's reasoning in the output, and then create a new SUT instance that handles reasoning.

One thing that strikes me is that SUTs are not going to change their reasoning-ness on the fly, so SUT instantiation isn't the only time to collect the information of what's a reasoning SUT and what isn't. If we make a big lookup table of what has reasoning and what doesn't, which we have to do anyway to properly categorize our benchmark results, we would only have to do run-time checks for novel-to-us SUTs. If the look-it-up approach is our default case and the run-time check becomes the unusual one, could that orientation make the code clearer?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the thing that creates the SUT could create a naive version that assumes no reasoning, check to see if there's reasoning in the output, and then create a new SUT instance that handles reasoning.

Isn't that what I did in the first PR? Does that mean your only issue was with the dynamic class mixing?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if I do the check reasoning + instantiate a new SUT with a reasoning handler in the factory... is there anyway to do it that avoids both dynamic class mixing from the first PR and the circular dependencies in this PR? @rogthefrog @wpietri

Copy link
Copy Markdown
Contributor

@rogthefrog rogthefrog Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed in DM, I think the circular dependency thing is for a broader discussion and not in scope.

And fwiw I don't think the initial implementation is too problematic.

def invoke(command, args=None, **kwargs):
args = list(args or [])
full_args = ["--run-path", run_dir] + args
print(command, full_args, kwargs)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean to leave this in?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants