Reasoning handler in the SUT class by bkorycki · Pull Request #1492 · mlcommons/modelbench

bkorycki · 2026-03-03T01:16:02Z

Just look at the changes from the last commit!

In the meeting, it was suggested that reasoning SUTs set some sort of "flag" (reasoning_handler here) that indicates that the SUT should use a different set of methods that handle reasoning. This was suggested as an alternative to dynamically mixing the reasoning class into the base class, as done in my original PR. Composition, rather than inheritance.

Here is an initial implementation. Please confirm if this is what folks had in mind. I find it a little clunky tbh and think the original code was easier to follow. But if people prefer this, I'm happy to roll with it.
To be specific, I don't love the sut.evaluate()->sut.reasoning_handler.evaluate() -> sut._evaluate() call chain. I think it's hard to follow, so please let me know if you have better suggestions.

github-actions · 2026-03-03T01:16:17Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

rogthefrog · 2026-03-03T22:52:39Z

src/modelgauge/reasoning_handlers.py


-    def translate_text_prompt(self, prompt: TextPrompt, options: ModelOptions) -> ReasoningRequest:
+    @classmethod
+    def response_contains_reasoning(cls, response: SUTResponse) -> bool:


It feels like SUTResponse's responsibility to report whether it contains reasoning.

I see your thinking there. But answering that question depends on the type of reasoning pattern that the model is following. And I think matching against all know reasoning patterns is a lot to ask of a SUTResponse object that is mainly intended to function as a bag of data.

Also, I think it's cleaner if all reasoning functionality (matching, parsing, request formatting) is handled by the same object.

I was thinking the same thing as Roger. But I think the matching can happen in the SUT. How about we have a subclass of SUTResponse that is a ReasoningSUTResponse which has a couple of new properties?

Off the top of my head, I'd say SUTResponse's text should be the raw response, as that's what the SUT actually output. Then we add a couple of properties that could be reasoning specific, like reasoning_text and response_text. Or you could add a property to SUTResponse that is something like clean_text or filtered_text or user_facing_text, which in the superclass is the same thing as text and in the ReasoningSUTReponse returns the stripped output.

That preserves the generality of PromptResponseSUT for a variety of use cases, but allows benchmarks to ask for the cleaned text they need for evaluation.

I can do that, but it feels overly complicated. I think people (Kurt, workstream folks) would struggle keeping track of when to look at text and when to look at response_text. I think it would also make maintenance more difficult (every time you look at the SUTResponse obejct, you have to recall that text only matters in one case and reasoning_text matters in another). I think this has a lot of downstream repercussions for understandability.

I don't see the advantages here.

src/modelgauge/reasoning_handlers.py

rogthefrog · 2026-03-03T23:20:06Z

src/modelgauge/sut.py


+    def __init__(self, uid: str):
+        super().__init__(uid)
+        self.reasoning_handler: Optional[Type[ReasoningHandler]] = ReasoningHandler.sut_matches(self)


I have two qualms with this:

It will be null most of the time and only used in specialized versions of a SUT class, so it should probably be a subclass like a ReasoningPromptResponseSUT.

It calls a third-party service on object creation, which has a side effect the caller may not be aware of, and which is only necessary if we don't know whether a SUT is reasoning. I think we know this for sure in a lot of cases, without having to test the SUT. But that could be wrong.

So I'd suggest doing the sut_matches check in the factory that instantiates the SUT, and either passing in the handler to this constructor or a flag for the constructor to create the handler, without the constructor doing the probing.

Alternatively, make the reasoning handler something more abstract that can be common to all SUTs. In reasoning SUTs, it would do the response parsing; in other SUTs, it would be a no-op. Then every SUT would behave the same. That may overcomplicate things, and so a ReasoningPromptResponseSUT subclass with a handler may be cleaner.

Alternatively, for SUTs we know are reasoning SUTs, there could be an option to bypass the check, like

def __init__(self, uid: str, is_reasoning: bool | None = None): if is_reasoning is None: # do the check elif is_reasoning is False: self.reasoning_handler = None else: self.reasoning_handler = ReasoningHandler()

So I'd suggest doing the sut_matches check in the factory that instantiates the SUT, and either passing in the handler to this constructor or a flag for the constructor to create the handler, without the constructor doing the probing.

I believe this is more so similar to my original PR (which I actually prefer). I could implement this approach, but I know @wpietri had qualms about the factory instantiating a SUT, doing a reasoning check, and then possibly modifying the SUT.

Alternatively, make the reasoning handler something more abstract that can be common to all SUTs. In reasoning SUTs, it would do the response parsing; in other SUTs, it would be a no-op.

I think this would be difficult if there are multiple reasoning handlers.

Alternatively, for SUTs we know are reasoning SUTs, there could be an option to bypass the check, like

The scope of this first task I'm tackling only includes SUTs that we do not have prior information about. But that's a good idea for later!

My qualms depend a lot on how the modification happens. If it's just setting a flag or something, I think that's ok. But I have a bias toward immutability, so I'd rather we instantiate a whole and complete thing. E.g., the thing that creates the SUT could create a naive version that assumes no reasoning, check to see if there's reasoning in the output, and then create a new SUT instance that handles reasoning.

One thing that strikes me is that SUTs are not going to change their reasoning-ness on the fly, so SUT instantiation isn't the only time to collect the information of what's a reasoning SUT and what isn't. If we make a big lookup table of what has reasoning and what doesn't, which we have to do anyway to properly categorize our benchmark results, we would only have to do run-time checks for novel-to-us SUTs. If the look-it-up approach is our default case and the run-time check becomes the unusual one, could that orientation make the code clearer?

the thing that creates the SUT could create a naive version that assumes no reasoning, check to see if there's reasoning in the output, and then create a new SUT instance that handles reasoning.

Isn't that what I did in the first PR? Does that mean your only issue was with the dynamic class mixing?

But if I do the check reasoning + instantiate a new SUT with a reasoning handler in the factory... is there anyway to do it that avoids both dynamic class mixing from the first PR and the circular dependencies in this PR? @rogthefrog @wpietri

As discussed in DM, I think the circular dependency thing is for a broader discussion and not in scope.

And fwiw I don't think the initial implementation is too problematic.

src/modelgauge/sut.py

rogthefrog · 2026-03-04T19:08:54Z

tests/modelbench_tests/test_run.py

        def invoke(command, args=None, **kwargs):
            args = list(args or [])
            full_args = ["--run-path", run_dir] + args
+            print(command, full_args, kwargs)


Did you mean to leave this in?

bkorycki added 9 commits February 25, 2026 11:59

reasoningSUT type + matching

fa3c1ea

patch for better tests

6b7100c

Store reasoning in response

3691448

remove pre-registered reasoning suts

c15c811

fix test

ccf9741

don't run unit tests on real suts

7e066fa

handle multiple think blocks

b50286d

document sut call

416b9b2

sut has a reasoning handler

dbf09cb

bkorycki requested review from rogthefrog, superdosh and wpietri March 3, 2026 01:16

bkorycki requested a review from a team as a code owner March 3, 2026 01:16

bkorycki marked this pull request as draft March 3, 2026 01:16

bkorycki had a problem deploying to Scheduled Testing March 3, 2026 01:16 — with GitHub Actions Failure

rogthefrog reviewed Mar 3, 2026

View reviewed changes

rogthefrog reviewed Mar 4, 2026

View reviewed changes

Conversation

bkorycki commented Mar 3, 2026

Uh oh!

github-actions bot commented Mar 3, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rogthefrog Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rogthefrog Mar 4, 2026 •

edited

Loading