Conversation
|
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
|
|
||
| def translate_text_prompt(self, prompt: TextPrompt, options: ModelOptions) -> ReasoningRequest: | ||
| @classmethod | ||
| def response_contains_reasoning(cls, response: SUTResponse) -> bool: |
There was a problem hiding this comment.
It feels like SUTResponse's responsibility to report whether it contains reasoning.
There was a problem hiding this comment.
I see your thinking there. But answering that question depends on the type of reasoning pattern that the model is following. And I think matching against all know reasoning patterns is a lot to ask of a SUTResponse object that is mainly intended to function as a bag of data.
Also, I think it's cleaner if all reasoning functionality (matching, parsing, request formatting) is handled by the same object.
There was a problem hiding this comment.
I was thinking the same thing as Roger. But I think the matching can happen in the SUT. How about we have a subclass of SUTResponse that is a ReasoningSUTResponse which has a couple of new properties?
Off the top of my head, I'd say SUTResponse's text should be the raw response, as that's what the SUT actually output. Then we add a couple of properties that could be reasoning specific, like reasoning_text and response_text. Or you could add a property to SUTResponse that is something like clean_text or filtered_text or user_facing_text, which in the superclass is the same thing as text and in the ReasoningSUTReponse returns the stripped output.
That preserves the generality of PromptResponseSUT for a variety of use cases, but allows benchmarks to ask for the cleaned text they need for evaluation.
There was a problem hiding this comment.
I can do that, but it feels overly complicated. I think people (Kurt, workstream folks) would struggle keeping track of when to look at text and when to look at response_text. I think it would also make maintenance more difficult (every time you look at the SUTResponse obejct, you have to recall that text only matters in one case and reasoning_text matters in another). I think this has a lot of downstream repercussions for understandability.
I don't see the advantages here.
|
|
||
| def __init__(self, uid: str): | ||
| super().__init__(uid) | ||
| self.reasoning_handler: Optional[Type[ReasoningHandler]] = ReasoningHandler.sut_matches(self) |
There was a problem hiding this comment.
I have two qualms with this:
- It will be null most of the time and only used in specialized versions of a SUT class, so it should probably be a subclass like a
ReasoningPromptResponseSUT. - It calls a third-party service on object creation, which has a side effect the caller may not be aware of, and which is only necessary if we don't know whether a SUT is reasoning. I think we know this for sure in a lot of cases, without having to test the SUT. But that could be wrong.
So I'd suggest doing the sut_matches check in the factory that instantiates the SUT, and either passing in the handler to this constructor or a flag for the constructor to create the handler, without the constructor doing the probing.
Alternatively, make the reasoning handler something more abstract that can be common to all SUTs. In reasoning SUTs, it would do the response parsing; in other SUTs, it would be a no-op. Then every SUT would behave the same. That may overcomplicate things, and so a ReasoningPromptResponseSUT subclass with a handler may be cleaner.
Alternatively, for SUTs we know are reasoning SUTs, there could be an option to bypass the check, like
def __init__(self, uid: str, is_reasoning: bool | None = None):
if is_reasoning is None:
# do the check
elif is_reasoning is False:
self.reasoning_handler = None
else:
self.reasoning_handler = ReasoningHandler()There was a problem hiding this comment.
So I'd suggest doing the sut_matches check in the factory that instantiates the SUT, and either passing in the handler to this constructor or a flag for the constructor to create the handler, without the constructor doing the probing.
I believe this is more so similar to my original PR (which I actually prefer). I could implement this approach, but I know @wpietri had qualms about the factory instantiating a SUT, doing a reasoning check, and then possibly modifying the SUT.
Alternatively, make the reasoning handler something more abstract that can be common to all SUTs. In reasoning SUTs, it would do the response parsing; in other SUTs, it would be a no-op.
I think this would be difficult if there are multiple reasoning handlers.
Alternatively, for SUTs we know are reasoning SUTs, there could be an option to bypass the check, like
The scope of this first task I'm tackling only includes SUTs that we do not have prior information about. But that's a good idea for later!
There was a problem hiding this comment.
My qualms depend a lot on how the modification happens. If it's just setting a flag or something, I think that's ok. But I have a bias toward immutability, so I'd rather we instantiate a whole and complete thing. E.g., the thing that creates the SUT could create a naive version that assumes no reasoning, check to see if there's reasoning in the output, and then create a new SUT instance that handles reasoning.
One thing that strikes me is that SUTs are not going to change their reasoning-ness on the fly, so SUT instantiation isn't the only time to collect the information of what's a reasoning SUT and what isn't. If we make a big lookup table of what has reasoning and what doesn't, which we have to do anyway to properly categorize our benchmark results, we would only have to do run-time checks for novel-to-us SUTs. If the look-it-up approach is our default case and the run-time check becomes the unusual one, could that orientation make the code clearer?
There was a problem hiding this comment.
the thing that creates the SUT could create a naive version that assumes no reasoning, check to see if there's reasoning in the output, and then create a new SUT instance that handles reasoning.
Isn't that what I did in the first PR? Does that mean your only issue was with the dynamic class mixing?
There was a problem hiding this comment.
But if I do the check reasoning + instantiate a new SUT with a reasoning handler in the factory... is there anyway to do it that avoids both dynamic class mixing from the first PR and the circular dependencies in this PR? @rogthefrog @wpietri
There was a problem hiding this comment.
As discussed in DM, I think the circular dependency thing is for a broader discussion and not in scope.
And fwiw I don't think the initial implementation is too problematic.
| def invoke(command, args=None, **kwargs): | ||
| args = list(args or []) | ||
| full_args = ["--run-path", run_dir] + args | ||
| print(command, full_args, kwargs) |
There was a problem hiding this comment.
Did you mean to leave this in?
Just look at the changes from the last commit!
In the meeting, it was suggested that reasoning SUTs set some sort of "flag" (
reasoning_handlerhere) that indicates that the SUT should use a different set of methods that handle reasoning. This was suggested as an alternative to dynamically mixing the reasoning class into the base class, as done in my original PR. Composition, rather than inheritance.Here is an initial implementation. Please confirm if this is what folks had in mind. I find it a little clunky tbh and think the original code was easier to follow. But if people prefer this, I'm happy to roll with it.
To be specific, I don't love the
sut.evaluate()->sut.reasoning_handler.evaluate()->sut._evaluate()call chain. I think it's hard to follow, so please let me know if you have better suggestions.