-
Notifications
You must be signed in to change notification settings - Fork 28
Reasoning handler in the SUT class #1492
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
fa3c1ea
6b7100c
3691448
c15c811
ccf9741
7e066fa
b50286d
416b9b2
dbf09cb
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -18,6 +18,7 @@ class SUTResponse(BaseModel): | |
| """The data that came out of the SUT.""" | ||
|
|
||
| text: str | ||
| reasoning: Optional[str] = None | ||
| top_logprobs: Optional[Sequence[TopTokens]] = None | ||
| """For each position, list the probabilities for each of the most likely tokens. | ||
|
|
||
|
|
@@ -58,14 +59,23 @@ class PromptResponseSUT(SUT, Readyable): | |
| Abstract base class that provides an interface to any SUT that is designed for handling a single-turn. | ||
| """ | ||
|
|
||
| def __init__(self, uid: str): | ||
| super().__init__(uid) | ||
| self.reasoning_handler: Optional[Type[ReasoningHandler]] = ReasoningHandler.sut_matches(self) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I have two qualms with this:
So I'd suggest doing the Alternatively, make the reasoning handler something more abstract that can be common to all SUTs. In reasoning SUTs, it would do the response parsing; in other SUTs, it would be a no-op. Then every SUT would behave the same. That may overcomplicate things, and so a ReasoningPromptResponseSUT subclass with a handler may be cleaner. Alternatively, for SUTs we know are reasoning SUTs, there could be an option to bypass the check, like def __init__(self, uid: str, is_reasoning: bool | None = None):
if is_reasoning is None:
# do the check
elif is_reasoning is False:
self.reasoning_handler = None
else:
self.reasoning_handler = ReasoningHandler()
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I believe this is more so similar to my original PR (which I actually prefer). I could implement this approach, but I know @wpietri had qualms about the factory instantiating a SUT, doing a reasoning check, and then possibly modifying the SUT.
I think this would be difficult if there are multiple reasoning handlers.
The scope of this first task I'm tackling only includes SUTs that we do not have prior information about. But that's a good idea for later!
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My qualms depend a lot on how the modification happens. If it's just setting a flag or something, I think that's ok. But I have a bias toward immutability, so I'd rather we instantiate a whole and complete thing. E.g., the thing that creates the SUT could create a naive version that assumes no reasoning, check to see if there's reasoning in the output, and then create a new SUT instance that handles reasoning. One thing that strikes me is that SUTs are not going to change their reasoning-ness on the fly, so SUT instantiation isn't the only time to collect the information of what's a reasoning SUT and what isn't. If we make a big lookup table of what has reasoning and what doesn't, which we have to do anyway to properly categorize our benchmark results, we would only have to do run-time checks for novel-to-us SUTs. If the look-it-up approach is our default case and the run-time check becomes the unusual one, could that orientation make the code clearer?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Isn't that what I did in the first PR? Does that mean your only issue was with the dynamic class mixing?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. But if I do the check reasoning + instantiate a new SUT with a reasoning handler in the factory... is there anyway to do it that avoids both dynamic class mixing from the first PR and the circular dependencies in this PR? @rogthefrog @wpietri
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As discussed in DM, I think the circular dependency thing is for a broader discussion and not in scope. And fwiw I don't think the initial implementation is too problematic. |
||
|
|
||
| def run_readiness_check(self) -> ReadyResponse: | ||
| raw_request = self.translate_text_prompt(_READINESS_CHECK_TEXT_PROMPT, options=_READINESS_CHECK_SUT_OPTIONS) | ||
| raw_response = self.evaluate(raw_request) | ||
| response = self.translate_response(raw_request, raw_response) | ||
| return ReadyResponse(is_ready=response.text is not None, response=response) | ||
|
|
||
| @not_implemented | ||
| def translate_text_prompt(self, prompt: TextPrompt, options: ModelOptions): | ||
| if self.reasoning_handler is not None: | ||
| return self.reasoning_handler.translate_text_prompt(self, prompt, options) | ||
| return self._translate_text_prompt(prompt, options) | ||
|
|
||
| @not_implemented | ||
| def _translate_text_prompt(self, prompt: TextPrompt, options: ModelOptions): | ||
| """Convert the prompt + SUT options into the SUT's native representation. | ||
|
|
||
| This method must be implemented if the SUT accepts text prompts. | ||
|
|
@@ -80,12 +90,24 @@ def translate_chat_prompt(self, prompt: ChatPrompt, options: ModelOptions): | |
| """ | ||
| raise NotImplementedError(f"SUT {self.__class__.__name__} does not implement translate_chat_prompt.") | ||
|
|
||
| @abstractmethod | ||
| def evaluate(self, request): | ||
| """Evaluate this SUT on the native request.""" | ||
| pass | ||
| if self.reasoning_handler is not None: | ||
| return self.reasoning_handler.evaluate(self, request) | ||
rogthefrog marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| return self._evaluate(request) | ||
|
|
||
| @abstractmethod | ||
| def _evaluate(self, request): | ||
| """Evaluate this SUT on the native request.""" | ||
| pass | ||
|
|
||
| def translate_response(self, request, response) -> SUTResponse: | ||
| """Convert the native response into a form all Tests can process.""" | ||
| if self.reasoning_handler is not None: | ||
| return self.reasoning_handler._translate_response(self, request, response) | ||
| return self._translate_response(request, response) | ||
|
|
||
| @abstractmethod | ||
| def _translate_response(self, request, response) -> SUTResponse: | ||
| """Convert the native response into a form all Tests can process.""" | ||
| pass | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -301,6 +301,7 @@ def runner(self, run_dir): | |
| def invoke(command, args=None, **kwargs): | ||
| args = list(args or []) | ||
| full_args = ["--run-path", run_dir] + args | ||
| print(command, full_args, kwargs) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Did you mean to leave this in? |
||
| return runner.invoke(command, full_args, **kwargs) | ||
|
|
||
| return invoke | ||
|
|
@@ -316,7 +317,7 @@ def invoke(command, args=None, **kwargs): | |
| ], | ||
| # TODO add more locales as we add support for them | ||
| ) | ||
| @pytest.mark.parametrize("sut_uid", ["fake-sut", "google/gemma-3-27b-it:scaleway:hfrelay"]) | ||
| @pytest.mark.parametrize("sut_uid", ["fake-sut"]) | ||
| def test_benchmark_basic_run_produces_json( | ||
| self, | ||
| monkeypatch, | ||
|
|
@@ -396,7 +397,7 @@ def test_benchmark_basic_run_produces_json( | |
| ], | ||
| # TODO add more locales as we add support for them | ||
| ) | ||
| @pytest.mark.parametrize("sut_uid", ["fake-sut", "google/gemma-3-27b-it:scaleway:hfrelay;mt=500;t=0.3"]) | ||
| @pytest.mark.parametrize("sut_uid", ["fake-sut"]) | ||
| def test_benchmark_multiple_suts_produces_json( | ||
| self, mock_run_benchmarks, runner, version, locale, prompt_set, sut_uid, run_dir, monkeypatch | ||
| ): | ||
|
|
@@ -546,7 +547,7 @@ def test_calls_score_benchmark_with_correct_v1_locale(self, runner, mock_run_ben | |
| # | ||
| # benchmark_arg = mock_score_benchmarks.call_args.args[0][0] | ||
| # assert isinstance(benchmark_arg, GeneralPurposeAiChatBenchmark) | ||
| @pytest.mark.parametrize("sut_uid", ["fake-sut", "google/gemma-3-27b-it:scaleway:hfrelay"]) | ||
| @pytest.mark.parametrize("sut_uid", ["fake-sut"]) | ||
| def test_v1_en_us_demo_is_default(self, runner, mock_run_benchmarks, sut_uid): | ||
| _ = runner(cli, ["benchmark", "general", "--sut", sut_uid]) | ||
|
|
||
|
|
@@ -555,14 +556,14 @@ def test_v1_en_us_demo_is_default(self, runner, mock_run_benchmarks, sut_uid): | |
| assert benchmark_arg.locale == EN_US | ||
| assert benchmark_arg.prompt_set == "demo" | ||
|
|
||
| @pytest.mark.parametrize("sut_uid", ["fake-sut", "google/gemma-3-27b-it:scaleway:hfrelay"]) | ||
| @pytest.mark.parametrize("sut_uid", ["fake-sut"]) | ||
| def test_nonexistent_benchmark_prompt_sets_can_not_be_called(self, runner, sut_uid): | ||
| result = runner(cli, ["benchmark", "general", "--prompt-set", "fake", "--sut", sut_uid]) | ||
| assert result.exit_code == 2 | ||
| assert "Invalid value for '--prompt-set'" in result.output | ||
|
|
||
| @pytest.mark.parametrize("prompt_set", GENERAL_PROMPT_SETS.keys()) | ||
| @pytest.mark.parametrize("sut_uid", ["fake-sut", "google/gemma-3-27b-it:scaleway:hfrelay"]) | ||
| @pytest.mark.parametrize("sut_uid", ["fake-sut"]) | ||
| def test_calls_score_benchmark_with_correct_prompt_set(self, runner, mock_run_benchmarks, prompt_set, sut_uid): | ||
| _ = runner(cli, ["benchmark", "general", "--prompt-set", prompt_set, "--sut", sut_uid]) | ||
|
|
||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels like
SUTResponse's responsibility to report whether it contains reasoning.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your thinking there. But answering that question depends on the type of reasoning pattern that the model is following. And I think matching against all know reasoning patterns is a lot to ask of a SUTResponse object that is mainly intended to function as a bag of data.
Also, I think it's cleaner if all reasoning functionality (matching, parsing, request formatting) is handled by the same object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking the same thing as Roger. But I think the matching can happen in the SUT. How about we have a subclass of
SUTResponsethat is aReasoningSUTResponsewhich has a couple of new properties?Off the top of my head, I'd say
SUTResponse'stextshould be the raw response, as that's what the SUT actually output. Then we add a couple of properties that could be reasoning specific, likereasoning_textandresponse_text. Or you could add a property toSUTResponsethat is something likeclean_textorfiltered_textoruser_facing_text, which in the superclass is the same thing astextand in theReasoningSUTReponse returnsthe stripped output.That preserves the generality of PromptResponseSUT for a variety of use cases, but allows benchmarks to ask for the cleaned text they need for evaluation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can do that, but it feels overly complicated. I think people (Kurt, workstream folks) would struggle keeping track of when to look at
textand when to look atresponse_text. I think it would also make maintenance more difficult (every time you look at the SUTResponse obejct, you have to recall thattextonly matters in one case andreasoning_textmatters in another). I think this has a lot of downstream repercussions for understandability.I don't see the advantages here.