Skip to content

Conversation

@fxmarty-amd
Copy link
Contributor

As per title, this is in the same line as #2924. We prompt the model with previous examples, but actually not give any instruction to answer the last question / follow the format of example answers, this is not great.

I take inspiration from

description: "Here are some example questions from experts. Answer the final question yourself, following the format of the previous questions exactly.\n"
.

The resolves #2707

cc @baberabb

cc @Qubitium FYI, probably interesting to you as well.

For reference, on main:

hf ({'pretrained': '/models/openai_gpt-oss-20b', 'dtype': 'auto', 'chat_template_args': {'reasoning_effort': 'low'}, 'enable_thinking': True, 'think_end_token': 200008}), gen_kwargs: (max_gen_toks=4048), limit: None, num_fewshot: None, batch_size: 16
|    Tasks     |Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|--------------|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k_platinum|      3|flexible-extract|     5|exact_match|↑  |0.8983|±  |0.0087|
|              |       |strict-match    |     5|exact_match|↑  |0.0596|±  |0.0068|

With this PR:

hf ({'pretrained': '/models/openai_gpt-oss-20b', 'dtype': 'auto', 'chat_template_args': {'reasoning_effort': 'low'}, 'enable_thinking': True, 'think_end_token': 200008}), gen_kwargs: (max_gen_toks=4048), limit: None, num_fewshot: None, batch_size: 16
|    Tasks     |Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|--------------|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k_platinum|      3|flexible-extract|     5|exact_match|↑  |0.9404|±  |0.0068|
|              |       |strict-match    |     5|exact_match|↑  |0.7072|±  |0.0131|

@Qubitium
Copy link
Contributor

Qubitium commented Nov 20, 2025

@fxmarty-amd @baberabb We have tested this and validated that it does fix accuracy issues (elevated scores).

@fxmarty-amd
Copy link
Contributor Author

@Qubitium That's great! Which model did you test on?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Accuracy gap with official model card due to wrong parsing]

2 participants