Fix `gsm8k_platinum` description #3411

fxmarty-amd · 2025-11-17T11:48:46Z

As per title, this is in the same line as #2924. We prompt the model with previous examples, but actually not give any instruction to answer the last question / follow the format of example answers, this is not great.

I take inspiration from

lm-evaluation-harness/lm_eval/tasks/gpqa/n_shot/_gpqa_n_shot_yaml

Line 9 in fcddf19

    
           description: "Here are some example questions from experts. Answer the final question yourself, following the format of the previous questions exactly.\n"

.

The resolves #2707

cc @baberabb

cc @Qubitium FYI, probably interesting to you as well.

For reference, on main:

hf ({'pretrained': '/models/openai_gpt-oss-20b', 'dtype': 'auto', 'chat_template_args': {'reasoning_effort': 'low'}, 'enable_thinking': True, 'think_end_token': 200008}), gen_kwargs: (max_gen_toks=4048), limit: None, num_fewshot: None, batch_size: 16
|    Tasks     |Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|--------------|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k_platinum|      3|flexible-extract|     5|exact_match|↑  |0.8983|±  |0.0087|
|              |       |strict-match    |     5|exact_match|↑  |0.0596|±  |0.0068|

With this PR:

hf ({'pretrained': '/models/openai_gpt-oss-20b', 'dtype': 'auto', 'chat_template_args': {'reasoning_effort': 'low'}, 'enable_thinking': True, 'think_end_token': 200008}), gen_kwargs: (max_gen_toks=4048), limit: None, num_fewshot: None, batch_size: 16
|    Tasks     |Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|--------------|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k_platinum|      3|flexible-extract|     5|exact_match|↑  |0.9404|±  |0.0068|
|              |       |strict-match    |     5|exact_match|↑  |0.7072|±  |0.0131|

Qubitium · 2025-11-20T09:16:20Z

@fxmarty-amd @baberabb We have tested this and validated that it does fix accuracy issues (elevated scores).

fxmarty-amd · 2025-11-20T09:28:53Z

@Qubitium That's great! Which model did you test on?

add description for gsm8k_platinum

47a2b23

fxmarty-amd requested a review from baberabb as a code owner November 17, 2025 11:48

baberabb mentioned this pull request Nov 19, 2025

add instruct_format (or chat_overload or similar) field to task configs #3417

Open

fxmarty-amd mentioned this pull request Dec 1, 2025

Question about evals - gsm8k Template & Strict extraction #3438

Closed

fxmarty-amd mentioned this pull request Dec 9, 2025

Qwen3-32B results with GSM8K and GSM8K_cot worse than paper #3129

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix `gsm8k_platinum` description #3411

Fix `gsm8k_platinum` description #3411

fxmarty-amd commented Nov 17, 2025

Uh oh!

Qubitium commented Nov 20, 2025 •

edited

Loading

Uh oh!

fxmarty-amd commented Nov 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix gsm8k_platinum description #3411

Are you sure you want to change the base?

Fix gsm8k_platinum description #3411

Conversation

fxmarty-amd commented Nov 17, 2025

Uh oh!

Qubitium commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fxmarty-amd commented Nov 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix `gsm8k_platinum` description #3411

Fix `gsm8k_platinum` description #3411

Qubitium commented Nov 20, 2025 •

edited

Loading