[Accuracy gap with official model card due to wrong parsing]

I tested the accuracy of gsm8k-cot on Qwen2-7B-Instruct whose[ model card](https://huggingface.co/Qwen/Qwen2-7B-Instruct) shows an accuracy of **0.82**. However, I tested on lm-eval-harness, no matter gsm8k and gsm8k-cot, there is still a significant accuracy gap.

[gsm8k]
```
VLLM_USE_V1=1 CUDA_VISIBLE_DEVICES=1 lm_eval --model vllm --model_args pretrained=Qwen/Qwen2-7B-Instruct,dtype=auto --tasks gsm8k --device cuda:1 --apply_chat_template --fewshot_as_multiturn  --num_fewshot 8 --gen_kwargs temperature=0 --batch_size auto --seed 123 --output_path /data/jinwei/bench_res/    --log_samples
```

![Image](https://github.com/user-attachments/assets/c934382b-9753-4704-b776-562f0f9d460e)

[gsm8k-cot]
```
CUDA_VISIBLE_DEVICES=0 lm_eval --model sglang     --model_args pretrained=Qwen/Qwen2-7B-Instruct,dtype=auto    --tasks gsm8k_cot     --device "cuda"        --apply_chat_template           --fewshot_as_multiturn          --num_fewshot 8         --gen_kwargs temperature=0     --batch_size auto     --seed 123    --output_path /data/jinwei/bench_res/    --log_samples
```

![Image](https://github.com/user-attachments/assets/5779fdcc-3ca9-4b96-8b1a-36749c0529db)

I analyzed the output logs and figure out the reason: the parser cannot detect many correct answers in "exact match". It can only extract the answer with the format of  "The answer is x."
```
Some error patterns: with "filtered_resps": ["[invalid]"], "filter": "strict-match", "metrics": ["exact_match"]:
...The answer is \$366\$.
...Therefore, the answer is 23 jewels.
...Therefore, Brandon's iPhone is 8 years old.
... The answer is: $40.
```


Therefore, I modified the prompt [here](https://github.com/Monstertail/lm-evaluation-harness/blob/c9446f606fdce1f098454850dc0494877d1803be/lm_eval/tasks/gsm8k/gsm8k-cot.yaml#L6l) by simply add `(Please follow the summarize the result at the end with the format of "The answer is xxx", where xx is the result.)`. 

**It works pretty well by telling the model output format!**
**The strict accuracy raised from 0.57 to 0.80, much closer to the model card:)**
![Image](https://github.com/user-attachments/assets/b5d94366-5fdc-476b-8fa3-f1af996b5f64)


**Will you add this formatting for more tasks if possible? I believe it can bridge the gap with model card in HF:)**

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Accuracy gap with official model card due to wrong parsing] #2707

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Accuracy gap with official model card due to wrong parsing] #2707

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions