Skip to content

Commit aabc5e3

Browse files
Update math_verify_leaderboard.md (#2680)
- Removed extra period outside the quotation mark in "Final answer is [ANSWER]. I hope it is correct." - Corrected subject-verb agreement: changed "Another major beneficiary" → "Other major beneficiaries" to match plural "Qwen derivatives." - Fixed incorrect verb tense in "It is using 1,324 highest difficulty problems" by changing "is using" → "uses" for proper simple present tense.
1 parent 6dc019f commit aabc5e3

File tree

1 file changed

+3
-3
lines changed

1 file changed

+3
-3
lines changed

math_verify_leaderboard.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ authors:
1717
Today, we’re thrilled to share that we’ve used Math-Verify to thoroughly re-evaluate all 3,751 models ever submitted to the Open LLM Leaderboard, for even fairer and more robust model comparisons!
1818

1919
## Why math evaluation on the Open LLM Leaderboard was broken
20-
The [Open LLM Leaderboard](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) is the most used leaderboard on the Hugging Face Hub: it compares open Large Language Models (LLM) performance across various tasks. One of these tasks, called MATH-Hard, is specifically about math problems: it evaluates how well LLMs solve high-school and university-level math problems. It is using 1,324 highest difficulty problems (Level 5) from the [Hendrycks MATH](https://github.com/hendrycks/math) dataset spread across 7 topics (precalculus, prealgebra, algebra, intermediate algebra, counting/probability and number theory), using a 5-shot approach (the model is provided with 5 examples in the prompt to showcase how it should answer).
20+
The [Open LLM Leaderboard](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) is the most used leaderboard on the Hugging Face Hub: it compares open Large Language Models (LLM) performance across various tasks. One of these tasks, called MATH-Hard, is specifically about math problems: it evaluates how well LLMs solve high-school and university-level math problems. It uses 1,324 highest difficulty problems (Level 5) from the [Hendrycks MATH](https://github.com/hendrycks/math) dataset spread across 7 topics (precalculus, prealgebra, algebra, intermediate algebra, counting/probability and number theory), using a 5-shot approach (the model is provided with 5 examples in the prompt to showcase how it should answer).
2121

2222
A typical question looks like this:
2323
```
@@ -31,7 +31,7 @@ To which the answer would be:
3131

3232
In the leaderboard, models would have to end their answers with a very specific string (following the [Minerva-Math paper](https://arxiv.org/abs/2206.14858)):
3333
```
34-
“Final answer is [ANSWER]. I hope it is correct.”.
34+
“Final answer is [ANSWER]. I hope it is correct.”
3535
```
3636

3737
The leaderboard would then try to parse `[ANSWER]` with [SymPy](https://docs.sympy.org/latest/index.html) to convert it to a symbolic representation (and simplify the values if needed), before finally comparing it to the gold target.
@@ -98,7 +98,7 @@ But Qwen models aren’t alone. Another major family affected is **DeepSeek**. A
9898

9999
### Changes in the MATH-Hard Leaderboard
100100
As mentioned at the beginning, the Top 20 rankings have undergone a significant shift, with **Nvidia’s AceMath** models now dominating the MATH-Hard leaderboard.
101-
Another major beneficiary of this change are the **Qwen** derivatives, which are now almost exclusively the only models ranking right below AceMath.
101+
Other major beneficiaries of this change are the **Qwen** derivatives, which are now almost exclusively the only models ranking right below AceMath.
102102
Following is the complete table comparing the old and new Top 20 leaderboard rankings:
103103

104104
![math_hard_leaderboard_change](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/math_verify_leaderboard/math-hard-change.png)

0 commit comments

Comments
 (0)