You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- Removed extra period outside the quotation mark in "Final answer is [ANSWER]. I hope it is correct."
- Corrected subject-verb agreement: changed "Another major beneficiary" → "Other major beneficiaries" to match plural "Qwen derivatives."
- Fixed incorrect verb tense in "It is using 1,324 highest difficulty problems" by changing "is using" → "uses" for proper simple present tense.
Copy file name to clipboardExpand all lines: math_verify_leaderboard.md
+3-3Lines changed: 3 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -17,7 +17,7 @@ authors:
17
17
Today, we’re thrilled to share that we’ve used Math-Verify to thoroughly re-evaluate all 3,751 models ever submitted to the Open LLM Leaderboard, for even fairer and more robust model comparisons!
18
18
19
19
## Why math evaluation on the Open LLM Leaderboard was broken
20
-
The [Open LLM Leaderboard](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) is the most used leaderboard on the Hugging Face Hub: it compares open Large Language Models (LLM) performance across various tasks. One of these tasks, called MATH-Hard, is specifically about math problems: it evaluates how well LLMs solve high-school and university-level math problems. It is using 1,324 highest difficulty problems (Level 5) from the [Hendrycks MATH](https://github.com/hendrycks/math) dataset spread across 7 topics (precalculus, prealgebra, algebra, intermediate algebra, counting/probability and number theory), using a 5-shot approach (the model is provided with 5 examples in the prompt to showcase how it should answer).
20
+
The [Open LLM Leaderboard](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) is the most used leaderboard on the Hugging Face Hub: it compares open Large Language Models (LLM) performance across various tasks. One of these tasks, called MATH-Hard, is specifically about math problems: it evaluates how well LLMs solve high-school and university-level math problems. It uses 1,324 highest difficulty problems (Level 5) from the [Hendrycks MATH](https://github.com/hendrycks/math) dataset spread across 7 topics (precalculus, prealgebra, algebra, intermediate algebra, counting/probability and number theory), using a 5-shot approach (the model is provided with 5 examples in the prompt to showcase how it should answer).
21
21
22
22
A typical question looks like this:
23
23
```
@@ -31,7 +31,7 @@ To which the answer would be:
31
31
32
32
In the leaderboard, models would have to end their answers with a very specific string (following the [Minerva-Math paper](https://arxiv.org/abs/2206.14858)):
33
33
```
34
-
“Final answer is [ANSWER]. I hope it is correct.”.
34
+
“Final answer is [ANSWER]. I hope it is correct.”
35
35
```
36
36
37
37
The leaderboard would then try to parse `[ANSWER]` with [SymPy](https://docs.sympy.org/latest/index.html) to convert it to a symbolic representation (and simplify the values if needed), before finally comparing it to the gold target.
@@ -98,7 +98,7 @@ But Qwen models aren’t alone. Another major family affected is **DeepSeek**. A
98
98
99
99
### Changes in the MATH-Hard Leaderboard
100
100
As mentioned at the beginning, the Top 20 rankings have undergone a significant shift, with **Nvidia’s AceMath** models now dominating the MATH-Hard leaderboard.
101
-
Another major beneficiary of this change are the **Qwen** derivatives, which are now almost exclusively the only models ranking right below AceMath.
101
+
Other major beneficiaries of this change are the **Qwen** derivatives, which are now almost exclusively the only models ranking right below AceMath.
102
102
Following is the complete table comparing the old and new Top 20 leaderboard rankings:
0 commit comments