|
1 |
| -# GSM8k |
| 1 | +# AIME |
2 | 2 |
|
3 | 3 | ## Paper
|
4 |
| -Training Verifiers to Solve Math Word Problems |
5 |
| -https://arxiv.org/abs/2110.14168 |
6 |
| - |
7 |
| -State-of-the-art language models can match human performance on many tasks, but |
8 |
| -they still struggle to robustly perform multi-step mathematical reasoning. To |
9 |
| -diagnose the failures of current models and support research, we introduce GSM8K, |
10 |
| -a dataset of 8.5K high quality linguistically diverse grade school math word problems. |
11 |
| -We find that even the largest transformer models fail to achieve high test performance, |
12 |
| -despite the conceptual simplicity of this problem distribution. |
13 |
| - |
14 |
| -NOTE: See the official implementation of the task: |
15 |
| - https://github.com/openai/grade-school-math/blob/master/grade_school_math/calculator.py |
16 |
| -for how to make use of the dataset's calculator annotations in your language |
17 |
| -model's sample/generation function. |
18 |
| - |
19 |
| -Homepage: https://github.com/openai/grade-school-math |
20 |
| - |
21 |
| - |
22 |
| -## Citation |
23 |
| -``` |
24 |
| -@misc{cobbe2021training, |
25 |
| - title={Training Verifiers to Solve Math Word Problems}, |
26 |
| - author={Karl Cobbe and Vineet Kosaraju and Mohammad Bavarian and Jacob Hilton and Reiichiro Nakano and Christopher Hesse and John Schulman}, |
27 |
| - year={2021}, |
28 |
| - eprint={2110.14168}, |
29 |
| - archivePrefix={arXiv}, |
30 |
| - primaryClass={cs.LG} |
31 |
| -} |
32 |
| -``` |
33 |
| - |
34 |
| -### Groups and Tasks |
| 4 | +The American Invitational Mathematics Examination (AIME) is a selective and prestigious 15-question 3-hour test given to high school students who qualify based on their AMC 10 or AMC 12 scores. All problems have integer answers between 0 and 999 inclusive. Questions increase in difficulty as the exam progresses. |
| 5 | + |
| 6 | +The AIME dataset evaluates mathematical problem-solving capabilities on competition-level mathematics problems. |
| 7 | + |
| 8 | +Homepage: https://huggingface.co/datasets/simplescaling/aime_nofigures |
| 9 | + |
| 10 | +## Dataset |
| 11 | + |
| 12 | +This implementation includes both: |
| 13 | +- `aime_nofigures`: AIME problems without figures/diagrams |
| 14 | +- `aime_figures`: AIME problems with figures/diagrams |
| 15 | + |
| 16 | +The dataset uses problems from AIME competitions, formatted for language model evaluation. |
| 17 | + |
| 18 | +## Groups and Tasks |
35 | 19 |
|
36 | 20 | #### Groups
|
37 | 21 |
|
38 | 22 | - `math_word_problems`
|
39 |
| -- `chain_of_thought` |
40 |
| -- `self_consistency` |
41 | 23 |
|
42 | 24 | #### Tasks
|
43 | 25 |
|
44 |
| -- `gsm8k_yaml` |
45 |
| -- `gsm8k_cot`: GSM8K with Chain-of-Thought |
46 |
| -- `gsm8k_cot_self_consistency`: GSM8K with Chain-of-Thought and Self-Consistency |
47 |
| -- `gsm8k_cot_llama`: GSM8K with prompt formatting modified to conform to the evaluation settings described by Meta here: https://huggingface.co/datasets/meta-llama/Meta-Llama-3.1-8B-Instruct-evals/viewer/Meta-Llama-3.1-8B-Instruct-evals__gsm8k__details?row=0 |
48 |
| - - Use this task with --fewshot_as_multiturn and --apply_chat_template to replicate Meta's reported performance. |
| 26 | +- `aime_nofigures`: AIME problems without figures |
| 27 | +- `aime_figures`: AIME problems with figures |
| 28 | +- `aime24_nofigures`: AIME 2024 problems without figures |
| 29 | +- `aime24_figures`: AIME 2024 problems with figures |
| 30 | +- `aime25_nofigures`: AIME 2025 problems without figures |
| 31 | +- Various aggregated versions (agg8, agg64) for multiple sampling |
49 | 32 |
|
| 33 | +### Evaluation |
| 34 | + |
| 35 | +The evaluation checks if the model's output matches the correct integer answer (0-999). The implementation includes: |
| 36 | +- Answer extraction from model outputs |
| 37 | +- Support for boxed answers (e.g., `\boxed{123}`) |
| 38 | +- Optional GPT-4o-mini based answer extraction for complex formats |
| 39 | +- Coverage and majority voting metrics for aggregated tasks |
| 40 | + |
| 41 | +### Environment Variables |
| 42 | + |
| 43 | +- `PROCESSOR=gpt-4o-mini`: Use GPT-4o-mini for answer extraction |
| 44 | +- `PROMPTSTEP`: Add thinking steps prompt |
| 45 | +- `PROMPTTOKEN`: Add thinking tokens prompt |
| 46 | +- `PROMPTLONG`: Add long thinking prompt |
| 47 | +- `PROMPTSHORT`: Add short thinking prompt |
50 | 48 |
|
51 | 49 | ### Checklist
|
52 | 50 |
|
53 |
| -- [x] Is in Eval-harness v1.0 ? |
| 51 | +- [ ] Is in Eval-harness v1.0? |
54 | 52 | - [ ] Has been checked for regression from v1.0?
|
55 | 53 | - [ ] Has been checked for equivalence with original paper methodology?
|
56 |
| -- [ ] "Main" checked variant clearly denoted? |
57 |
| - |
58 |
| -### Variant Wishlist |
59 |
| - |
60 |
| -- [ ] Variant with Calculator (see https://github.com/openai/grade-school-math/blob/master/grade_school_math/calculator.py for example implementation) |
61 |
| -- [ ] Using Verifiers |
62 |
| -- [ ] Majority voting "without CoT" |
| 54 | +- [ ] "Main" checked variant clearly denoted? |
0 commit comments