Skip to content

Commit c480217

Browse files
committed
docs: Update task.py example and AIME task README
1 parent c0c23bc commit c480217

File tree

2 files changed

+40
-48
lines changed

2 files changed

+40
-48
lines changed

lmms_eval/api/task.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -182,7 +182,7 @@ def to_dict(self):
182182

183183
class Task(abc.ABC):
184184
"""A task represents an entire benchmark including its dataset, problems,
185-
answers, and evaluation methods. See BoolQ for a simple example implementation
185+
answers, and evaluation methods. See MME for a simple example implementation
186186
187187
A `doc` can be any python object which represents one instance of evaluation.
188188
This is usually a dictionary e.g.

lmms_eval/tasks/aime/README.md

Lines changed: 39 additions & 47 deletions
Original file line numberDiff line numberDiff line change
@@ -1,62 +1,54 @@
1-
# GSM8k
1+
# AIME
22

33
## Paper
4-
Training Verifiers to Solve Math Word Problems
5-
https://arxiv.org/abs/2110.14168
6-
7-
State-of-the-art language models can match human performance on many tasks, but
8-
they still struggle to robustly perform multi-step mathematical reasoning. To
9-
diagnose the failures of current models and support research, we introduce GSM8K,
10-
a dataset of 8.5K high quality linguistically diverse grade school math word problems.
11-
We find that even the largest transformer models fail to achieve high test performance,
12-
despite the conceptual simplicity of this problem distribution.
13-
14-
NOTE: See the official implementation of the task:
15-
https://github.com/openai/grade-school-math/blob/master/grade_school_math/calculator.py
16-
for how to make use of the dataset's calculator annotations in your language
17-
model's sample/generation function.
18-
19-
Homepage: https://github.com/openai/grade-school-math
20-
21-
22-
## Citation
23-
```
24-
@misc{cobbe2021training,
25-
title={Training Verifiers to Solve Math Word Problems},
26-
author={Karl Cobbe and Vineet Kosaraju and Mohammad Bavarian and Jacob Hilton and Reiichiro Nakano and Christopher Hesse and John Schulman},
27-
year={2021},
28-
eprint={2110.14168},
29-
archivePrefix={arXiv},
30-
primaryClass={cs.LG}
31-
}
32-
```
33-
34-
### Groups and Tasks
4+
The American Invitational Mathematics Examination (AIME) is a selective and prestigious 15-question 3-hour test given to high school students who qualify based on their AMC 10 or AMC 12 scores. All problems have integer answers between 0 and 999 inclusive. Questions increase in difficulty as the exam progresses.
5+
6+
The AIME dataset evaluates mathematical problem-solving capabilities on competition-level mathematics problems.
7+
8+
Homepage: https://huggingface.co/datasets/simplescaling/aime_nofigures
9+
10+
## Dataset
11+
12+
This implementation includes both:
13+
- `aime_nofigures`: AIME problems without figures/diagrams
14+
- `aime_figures`: AIME problems with figures/diagrams
15+
16+
The dataset uses problems from AIME competitions, formatted for language model evaluation.
17+
18+
## Groups and Tasks
3519

3620
#### Groups
3721

3822
- `math_word_problems`
39-
- `chain_of_thought`
40-
- `self_consistency`
4123

4224
#### Tasks
4325

44-
- `gsm8k_yaml`
45-
- `gsm8k_cot`: GSM8K with Chain-of-Thought
46-
- `gsm8k_cot_self_consistency`: GSM8K with Chain-of-Thought and Self-Consistency
47-
- `gsm8k_cot_llama`: GSM8K with prompt formatting modified to conform to the evaluation settings described by Meta here: https://huggingface.co/datasets/meta-llama/Meta-Llama-3.1-8B-Instruct-evals/viewer/Meta-Llama-3.1-8B-Instruct-evals__gsm8k__details?row=0
48-
- Use this task with --fewshot_as_multiturn and --apply_chat_template to replicate Meta's reported performance.
26+
- `aime_nofigures`: AIME problems without figures
27+
- `aime_figures`: AIME problems with figures
28+
- `aime24_nofigures`: AIME 2024 problems without figures
29+
- `aime24_figures`: AIME 2024 problems with figures
30+
- `aime25_nofigures`: AIME 2025 problems without figures
31+
- Various aggregated versions (agg8, agg64) for multiple sampling
4932

33+
### Evaluation
34+
35+
The evaluation checks if the model's output matches the correct integer answer (0-999). The implementation includes:
36+
- Answer extraction from model outputs
37+
- Support for boxed answers (e.g., `\boxed{123}`)
38+
- Optional GPT-4o-mini based answer extraction for complex formats
39+
- Coverage and majority voting metrics for aggregated tasks
40+
41+
### Environment Variables
42+
43+
- `PROCESSOR=gpt-4o-mini`: Use GPT-4o-mini for answer extraction
44+
- `PROMPTSTEP`: Add thinking steps prompt
45+
- `PROMPTTOKEN`: Add thinking tokens prompt
46+
- `PROMPTLONG`: Add long thinking prompt
47+
- `PROMPTSHORT`: Add short thinking prompt
5048

5149
### Checklist
5250

53-
- [x] Is in Eval-harness v1.0 ?
51+
- [ ] Is in Eval-harness v1.0?
5452
- [ ] Has been checked for regression from v1.0?
5553
- [ ] Has been checked for equivalence with original paper methodology?
56-
- [ ] "Main" checked variant clearly denoted?
57-
58-
### Variant Wishlist
59-
60-
- [ ] Variant with Calculator (see https://github.com/openai/grade-school-math/blob/master/grade_school_math/calculator.py for example implementation)
61-
- [ ] Using Verifiers
62-
- [ ] Majority voting "without CoT"
54+
- [ ] "Main" checked variant clearly denoted?

0 commit comments

Comments
 (0)