For reproductivity, we perform all the evaluation using the [Qwen’s math evaluation suite](https://github.com/QwenLM/Qwen2.5-Math/blob/main/evaluation/sh/eval.sh). For AIME24 and AMC 23, since they only have 30 and 40 questions respectively, we evaluate their performance by sampling 8 times for each question with a temperature of 0.6 and a top-p sampling probability of 0.95 and then compute the [pass@1](https://arxiv.org/pdf/2107.03374) (the calculation script is also provided [here](https://github.com/NovaSky-AI/SkyThought/tree/main/scripts/qwen_eval_bon.py)). For MATH500 and OlympiadBench, we use greedy decoding.
0 commit comments