This evaluation investigates the effectiveness of prompt repetition in improving the benchmark performance of non-extended-reasoning models.
Using inspect, three non-extended-reasoning models are compared on the GSM8K, ARC Challenge, and MATH 500 Benchmarks. The models are evaluated in three conditions:
- (Baseline) Single Prompting: The model is prompted once with the question.
- Double Prompting: The model is prompted twice with the same question.
- Triple Prompting: The model is prompted three times with the same question.
The prompts are composed of a baseline system prompt, followed by
eval.mp4
This evaluation is based on the following paper:
Leviathan, Y., Kalman, M., & Matias, Y. (2025). Prompt Repetition Improves Non-Reasoning LLMs. arXiv:2512.14982. https://arxiv.org/abs/2512.14982
