Double Prompting in Non-Reasoning Models

This evaluation investigates the effectiveness of prompt repetition in improving the benchmark performance of non-extended-reasoning models.

Evaluation

Using inspect, three non-extended-reasoning models are compared on the GSM8K, ARC Challenge, and MATH 500 Benchmarks. The models are evaluated in three conditions:

(Baseline) Single Prompting: The model is prompted once with the question.
Double Prompting: The model is prompted twice with the same question.
Triple Prompting: The model is prompted three times with the same question.

The prompts are composed of a baseline system prompt, followed by $n$ repetitions of the instruction prompt, containing the evaluation question. The performance is measured in terms of accuracy on the respective benchmarks.

Walkthrough

eval.mp4

Results

References

This evaluation is based on the following paper:

Leviathan, Y., Kalman, M., & Matias, Y. (2025). Prompt Repetition Improves Non-Reasoning LLMs. arXiv:2512.14982. https://arxiv.org/abs/2512.14982

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
logs/repetition_sweep		logs/repetition_sweep
README.md		README.md
figure_1.png		figure_1.png
plot.py		plot.py
repetition.py		repetition.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Double Prompting in Non-Reasoning Models

Evaluation

Walkthrough

Results

References

About

Uh oh!

Releases

Packages

Uh oh!

Languages

perakasem/inspect-double-prompting

Folders and files

Latest commit

History

Repository files navigation

Double Prompting in Non-Reasoning Models

Evaluation

Walkthrough

Results

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages