Problem Description
While reading the paper, I noticed that the average inference cost for 32 rollouts of the LLaMA2-7B model on the GSM8K dataset is reported as 166. However, in our actual tests, even with 8 rollouts, the average inference cost has already reached 600. This is a significant discrepancy compared to the results reported in the paper. Could you kindly provide guidance on whether there are any specific settings, configurations, or considerations that we might have overlooked?