Skip to content

Commit c76da3f

Browse files
committed
typo
1 parent ae0fe3c commit c76da3f

File tree

1 file changed

+1
-1
lines changed

1 file changed

+1
-1
lines changed

src/content/posts/sky-t1-7b.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -56,7 +56,7 @@ In this stage, to speed up the RL training, we adopt the simple [RLOO](https://a
5656
For reproductivity, we perform all the evaluation using the [Qwen’s math evaluation suite](https://github.com/QwenLM/Qwen2.5-Math/blob/main/evaluation/sh/eval.sh). For AIME24 and AMC 23, since they only have 30 and 40 questions respectively, we evaluate their performance by sampling 8 times for each question with a temperature of 0.6 and a top-p sampling probability of 0.95 and then compute the [pass@1](https://arxiv.org/pdf/2107.03374) (the calculation script is also provided [here](https://github.com/NovaSky-AI/SkyThought/tree/main/scripts/qwen_eval_bon.py)). For MATH500 and OlympiadBench, we use greedy decoding.
5757

5858
### Results
59-
We report the benchmark results for models after each stage as well as the intermediate distilled model in Table 1. We also plot the models’ pass@k curves to better understand how each SFT and RL stage impacts the model’s internal capability. For comparison, we conduct another ablation experiment which runs the RLOO directly on the Qwen2.5-math-7B base model using the [STILL3](https://huggingface.co/datasets/RUC-AIBOX/STILL-3-Preview-RL-Data) dataset, with 4 rollouts for each prompt. We train for 104 steps and get the final model as Sky-T1-7B-Zero.
59+
We report the benchmark results for models after each stage as well as the intermediate distilled model in Table 1. We also plot the models’ pass@k curves to better understand how each SFT and RL stage impacts the model’s internal capability. For comparison, we conduct another ablation experiment which runs the RLOO directly on the Qwen2.5-Math-7B base model using the [STILL3](https://huggingface.co/datasets/RUC-AIBOX/STILL-3-Preview-RL-Data) dataset, with 4 rollouts for each prompt. We train for 104 steps and get the final model as Sky-T1-7B-Zero.
6060

6161
As shown in Figure 2, Long CoT SFT significantly improves the model’s overall pass@k performance in both AIME24 and AMC23. In AMC, the two-stage RL primarily boosts pass@1 accuracy while reducing the diversity of solutions for k = 4 to 32. In AIME, the step4 RL further enhances overall pass@k compared to the step1 SFT and step2 RL, though its impact is less pronounced compared to Sky-T1-7B-Zero.
6262

0 commit comments

Comments
 (0)