NovaSky-AI
diff --git a/‎assets/images/sky-t1-7b/7b.jpg‎
48.5 KB b/‎assets/images/sky-t1-7b/7b.jpg‎
48.5 KB
diff --git a/‎src/content/posts/sky-t1-7b.md‎
Lines changed: 2 additions & 2 deletions b/‎src/content/posts/sky-t1-7b.md‎
Lines changed: 2 additions & 2 deletions
@@ -44,10 +44,10 @@ Finally, we use the 5K responses to perform SFT on the Qwen2.5-Math-7B using the
 ### Step 2: RL
 Next, we apply the [PRIME](https://github.com/PRIME-RL/PRIME)’s algorithms to it. We use the [Eurus-2-RL-Data](https://huggingface.co/datasets/PRIME-RL/Eurus-2-RL-Data) for the RL training and run it for 127 steps with a batch size of 256 (~30K data). For each prompt, we generate 4 rollouts and adopt the prompt filtering optimization proposed in PRIME that filters out the problems for which all of the 4 rollouts are correct or wrong. After this stage, we get the [Sky-T1-7B-Step2](https://huggingface.co/NovaSky-AI/Sky-T1-7B-step2) model. This stage runs on 8xH100 for around 44 hours.
 
-As suggested in [DeepSeek-V3 technical report’s](https://arxiv.org/pdf/2412.19437v1) sec 5.1, the model trained through SFT and RL can serve as a high-quality data generator. We therefore perform another round of distillation and rejection sampling on traces generated by Sky-T1-7B-Step2 and curated [4k SFT samples](https://huggingface.co/datasets/NovaSky-AI/Sky-T1-7B-step2-distill-4k) using the same data mixture in Step 1. We fine-tune the Qwen2.5-Math-7B with these 4k samples and obtained the Sky-T1-7B-Step2-4k-distill model, which surprisingly maintains similar or even better performance than Sky-T1-7B-Step2 across the 4 benchmarks, demonstrating extremely high data-efficiency compared to the model fine-tuned with 5k QwQ traces.
+As suggested in [DeepSeek-V3 technical report’s](https://arxiv.org/pdf/2412.19437v1) sec 5.1, the model trained through SFT and RL can serve as a high-quality data generator. We therefore perform another round of distillation and rejection sampling on traces generated by Sky-T1-7B-Step2 and curated [5k SFT samples](https://huggingface.co/datasets/NovaSky-AI/Sky-T1-7B-step2-distill-5k) using the same data mixture in Step 1. We fine-tune the Qwen2.5-Math-7B with these 5k samples and obtained the Sky-T1-7B-Step2-5k-distill model, which surprisingly maintains similar or even better performance than Sky-T1-7B-Step2 across the 4 benchmarks, demonstrating extremely high data-efficiency compared to the model fine-tuned with 5k QwQ traces.
 
 ### Step 3: SFT Again
-Together, with the 4K data distilled from Sky-T1-7B-Step2 in Step 2 and 5K data distilled from QwQ in Step 1, we perform another round of SFT on Qwen2.5-Math-7B base model. Similarly, we trained the model for 3 epochs, using a learning rate of 1e-5, and a batch size of 96. We then get the [Sky-T1-7B-step3](https://huggingface.co/NovaSky-AI/Sky-T1-7B-step3) model.
+Together, with the 5K data distilled from Sky-T1-7B-Step2 in Step 2 and 5K data distilled from QwQ in Step 1, we perform another round of SFT on Qwen2.5-Math-7B base model. Similarly, we trained the model for 3 epochs, using a learning rate of 1e-5, and a batch size of 96. We then get the [Sky-T1-7B-step3](https://huggingface.co/NovaSky-AI/Sky-T1-7B-step3) model.
 
 ### Step 4: RL Again
 In this stage, to speed up the RL training, we adopt the simple [RLOO](https://arxiv.org/abs/2402.14740) algorithm without using prompt filtering and process reward model. We use the numina_amc_aime and numina_olympiads subset of the [Eurus-2-RL-Data](https://huggingface.co/datasets/PRIME-RL/Eurus-2-RL-Data). We run the training for 59 steps with a batch size of 256 (~15K data). For each prompt, we generate 8 rollouts. We get [Sky-T1-7B](https://huggingface.co/NovaSky-AI/Sky-T1-7B) as the final model.