Skip to content

Commit b4848ec

Browse files
committed
update
1 parent 6137d26 commit b4848ec

File tree

2 files changed

+2
-2
lines changed

2 files changed

+2
-2
lines changed

assets/images/sky-t1-7b/7b.jpg

48.5 KB
Loading

src/content/posts/sky-t1-7b.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -44,10 +44,10 @@ Finally, we use the 5K responses to perform SFT on the Qwen2.5-Math-7B using the
4444
### Step 2: RL
4545
Next, we apply the [PRIME](https://github.com/PRIME-RL/PRIME)’s algorithms to it. We use the [Eurus-2-RL-Data](https://huggingface.co/datasets/PRIME-RL/Eurus-2-RL-Data) for the RL training and run it for 127 steps with a batch size of 256 (~30K data). For each prompt, we generate 4 rollouts and adopt the prompt filtering optimization proposed in PRIME that filters out the problems for which all of the 4 rollouts are correct or wrong. After this stage, we get the [Sky-T1-7B-Step2](https://huggingface.co/NovaSky-AI/Sky-T1-7B-step2) model. This stage runs on 8xH100 for around 44 hours.
4646

47-
As suggested in [DeepSeek-V3 technical report’s](https://arxiv.org/pdf/2412.19437v1) sec 5.1, the model trained through SFT and RL can serve as a high-quality data generator. We therefore perform another round of distillation and rejection sampling on traces generated by Sky-T1-7B-Step2 and curated [4k SFT samples](https://huggingface.co/datasets/NovaSky-AI/Sky-T1-7B-step2-distill-4k) using the same data mixture in Step 1. We fine-tune the Qwen2.5-Math-7B with these 4k samples and obtained the Sky-T1-7B-Step2-4k-distill model, which surprisingly maintains similar or even better performance than Sky-T1-7B-Step2 across the 4 benchmarks, demonstrating extremely high data-efficiency compared to the model fine-tuned with 5k QwQ traces.
47+
As suggested in [DeepSeek-V3 technical report’s](https://arxiv.org/pdf/2412.19437v1) sec 5.1, the model trained through SFT and RL can serve as a high-quality data generator. We therefore perform another round of distillation and rejection sampling on traces generated by Sky-T1-7B-Step2 and curated [5k SFT samples](https://huggingface.co/datasets/NovaSky-AI/Sky-T1-7B-step2-distill-5k) using the same data mixture in Step 1. We fine-tune the Qwen2.5-Math-7B with these 5k samples and obtained the Sky-T1-7B-Step2-5k-distill model, which surprisingly maintains similar or even better performance than Sky-T1-7B-Step2 across the 4 benchmarks, demonstrating extremely high data-efficiency compared to the model fine-tuned with 5k QwQ traces.
4848

4949
### Step 3: SFT Again
50-
Together, with the 4K data distilled from Sky-T1-7B-Step2 in Step 2 and 5K data distilled from QwQ in Step 1, we perform another round of SFT on Qwen2.5-Math-7B base model. Similarly, we trained the model for 3 epochs, using a learning rate of 1e-5, and a batch size of 96. We then get the [Sky-T1-7B-step3](https://huggingface.co/NovaSky-AI/Sky-T1-7B-step3) model.
50+
Together, with the 5K data distilled from Sky-T1-7B-Step2 in Step 2 and 5K data distilled from QwQ in Step 1, we perform another round of SFT on Qwen2.5-Math-7B base model. Similarly, we trained the model for 3 epochs, using a learning rate of 1e-5, and a batch size of 96. We then get the [Sky-T1-7B-step3](https://huggingface.co/NovaSky-AI/Sky-T1-7B-step3) model.
5151

5252
### Step 4: RL Again
5353
In this stage, to speed up the RL training, we adopt the simple [RLOO](https://arxiv.org/abs/2402.14740) algorithm without using prompt filtering and process reward model. We use the numina_amc_aime and numina_olympiads subset of the [Eurus-2-RL-Data](https://huggingface.co/datasets/PRIME-RL/Eurus-2-RL-Data). We run the training for 59 steps with a batch size of 256 (~15K data). For each prompt, we generate 8 rollouts. We get [Sky-T1-7B](https://huggingface.co/NovaSky-AI/Sky-T1-7B) as the final model.

0 commit comments

Comments
 (0)