Couldn't reproduce distillation results #644

akriegman · 2025-05-15T21:38:15Z

akriegman
May 15, 2025

So you guys were able to reproduce DeepSeek's distillation results by training Qwen2.5-7B-Math-Instruct for 3 epochs on the 220k dataset. I tried to reproduce a similar result but with Qwen2.5-1.5B-Instruct. I ran the command from the Training section of the README essentially unchanged except I ran it for 3 epochs. I was not able to get very close at all to DeepSeek's Qwen-1.5B distillation results. See the below graphs for the performance of my checkpoints over the three epochs.

What am I doing wrong? I wanted to ask you guys before I spend time fixing the wrong thing, in case someone already knows what I need to change.

Should I be filtering out the incorrect responses from the dataset? Or does the config that I copied from the README already do that?
Does this only work with the 7B model?
Does this only work with the Math-Instruct models and not the Instruct models?
Do I just need to keep going? Do smaller models need to train longer? That curve doesn't look like it's going to come up to meet DeepSeek, but it could...

Thanks for all your hard work!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Couldn't reproduce distillation results #644

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Couldn't reproduce distillation results #644

Uh oh!

akriegman May 15, 2025

Replies: 0 comments

akriegman
May 15, 2025