You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -149,52 +149,27 @@ You can also look at the NeMo tutorial notebooks [here](https://github.com/NVIDI
149
149
150
150
## Knowledge Distillation (KD) for HuggingFace Models
151
151
152
-
In this e2e example we finetune Llama-2 models on the [OpenOrca](https://huggingface.co/datasets/Open-Orca/OpenOrca)
153
-
question-answer dataset as a minimal example to demonstrate a simple way of integrating Model Optimizer's KD feature.
152
+
In this e2e example we finetune Llama-3.2 models on the [smol-smoltalk-Interaction-SFT](https://huggingface.co/datasets/ReactiveAI/smol-smoltalk-Interaction-SFT)
153
+
dataset as a minimal example to demonstrate a simple way of integrating Model Optimizer's KD feature.
154
154
155
-
First we do supervised finetuning (SFT) of a Llama-2-7b on OpenOrca dataset as the teacher, then distill it into
156
-
a 1B-parameter model.
157
-
158
-
Keep in mind the training loss of the distillation run is not directly comparable to the training loss of the teacher run.
155
+
We replace normal supervised finetuning (SFT) of a Llama-3.2-1B base model by distilling information from Llama-3.2-3B-Instruct which has already been instruction-finetuned.
159
156
160
157
> [!NOTE]
161
158
> We can fit the following in memory using [FSDP](https://huggingface.co/docs/accelerate/en/usage_guides/fsdp) enabled on 8x RTX 6000 (total ~400GB VRAM)
0 commit comments