You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: examples/nemo_run/prune_distill/README.md
+21-17Lines changed: 21 additions & 17 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -12,35 +12,37 @@ After structured pruning, the compressed model may show some accuracy degradatio
12
12
13
13
## Flow Stages
14
14
15
-
The Simplified Flow runs the following steps in order:
15
+
The Simplified Flow runs the following steps:
16
16
17
17
1. 01_import — Import HuggingFace model to NeMo format
18
-
1. 02_prune — Apply structured pruning to create a compressed student model
18
+
1. 02a_eval_teacher — Evaluate teacher model on 5% of MMLU benchmark
19
+
1. 02b_prune — Apply structured pruning to create a compressed student model
19
20
1. 03_distill — Knowledge distillation from teacher to pruned student model
20
-
1. 04_export — Export final compressed model to HuggingFace format
21
-
1. eval_teacher — Evaluate teacher model on 5% of MMLU benchmark
22
-
1. eval_student — Evaluate student model on 5% of MMLU benchmark
21
+
1. 04a_eval_student — Evaluate student model on 5% of MMLU benchmark
22
+
1. 04b_export — Export final compressed model to HuggingFace format
23
23
24
24
```mermaid
25
25
graph TD;
26
-
01_import-->02_prune;
27
-
01_import-->eval_teacher;
28
-
02_prune-->03_distill;
29
-
03_distill-->eval_student;
30
-
03_distill-->04_export;
26
+
01_import-->02a_eval_teacher;
27
+
01_import-->02b_prune;
28
+
02b_prune-->03_distill;
29
+
03_distill-->04a_eval_student;
30
+
03_distill-->04b_export;
31
31
```
32
32
33
33
## Results
34
34
35
35
Pruning + Knowledge Distillation of Qwen3-8B achieves significant model compression while recovering most of the accuracy through distillation. We depth-prune the model from 32 to 24 layers (reducing from 8B to 6B parameters) and distill for ~28,000 steps (determined by sequence length, default 4096) with a learning rate of 1e-4 and global batch size of 768 using a 25% subset of the [ClimbMix dataset](https://huggingface.co/datasets/OptimalScale/ClimbMix). (This is about 90 billion tokens and takes a total of ~6k H100 GPU hours)
| Qwen3-6B Pruned+Distilled from 8B | 6950 | 72.5 |
41
+
| Qwen3-4B Original (comparison) | 5210 | 70.0 |
42
42
43
-
The resulting compressed student maintains competitive performance while being significantly faster with a smaller memory footprint than the teacher. It also happens to have both better performance and throughput than the existing Qwen3-4B model!
43
+
The resulting compressed student maintains competitive performance while being significantly faster with fewer parameters than the teacher. It also happens to have both better performance and throughput than the existing Qwen3-4B model!
44
+
45
+
\*_Measured on H100 using TRT-LLM, FP8 precision_
44
46
45
47
## Usage
46
48
@@ -76,14 +78,16 @@ This will download and process the ClimbMix dataset, creating the necessary data
76
78
After launching the NeMo container with the specified mounts, change the contents of the `SLURM_CONFIG` in `nemo_prune_kd_flow.py`
77
79
to reflect your environment, and then perform the following:
78
80
79
-
From the `nemo_run` folder, launch the example with the `nemo_prune_kd_flow.py` script. To use a different model than the default model (Qwen3-8B), you can add the `--model-name <hf-model-name> --base-recipe <recipe-name>` flags and use the model's HuggingFace name and NeMo recipe names listed [here](https://github.com/NVIDIA/NeMo/tree/main/nemo/collections/llm/recipes). Provide the processed dataset path using the `--data-dir` flag.
81
+
Launch the example with the `nemo_prune_kd_flow.py` script. To use a different model than the default model (Qwen3-8B), you can add the `--model-name <hf-model-name> --base-recipe <recipe-name>` flags and use the model's HuggingFace name and NeMo recipe names listed [here](https://github.com/NVIDIA/NeMo/tree/main/nemo/collections/llm/recipes). Provide the processed dataset path using the `--data-dir` flag.
> **_NOTE:_** You can omit the `--use-slurm` flag to run locally for testing, and optionally with `--mock-run` to use a mock dataset.
90
+
87
91
## Supported models
88
92
89
93
Locally this script currently supports models that can be trained on 1 node with 8 x 80GB GPUs. On Slurm you can configure the number of nodes/gpus for training and pruning with the following flags: `--nodes`, `--train-gpus`.
0 commit comments