|
1 |
| -# Pruning and Knowledge Distillation Nemo Run example |
| 1 | +<div align="center"> |
| 2 | + |
| 3 | +# NeMo Pruning + Knowledge Distillation Simplified Flow Example |
| 4 | + |
| 5 | +[Slurm Examples](ADVANCED.md) | |
| 6 | +[Advanced Topics](ADVANCED.md) | |
| 7 | +[NeMo Integration](https://github.com/NVIDIA-NeMo/NeMo/tree/main/nemo/collections/llm/modelopt) |
| 8 | + |
| 9 | +</div> |
2 | 10 |
|
3 | 11 | ## Overview
|
4 | 12 |
|
5 |
| -This directory contains the NeMo 2.0 Pruning + Knowledge Distillation flow implementation. The main script `nemo_prune_kd_flow.py` enables model compression through structured pruning followed by knowledge distillation to recover performance. |
| 13 | +This directory contains an end-to-end Pruning + Knowledge Distillation Simplified Flow example using NeMo for model compression. It supports structured pruning followed by knowledge distillation to recover performance after compression. |
6 | 14 |
|
7 |
| -## Usage |
| 15 | +After structured pruning, the compressed model may show some accuracy degradation; the knowledge distillation stage aims to recover that loss by transferring knowledge from the full-precision teacher model to the pruned student model. |
8 | 16 |
|
9 |
| -### Prerequisites |
| 17 | +## Flow Stages |
10 | 18 |
|
11 |
| -#### Install NeMo 2.0 and related dependencies |
| 19 | +The Simplified Flow runs the following steps in order: |
12 | 20 |
|
13 |
| -To run the example, launch a [NeMo container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo) with version 25.04.01 or higher using Docker/Slurm. Mount your cloned `modelopt` repository to the container by adding this mount flag to your Docker/Slurm command: `-v <modelopt-path>:/workspace/modelopt -v <modelopt-path>/modelopt:/usr/local/lib/python3.12/dist-packages/modelopt`. |
| 21 | +1. 01_import — Import HuggingFace model to NeMo format |
| 22 | +1. 02_prune — Apply structured pruning to create a compressed student model |
| 23 | +1. 03_distill — Knowledge distillation from teacher to pruned student model |
| 24 | +1. 04_export — Export final compressed model to HuggingFace format |
| 25 | +1. eval_teacher — Evaluate teacher model on 5% of MMLU benchmark |
| 26 | +1. eval_student — Evaluate student model on 5% of MMLU benchmark |
14 | 27 |
|
15 |
| -To run SFT properly you may also need to clone NeMo and Megatron-LM at the respective commits, and mount to `/opt/NeMo` and `/opt/megatron-lm`: |
| 28 | +```mermaid |
| 29 | +graph TD; |
| 30 | +01_import-->02_prune; |
| 31 | +01_import-->eval_teacher; |
| 32 | +02_prune-->03_distill; |
| 33 | +03_distill-->eval_student; |
| 34 | +03_distill-->04_export; |
| 35 | +``` |
16 | 36 |
|
17 |
| -- `git clone https://github.com/NVIDIA-NeMo/NeMo && cd NeMo && git checkout d7b87b1` |
18 |
| -- `git clone https://github.com/NVIDIA/Megatron-LM.git && cd Megatron-LM && git checkout 8c15450` |
| 37 | +## Results |
19 | 38 |
|
20 |
| -### Data Preparation |
| 39 | +Pruning + Knowledge Distillation of Qwen3-8B achieves significant model compression while recovering most of the accuracy through distillation. We depth-prune the model from 32 to 24 layers (reducing from 8B to 6B parameters) and distill for ~14,000 steps with a learning rate of 1e-4 and global batch size of 768 using a 25% subset of the [ClimbMix dataset](https://huggingface.co/datasets/OptimalScale/ClimbMix). (This is about 90 billion tokens and takes a total of ~6k H100 GPU hours) |
21 | 40 |
|
22 |
| -The script supports chat datasets in ShareGPT or HuggingFace/OpenAI chat format. You can prepare your dataset in JSONL format with the required chat structure. To provide your own custom dataset, use the `--data-path` flag, otherwise the default [LIMA](https://huggingface.co/datasets/GAIR/lima) dataset will be used. |
| 41 | +| | Tokens per Second | MMLU | |
| 42 | +|---------------------------|-------------------|------| |
| 43 | +| Qwen3-8B Original | 4420 | 74.9 | |
| 44 | +| Qwen3-6B Pruned+Distilled | 6950 | 72.5 | |
23 | 45 |
|
24 |
| -### Running the Flow |
| 46 | +The resulting compressed model maintains competitive performance while being significantly faster with a smaller memory footprint. |
25 | 47 |
|
26 |
| -#### Standard Usage |
| 48 | +## Usage |
27 | 49 |
|
28 |
| -From the `nemo_run` folder, run: |
| 50 | +### Prerequisites |
29 | 51 |
|
30 |
| -```bash |
31 |
| -python prune_distill/nemo_prune_kd_flow.py --data_path your_dataset.jsonl |
32 |
| -``` |
| 52 | +You can run the example either locally or on a [Slurm cluster](ADVANCED.md). |
33 | 53 |
|
34 |
| -#### Mock Run (for testing) |
| 54 | +To run the example locally, launch a [NeMo container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo) with version 25.09 or higher. Clone the `TensorRT-Model-Optimizer` repository and `NeMo` repository (checkout a specific commit for NeMo), then mount it onto your docker container. |
35 | 55 |
|
36 |
| -To test the flow without actual data, run the following command from the `nemo_run` folder: |
| 56 | +- `git clone https://github.com/NVIDIA/TensorRT-Model-Optimizer.git` |
| 57 | + |
| 58 | +Example docker command: |
37 | 59 |
|
38 | 60 | ```bash
|
39 |
| -python prune_distill/nemo_prune_kd_flow.py --mock_run |
| 61 | +docker run -v /home/user/:/home/user/ -v /home/user/NeMo:/opt/NeMo -v /home/user/TensorRT-Model-Optimizer/modelopt/:/usr/local/lib/python3.12/dist-packages/modelopt --gpus all -it --shm-size 20g --rm nvcr.io/nvidia/nemo:25.09 bash |
40 | 62 | ```
|
41 | 63 |
|
42 |
| -### Flow Stages |
| 64 | +You will also need to set your Huggingface token with `export HF_TOKEN=<your-token>`. You may also need to enable write access to the docker container to the `examples/nemo_run` folder by doing `chmod 777 nemo_run` so that logs can be written. |
43 | 65 |
|
44 |
| -The script executes the following stages in sequence: |
| 66 | +### Dataset Preparation |
45 | 67 |
|
46 |
| -1. Process LIMA data (if `--data-path` is not specified) |
47 |
| -1. **Import Model**: Imports the HuggingFace model to NeMo format |
48 |
| -1. **Fine-tuning**: Fine-tunes the model on the provided dataset |
49 |
| -1. **Pruning**: Prunes the fine-tuned model to create a smaller student model |
50 |
| -1. **Knowledge Distillation**: Distills knowledge from the teacher to the pruned student model |
51 |
| -1. **Export**: Exports the final compressed model |
| 68 | +Unlike the QAT flow, this workflow does not automatically download the dataset due to its large size and long tokenization time. |
| 69 | +You must first prepare the dataset by running: |
52 | 70 |
|
53 |
| -### Configuration Parameters |
| 71 | +```bash |
| 72 | +python ../common/process_climbmix.py --output-dir /path/to/save |
| 73 | +``` |
54 | 74 |
|
55 |
| -The script includes several configurable parameters: |
| 75 | +This will download and process the ClimbMix dataset, creating the necessary data files for training. |
56 | 76 |
|
57 |
| -- **GPUS**: Number of GPUs (default: 8) |
58 |
| -- **SEQUENCE_LENGTH**: Maximum sequence length (default: 8192) |
59 |
| -- **MBS**: Micro batch size (default: 2) |
60 |
| -- **GBS**: Global batch size (default: 2048 for real runs, 8 for mock runs) |
61 |
| -- **FINETUNE_STEPS**: Number of fine-tuning steps (default: 2500 for real runs, 20 for mock runs) |
62 |
| -- **DISTILL_STEPS**: Number of distillation steps (default: 7500 for real runs, 20 for mock runs) |
63 |
| -- **VAL_INTERVAL**: Validation interval (default: 500 for real runs, 10 for mock runs) |
64 |
| -- **PRUNE_SAMPLES**: Number of samples for pruning calibration (default: 1024 for real runs, 3 for mock runs) |
| 77 | +### Running the Flow via Slurm |
65 | 78 |
|
66 |
| -### Pruning Configuration |
| 79 | +After launching the NeMo container with the specified mounts, change the contents of the `SLURM_CONFIG` in `nemo_prune_kd_flow.py` |
| 80 | +to reflect your environment, and then perform the following: |
67 | 81 |
|
68 |
| -- **Target Hidden Size**: Default is 3072 (configurable via `--prune_target_hidden_size`) |
69 |
| -- **Target FFN Hidden Size**: Automatically set to 3 × target_hidden_size |
70 |
| -- **Pruning Method**: Structured pruning to reduce model dimensions |
| 82 | +From the `nemo_run` folder, launch the example with the `nemo_prune_kd_flow.py` script. To use a different model than the default model (Qwen3-8B), you can add the `--model-name <hf-model-name> --base-recipe <recipe-name>` flags and use the model's HuggingFace name and NeMo recipe names listed [here](https://github.com/NVIDIA/NeMo/tree/main/nemo/collections/llm/recipes). Provide the processed dataset path using the `--data-dir` flag. |
71 | 83 |
|
72 |
| -### Output |
| 84 | +To perform Pruning + Knowledge Distillation, run: |
73 | 85 |
|
74 |
| -The script generates the following outputs in the specified log directory: |
| 86 | +```bash |
| 87 | +python prune_distill/nemo_prune_kd_flow.py --log-dir /my/log/dir --data-dir /path/to/climbix_proc --use-slurm |
| 88 | +``` |
75 | 89 |
|
76 |
| -- `{model_name}_initial/`: Initial NeMo checkpoint |
77 |
| -- `finetune_log_dir/`: Fine-tuning logs and checkpoints (teacher model) |
78 |
| -- `{model_name}_pruned/`: Pruned student model |
79 |
| -- `distill_log_dir/`: Knowledge distillation logs and checkpoints |
80 |
| -- `{model_name}_final/`: Final compressed model after distillation |
| 90 | +## Supported models |
81 | 91 |
|
82 |
| -### Supported Models |
| 92 | +Locally this script currently supports models that can be trained on 1 node with 8 x 80GB GPUs. On Slurm you can configure the number of nodes/gpus for training and pruning with the following flags: `--nodes`, `--train-gpus`. |
83 | 93 |
|
84 |
| -Currently supports models that can be trained on 1 node with 8 x 80GB GPUs. The default configuration uses: |
| 94 | +The default configuration works on 1 node with 8 H100 GPUs: |
85 | 95 |
|
86 |
| -- **Model**: Meta-Llama-3.1-8B |
87 |
| -- **Recipe**: llama31_8b |
88 |
| -- **Pruning Strategy**: Structured pruning with knowledge distillation recovery |
| 96 | +- **Model**: Qwen/Qwen3-8B |
| 97 | +- **Recipe**: qwen3_8b |
89 | 98 |
|
90 |
| -### Troubleshooting |
| 99 | +### Dataset limitations |
91 | 100 |
|
92 |
| -1. **GPU Memory Issues**: Reduce batch sizes (MBS, GBS) if encountering OOM errors |
93 |
| -1. **Data Format**: Ensure your dataset follows the expected chat format |
94 |
| -1. **NeMo Installation**: If encountering NeMo-related errors, use the recommended docker container |
95 |
| -1. **Model Size**: Ensure your model fits within the 8-GPU configuration |
| 101 | +The current pruning + knowledge distillation recipe has been tuned for the Qwen3-8B model to achieve significant speedup while maintaining performance. Pruning and distillation results are highly dependent on the specific model, dataset, and hyperparameters. There is no guarantee that a given dataset will recover the accuracy of the pruned model. Feel free to try your own model and dataset combinations and test which combination works best. |
0 commit comments