Skip to content

Commit d621b27

Browse files
authored
Updated Prune-KD NeMo flow (#382)
Signed-off-by: Asha Anoosheh <[email protected]>
1 parent b39c73d commit d621b27

File tree

4 files changed

+364
-216
lines changed

4 files changed

+364
-216
lines changed

examples/llm_distill/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@ This section focuses on demonstrating how to apply Model Optimizer to perform kn
1616
| Distillation with NeMo | Learn how to distill your models with NeMo Framework | \[[Link](#knowledge-distillation-kd-for-nvidia-nemo-models)\] | \[[docs](https://nvidia.github.io/TensorRT-Model-Optimizer/guides/4_distillation.html)\] |
1717
| Distillation with Huggingface | Learn how to distill your models with Hugging Face | \[[Link](#knowledge-distillation-kd-for-huggingface-models)\] | \[[docs](https://nvidia.github.io/TensorRT-Model-Optimizer/guides/4_distillation.html)\] |
1818
| Resources | Extra links to relevant resources | \[[Link](#resources)\] | |
19+
| NeMo Prune + Distill Simplified Flow | Example script demonstrating end-to-end pruning plus distillation in NeMo | \[[Link](../nemo_run/prune_distill/README.md)\] | |
1920

2021
</div>
2122

Lines changed: 77 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,77 @@
1+
# SPDX-FileCopyrightText: Copyright (c) 2023-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
# SPDX-License-Identifier: Apache-2.0
3+
#
4+
# Licensed under the Apache License, Version 2.0 (the "License");
5+
# you may not use this file except in compliance with the License.
6+
# You may obtain a copy of the License at
7+
#
8+
# http://www.apache.org/licenses/LICENSE-2.0
9+
#
10+
# Unless required by applicable law or agreed to in writing, software
11+
# distributed under the License is distributed on an "AS IS" BASIS,
12+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
# See the License for the specific language governing permissions and
14+
# limitations under the License.
15+
16+
import argparse
17+
from pathlib import Path
18+
19+
from huggingface_hub import snapshot_download
20+
21+
from modelopt.torch.utils.plugins import megatron_preprocess_data
22+
23+
SUBSET_IDX = [
24+
*[0, 1, 6, 10, 11],
25+
*[12, 13, 14, 21, 24],
26+
*[33, 35, 38, 40, 48],
27+
*[49, 52, 66, 70, 76],
28+
*[83, 88, 91, 94, 99],
29+
] # 25% of total dataset
30+
31+
32+
def get_args():
33+
parser = argparse.ArgumentParser(description="Process ClimbMix dataset")
34+
parser.add_argument(
35+
"--output-dir",
36+
default=".",
37+
help="Path to the directory to store the processed dataset",
38+
)
39+
parser.add_argument(
40+
"--tokenizer",
41+
default="Qwen/Qwen3-8B",
42+
help="Tokenizer to use for preprocessing",
43+
)
44+
return parser.parse_args()
45+
46+
47+
if __name__ == "__main__":
48+
args = get_args()
49+
Path(args.output_dir).mkdir(parents=True, exist_ok=True)
50+
51+
# create raw and processed directories
52+
raw_dir = Path(args.output_dir) / "climbmix_raw"
53+
proc_dir = Path(args.output_dir) / "climbmix_proc"
54+
55+
# only download the subset of the data
56+
subset_filenames = [f"part_{i}.jsonl" for i in SUBSET_IDX]
57+
58+
# download raw data
59+
snapshot_download(
60+
repo_id="OptimalScale/ClimbMix",
61+
repo_type="dataset",
62+
local_dir=raw_dir,
63+
allow_patterns=subset_filenames,
64+
)
65+
66+
# preprocess (tokenize)
67+
print("Tokenizing ClimbMix dataset...")
68+
input_paths = [raw_dir / name for name in subset_filenames]
69+
megatron_preprocess_data(
70+
input_paths,
71+
output_dir=proc_dir,
72+
tokenizer_name_or_path=args.tokenizer,
73+
append_eod=True,
74+
max_sequence_length=32000,
75+
workers=8,
76+
log_interval=10000,
77+
)
Lines changed: 66 additions & 59 deletions
Original file line numberDiff line numberDiff line change
@@ -1,95 +1,102 @@
1-
# Pruning and Knowledge Distillation Nemo Run example
1+
<div align="center">
2+
3+
# NeMo Pruning + Knowledge Distillation Simplified Flow Example
4+
5+
</div>
26

37
## Overview
48

5-
This directory contains the NeMo 2.0 Pruning + Knowledge Distillation flow implementation. The main script `nemo_prune_kd_flow.py` enables model compression through structured pruning followed by knowledge distillation to recover performance.
9+
This directory contains an end-to-end Pruning + Knowledge Distillation Simplified Flow example using NeMo for model compression. It supports structured pruning followed by knowledge distillation to recover performance after compression.
610

7-
## Usage
11+
After structured pruning, the compressed model may show some accuracy degradation; the knowledge distillation stage aims to recover that loss by transferring knowledge from the full-precision teacher model to the pruned student model.
812

9-
### Prerequisites
13+
## Flow Stages
1014

11-
#### Install NeMo 2.0 and related dependencies
15+
The Simplified Flow runs the following steps:
1216

13-
To run the example, launch a [NeMo container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo) with version 25.04.01 or higher using Docker/Slurm. Mount your cloned `modelopt` repository to the container by adding this mount flag to your Docker/Slurm command: `-v <modelopt-path>:/workspace/modelopt -v <modelopt-path>/modelopt:/usr/local/lib/python3.12/dist-packages/modelopt`.
17+
1. 01_import — Import HuggingFace model to NeMo format
18+
1. 02a_eval_teacher — Evaluate teacher model on 5% of MMLU benchmark
19+
1. 02b_prune — Apply structured pruning to create a compressed student model
20+
1. 03_distill — Knowledge distillation from teacher to pruned student model
21+
1. 04a_eval_student — Evaluate student model on 5% of MMLU benchmark
22+
1. 04b_export — Export final compressed model to HuggingFace format
1423

15-
To run SFT properly you may also need to clone NeMo and Megatron-LM at the respective commits, and mount to `/opt/NeMo` and `/opt/megatron-lm`:
24+
```mermaid
25+
graph TD;
26+
01_import-->02a_eval_teacher;
27+
01_import-->02b_prune;
28+
02b_prune-->03_distill;
29+
03_distill-->04a_eval_student;
30+
03_distill-->04b_export;
31+
```
1632

17-
- `git clone https://github.com/NVIDIA-NeMo/NeMo && cd NeMo && git checkout d7b87b1`
18-
- `git clone https://github.com/NVIDIA/Megatron-LM.git && cd Megatron-LM && git checkout 8c15450`
33+
## Results
1934

20-
### Data Preparation
35+
Pruning + Knowledge Distillation of Qwen3-8B achieves significant model compression while recovering most of the accuracy through distillation. We depth-prune the model from 32 to 24 layers (reducing from 8B to 6B parameters) and distill for ~28,000 steps (determined by sequence length, default 4096) with a learning rate of 1e-4 and global batch size of 768 using a 25% subset of the [ClimbMix dataset](https://huggingface.co/datasets/OptimalScale/ClimbMix). (This is about 90 billion tokens and takes a total of ~6k H100 GPU hours)
2136

22-
The script supports chat datasets in ShareGPT or HuggingFace/OpenAI chat format. You can prepare your dataset in JSONL format with the required chat structure. To provide your own custom dataset, use the `--data-path` flag, otherwise the default [LIMA](https://huggingface.co/datasets/GAIR/lima) dataset will be used.
37+
| | Tokens per Second * | MMLU |
38+
|-----------------------------------|---------------------|------|
39+
| Qwen3-8B Original | 4420 | 74.9 |
40+
| Qwen3-6B Pruned+Distilled from 8B | 6950 | 72.5 |
41+
| Qwen3-4B Original (comparison) | 5210 | 70.0 |
2342

24-
### Running the Flow
43+
The resulting compressed student maintains competitive performance while being significantly faster with fewer parameters than the teacher. It also happens to have both better performance and throughput than the existing Qwen3-4B model!
2544

26-
#### Standard Usage
45+
\* _Measured on H100 using TRT-LLM, FP8 precision_
2746

28-
From the `nemo_run` folder, run:
47+
## Usage
2948

30-
```bash
31-
python prune_distill/nemo_prune_kd_flow.py --data_path your_dataset.jsonl
32-
```
49+
### Prerequisites
50+
51+
You can run the example either locally or on a [Slurm cluster](ADVANCED.md).
3352

34-
#### Mock Run (for testing)
53+
To run the example locally, launch a [NeMo container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo) with version 25.09 or higher. Clone the `TensorRT-Model-Optimizer` repository and `NeMo` repository (checkout a specific commit for NeMo), then mount it onto your docker container.
3554

36-
To test the flow without actual data, run the following command from the `nemo_run` folder:
55+
- `git clone https://github.com/NVIDIA/TensorRT-Model-Optimizer.git`
56+
57+
Example docker command:
3758

3859
```bash
39-
python prune_distill/nemo_prune_kd_flow.py --mock_run
60+
docker run -v /home/user/:/home/user/ -v /home/user/NeMo:/opt/NeMo -v /home/user/TensorRT-Model-Optimizer:/opt/TensorRT-Model-Optimizer --gpus all -it --shm-size 20g --rm nvcr.io/nvidia/nemo:25.09 bash
4061
```
4162

42-
### Flow Stages
63+
You will also need to set your Huggingface token with `export HF_TOKEN=<your-token>`. You may also need to enable write access to the docker container to the `examples/nemo_run` folder by doing `chmod 777 nemo_run` so that logs can be written.
64+
65+
### Dataset Preparation
4366

44-
The script executes the following stages in sequence:
67+
Unlike the QAT flow, this workflow does not automatically download the dataset due to its large size and long tokenization time.
68+
You must first prepare the dataset by running:
4569

46-
1. Process LIMA data (if `--data-path` is not specified)
47-
1. **Import Model**: Imports the HuggingFace model to NeMo format
48-
1. **Fine-tuning**: Fine-tunes the model on the provided dataset
49-
1. **Pruning**: Prunes the fine-tuned model to create a smaller student model
50-
1. **Knowledge Distillation**: Distills knowledge from the teacher to the pruned student model
51-
1. **Export**: Exports the final compressed model
70+
```bash
71+
python ../common/process_climbmix.py --output-dir /path/to/save
72+
```
5273

53-
### Configuration Parameters
74+
This will download and process the ClimbMix dataset, creating the necessary data files for training.
5475

55-
The script includes several configurable parameters:
76+
### Running the Flow via Slurm
5677

57-
- **GPUS**: Number of GPUs (default: 8)
58-
- **SEQUENCE_LENGTH**: Maximum sequence length (default: 8192)
59-
- **MBS**: Micro batch size (default: 2)
60-
- **GBS**: Global batch size (default: 2048 for real runs, 8 for mock runs)
61-
- **FINETUNE_STEPS**: Number of fine-tuning steps (default: 2500 for real runs, 20 for mock runs)
62-
- **DISTILL_STEPS**: Number of distillation steps (default: 7500 for real runs, 20 for mock runs)
63-
- **VAL_INTERVAL**: Validation interval (default: 500 for real runs, 10 for mock runs)
64-
- **PRUNE_SAMPLES**: Number of samples for pruning calibration (default: 1024 for real runs, 3 for mock runs)
78+
After launching the NeMo container with the specified mounts, change the contents of the `SLURM_CONFIG` in `nemo_prune_kd_flow.py`
79+
to reflect your environment, and then perform the following:
6580

66-
### Pruning Configuration
81+
Launch the example with the `nemo_prune_kd_flow.py` script. To use a different model than the default model (Qwen3-8B), you can add the `--model-name <hf-model-name> --base-recipe <recipe-name>` flags and use the model's HuggingFace name and NeMo recipe names listed [here](https://github.com/NVIDIA/NeMo/tree/main/nemo/collections/llm/recipes). Provide the processed dataset path using the `--data-dir` flag.
6782

68-
- **Target Hidden Size**: Default is 3072 (configurable via `--prune_target_hidden_size`)
69-
- **Target FFN Hidden Size**: Automatically set to 3 × target_hidden_size
70-
- **Pruning Method**: Structured pruning to reduce model dimensions
83+
To perform Pruning + Knowledge Distillation, run:
7184

72-
### Output
85+
```bash
86+
python prune_distill/nemo_prune_kd_flow.py --log-dir /my/log/dir --data-dir /path/to/climbmix_proc --use-slurm
87+
```
7388

74-
The script generates the following outputs in the specified log directory:
89+
> **_NOTE:_** You can omit the `--use-slurm` flag to run locally for testing, and optionally with `--mock-run` to use a mock dataset.
7590
76-
- `{model_name}_initial/`: Initial NeMo checkpoint
77-
- `finetune_log_dir/`: Fine-tuning logs and checkpoints (teacher model)
78-
- `{model_name}_pruned/`: Pruned student model
79-
- `distill_log_dir/`: Knowledge distillation logs and checkpoints
80-
- `{model_name}_final/`: Final compressed model after distillation
91+
## Supported models
8192

82-
### Supported Models
93+
Locally this script currently supports models that can be trained on 1 node with 8 x 80GB GPUs. On Slurm you can configure the number of nodes/gpus for training and pruning with the following flags: `--nodes`, `--train-gpus`.
8394

84-
Currently supports models that can be trained on 1 node with 8 x 80GB GPUs. The default configuration uses:
95+
The default configuration works on 1 node with 8 H100 GPUs:
8596

86-
- **Model**: Meta-Llama-3.1-8B
87-
- **Recipe**: llama31_8b
88-
- **Pruning Strategy**: Structured pruning with knowledge distillation recovery
97+
- **Model**: Qwen/Qwen3-8B
98+
- **Recipe**: qwen3_8b
8999

90-
### Troubleshooting
100+
### Dataset limitations
91101

92-
1. **GPU Memory Issues**: Reduce batch sizes (MBS, GBS) if encountering OOM errors
93-
1. **Data Format**: Ensure your dataset follows the expected chat format
94-
1. **NeMo Installation**: If encountering NeMo-related errors, use the recommended docker container
95-
1. **Model Size**: Ensure your model fits within the 8-GPU configuration
102+
The current pruning + knowledge distillation recipe has been tuned for the Qwen3-8B model to achieve significant speedup while maintaining performance. Pruning and distillation results are highly dependent on the specific model, dataset, and hyperparameters. There is no guarantee that a given dataset will recover the accuracy of the pruned model. Feel free to try your own model and dataset combinations and test which combination works best.

0 commit comments

Comments
 (0)