Skip to content

Commit eb4e041

Browse files
committed
Squashed commit
Signed-off-by: Asha Anoosheh <[email protected]>
1 parent 26c203a commit eb4e041

File tree

4 files changed

+368
-216
lines changed

4 files changed

+368
-216
lines changed

examples/llm_distill/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@ This section focuses on demonstrating how to apply Model Optimizer to perform kn
1616
| Distillation with NeMo | Learn how to distill your models with NeMo Framework | \[[Link](#knowledge-distillation-kd-for-nvidia-nemo-models)\] | \[[docs](https://nvidia.github.io/TensorRT-Model-Optimizer/guides/4_distillation.html)\] |
1717
| Distillation with Huggingface | Learn how to distill your models with Hugging Face | \[[Link](#knowledge-distillation-kd-for-huggingface-models)\] | \[[docs](https://nvidia.github.io/TensorRT-Model-Optimizer/guides/4_distillation.html)\] |
1818
| Resources | Extra links to relevant resources | \[[Link](#resources)\] | |
19+
| NeMo Prune + Distill Simplified Flow | Example script demonstrating end-to-end pruning plus distillation in NeMo | \[[Link](../nemo_run/prune_kd/README.md)\] | |
1920

2021
</div>
2122

Lines changed: 85 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,85 @@
1+
# SPDX-FileCopyrightText: Copyright (c) 2023-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
# SPDX-License-Identifier: Apache-2.0
3+
#
4+
# Licensed under the Apache License, Version 2.0 (the "License");
5+
# you may not use this file except in compliance with the License.
6+
# You may obtain a copy of the License at
7+
#
8+
# http://www.apache.org/licenses/LICENSE-2.0
9+
#
10+
# Unless required by applicable law or agreed to in writing, software
11+
# distributed under the License is distributed on an "AS IS" BASIS,
12+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
# See the License for the specific language governing permissions and
14+
# limitations under the License.
15+
16+
import argparse
17+
from pathlib import Path
18+
19+
from huggingface_hub import snapshot_download
20+
21+
from modelopt.torch.utils.plugins import megatron_preprocess_data
22+
23+
SUBSET_IDX = [
24+
*[0, 1, 6, 10, 11],
25+
*[12, 13, 14, 21, 24],
26+
*[33, 35, 38, 40, 48],
27+
*[49, 52, 66, 70, 76],
28+
*[83, 88, 91, 94, 99],
29+
] # 25% of total dataset
30+
31+
32+
def get_args():
33+
parser = argparse.ArgumentParser(description="Process ClimbMix dataset")
34+
parser.add_argument(
35+
"--output-dir",
36+
default=".",
37+
help="Path to the directory to store the processed dataset",
38+
)
39+
parser.add_argument(
40+
"--tokenizer",
41+
default="Qwen/Qwen3-8B",
42+
help="Tokenizer to use for preprocessing",
43+
)
44+
parser.add_argument(
45+
"--subset-indices",
46+
help="Comma-separated subset indices to download",
47+
)
48+
return parser.parse_args()
49+
50+
51+
if __name__ == "__main__":
52+
args = get_args()
53+
Path(args.output_dir).mkdir(exist_ok=True)
54+
55+
# create raw and processed directories
56+
raw_dir = Path(args.output_dir) / "climbmix_raw"
57+
proc_dir = Path(args.output_dir) / "climbmix_proc"
58+
59+
# only download the subset of the data
60+
if args.subset_indices:
61+
subset_idx = [int(i) for i in args.subset_indices.split(",")]
62+
else:
63+
subset_idx = SUBSET_IDX
64+
subset_filenames = [f"part_{i}.jsonl" for i in subset_idx]
65+
66+
# download raw data
67+
snapshot_download(
68+
repo_id="OptimalScale/ClimbMix",
69+
repo_type="dataset",
70+
local_dir=raw_dir,
71+
allow_patterns=subset_filenames,
72+
)
73+
74+
# preprocess (tokenize)
75+
print("Processing ClimbMix dataset...")
76+
input_paths = [raw_dir / name for name in subset_filenames]
77+
megatron_preprocess_data(
78+
input_paths,
79+
output_dir=proc_dir,
80+
tokenizer_name_or_path=args.tokenizer,
81+
append_eod=True,
82+
max_sequence_length=32000,
83+
workers=8,
84+
log_interval=10000,
85+
)
Lines changed: 65 additions & 59 deletions
Original file line numberDiff line numberDiff line change
@@ -1,95 +1,101 @@
1-
# Pruning and Knowledge Distillation Nemo Run example
1+
<div align="center">
2+
3+
# NeMo Pruning + Knowledge Distillation Simplified Flow Example
4+
5+
[Slurm Examples](ADVANCED.md) |
6+
[Advanced Topics](ADVANCED.md) |
7+
[NeMo Integration](https://github.com/NVIDIA-NeMo/NeMo/tree/main/nemo/collections/llm/modelopt)
8+
9+
</div>
210

311
## Overview
412

5-
This directory contains the NeMo 2.0 Pruning + Knowledge Distillation flow implementation. The main script `nemo_prune_kd_flow.py` enables model compression through structured pruning followed by knowledge distillation to recover performance.
13+
This directory contains an end-to-end Pruning + Knowledge Distillation Simplified Flow example using NeMo for model compression. It supports structured pruning followed by knowledge distillation to recover performance after compression.
614

7-
## Usage
15+
After structured pruning, the compressed model may show some accuracy degradation; the knowledge distillation stage aims to recover that loss by transferring knowledge from the full-precision teacher model to the pruned student model.
816

9-
### Prerequisites
17+
## Flow Stages
1018

11-
#### Install NeMo 2.0 and related dependencies
19+
The Simplified Flow runs the following steps in order:
1220

13-
To run the example, launch a [NeMo container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo) with version 25.04.01 or higher using Docker/Slurm. Mount your cloned `modelopt` repository to the container by adding this mount flag to your Docker/Slurm command: `-v <modelopt-path>:/workspace/modelopt -v <modelopt-path>/modelopt:/usr/local/lib/python3.12/dist-packages/modelopt`.
21+
1. 01_import — Import HuggingFace model to NeMo format
22+
1. 02_prune — Apply structured pruning to create a compressed student model
23+
1. 03_distill — Knowledge distillation from teacher to pruned student model
24+
1. 04_export — Export final compressed model to HuggingFace format
25+
1. eval_teacher — Evaluate teacher model on 5% of MMLU benchmark
26+
1. eval_student — Evaluate student model on 5% of MMLU benchmark
1427

15-
To run SFT properly you may also need to clone NeMo and Megatron-LM at the respective commits, and mount to `/opt/NeMo` and `/opt/megatron-lm`:
28+
```mermaid
29+
graph TD;
30+
01_import-->02_prune;
31+
01_import-->eval_teacher;
32+
02_prune-->03_distill;
33+
03_distill-->eval_student;
34+
03_distill-->04_export;
35+
```
1636

17-
- `git clone https://github.com/NVIDIA-NeMo/NeMo && cd NeMo && git checkout d7b87b1`
18-
- `git clone https://github.com/NVIDIA/Megatron-LM.git && cd Megatron-LM && git checkout 8c15450`
37+
## Results
1938

20-
### Data Preparation
39+
Pruning + Knowledge Distillation of Qwen3-8B achieves significant model compression while recovering most of the accuracy through distillation. We depth-prune the model from 32 to 24 layers (reducing from 8B to 6B parameters) and distill for ~14,000 steps with a learning rate of 1e-4 and global batch size of 768 using a 25% subset of the [ClimbMix dataset](https://huggingface.co/datasets/OptimalScale/ClimbMix). (This is about 90 billion tokens and takes a total of ~6k H100 GPU hours)
2140

22-
The script supports chat datasets in ShareGPT or HuggingFace/OpenAI chat format. You can prepare your dataset in JSONL format with the required chat structure. To provide your own custom dataset, use the `--data-path` flag, otherwise the default [LIMA](https://huggingface.co/datasets/GAIR/lima) dataset will be used.
41+
| | Tokens per Second | MMLU |
42+
|---------------------------|-------------------|------|
43+
| Qwen3-8B Original | 4420 | 74.9 |
44+
| Qwen3-6B Pruned+Distilled | 6950 | 72.5 |
2345

24-
### Running the Flow
46+
The resulting compressed model maintains competitive performance while being significantly faster with a smaller memory footprint.
2547

26-
#### Standard Usage
48+
## Usage
2749

28-
From the `nemo_run` folder, run:
50+
### Prerequisites
2951

30-
```bash
31-
python prune_distill/nemo_prune_kd_flow.py --data_path your_dataset.jsonl
32-
```
52+
You can run the example either locally or on a [Slurm cluster](ADVANCED.md).
3353

34-
#### Mock Run (for testing)
54+
To run the example locally, launch a [NeMo container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo) with version 25.09 or higher. Clone the `TensorRT-Model-Optimizer` repository and `NeMo` repository (checkout a specific commit for NeMo), then mount it onto your docker container.
3555

36-
To test the flow without actual data, run the following command from the `nemo_run` folder:
56+
- `git clone https://github.com/NVIDIA/TensorRT-Model-Optimizer.git`
57+
58+
Example docker command:
3759

3860
```bash
39-
python prune_distill/nemo_prune_kd_flow.py --mock_run
61+
docker run -v /home/user/:/home/user/ -v /home/user/NeMo:/opt/NeMo -v /home/user/TensorRT-Model-Optimizer/modelopt/:/usr/local/lib/python3.12/dist-packages/modelopt --gpus all -it --shm-size 20g --rm nvcr.io/nvidia/nemo:25.09 bash
4062
```
4163

42-
### Flow Stages
64+
You will also need to set your Huggingface token with `export HF_TOKEN=<your-token>`. You may also need to enable write access to the docker container to the `examples/nemo_run` folder by doing `chmod 777 nemo_run` so that logs can be written.
4365

44-
The script executes the following stages in sequence:
66+
### Dataset Preparation
4567

46-
1. Process LIMA data (if `--data-path` is not specified)
47-
1. **Import Model**: Imports the HuggingFace model to NeMo format
48-
1. **Fine-tuning**: Fine-tunes the model on the provided dataset
49-
1. **Pruning**: Prunes the fine-tuned model to create a smaller student model
50-
1. **Knowledge Distillation**: Distills knowledge from the teacher to the pruned student model
51-
1. **Export**: Exports the final compressed model
68+
Unlike the QAT flow, this workflow does not automatically download the dataset due to its large size and long tokenization time.
69+
You must first prepare the dataset by running:
5270

53-
### Configuration Parameters
71+
```bash
72+
python ../common/process_climbmix.py --output-dir /path/to/save
73+
```
5474

55-
The script includes several configurable parameters:
75+
This will download and process the ClimbMix dataset, creating the necessary data files for training.
5676

57-
- **GPUS**: Number of GPUs (default: 8)
58-
- **SEQUENCE_LENGTH**: Maximum sequence length (default: 8192)
59-
- **MBS**: Micro batch size (default: 2)
60-
- **GBS**: Global batch size (default: 2048 for real runs, 8 for mock runs)
61-
- **FINETUNE_STEPS**: Number of fine-tuning steps (default: 2500 for real runs, 20 for mock runs)
62-
- **DISTILL_STEPS**: Number of distillation steps (default: 7500 for real runs, 20 for mock runs)
63-
- **VAL_INTERVAL**: Validation interval (default: 500 for real runs, 10 for mock runs)
64-
- **PRUNE_SAMPLES**: Number of samples for pruning calibration (default: 1024 for real runs, 3 for mock runs)
77+
### Running the Flow via Slurm
6578

66-
### Pruning Configuration
79+
After launching the NeMo container with the specified mounts, change the contents of the `SLURM_CONFIG` in `nemo_prune_kd_flow.py`
80+
to reflect your environment, and then perform the following:
6781

68-
- **Target Hidden Size**: Default is 3072 (configurable via `--prune_target_hidden_size`)
69-
- **Target FFN Hidden Size**: Automatically set to 3 × target_hidden_size
70-
- **Pruning Method**: Structured pruning to reduce model dimensions
82+
From the `nemo_run` folder, launch the example with the `nemo_prune_kd_flow.py` script. To use a different model than the default model (Qwen3-8B), you can add the `--model-name <hf-model-name> --base-recipe <recipe-name>` flags and use the model's HuggingFace name and NeMo recipe names listed [here](https://github.com/NVIDIA/NeMo/tree/main/nemo/collections/llm/recipes). Provide the processed dataset path using the `--data-dir` flag.
7183

72-
### Output
84+
To perform Pruning + Knowledge Distillation, run:
7385

74-
The script generates the following outputs in the specified log directory:
86+
```bash
87+
python prune_distill/nemo_prune_kd_flow.py --log-dir /my/log/dir --data-dir /path/to/climbix_proc --use-slurm
88+
```
7589

76-
- `{model_name}_initial/`: Initial NeMo checkpoint
77-
- `finetune_log_dir/`: Fine-tuning logs and checkpoints (teacher model)
78-
- `{model_name}_pruned/`: Pruned student model
79-
- `distill_log_dir/`: Knowledge distillation logs and checkpoints
80-
- `{model_name}_final/`: Final compressed model after distillation
90+
## Supported models
8191

82-
### Supported Models
92+
Locally this script currently supports models that can be trained on 1 node with 8 x 80GB GPUs. On Slurm you can configure the number of nodes/gpus for training and pruning with the following flags: `--nodes`, `--train-gpus`.
8393

84-
Currently supports models that can be trained on 1 node with 8 x 80GB GPUs. The default configuration uses:
94+
The default configuration works on 1 node with 8 H100 GPUs:
8595

86-
- **Model**: Meta-Llama-3.1-8B
87-
- **Recipe**: llama31_8b
88-
- **Pruning Strategy**: Structured pruning with knowledge distillation recovery
96+
- **Model**: Qwen/Qwen3-8B
97+
- **Recipe**: qwen3_8b
8998

90-
### Troubleshooting
99+
### Dataset limitations
91100

92-
1. **GPU Memory Issues**: Reduce batch sizes (MBS, GBS) if encountering OOM errors
93-
1. **Data Format**: Ensure your dataset follows the expected chat format
94-
1. **NeMo Installation**: If encountering NeMo-related errors, use the recommended docker container
95-
1. **Model Size**: Ensure your model fits within the 8-GPU configuration
101+
The current pruning + knowledge distillation recipe has been tuned for the Qwen3-8B model to achieve significant speedup while maintaining performance. Pruning and distillation results are highly dependent on the specific model, dataset, and hyperparameters. There is no guarantee that a given dataset will recover the accuracy of the pruned model. Feel free to try your own model and dataset combinations and test which combination works best.

0 commit comments

Comments
 (0)