Skip to content
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions examples/llm_distill/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ This section focuses on demonstrating how to apply Model Optimizer to perform kn
| Distillation with NeMo | Learn how to distill your models with NeMo Framework | \[[Link](#knowledge-distillation-kd-for-nvidia-nemo-models)\] | \[[docs](https://nvidia.github.io/TensorRT-Model-Optimizer/guides/4_distillation.html)\] |
| Distillation with Huggingface | Learn how to distill your models with Hugging Face | \[[Link](#knowledge-distillation-kd-for-huggingface-models)\] | \[[docs](https://nvidia.github.io/TensorRT-Model-Optimizer/guides/4_distillation.html)\] |
| Resources | Extra links to relevant resources | \[[Link](#resources)\] | |
| NeMo Prune + Distill Simplified Flow | Example script demonstrating end-to-end pruning plus distillation in NeMo | \[[Link](../nemo_run/prune_distill/README.md)\] | |

</div>

Expand Down
85 changes: 85 additions & 0 deletions examples/nemo_run/common/process_climbmix.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
# SPDX-FileCopyrightText: Copyright (c) 2023-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import argparse
from pathlib import Path

from huggingface_hub import snapshot_download

from modelopt.torch.utils.plugins import megatron_preprocess_data

SUBSET_IDX = [
*[0, 1, 6, 10, 11],
*[12, 13, 14, 21, 24],
*[33, 35, 38, 40, 48],
*[49, 52, 66, 70, 76],
*[83, 88, 91, 94, 99],
Comment on lines +24 to +28
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know why we selected these numbers in the first place? Was it randomly generated or had some significance?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started to download/process them all before realizing it didn't need them all, so it already downloaded 0,1,10,11,12,13,14 but the rest are random.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to randomly sample 25 numbers out of 0-99 instead of hardcoding this? If necessary to hardcode, can this just be a normal list instead of having strange syntax (multiple sub-lists with asterisks to unpack them)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have better reproducibility and control if they are hardcoded. For example, I can now import this list of indices and create the data_paths argument to the PretrainDataModule in the Nemo-Run script.

The sub-list formatting is apparently Ruff's ideal way to make it fit within the line limit.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can add # fmt: off and # fmt: on line before and after the code block to avoid auto-formatting

] # 25% of total dataset


def get_args():
parser = argparse.ArgumentParser(description="Process ClimbMix dataset")
parser.add_argument(
"--output-dir",
default=".",
help="Path to the directory to store the processed dataset",
)
parser.add_argument(
"--tokenizer",
default="Qwen/Qwen3-8B",
help="Tokenizer to use for preprocessing",
)
parser.add_argument(
"--subset-indices",
help="Comma-separated subset indices to download",
)
return parser.parse_args()


if __name__ == "__main__":
args = get_args()
Path(args.output_dir).mkdir(exist_ok=True)

# create raw and processed directories
raw_dir = Path(args.output_dir) / "climbmix_raw"
proc_dir = Path(args.output_dir) / "climbmix_proc"

# only download the subset of the data
if args.subset_indices:
subset_idx = [int(i) for i in args.subset_indices.split(",")]
else:
subset_idx = SUBSET_IDX
subset_filenames = [f"part_{i}.jsonl" for i in subset_idx]

# download raw data
snapshot_download(
repo_id="OptimalScale/ClimbMix",
repo_type="dataset",
local_dir=raw_dir,
allow_patterns=subset_filenames,
)

# preprocess (tokenize)
print("Processing ClimbMix dataset...")
input_paths = [raw_dir / name for name in subset_filenames]
megatron_preprocess_data(
input_paths,
output_dir=proc_dir,
tokenizer_name_or_path=args.tokenizer,
append_eod=True,
max_sequence_length=32000,
workers=8,
log_interval=10000,
)
124 changes: 65 additions & 59 deletions examples/nemo_run/prune_distill/README.md
Original file line number Diff line number Diff line change
@@ -1,95 +1,101 @@
# Pruning and Knowledge Distillation Nemo Run example
<div align="center">

# NeMo Pruning + Knowledge Distillation Simplified Flow Example

[Slurm Examples](ADVANCED.md) |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't have an ADVANCED.md file .. can you move the Slurm info into it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking to just leave it as Slurm default tutorial in main README cause KD on such dataset will be useless on a local node anyway.

[Advanced Topics](ADVANCED.md) |
[NeMo Integration](https://github.com/NVIDIA-NeMo/NeMo/tree/main/nemo/collections/llm/modelopt)

</div>

## Overview

This directory contains the NeMo 2.0 Pruning + Knowledge Distillation flow implementation. The main script `nemo_prune_kd_flow.py` enables model compression through structured pruning followed by knowledge distillation to recover performance.
This directory contains an end-to-end Pruning + Knowledge Distillation Simplified Flow example using NeMo for model compression. It supports structured pruning followed by knowledge distillation to recover performance after compression.

## Usage
After structured pruning, the compressed model may show some accuracy degradation; the knowledge distillation stage aims to recover that loss by transferring knowledge from the full-precision teacher model to the pruned student model.

### Prerequisites
## Flow Stages

#### Install NeMo 2.0 and related dependencies
The Simplified Flow runs the following steps in order:

To run the example, launch a [NeMo container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo) with version 25.04.01 or higher using Docker/Slurm. Mount your cloned `modelopt` repository to the container by adding this mount flag to your Docker/Slurm command: `-v <modelopt-path>:/workspace/modelopt -v <modelopt-path>/modelopt:/usr/local/lib/python3.12/dist-packages/modelopt`.
1. 01_import — Import HuggingFace model to NeMo format
1. 02_prune — Apply structured pruning to create a compressed student model
1. 03_distill — Knowledge distillation from teacher to pruned student model
1. 04_export — Export final compressed model to HuggingFace format
1. eval_teacher — Evaluate teacher model on 5% of MMLU benchmark
1. eval_student — Evaluate student model on 5% of MMLU benchmark
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why these are not numbered? Like 05a, 05b?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They're not actually sequential, but I fixed it by making them 2b and 4b


To run SFT properly you may also need to clone NeMo and Megatron-LM at the respective commits, and mount to `/opt/NeMo` and `/opt/megatron-lm`:
```mermaid
graph TD;
01_import-->02_prune;
01_import-->eval_teacher;
02_prune-->03_distill;
03_distill-->eval_student;
03_distill-->04_export;
```

- `git clone https://github.com/NVIDIA-NeMo/NeMo && cd NeMo && git checkout d7b87b1`
- `git clone https://github.com/NVIDIA/Megatron-LM.git && cd Megatron-LM && git checkout 8c15450`
## Results

### Data Preparation
Pruning + Knowledge Distillation of Qwen3-8B achieves significant model compression while recovering most of the accuracy through distillation. We depth-prune the model from 32 to 24 layers (reducing from 8B to 6B parameters) and distill for ~14,000 steps with a learning rate of 1e-4 and global batch size of 768 using a 25% subset of the [ClimbMix dataset](https://huggingface.co/datasets/OptimalScale/ClimbMix). (This is about 90 billion tokens and takes a total of ~6k H100 GPU hours)

The script supports chat datasets in ShareGPT or HuggingFace/OpenAI chat format. You can prepare your dataset in JSONL format with the required chat structure. To provide your own custom dataset, use the `--data-path` flag, otherwise the default [LIMA](https://huggingface.co/datasets/GAIR/lima) dataset will be used.
| | Tokens per Second | MMLU |
|---------------------------|-------------------|------|
| Qwen3-8B Original | 4420 | 74.9 |
| Qwen3-6B Pruned+Distilled | 6950 | 72.5 |

### Running the Flow
The resulting compressed model maintains competitive performance while being significantly faster with a smaller memory footprint.

#### Standard Usage
## Usage

From the `nemo_run` folder, run:
### Prerequisites

```bash
python prune_distill/nemo_prune_kd_flow.py --data_path your_dataset.jsonl
```
You can run the example either locally or on a [Slurm cluster](ADVANCED.md).

#### Mock Run (for testing)
To run the example locally, launch a [NeMo container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo) with version 25.09 or higher. Clone the `TensorRT-Model-Optimizer` repository and `NeMo` repository (checkout a specific commit for NeMo), then mount it onto your docker container.

To test the flow without actual data, run the following command from the `nemo_run` folder:
- `git clone https://github.com/NVIDIA/TensorRT-Model-Optimizer.git`

Example docker command:

```bash
python prune_distill/nemo_prune_kd_flow.py --mock_run
docker run -v /home/user/:/home/user/ -v /home/user/NeMo:/opt/NeMo -v /home/user/TensorRT-Model-Optimizer/modelopt/:/usr/local/lib/python3.12/dist-packages/modelopt --gpus all -it --shm-size 20g --rm nvcr.io/nvidia/nemo:25.09 bash
```

### Flow Stages
You will also need to set your Huggingface token with `export HF_TOKEN=<your-token>`. You may also need to enable write access to the docker container to the `examples/nemo_run` folder by doing `chmod 777 nemo_run` so that logs can be written.

The script executes the following stages in sequence:
### Dataset Preparation

1. Process LIMA data (if `--data-path` is not specified)
1. **Import Model**: Imports the HuggingFace model to NeMo format
1. **Fine-tuning**: Fine-tunes the model on the provided dataset
1. **Pruning**: Prunes the fine-tuned model to create a smaller student model
1. **Knowledge Distillation**: Distills knowledge from the teacher to the pruned student model
1. **Export**: Exports the final compressed model
Unlike the QAT flow, this workflow does not automatically download the dataset due to its large size and long tokenization time.
You must first prepare the dataset by running:

### Configuration Parameters
```bash
python ../common/process_climbmix.py --output-dir /path/to/save
```

The script includes several configurable parameters:
This will download and process the ClimbMix dataset, creating the necessary data files for training.

- **GPUS**: Number of GPUs (default: 8)
- **SEQUENCE_LENGTH**: Maximum sequence length (default: 8192)
- **MBS**: Micro batch size (default: 2)
- **GBS**: Global batch size (default: 2048 for real runs, 8 for mock runs)
- **FINETUNE_STEPS**: Number of fine-tuning steps (default: 2500 for real runs, 20 for mock runs)
- **DISTILL_STEPS**: Number of distillation steps (default: 7500 for real runs, 20 for mock runs)
- **VAL_INTERVAL**: Validation interval (default: 500 for real runs, 10 for mock runs)
- **PRUNE_SAMPLES**: Number of samples for pruning calibration (default: 1024 for real runs, 3 for mock runs)
### Running the Flow via Slurm
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar to the QAT flow README can the Slurm info be moved into ADVANCED.md?


### Pruning Configuration
After launching the NeMo container with the specified mounts, change the contents of the `SLURM_CONFIG` in `nemo_prune_kd_flow.py`
to reflect your environment, and then perform the following:

- **Target Hidden Size**: Default is 3072 (configurable via `--prune_target_hidden_size`)
- **Target FFN Hidden Size**: Automatically set to 3 × target_hidden_size
- **Pruning Method**: Structured pruning to reduce model dimensions
From the `nemo_run` folder, launch the example with the `nemo_prune_kd_flow.py` script. To use a different model than the default model (Qwen3-8B), you can add the `--model-name <hf-model-name> --base-recipe <recipe-name>` flags and use the model's HuggingFace name and NeMo recipe names listed [here](https://github.com/NVIDIA/NeMo/tree/main/nemo/collections/llm/recipes). Provide the processed dataset path using the `--data-dir` flag.

### Output
To perform Pruning + Knowledge Distillation, run:

The script generates the following outputs in the specified log directory:
```bash
python prune_distill/nemo_prune_kd_flow.py --log-dir /my/log/dir --data-dir /path/to/climbix_proc --use-slurm
```

- `{model_name}_initial/`: Initial NeMo checkpoint
- `finetune_log_dir/`: Fine-tuning logs and checkpoints (teacher model)
- `{model_name}_pruned/`: Pruned student model
- `distill_log_dir/`: Knowledge distillation logs and checkpoints
- `{model_name}_final/`: Final compressed model after distillation
## Supported models

### Supported Models
Locally this script currently supports models that can be trained on 1 node with 8 x 80GB GPUs. On Slurm you can configure the number of nodes/gpus for training and pruning with the following flags: `--nodes`, `--train-gpus`.

Currently supports models that can be trained on 1 node with 8 x 80GB GPUs. The default configuration uses:
The default configuration works on 1 node with 8 H100 GPUs:

- **Model**: Meta-Llama-3.1-8B
- **Recipe**: llama31_8b
- **Pruning Strategy**: Structured pruning with knowledge distillation recovery
- **Model**: Qwen/Qwen3-8B
- **Recipe**: qwen3_8b

### Troubleshooting
### Dataset limitations

1. **GPU Memory Issues**: Reduce batch sizes (MBS, GBS) if encountering OOM errors
1. **Data Format**: Ensure your dataset follows the expected chat format
1. **NeMo Installation**: If encountering NeMo-related errors, use the recommended docker container
1. **Model Size**: Ensure your model fits within the 8-GPU configuration
The current pruning + knowledge distillation recipe has been tuned for the Qwen3-8B model to achieve significant speedup while maintaining performance. Pruning and distillation results are highly dependent on the specific model, dataset, and hyperparameters. There is no guarantee that a given dataset will recover the accuracy of the pruned model. Feel free to try your own model and dataset combinations and test which combination works best.
Loading
Loading