Squashed commit

AAnoosheh · AAnoosheh · commit eb4e0417840a · 2025-09-26T10:39:57.000-07:00
Signed-off-by: Asha Anoosheh &lt;aanoosheh@nvidia.com&gt;
diff --git a/examples/llm_distill/README.md b/examples/llm_distill/README.md
@@ -16,6 +16,7 @@ This section focuses on demonstrating how to apply Model Optimizer to perform kn
 | Distillation with NeMo | Learn how to distill your models with NeMo Framework | \[[Link](#knowledge-distillation-kd-for-nvidia-nemo-models)\] | \[[docs](https://nvidia.github.io/TensorRT-Model-Optimizer/guides/4_distillation.html)\] |
 | Distillation with Huggingface | Learn how to distill your models with Hugging Face | \[[Link](#knowledge-distillation-kd-for-huggingface-models)\] | \[[docs](https://nvidia.github.io/TensorRT-Model-Optimizer/guides/4_distillation.html)\] |
 | Resources | Extra links to relevant resources | \[[Link](#resources)\] | |
+| NeMo Prune + Distill Simplified Flow | Example script demonstrating end-to-end pruning plus distillation in NeMo | \[[Link](../nemo_run/prune_kd/README.md)\] | |
 
 </div>
 
diff --git a/examples/nemo_run/common/process_climbmix.py b/examples/nemo_run/common/process_climbmix.py
@@ -0,0 +1,85 @@
+# SPDX-FileCopyrightText: Copyright (c) 2023-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+from pathlib import Path
+
+from huggingface_hub import snapshot_download
+
+from modelopt.torch.utils.plugins import megatron_preprocess_data
+
+SUBSET_IDX = [
+    *[0, 1, 6, 10, 11],
+    *[12, 13, 14, 21, 24],
+    *[33, 35, 38, 40, 48],
+    *[49, 52, 66, 70, 76],
+    *[83, 88, 91, 94, 99],
+]  # 25% of total dataset
+
+
+def get_args():
+    parser = argparse.ArgumentParser(description="Process ClimbMix dataset")
+    parser.add_argument(
+        "--output-dir",
+        default=".",
+        help="Path to the directory to store the processed dataset",
+    )
+    parser.add_argument(
+        "--tokenizer",
+        default="Qwen/Qwen3-8B",
+        help="Tokenizer to use for preprocessing",
+    )
+    parser.add_argument(
+        "--subset-indices",
+        help="Comma-separated subset indices to download",
+    )
+    return parser.parse_args()
+
+
+if __name__ == "__main__":
+    args = get_args()
+    Path(args.output_dir).mkdir(exist_ok=True)
+
+    # create raw and processed directories
+    raw_dir = Path(args.output_dir) / "climbmix_raw"
+    proc_dir = Path(args.output_dir) / "climbmix_proc"
+
+    # only download the subset of the data
+    if args.subset_indices:
+        subset_idx = [int(i) for i in args.subset_indices.split(",")]
+    else:
+        subset_idx = SUBSET_IDX
+    subset_filenames = [f"part_{i}.jsonl" for i in subset_idx]
+
+    # download raw data
+    snapshot_download(
+        repo_id="OptimalScale/ClimbMix",
+        repo_type="dataset",
+        local_dir=raw_dir,
+        allow_patterns=subset_filenames,
+    )
+
+    # preprocess (tokenize)
+    print("Processing ClimbMix dataset...")
+    input_paths = [raw_dir / name for name in subset_filenames]
+    megatron_preprocess_data(
+        input_paths,
+        output_dir=proc_dir,
+        tokenizer_name_or_path=args.tokenizer,
+        append_eod=True,
+        max_sequence_length=32000,
+        workers=8,
+        log_interval=10000,
+    )
diff --git a/examples/nemo_run/prune_distill/README.md b/examples/nemo_run/prune_distill/README.md
@@ -1,95 +1,101 @@
-# Pruning and Knowledge Distillation Nemo Run example
+<div align="center">
+
+# NeMo Pruning + Knowledge Distillation Simplified Flow Example
+
+[Slurm Examples](ADVANCED.md) |
+[Advanced Topics](ADVANCED.md) |
+[NeMo Integration](https://github.com/NVIDIA-NeMo/NeMo/tree/main/nemo/collections/llm/modelopt)
+
+</div>
 
 ## Overview
 
-This directory contains the NeMo 2.0 Pruning + Knowledge Distillation flow implementation. The main script `nemo_prune_kd_flow.py` enables model compression through structured pruning followed by knowledge distillation to recover performance.
+This directory contains an end-to-end Pruning + Knowledge Distillation Simplified Flow example using NeMo for model compression. It supports structured pruning followed by knowledge distillation to recover performance after compression.
 
-## Usage
+After structured pruning, the compressed model may show some accuracy degradation; the knowledge distillation stage aims to recover that loss by transferring knowledge from the full-precision teacher model to the pruned student model.
 
-### Prerequisites
+## Flow Stages
 
-#### Install NeMo 2.0 and related dependencies
+The Simplified Flow runs the following steps in order:
 
-To run the example, launch a [NeMo container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo) with version 25.04.01 or higher using Docker/Slurm. Mount your cloned `modelopt` repository to the container by adding this mount flag to your Docker/Slurm command: `-v <modelopt-path>:/workspace/modelopt -v <modelopt-path>/modelopt:/usr/local/lib/python3.12/dist-packages/modelopt`.
+1. 01_import — Import HuggingFace model to NeMo format
+1. 02_prune — Apply structured pruning to create a compressed student model
+1. 03_distill — Knowledge distillation from teacher to pruned student model
+1. 04_export — Export final compressed model to HuggingFace format
+1. eval_teacher — Evaluate teacher model on 5% of MMLU benchmark
+1. eval_student — Evaluate student model on 5% of MMLU benchmark
 
-To run SFT properly you may also need to clone NeMo and Megatron-LM at the respective commits, and mount to `/opt/NeMo` and `/opt/megatron-lm`:
+```mermaid
+graph TD;
+01_import-->02_prune;
+01_import-->eval_teacher;
+02_prune-->03_distill;
+03_distill-->eval_student;
+03_distill-->04_export;
+```
 
-- `git clone https://github.com/NVIDIA-NeMo/NeMo && cd NeMo && git checkout d7b87b1`
-- `git clone https://github.com/NVIDIA/Megatron-LM.git && cd Megatron-LM && git checkout 8c15450`
+## Results
 
-### Data Preparation
+Pruning + Knowledge Distillation of Qwen3-8B achieves significant model compression while recovering most of the accuracy through distillation. We depth-prune the model from 32 to 24 layers (reducing from 8B to 6B parameters) and distill for ~14,000 steps with a learning rate of 1e-4 and global batch size of 768 using a 25% subset of the [ClimbMix dataset](https://huggingface.co/datasets/OptimalScale/ClimbMix). (This is about 90 billion tokens and takes a total of ~6k H100 GPU hours)
 
-The script supports chat datasets in ShareGPT or HuggingFace/OpenAI chat format. You can prepare your dataset in JSONL format with the required chat structure. To provide your own custom dataset, use the `--data-path` flag, otherwise the default [LIMA](https://huggingface.co/datasets/GAIR/lima) dataset will be used.
+|                           | Tokens per Second | MMLU |
+|---------------------------|-------------------|------|
+| Qwen3-8B Original         | 4420              | 74.9 |
+| Qwen3-6B Pruned+Distilled | 6950              | 72.5 |
 
-### Running the Flow
+The resulting compressed model maintains competitive performance while being significantly faster with a smaller memory footprint.
 
-#### Standard Usage
+## Usage
 
-From the `nemo_run` folder, run:
+### Prerequisites
 
-```bash
-python prune_distill/nemo_prune_kd_flow.py --data_path your_dataset.jsonl
-```
+You can run the example either locally or on a [Slurm cluster](ADVANCED.md).
 
-#### Mock Run (for testing)
+To run the example locally, launch a [NeMo container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo) with version 25.09 or higher. Clone the `TensorRT-Model-Optimizer` repository and `NeMo` repository (checkout a specific commit for NeMo), then mount it onto your docker container.
 
-To test the flow without actual data, run the following command from the `nemo_run` folder:
+- `git clone https://github.com/NVIDIA/TensorRT-Model-Optimizer.git`
+
+Example docker command:
 
 ```bash
-python prune_distill/nemo_prune_kd_flow.py --mock_run
+docker run -v /home/user/:/home/user/ -v /home/user/NeMo:/opt/NeMo -v /home/user/TensorRT-Model-Optimizer/modelopt/:/usr/local/lib/python3.12/dist-packages/modelopt --gpus all -it --shm-size 20g --rm nvcr.io/nvidia/nemo:25.09 bash
 ```
 
-### Flow Stages
+You will also need to set your Huggingface token with `export HF_TOKEN=<your-token>`. You may also need to enable write access to the docker container to the `examples/nemo_run` folder by doing `chmod 777 nemo_run` so that logs can be written.
 
-The script executes the following stages in sequence:
+### Dataset Preparation
 
-1. Process LIMA data (if `--data-path` is not specified)
-1. **Import Model**: Imports the HuggingFace model to NeMo format
-1. **Fine-tuning**: Fine-tunes the model on the provided dataset
-1. **Pruning**: Prunes the fine-tuned model to create a smaller student model
-1. **Knowledge Distillation**: Distills knowledge from the teacher to the pruned student model
-1. **Export**: Exports the final compressed model
+Unlike the QAT flow, this workflow does not automatically download the dataset due to its large size and long tokenization time.
+You must first prepare the dataset by running:
 
-### Configuration Parameters
+```bash
+python ../common/process_climbmix.py --output-dir /path/to/save
+```
 
-The script includes several configurable parameters:
+This will download and process the ClimbMix dataset, creating the necessary data files for training.
 
-- **GPUS**: Number of GPUs (default: 8)
-- **SEQUENCE_LENGTH**: Maximum sequence length (default: 8192)
-- **MBS**: Micro batch size (default: 2)
-- **GBS**: Global batch size (default: 2048 for real runs, 8 for mock runs)
-- **FINETUNE_STEPS**: Number of fine-tuning steps (default: 2500 for real runs, 20 for mock runs)
-- **DISTILL_STEPS**: Number of distillation steps (default: 7500 for real runs, 20 for mock runs)
-- **VAL_INTERVAL**: Validation interval (default: 500 for real runs, 10 for mock runs)
-- **PRUNE_SAMPLES**: Number of samples for pruning calibration (default: 1024 for real runs, 3 for mock runs)
+### Running the Flow via Slurm
 
-### Pruning Configuration
+After launching the NeMo container with the specified mounts, change the contents of the `SLURM_CONFIG` in `nemo_prune_kd_flow.py`
+to reflect your environment, and then perform the following:
 
-- **Target Hidden Size**: Default is 3072 (configurable via `--prune_target_hidden_size`)
-- **Target FFN Hidden Size**: Automatically set to 3 × target_hidden_size
-- **Pruning Method**: Structured pruning to reduce model dimensions
+From the `nemo_run` folder, launch the example with the `nemo_prune_kd_flow.py` script. To use a different model than the default model (Qwen3-8B), you can add the `--model-name <hf-model-name> --base-recipe <recipe-name>` flags and use the model's HuggingFace name and NeMo recipe names listed [here](https://github.com/NVIDIA/NeMo/tree/main/nemo/collections/llm/recipes). Provide the processed dataset path using the `--data-dir` flag.
 
-### Output
+To perform Pruning + Knowledge Distillation, run:
 
-The script generates the following outputs in the specified log directory:
+```bash
+python prune_distill/nemo_prune_kd_flow.py --log-dir /my/log/dir --data-dir /path/to/climbix_proc --use-slurm
+```
 
-- `{model_name}_initial/`: Initial NeMo checkpoint
-- `finetune_log_dir/`: Fine-tuning logs and checkpoints (teacher model)
-- `{model_name}_pruned/`: Pruned student model
-- `distill_log_dir/`: Knowledge distillation logs and checkpoints
-- `{model_name}_final/`: Final compressed model after distillation
+## Supported models
 
-### Supported Models
+Locally this script currently supports models that can be trained on 1 node with 8 x 80GB GPUs. On Slurm you can configure the number of nodes/gpus for training and pruning with the following flags: `--nodes`, `--train-gpus`.
 
-Currently supports models that can be trained on 1 node with 8 x 80GB GPUs. The default configuration uses:
+The default configuration works on 1 node with 8 H100 GPUs:
 
-- **Model**: Meta-Llama-3.1-8B
-- **Recipe**: llama31_8b
-- **Pruning Strategy**: Structured pruning with knowledge distillation recovery
+- **Model**: Qwen/Qwen3-8B
+- **Recipe**: qwen3_8b
 
-### Troubleshooting
+### Dataset limitations
 
-1. **GPU Memory Issues**: Reduce batch sizes (MBS, GBS) if encountering OOM errors
-1. **Data Format**: Ensure your dataset follows the expected chat format
-1. **NeMo Installation**: If encountering NeMo-related errors, use the recommended docker container
-1. **Model Size**: Ensure your model fits within the 8-GPU configuration
+The current pruning + knowledge distillation recipe has been tuned for the Qwen3-8B model to achieve significant speedup while maintaining performance. Pruning and distillation results are highly dependent on the specific model, dataset, and hyperparameters. There is no guarantee that a given dataset will recover the accuracy of the pruned model. Feel free to try your own model and dataset combinations and test which combination works best.
diff --git a/examples/nemo_run/prune_distill/nemo_prune_kd_flow.py b/examples/nemo_run/prune_distill/nemo_prune_kd_flow.py