You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: CHANGELOG.rst
+6-2Lines changed: 6 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,7 +1,7 @@
1
1
Model Optimizer Changelog (Linux)
2
2
=================================
3
3
4
-
0.39 (2025-10-xx)
4
+
0.39 (2025-11-xx)
5
5
^^^^^^^^^^^^^^^^^
6
6
7
7
**Deprecations**
@@ -12,7 +12,11 @@ Model Optimizer Changelog (Linux)
12
12
- Add LoRA mode support for MCore in a new peft submodule: ``modelopt.torch.peft.update_model(model, LORA_CFG)``.
13
13
- Support PTQ and fakequant in vLLM for fast evaluation of arbitrary quantization formats. See ``examples/vllm_serve`` for more details.
14
14
15
-
0.37 (2025-09-xx)
15
+
**Documentation**
16
+
17
+
- Add general guidelines for Minitron pruning and distillation. See `examples/pruning/README.md <https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/examples/pruning#pruning-guidelines>`_ for more details.
Copy file name to clipboardExpand all lines: examples/pruning/README.md
+83-1Lines changed: 83 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -17,6 +17,8 @@ This section focuses on applying Model Optimizer's state-of-the-art complementar
17
17
| Pre-Requisites | Required & optional packages to use this technique |\[[Link](#pre-requisites)\]||
18
18
| Getting Started | Learn how to use the pruning API |\[[Link](#getting-started)\]|\[[docs](https://nvidia.github.io/TensorRT-Model-Optimizer/guides/3_pruning.html)\]|
19
19
| Support Matrix | View the support matrix to see available pruning algorithms and their compatibility with different models and frameworks |\[[Link](#support-matrix)\]||
20
+
| Pruning Guidelines | Guidelines for choosing how and how much to prune for best results |\[[Link](#pruning-guidelines)\]||
21
+
| Examples | Examples of different pruning methods |\[[Link](#examples)\]||
20
22
| Resources | Extra links to relevant resources |\[[Link](#resources)\]||
21
23
22
24
</div>
@@ -93,6 +95,84 @@ If your model parameters are already sorted, you can skip the sorting step by se
93
95
94
96
> *<sup>1.</sup>Only Pipeline Parallel models are supported. Hugging Face models can be converted to NeMo format and used subsequently.*
95
97
98
+
## Pruning Guidelines
99
+
100
+
### Minitron
101
+
102
+
This section provides recommendations for choosing pruning strategies and distillation hyperparameters for Minitron pruning to help achieve the best latency-accuracy trade-offs.
103
+
104
+
#### Depth Pruning
105
+
106
+
Depth pruning reduces the number of layers (`num_layers`) in the model.
107
+
108
+
**Advantages:**
109
+
110
+
- Simpler to configure - only 1 parameter to tune
111
+
- Faster inference than width-pruned models at a fixed number of parameters
112
+
113
+
**Recommendations:**
114
+
115
+
- Up to **1/3rd parameter reduction** can generally result in a model above the Pareto frontier with good latency-accuracy trade-off (when using a good quality dataset for distillation with ~80-100B tokens)
116
+
- For pruning **>50%**, use iterative pruning: compress by 30%, perform distillation, then compress again
Width pruning reduces model dimensions per layer such as `hidden_size`, `ffn_hidden_size`, `num_attention_heads`, `num_query_groups`, `mamba_num_heads`, and `mamba_head_dim`.
126
+
127
+
**Advantages:**
128
+
129
+
- Better accuracy than depth-pruned models at a fixed number of parameters
130
+
131
+
**Recommendations:**
132
+
133
+
- Start with pruning `hidden_size` and `ffn_hidden_size` as the simplest configuration
134
+
- Up to **1/3rd parameter reduction** can generally result in a model above the Pareto frontier with good latency-accuracy trade-off (when using a good quality dataset for distillation with ~80-100B tokens)
135
+
-**Axis sensitivity:** MLP dimensions (`ffn_hidden_size`) can typically be pruned more aggressively than embedding dimensions (`hidden_size`) and attention/Mamba dimensions (`num_attention_heads`, `num_query_groups`, `mamba_num_heads`, `mamba_head_dim`)
136
+
- For pruning **>50%**, use iterative pruning: compress by 30%, perform distillation, then compress again
-**Pruning ratio:** Anything **>50% pruning is hard to recover**. For such aggressive pruning, iterative pruning (compress → distill → compress again) is recommended.
155
+
-**Latency-accuracy trade-off:** The more pruning you do, the faster your model will be at the cost of lower accuracy. Choose based on your requirements.
156
+
-**Dataset quality:** Use a high-quality dataset for distillation. If you don't have a specific dataset, [Nemotron-Pretraining-SFT-v1](https://huggingface.co/datasets/nvidia/Nemotron-Pretraining-SFT-v1) is recommended.
157
+
-**Post-training:** Further post-training (e.g., instruction tuning, preference alignment) is needed after pruning and distillation on pre-training datasets to improve reasoning capabilities. A good dataset for post-training is [Nemotron-Post-Training-Dataset-v2](https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v2).
158
+
159
+
#### Distillation Hyperparameters
160
+
161
+
After pruning, distillation is required to recover model accuracy. Below are recommended starting hyperparameters for distillation:
162
+
163
+
|**Hyperparameter**|**Recommendation**|
164
+
| :--- | :--- |
165
+
|**Sequence Length**| 8192 (or 4096 if dataset has smaller sequences) |
166
+
|**Global Batch Size (GBS)**| 768 |
167
+
|**Micro Batch Size (MBS)**| As large as your GPU memory can accommodate |
168
+
|**Learning Rate (LR)**| 1e-4 → 1e-5 (linear decay) for 30-50% pruning<br>• More compression → higher LR<br>• Less compression → lower LR<br>• As model gets larger → reduce LR to avoid divergence |
169
+
|**Warmup Steps**| 100 |
170
+
|**Training Max Steps**| Num training tokens / (Seq len × GBS)<br>• Recommended: 80-100B tokens |
171
+
|**Data Composition**| • Standard models: 100% pre-training data<br>• Reasoning models: 70% reasoning data + 30% pre-training data |
172
+
173
+
> [!TIP]
174
+
> If you know the maximum learning rate used during the original training, a good rule of thumb for knowledge distillation is to use **1/5th of that maximum LR** when compressing by ~50%.
@@ -108,7 +188,9 @@ Some of the models pruned using Minitron method followed by distillation and pos
108
188
109
189
### FastNAS Pruning for PyTorch Computer Vision Models
110
190
111
-
Checkout the FastNAS pruning interactive notebook [cifar_resnet](./cifar_resnet.ipynb) in this directory
191
+
Check out the FastNAS pruning example usage in the [documentation](https://nvidia.github.io/TensorRT-Model-Optimizer/guides/3_pruning.html#pruning-and-subnet-search).
192
+
193
+
You can also take a look at FastNAS pruning interactive notebook [cifar_resnet](./cifar_resnet.ipynb) in this directory
112
194
which showcases the usage of FastNAS for pruning a ResNet 20 model for the CIFAR-10 dataset. The notebook
113
195
also how to profiling the model to understand the search space of possible pruning options and demonstrates
0 commit comments