Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 6 additions & 2 deletions CHANGELOG.rst
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
Model Optimizer Changelog (Linux)
=================================

0.39 (2025-10-xx)
0.39 (2025-11-xx)
^^^^^^^^^^^^^^^^^

**Deprecations**
Expand All @@ -12,7 +12,11 @@ Model Optimizer Changelog (Linux)
- Add LoRA mode support for MCore in a new peft submodule: ``modelopt.torch.peft.update_model(model, LORA_CFG)``.
- Support PTQ and fakequant in vLLM for fast evaluation of arbitrary quantization formats. See ``examples/vllm_serve`` for more details.

0.37 (2025-09-xx)
**Documentation**

- Add general guidelines for Minitron pruning and distillation. See `examples/pruning/README.md <https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/examples/pruning#pruning-guidelines>`_ for more details.

0.37 (2025-10-08)
^^^^^^^^^^^^^^^^^

**Deprecations**
Expand Down
88 changes: 85 additions & 3 deletions examples/pruning/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,8 @@ This section focuses on applying Model Optimizer's state-of-the-art complementar
| Pre-Requisites | Required & optional packages to use this technique | \[[Link](#pre-requisites)\] | |
| Getting Started | Learn how to use the pruning API | \[[Link](#getting-started)\] | \[[docs](https://nvidia.github.io/TensorRT-Model-Optimizer/guides/3_pruning.html)\] |
| Support Matrix | View the support matrix to see available pruning algorithms and their compatibility with different models and frameworks | \[[Link](#support-matrix)\] | |
| Pruning Guidelines | Guidelines for choosing how and how much to prune for best results | \[[Link](#pruning-guidelines)\] | |
| Examples | Examples of different pruning methods | \[[Link](#examples)\] | |
| Resources | Extra links to relevant resources | \[[Link](#resources)\] | |

</div>
Expand Down Expand Up @@ -93,6 +95,84 @@ If your model parameters are already sorted, you can skip the sorting step by se

> *<sup>1.</sup>Only Pipeline Parallel models are supported. Hugging Face models can be converted to NeMo format and used subsequently.*

## Pruning Guidelines

### Minitron

This section provides recommendations for choosing pruning strategies and distillation hyperparameters for Minitron pruning to help achieve the best latency-accuracy trade-offs.

#### Depth Pruning

Depth pruning reduces the number of layers (`num_layers`) in the model.

**Advantages:**

- Simpler to configure - only 1 parameter to tune
- Faster inference than width-pruned models at a fixed number of parameters

**Recommendations:**

- Up to **1/3rd parameter reduction** can generally result in a model above the Pareto frontier with good latency-accuracy trade-off (when using a good quality dataset for distillation with ~80-100B tokens)
- For pruning **>50%**, use iterative pruning: compress by 30%, perform distillation, then compress again

**Examples:**

- [Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B) (`num_layers=36`) → 6B (`num_layers=24`)
- [Llama-3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B) (`num_layers=32`) → 4.5B (`num_layers=16`)

#### Width Pruning

Width pruning reduces model dimensions per layer such as `hidden_size`, `ffn_hidden_size`, `num_attention_heads`, `num_query_groups`, `mamba_num_heads`, and `mamba_head_dim`.

**Advantages:**

- Better accuracy than depth-pruned models at a fixed number of parameters

**Recommendations:**

- Start with pruning `hidden_size` and `ffn_hidden_size` as the simplest configuration
- Up to **1/3rd parameter reduction** can generally result in a model above the Pareto frontier with good latency-accuracy trade-off (when using a good quality dataset for distillation with ~80-100B tokens)
- **Axis sensitivity:** MLP dimensions (`ffn_hidden_size`) can typically be pruned more aggressively than embedding dimensions (`hidden_size`) and attention/Mamba dimensions (`num_attention_heads`, `num_query_groups`, `mamba_num_heads`, `mamba_head_dim`)
- For pruning **>50%**, use iterative pruning: compress by 30%, perform distillation, then compress again

**Examples:**

- [Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B) (`ffn_hidden_size=12288`, `hidden_size=4096`) → 6B (`ffn_hidden_size=9216`, `hidden_size=3584`)
- [Llama-3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B) (`ffn_hidden_size=14336`, `hidden_size=4096`) → 4.5B (`ffn_hidden_size=9216`, `hidden_size=3072`)
- [Nemotron-H-8B-Base-8K](https://huggingface.co/nvidia/Nemotron-H-8B-Base-8K) (`ffn_hidden_size=21504`, `hidden_size=4096`, `mamba_num_heads=128`) → [Nemotron-H-4B-Base-8K](https://huggingface.co/nvidia/Nemotron-H-4B-Base-8K) (`ffn_hidden_size=12288`, `hidden_size=3072`, `mamba_num_heads=112`) - See [paper](https://arxiv.org/pdf/2504.11409)

#### Depth and Width Pruning

For optimal results, combine depth and width pruning. This will require more tuning to find the best architecture.

**Examples:**

- [NVIDIA-Nemotron-Nano-12B-v2](https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-12B-v2) (`ffn_hidden_size=20480`, `hidden_size=5120`, `num_layers=62`) → [NVIDIA-Nemotron-Nano-9B-v2](https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-9B-v2) (`ffn_hidden_size=15680`, `hidden_size=4480`, `num_layers=56`) - See [paper](https://arxiv.org/pdf/2508.14444)

#### General Pruning Guidelines

- **Pruning ratio:** Anything **>50% pruning is hard to recover**. For such aggressive pruning, iterative pruning (compress → distill → compress again) is recommended.
- **Latency-accuracy trade-off:** The more pruning you do, the faster your model will be at the cost of lower accuracy. Choose based on your requirements.
- **Dataset quality:** Use a high-quality dataset for distillation. If you don't have a specific dataset, [Nemotron-Pretraining-SFT-v1](https://huggingface.co/datasets/nvidia/Nemotron-Pretraining-SFT-v1) is recommended.
- **Post-training:** Further post-training (e.g., instruction tuning, preference alignment) is needed after pruning and distillation on pre-training datasets to improve reasoning capabilities. A good dataset for post-training is [Nemotron-Post-Training-Dataset-v2](https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v2).

#### Distillation Hyperparameters

After pruning, distillation is required to recover model accuracy. Below are recommended starting hyperparameters for distillation:

| **Hyperparameter** | **Recommendation** |
| :---: | :---: |
| **Sequence Length** | 8192 (or 4096 if dataset has smaller sequences) |
| **Global Batch Size (GBS)** | 768 |
| **Micro Batch Size (MBS)** | As large as your GPU memory can accommodate |
| **Learning Rate (LR)** | 1e-4 → 1e-5 (linear decay) for 30-50% pruning<br>• More compression → higher LR<br>• Less compression → lower LR<br>• As model gets larger → reduce LR to avoid divergence |
| **Warmup Steps** | 100 |
| **Training Max Steps** | Num training tokens / (Seq len × GBS)<br>• Recommended: 80-100B tokens |
| **Data Composition** | • Standard models: 100% pre-training data<br>• Reasoning models: 70% reasoning data + 30% pre-training data |

> [!TIP]
> If you know the maximum learning rate used during the original training, a good rule of thumb for knowledge distillation is to use **1/5th of that maximum LR** when compressing by ~50%.

Comment on lines +98 to +175
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Sanity‑check external links and local notebook path.

Run this to quickly validate URLs (HTTP 200/301) and ensure the notebook exists.


🏁 Script executed:

#!/usr/bin/env bash
set -euo pipefail

FILE="examples/pruning/README.md"

echo "Checking external links..."
# Extract and de-duplicate http(s) links
mapfile -t LINKS < <(grep -oE '\(https?://[^) ]+\)' -n "$FILE" | sed -E 's/.*\((https?:\/\/[^) ]+)\).*/\1/' | sort -u)

fail=0
for u in "${LINKS[@]}"; do
  code=$(curl -m 10 -sS -o /dev/null -w '%{http_code}' -L "$u" || echo "000")
  printf '%-3s %s\n' "$code" "$u"
  [[ "$code" =~ ^2|3 ]] || fail=1
done

echo
echo "Checking local notebook path..."
if [[ -f "examples/pruning/cifar_resnet.ipynb" ]]; then
  echo "OK  examples/pruning/cifar_resnet.ipynb exists"
else
  echo "ERR examples/pruning/cifar_resnet.ipynb missing"
  fail=1
fi

exit $fail

Length of output: 83


🏁 Script executed:

#!/usr/bin/env bash
set -euo pipefail

FILE="examples/pruning/README.md"
echo "Checking external links..."
fail=0

grep -oE '\(https?://[^) ]+\)' "$FILE" \
  | sed -E 's/.*\((https?:\/\/[^) ]+)\).*/\1/' \
  | sort -u \
  | while read -r url; do
      code=$(curl -m 10 -sS -o /dev/null -w '%{http_code}' -L "$url" || echo "000")
      printf '%-3s %s\n' "$code" "$url"
      if ! [[ "$code" =~ ^[23] ]]; then
        fail=1
      fi
    done

echo
echo "Checking local notebook path..."
if [[ -f "examples/pruning/cifar_resnet.ipynb" ]]; then
  echo "OK  examples/pruning/cifar_resnet.ipynb exists"
else
  echo "ERR examples/pruning/cifar_resnet.ipynb missing"
  fail=1
fi

exit $fail

Length of output: 1529


Broken external link: the URL for Nemotron-H-4B-Base-8K (https://huggingface.co/nvidia/Nemotron-H-4B-Base-8K) returns HTTP 401; update it to a publicly accessible link or ensure the model page is published.

🤖 Prompt for AI Agents
In examples/pruning/README.md around lines 98 to 175 the reference link for
Nemotron-H-4B-Base-8K (https://huggingface.co/nvidia/Nemotron-H-4B-Base-8K)
returns HTTP 401; update the link to a publicly accessible URL or replace it
with the correct published Hugging Face model/dataset page or an alternative
public resource, and verify the new URL returns 200; also ensure any adjacent
links referencing the same private artifact are updated consistently and the
link text remains accurate.

## Examples

### Minitron Pruning for Megatron-LM / NeMo Framework LLMs (e.g. Qwen 3, Nemotron Nano)
Expand All @@ -108,10 +188,12 @@ Some of the models pruned using Minitron method followed by distillation and pos

### FastNAS Pruning for PyTorch Computer Vision Models

Checkout the FastNAS pruning interactive notebook [cifar_resnet](./cifar_resnet.ipynb) in this directory
Check out the FastNAS pruning example usage in the [documentation](https://nvidia.github.io/TensorRT-Model-Optimizer/guides/3_pruning.html#pruning-and-subnet-search).

You can also take a look at FastNAS pruning interactive notebook [cifar_resnet](./cifar_resnet.ipynb) in this directory
which showcases the usage of FastNAS for pruning a ResNet 20 model for the CIFAR-10 dataset. The notebook
also how to profiling the model to understand the search space of possible pruning options and demonstrates
the usage saving and restoring pruned models.
also shows how to profile the model to understand the search space of possible pruning options and demonstrates
how to save and restore pruned models.
Comment on lines +191 to +196
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Polish FastNAS paragraph (articles, hyphenation, flow).

Minor grammar/style fixes.

-Check out the FastNAS pruning example usage in the [documentation](https://nvidia.github.io/TensorRT-Model-Optimizer/guides/3_pruning.html#pruning-and-subnet-search).
+Check out the FastNAS pruning example usage in the [documentation](https://nvidia.github.io/TensorRT-Model-Optimizer/guides/3_pruning.html#pruning-and-subnet-search).

-You can also take a look at FastNAS pruning interactive notebook [cifar_resnet](./cifar_resnet.ipynb) in this directory
-which showcases the usage of FastNAS for pruning a ResNet 20 model for the CIFAR-10 dataset. The notebook
-also shows how to profile the model to understand the search space of possible pruning options and demonstrates
-how to save and restore pruned models.
+You can also take a look at the FastNAS pruning interactive notebook [cifar_resnet](./cifar_resnet.ipynb) in this directory,
+which shows how to use FastNAS to prune a ResNet‑20 model on the CIFAR‑10 dataset. The notebook
+also shows how to profile the model to understand the search space of possible pruning options and
+how to save and restore pruned models.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
Check out the FastNAS pruning example usage in the [documentation](https://nvidia.github.io/TensorRT-Model-Optimizer/guides/3_pruning.html#pruning-and-subnet-search).
You can also take a look at FastNAS pruning interactive notebook [cifar_resnet](./cifar_resnet.ipynb) in this directory
which showcases the usage of FastNAS for pruning a ResNet 20 model for the CIFAR-10 dataset. The notebook
also how to profiling the model to understand the search space of possible pruning options and demonstrates
the usage saving and restoring pruned models.
also shows how to profile the model to understand the search space of possible pruning options and demonstrates
how to save and restore pruned models.
Check out the FastNAS pruning example usage in the [documentation](https://nvidia.github.io/TensorRT-Model-Optimizer/guides/3_pruning.html#pruning-and-subnet-search).
You can also take a look at the FastNAS pruning interactive notebook [cifar_resnet](./cifar_resnet.ipynb) in this directory,
which shows how to use FastNAS to prune a ResNet-20 model on the CIFAR-10 dataset. The notebook
also shows how to profile the model to understand the search space of possible pruning options and
how to save and restore pruned models.
🤖 Prompt for AI Agents
In examples/pruning/README.md around lines 191 to 196, the FastNAS paragraph
needs minor grammar and style polishing: add definite/indefinite articles where
appropriate, hyphenate compound adjectives (e.g., "interactive notebook" is fine
but "FastNAS pruning" could be "the FastNAS pruning"), improve flow by combining
sentences and clarifying references, and ensure consistent punctuation. Edit the
text to read smoothly (e.g., reference the documentation link, refer to "the
FastNAS pruning interactive notebook cifar_resnet.ipynb in this directory,"
mention "ResNet-20" with hyphenation, and use "CIFAR-10 dataset"), and ensure
the final sentences clearly state that the notebook profiles the model to
explore pruning options and demonstrates saving and restoring pruned models.


### GradNAS Pruning for HuggingFace Language Models (e.g. BERT)

Expand Down