Skip to content
Open
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions CHANGELOG.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,10 @@ Model Optimizer Changelog (Linux)

- Fix a bug in FastNAS pruning (computer vision models) where the model parameters were sorted twice messing up the ordering.

**New Features**

- Add MoE (e.g. Qwen3-30B-A3B) pruning support for ``num_moe_experts``, ``moe_ffn_hidden_size`` and ``moe_shared_expert_intermediate_size`` parameters in Minitron pruning (``mcore_minitron``).

0.39 (2025-11-14)
^^^^^^^^^^^^^^^^^

Expand Down
11 changes: 7 additions & 4 deletions docs/source/guides/7_nas.rst
Original file line number Diff line number Diff line change
Expand Up @@ -361,9 +361,12 @@ can be converted into searchable units:
# search over the number of layers (depth) in the sequential layer.
nn.Sequential

# We convert Megatron-core / NeMo GPT or Mamba style models (e.g. Llama3.1, NeMo Mistral, NeMotron-H, etc.)
# to automatically search over the MLP hidden size, number of attention heads, number of GQA groups,
# number of mamba heads, mamba head dimension, and depth of the model.
# We convert Megatron-core / NeMo GPT or MoE or Mamba Hybrid style models (e.g. Llama3, Nemotron-H, Qwen3-30B-A3B)
# to automatically search over the
# MLP hidden size, number of attention heads, number of GQA groups,
# number of mamba heads, mamba head dimension,
# number of moe experts, moe ffn hidden size, moe shared expert intermediate size,
# and depth of the model.
megatron.core.models.gpt.GPTModel
megatron.core.models.mamba.MambaModel
nemo.collections.llm.gpt.model.base.GPTModel
Expand Down Expand Up @@ -640,7 +643,7 @@ The difference between NAS and pruning is summarized below.
[Advanced] Adding a new NAS/Prune Algorithm
===========================================

* Please refer to this `template <https://github.com/NVIDIA/TensorRT-Model-Optimizer/compare/template/new-nas-mode>`_
* Please refer to this `template <https://github.com/NVIDIA/TensorRT-Model-Optimizer/compare/template/new-nas-mode>`_
for adding a new NAS algorithm.
* Please refer to `mcore_minitron.py <https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/modelopt/torch/prune/plugins/mcore_minitron.py>`_
for an actual example of adding Minitron Pruning algorithm.
11 changes: 9 additions & 2 deletions examples/megatron-lm/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@
| Model | Quantization | EAGLE3 | Q-LoRA | Pruning (PP only) | Distillation |
| :---: | :---: | :---: | :---: | :---: | :---: |
| `moonshotai/Kimi-K2-Instruct` | ✅ | **Online** | | | |
| `Qwen/Qwen3-{30B-A3B, 235B-A22B}` | **WAR** | **Online** | | | |
| `Qwen/Qwen3-{30B-A3B, 235B-A22B}` | **WAR** | **Online** | | ✅ | ✅ |
| `Qwen/Qwen3-{0.6B, 8B}` | ✅ | **Online** | | ✅ | ✅ |
| `deepseek-ai/DeepSeek-R1` | ✅ | **Online** | | | |
| `meta-llama/Llama-{3.1-8B, 3.1-405B, 3.2-1B}-Instruct` | ✅ | **Online** | | ✅ | ✅ |
Expand Down Expand Up @@ -112,14 +112,17 @@ Coming soon ...

Checkout pruning [getting started section](../pruning/README.md#getting-started) and [guidelines](../pruning/README.md#pruning-guidelines) for configuring pruning parameters in the pruning README.

Pruning is supported for GPT and Mamba models in Pipeline Parallel mode. Available pruning options are:
Pruning is supported for GPT and Mamba models in Pipeline Parallel mode. Available pruning dimensions are:

- `TARGET_FFN_HIDDEN_SIZE`
- `TARGET_HIDDEN_SIZE`
- `TARGET_NUM_ATTENTION_HEADS`
- `TARGET_NUM_QUERY_GROUPS`
- `TARGET_MAMBA_NUM_HEADS`
- `TARGET_MAMBA_HEAD_DIM`
- `TARGET_NUM_MOE_EXPERTS`
- `TARGET_MOE_FFN_HIDDEN_SIZE`
- `TARGET_MOE_SHARED_EXPERT_INTERMEDIATE_SIZE`
- `TARGET_NUM_LAYERS`
- `LAYERS_TO_DROP` (comma separated, 1-indexed list of layer numbers to directly drop)

Expand All @@ -137,6 +140,10 @@ bash Megatron-LM/examples/post_training/modelopt/prune.sh qwen/Qwen3-8B
> If number of layers in the model is not divisible by pipeline parallel size (PP), you can configure uneven
> PP by setting `MLM_EXTRA_ARGS="--decoder-first-pipeline-num-layers <X> --decoder-last-pipeline-num-layers <Y>"`

> [!TIP]
> You can reuse pruning scores for pruning same model again to different architectures by setting
> `PRUNE_ARGS="--pruning-scores-path <path_to_save_scores>"`

## Learn More About Configuration

For simplicity, we use `shell` scripts and variables as arguments. Each script has at least 1 positional
Expand Down
8 changes: 4 additions & 4 deletions examples/pruning/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ Pruning can involve removal (prune) of Linear and Conv layers, and Transformer a

This section focuses on applying Model Optimizer's state-of-the-art complementary pruning modes to enable you to search for the best subnet architecture from your provided base model:

1. [Minitron](https://arxiv.org/pdf/2408.11796): A pruning method developed by NVIDIA Research for pruning GPT, Mamba and Hybrid Transformer Mamba models in NVIDIA NeMo or Megatron-LM framework. It uses the activation magnitudes to prune the embedding hidden size, mlp ffn hidden size, transformer attention heads, GQA query groups, mamba heads and head dimension, and number of layers of the model.
1. [Minitron](https://arxiv.org/pdf/2408.11796): A pruning method developed by NVIDIA Research for pruning GPT, Mamba and Hybrid Transformer Mamba models in NVIDIA NeMo or Megatron-LM framework. It uses the activation magnitudes to prune the embedding hidden size; mlp ffn hidden size; transformer attention heads and GQA query groups; mamba heads and head dimension; MoE number of experts, ffn hidden size, and shared expert intermediate size; and number of layers of the model.
1. FastNAS: A pruning method recommended for Computer Vision models. Given a pretrained model, FastNAS finds the subnet which maximizes the score function while meeting the given constraints.
1. GradNAS: A light-weight pruning method recommended for language models like Hugging Face BERT, GPT-J. It uses the gradient information to prune the model's linear layers and attention heads to meet the given constraints.

Expand Down Expand Up @@ -89,11 +89,11 @@ If your model parameters are already sorted, you can skip the sorting step by se

| **Algorithm** | **Model** | **Pruning Constraints** |
| :---: | :---: | :---: |
| Minitron | Megatron-core / NeMo based GPT / Mamba / Hybrid Models<sup>1</sup> | Export config with width (`hidden_size`, `ffn_hidden_size`, `num_attention_heads`, `num_query_groups`, `mamba_num_heads`, `mamba_head_dim`) and/or depth (`num_layers`) values |
| Minitron | Megatron-core / NeMo based GPT / Mamba / MoE / Hybrid Models<sup>1</sup> | Export config with width (`hidden_size`, `ffn_hidden_size`, `num_attention_heads`, `num_query_groups`, `mamba_num_heads`, `mamba_head_dim`, `num_moe_experts`, `moe_ffn_hidden_size`, `moe_shared_expert_intermediate_size`) and/or depth (`num_layers`) values |
| FastNAS | Computer Vision models | flops, parameters |
| GradNAS | HuggingFace BERT, GPT-J | flops, parameters |

> *<sup>1.</sup>Only Pipeline Parallel models are supported. Hugging Face models can be converted to NeMo format and used subsequently.*
> *<sup>1.</sup>Only Pipeline Parallel models are supported. Hugging Face models can be converted to Megatron-LM/NeMo format and used subsequently.*

## Pruning Guidelines

Expand Down Expand Up @@ -122,7 +122,7 @@ Depth pruning reduces the number of layers (`num_layers`) in the model.

#### Width Pruning

Width pruning reduces model dimensions per layer such as `hidden_size`, `ffn_hidden_size`, `num_attention_heads`, `num_query_groups`, `mamba_num_heads`, and `mamba_head_dim`.
Width pruning reduces model dimensions per layer such as `hidden_size`, `ffn_hidden_size`, `num_attention_heads`, `num_query_groups`, `mamba_num_heads`, `mamba_head_dim`, `num_moe_experts`, `moe_ffn_hidden_size`, and `moe_shared_expert_intermediate_size`.

**Advantages:**

Expand Down
34 changes: 33 additions & 1 deletion modelopt/torch/nas/modules/container.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@
from ..registry import DMRegistry
from ..traced_hp import TracedHp

__all__ = ["_DynamicSequential"]
__all__ = ["DynamicModuleList", "_DynamicSequential"]


def _activate_depth(func: Callable) -> Callable:
Expand Down Expand Up @@ -97,3 +97,35 @@ def modify(self, *, min_depth: int = 0):
"""
hp = self.get_hparam("depth")
hp.choices = [d for d in hp.choices if d >= min_depth]


# NOTE: We provide a parent class since we do not register to DMRegistry and explicitly convert a module if needed.
class DynamicModuleList(DynamicModule, nn.ModuleList):
"""An ``nn.ModuleList`` container with dynamic hyperparams and variable ``depth``.
Unlike _DynamicSequential, this module supports sorting/reordering of modules based on
importance in addition to variable depth.
"""

def _setup(self):
# register hyperparameters
self._register_hparam("depth", TracedHp(list(range(1, len(self) + 1))))

# register _modules as a dynamic attribute
self._register_dynamic_attribute("_modules", self._get_modules)

@staticmethod
def _get_modules(mod: "DynamicModuleList", modules: dict) -> dict:
"""Get modules with dynamic depth and ordering applied based on active_slice."""
hp = mod.get_hparam("depth")
active_slice = hp.active_slice

items = list(modules.items())

if isinstance(active_slice, slice):
active_items = items[active_slice]
else:
active_items = [items[idx] for idx in active_slice.tolist()]

# Re-create dict with keys as str(index) from 0 to len(active_items)
return {str(i): module for i, (_, module) in enumerate(active_items)}
Loading