Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ Model Optimizer Changelog (Linux)
- ``high_precision_dtype`` default to fp16 in ONNX quantization, i.e. quantized output model weights are now FP16 by default.
- Upgrade TensorRT-LLM dependency to 1.1.0rc2.
- Support Phi-4-multimodal and Qwen2.5-VL quantized HF checkpoint export in ``examples/vlm_ptq``.
- Add Minitron pruning example for Megatron-LM framework. See ``examples/megatron-lm`` for more details.

0.35 (2025-09-04)
^^^^^^^^^^^^^^^^^
Expand Down
51 changes: 36 additions & 15 deletions examples/megatron-lm/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,13 +17,13 @@

## Support Matrix: {Model}x{Features}

| Model | Quantization | EAGLE3 | Q-LoRA | Distillation |
| ------------------------------------------------------ | -----------| ------ | ----- | ---- |
| `moonshotai/Kimi-K2-Instruct` | ✅ | **Online** | | |
| `Qwen/Qwen3-{30B-A3B, 235B-A22B}` | **WAR** | **Online** | | |
| `Qwen/Qwen3-{0.6B, 8B}` | ✅ | **Online** | | |
| `deepseek-ai/DeepSeek-R1` | ✅ | **Online** | | |
| `meta-llama/Llama-{3.1-8B, 3.1-405B, 3.2-1B}-Instruct` | ✅ | **Online** | | |
| Model | Quantization | EAGLE3 | Q-LoRA | Pruning (PP only) | Distillation |
| :---: | :---: | :---: | :---: | :---: | :---: |
| `moonshotai/Kimi-K2-Instruct` | ✅ | **Online** | | | |
| `Qwen/Qwen3-{30B-A3B, 235B-A22B}` | **WAR** | **Online** | | | |
| `Qwen/Qwen3-{0.6B, 8B}` | ✅ | **Online** | | ✅ | ✅ |
| `deepseek-ai/DeepSeek-R1` | ✅ | **Online** | | | |
| `meta-llama/Llama-{3.1-8B, 3.1-405B, 3.2-1B}-Instruct` | ✅ | **Online** | | ✅ | ✅ |

## Getting Started in a Local Environment

Expand All @@ -50,20 +50,20 @@ USER_FSW=<path_to_scratch_space> bash interactive.sh

<br>

## ⭐ FP8 Post-Training Quantization (PTQ)
### ⭐ FP8 Post-Training Quantization (PTQ)

Provide the pretrained checkpoint path through variable `${HF_MODEL_CKPT}`:

```sh
\
TP=1 \
HF_MODEL_CKPT=<pretrained_checkpoint_path> \
HF_MODEL_CKPT=<pretrained_model_name_or_path> \
MLM_MODEL_SAVE=/tmp/Llama-3.2-1B-Instruct-FP8 \
bash megatron-lm/examples/post_training/modelopt/quantize.sh meta-llama/Llama-3.2-1B-Instruct fp8

\
PP=1 \
HF_MODEL_CKPT=<pretrained_checkpoint_path> \
HF_MODEL_CKPT=<pretrained_model_name_or_path> \
MLM_MODEL_LOAD=/tmp/Llama-3.2-1B-Instruct-FP8 \
EXPORT_DIR=/tmp/Llama-3.2-1B-Instruct-Export \
bash megatron-lm/examples/post_training/modelopt/export.sh meta-llama/Llama-3.2-1B-Instruct
Expand All @@ -76,21 +76,21 @@ deployment (`/tmp/Llama-3.2-1B-Instruct-Export`).

<br>

## ⭐ Online BF16 EAGLE3 Training
### ⭐ Online BF16 EAGLE3 Training

Online EAGLE3 training has both the target (frozen) and draft models in the memory where the `hidden_states`
required for training is generated on the fly.

```sh
\
TP=1 \
HF_MODEL_CKPT=<pretrained_checkpoint_path> \
HF_MODEL_CKPT=<pretrained_model_name_or_path> \
MLM_MODEL_SAVE=/tmp/Llama-3.2-1B-Eagle3 \
bash megatron-lm/examples/post_training/modelopt/eagle3.sh meta-llama/Llama-3.2-1B-Instruct

\
PP=1 \
HF_MODEL_CKPT=<pretrained_checkpoint_path> \
HF_MODEL_CKPT=<pretrained_model_name_or_path> \
MLM_MODEL_LOAD=/tmp/Llama-3.2-1B-Eagle3 \
EXPORT_DIR=/tmp/Llama-3.2-1B-Eagle3-Export \
bash megatron-lm/examples/post_training/modelopt/export.sh meta-llama/Llama-3.2-1B-Instruct
Expand All @@ -104,10 +104,31 @@ See [ADVANCED.md](ADVANCED.md) for a multi-gpu multi-node training example for `

<br>

## ⭐ Offline BF16 EAGLE3 Training
### ⭐ Offline BF16 EAGLE3 Training

Coming soon ...

### ⭐ Pruning

Pruning is supported for GPT and Mamba models in Pipeline Parallel mode. Available pruning options are:

- `TARGET_FFN_HIDDEN_SIZE`
- `TARGET_HIDDEN_SIZE`
- `TARGET_NUM_ATTENTION_HEADS`
- `TARGET_NUM_QUERY_GROUPS`
- `TARGET_MAMBA_NUM_HEADS`
- `TARGET_MAMBA_HEAD_DIM`
- `TARGET_NUM_LAYERS`
- `LAYERS_TO_DROP` (comma separated, 1-indexed list of layer numbers to directly drop)

```sh
PP=1 \
TARGET_NUM_LAYERS=24 \
HF_MODEL_CKPT=<pretrained_model_name_or_path> \
MLM_MODEL_SAVE=/tmp/Qwen3-8B-DPruned \
bash megatron-lm/examples/post_training/modelopt/prune.sh qwen/Qwen3-8B
```

## Learn More About Configuration

For simplicity, we use `shell` scripts and variables as arguments. Each script has at least 1 positional
Expand All @@ -116,7 +137,7 @@ quantization.

```sh
\
HF_MODEL_CKPT=[pretrained_checkpoint] \
HF_MODEL_CKPT=<pretrained_model_name_or_path> \
bash megatron-lm/examples/post_training/modelopt/quantize.sh [pretrained_model_card] [qformat]
```

Expand Down
18 changes: 9 additions & 9 deletions examples/pruning/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -91,30 +91,30 @@ mtp.prune(

## Examples

### Minitron Pruning for NVIDIA NeMo / Megatron-LM LLMs (e.g. Llama 3)
### Minitron Pruning for Megatron-LM / NeMo Framework LLMs (e.g. Llama 3.1, Nemotron Nano)

Checkout the Minitron pruning example in the [NVIDIA NeMo repository](https://docs.nvidia.com/nemo-framework/user-guide/latest/model-optimization/pruning/pruning.html) which showcases the usage of the powerful Minitron pruning algorithm developed by NVIDIA Research for pruning LLMs like Llama 3.1 8B, Qwen 3 8B, Mistral NeMo 12B, etc.
Checkout the Minitron pruning example for the [Megatron-LM Framework](../megatron-lm/README.md#-pruning) and [NeMo Framework](https://docs.nvidia.com/nemo-framework/user-guide/latest/model-optimization/pruning/pruning.html) which showcases the usage of the powerful Minitron pruning algorithm developed by NVIDIA Research for pruning LLMs like Llama 3.1 8B, Qwen 3 8B, Nemotron Nano 12B v2, etc.

You can also look at the tutorial notebooks [here](https://github.com/NVIDIA-NeMo/NeMo/tree/main/tutorials/llm/llama/pruning-distillation) which showcase the usage of Minitron pruning followed by distillation for Llama 3.1 8B step-by-step in NeMo framework. Hugging Face models can also be converted to NeMo format and used subsequently as shown in the tutorial.
You can also look at the NeMo tutorial notebooks [here](https://github.com/NVIDIA-NeMo/NeMo/tree/main/tutorials/llm/llama/pruning-distillation) which showcase the usage of Minitron pruning followed by distillation for Llama 3.1 8B step-by-step in NeMo framework. Hugging Face models can also be converted to NeMo format and used subsequently as shown in the tutorial.

Some of the models pruned using Minitron method followed by distillation and post-training are:

- [Minitron Collection on Hugging Face](https://huggingface.co/collections/nvidia/minitron-669ac727dc9c86e6ab7f0f3e)
- [NVIDIA-Nemotron-Nano-9B-v2](https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-9B-v2)

### GradNAS Pruning for HuggingFace Language Models (e.g. BERT)

Checkout the BERT pruning example in [chained_optimizations](../chained_optimizations/README.md) directory
which showcases the usage of GradNAS for pruning BERT model for Question Answering followed by fine-tuning
with distillation and quantization. The example also demonstrates how to save and restore pruned models.

### FastNAS Pruning for PyTorch Computer Vision Models

Checkout the FastNAS pruning interactive notebook [cifar_resnet](./cifar_resnet.ipynb) in this directory
which showcases the usage of FastNAS for pruning a ResNet 20 model for the CIFAR-10 dataset. The notebook
also how to profiling the model to understand the search space of possible pruning options and demonstrates
the usage saving and restoring pruned models.

### GradNAS Pruning for HuggingFace Language Models (e.g. BERT)

Checkout the BERT pruning example in [chained_optimizations](../chained_optimizations/README.md) directory
which showcases the usage of GradNAS for pruning BERT model for Question Answering followed by fine-tuning
with distillation and quantization. The example also demonstrates how to save and restore pruned models.

## Resources

- 📅 [Roadmap](https://github.com/NVIDIA/TensorRT-Model-Optimizer/issues/146)
Expand Down
21 changes: 0 additions & 21 deletions modelopt/torch/prune/plugins/mcore_minitron.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,8 +24,6 @@
Actual dynamic module implementations are at :mod:`modelopt.torch.nas.plugins.megatron`.
"""

from warnings import warn

import torch
from pydantic import create_model

Expand Down Expand Up @@ -209,22 +207,3 @@ def config_class(self) -> type[ModeloptBaseConfig]:
def search_algorithm(self) -> type[BaseSearcher]:
"""Specifies the search algorithm to use for this mode (if any)."""
return MCoreMinitronSearcher


@NASModeRegistry.register_mode
@PruneModeRegistry.register_mode
class MCoreGPTMinitronModeDescriptor(MCoreMinitronModeDescriptor):
"""[Deprecated] Class to describe the ``"mcore_gpt_minitron"`` mode.

The properties of this mode can be inspected via the source code.
"""

@property
def name(self) -> str:
"""Returns the value (str representation) of the mode."""
warn(
"`mcore_gpt_minitron` mode is deprecated will be removed in a later release. "
"Please use `mcore_minitron` instead.",
DeprecationWarning,
)
return "mcore_gpt_minitron"
Loading