Skip to content

Commit c60baae

Browse files
Add Megatron-LM pruning example link (#344)
Signed-off-by: Keval Morabia <[email protected]>
1 parent 5a3fd29 commit c60baae

File tree

4 files changed

+46
-45
lines changed

4 files changed

+46
-45
lines changed

CHANGELOG.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@ Model Optimizer Changelog (Linux)
1616
- ``high_precision_dtype`` default to fp16 in ONNX quantization, i.e. quantized output model weights are now FP16 by default.
1717
- Upgrade TensorRT-LLM dependency to 1.1.0rc2.
1818
- Support Phi-4-multimodal and Qwen2.5-VL quantized HF checkpoint export in ``examples/vlm_ptq``.
19+
- Add Minitron pruning example for Megatron-LM framework. See ``examples/megatron-lm`` for more details.
1920

2021
0.35 (2025-09-04)
2122
^^^^^^^^^^^^^^^^^

examples/megatron-lm/README.md

Lines changed: 36 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -17,13 +17,13 @@
1717

1818
## Support Matrix: {Model}x{Features}
1919

20-
| Model | Quantization | EAGLE3 | Q-LoRA | Distillation |
21-
| ------------------------------------------------------ | -----------| ------ | ----- | ---- |
22-
| `moonshotai/Kimi-K2-Instruct` || **Online** | | |
23-
| `Qwen/Qwen3-{30B-A3B, 235B-A22B}` | **WAR** | **Online** | | |
24-
| `Qwen/Qwen3-{0.6B, 8B}` || **Online** | | |
25-
| `deepseek-ai/DeepSeek-R1` || **Online** | | |
26-
| `meta-llama/Llama-{3.1-8B, 3.1-405B, 3.2-1B}-Instruct` || **Online** | | |
20+
| Model | Quantization | EAGLE3 | Q-LoRA | Pruning (PP only) | Distillation |
21+
| :---: | :---: | :---: | :---: | :---: | :---: |
22+
| `moonshotai/Kimi-K2-Instruct` || **Online** | | | |
23+
| `Qwen/Qwen3-{30B-A3B, 235B-A22B}` | **WAR** | **Online** | | | |
24+
| `Qwen/Qwen3-{0.6B, 8B}` || **Online** | | ||
25+
| `deepseek-ai/DeepSeek-R1` || **Online** | | | |
26+
| `meta-llama/Llama-{3.1-8B, 3.1-405B, 3.2-1B}-Instruct` || **Online** | | ||
2727

2828
## Getting Started in a Local Environment
2929

@@ -50,20 +50,20 @@ USER_FSW=<path_to_scratch_space> bash interactive.sh
5050
5151
<br>
5252

53-
## ⭐ FP8 Post-Training Quantization (PTQ)
53+
### ⭐ FP8 Post-Training Quantization (PTQ)
5454

5555
Provide the pretrained checkpoint path through variable `${HF_MODEL_CKPT}`:
5656

5757
```sh
5858
\
5959
TP=1 \
60-
HF_MODEL_CKPT=<pretrained_checkpoint_path> \
60+
HF_MODEL_CKPT=<pretrained_model_name_or_path> \
6161
MLM_MODEL_SAVE=/tmp/Llama-3.2-1B-Instruct-FP8 \
6262
bash megatron-lm/examples/post_training/modelopt/quantize.sh meta-llama/Llama-3.2-1B-Instruct fp8
6363

6464
\
6565
PP=1 \
66-
HF_MODEL_CKPT=<pretrained_checkpoint_path> \
66+
HF_MODEL_CKPT=<pretrained_model_name_or_path> \
6767
MLM_MODEL_LOAD=/tmp/Llama-3.2-1B-Instruct-FP8 \
6868
EXPORT_DIR=/tmp/Llama-3.2-1B-Instruct-Export \
6969
bash megatron-lm/examples/post_training/modelopt/export.sh meta-llama/Llama-3.2-1B-Instruct
@@ -76,21 +76,21 @@ deployment (`/tmp/Llama-3.2-1B-Instruct-Export`).
7676

7777
<br>
7878

79-
## ⭐ Online BF16 EAGLE3 Training
79+
### ⭐ Online BF16 EAGLE3 Training
8080

8181
Online EAGLE3 training has both the target (frozen) and draft models in the memory where the `hidden_states`
8282
required for training is generated on the fly.
8383

8484
```sh
8585
\
8686
TP=1 \
87-
HF_MODEL_CKPT=<pretrained_checkpoint_path> \
87+
HF_MODEL_CKPT=<pretrained_model_name_or_path> \
8888
MLM_MODEL_SAVE=/tmp/Llama-3.2-1B-Eagle3 \
8989
bash megatron-lm/examples/post_training/modelopt/eagle3.sh meta-llama/Llama-3.2-1B-Instruct
9090

9191
\
9292
PP=1 \
93-
HF_MODEL_CKPT=<pretrained_checkpoint_path> \
93+
HF_MODEL_CKPT=<pretrained_model_name_or_path> \
9494
MLM_MODEL_LOAD=/tmp/Llama-3.2-1B-Eagle3 \
9595
EXPORT_DIR=/tmp/Llama-3.2-1B-Eagle3-Export \
9696
bash megatron-lm/examples/post_training/modelopt/export.sh meta-llama/Llama-3.2-1B-Instruct
@@ -104,10 +104,31 @@ See [ADVANCED.md](ADVANCED.md) for a multi-gpu multi-node training example for `
104104

105105
<br>
106106

107-
## ⭐ Offline BF16 EAGLE3 Training
107+
### ⭐ Offline BF16 EAGLE3 Training
108108

109109
Coming soon ...
110110

111+
### ⭐ Pruning
112+
113+
Pruning is supported for GPT and Mamba models in Pipeline Parallel mode. Available pruning options are:
114+
115+
- `TARGET_FFN_HIDDEN_SIZE`
116+
- `TARGET_HIDDEN_SIZE`
117+
- `TARGET_NUM_ATTENTION_HEADS`
118+
- `TARGET_NUM_QUERY_GROUPS`
119+
- `TARGET_MAMBA_NUM_HEADS`
120+
- `TARGET_MAMBA_HEAD_DIM`
121+
- `TARGET_NUM_LAYERS`
122+
- `LAYERS_TO_DROP` (comma separated, 1-indexed list of layer numbers to directly drop)
123+
124+
```sh
125+
PP=1 \
126+
TARGET_NUM_LAYERS=24 \
127+
HF_MODEL_CKPT=<pretrained_model_name_or_path> \
128+
MLM_MODEL_SAVE=/tmp/Qwen3-8B-DPruned \
129+
bash megatron-lm/examples/post_training/modelopt/prune.sh qwen/Qwen3-8B
130+
```
131+
111132
## Learn More About Configuration
112133

113134
For simplicity, we use `shell` scripts and variables as arguments. Each script has at least 1 positional
@@ -116,7 +137,7 @@ quantization.
116137

117138
```sh
118139
\
119-
HF_MODEL_CKPT=[pretrained_checkpoint] \
140+
HF_MODEL_CKPT=<pretrained_model_name_or_path> \
120141
bash megatron-lm/examples/post_training/modelopt/quantize.sh [pretrained_model_card] [qformat]
121142
```
122143

examples/pruning/README.md

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -91,30 +91,30 @@ mtp.prune(
9191
9292
## Examples
9393

94-
### Minitron Pruning for NVIDIA NeMo / Megatron-LM LLMs (e.g. Llama 3)
94+
### Minitron Pruning for Megatron-LM / NeMo Framework LLMs (e.g. Llama 3.1, Nemotron Nano)
9595

96-
Checkout the Minitron pruning example in the [NVIDIA NeMo repository](https://docs.nvidia.com/nemo-framework/user-guide/latest/model-optimization/pruning/pruning.html) which showcases the usage of the powerful Minitron pruning algorithm developed by NVIDIA Research for pruning LLMs like Llama 3.1 8B, Qwen 3 8B, Mistral NeMo 12B, etc.
96+
Checkout the Minitron pruning example for the [Megatron-LM Framework](../megatron-lm/README.md#-pruning) and [NeMo Framework](https://docs.nvidia.com/nemo-framework/user-guide/latest/model-optimization/pruning/pruning.html) which showcases the usage of the powerful Minitron pruning algorithm developed by NVIDIA Research for pruning LLMs like Llama 3.1 8B, Qwen 3 8B, Nemotron Nano 12B v2, etc.
9797

98-
You can also look at the tutorial notebooks [here](https://github.com/NVIDIA-NeMo/NeMo/tree/main/tutorials/llm/llama/pruning-distillation) which showcase the usage of Minitron pruning followed by distillation for Llama 3.1 8B step-by-step in NeMo framework. Hugging Face models can also be converted to NeMo format and used subsequently as shown in the tutorial.
98+
You can also look at the NeMo tutorial notebooks [here](https://github.com/NVIDIA-NeMo/NeMo/tree/main/tutorials/llm/llama/pruning-distillation) which showcase the usage of Minitron pruning followed by distillation for Llama 3.1 8B step-by-step in NeMo framework. Hugging Face models can also be converted to NeMo format and used subsequently as shown in the tutorial.
9999

100100
Some of the models pruned using Minitron method followed by distillation and post-training are:
101101

102102
- [Minitron Collection on Hugging Face](https://huggingface.co/collections/nvidia/minitron-669ac727dc9c86e6ab7f0f3e)
103103
- [NVIDIA-Nemotron-Nano-9B-v2](https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-9B-v2)
104104

105-
### GradNAS Pruning for HuggingFace Language Models (e.g. BERT)
106-
107-
Checkout the BERT pruning example in [chained_optimizations](../chained_optimizations/README.md) directory
108-
which showcases the usage of GradNAS for pruning BERT model for Question Answering followed by fine-tuning
109-
with distillation and quantization. The example also demonstrates how to save and restore pruned models.
110-
111105
### FastNAS Pruning for PyTorch Computer Vision Models
112106

113107
Checkout the FastNAS pruning interactive notebook [cifar_resnet](./cifar_resnet.ipynb) in this directory
114108
which showcases the usage of FastNAS for pruning a ResNet 20 model for the CIFAR-10 dataset. The notebook
115109
also how to profiling the model to understand the search space of possible pruning options and demonstrates
116110
the usage saving and restoring pruned models.
117111

112+
### GradNAS Pruning for HuggingFace Language Models (e.g. BERT)
113+
114+
Checkout the BERT pruning example in [chained_optimizations](../chained_optimizations/README.md) directory
115+
which showcases the usage of GradNAS for pruning BERT model for Question Answering followed by fine-tuning
116+
with distillation and quantization. The example also demonstrates how to save and restore pruned models.
117+
118118
## Resources
119119

120120
- 📅 [Roadmap](https://github.com/NVIDIA/TensorRT-Model-Optimizer/issues/146)

modelopt/torch/prune/plugins/mcore_minitron.py

Lines changed: 0 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -24,8 +24,6 @@
2424
Actual dynamic module implementations are at :mod:`modelopt.torch.nas.plugins.megatron`.
2525
"""
2626

27-
from warnings import warn
28-
2927
import torch
3028
from pydantic import create_model
3129

@@ -209,22 +207,3 @@ def config_class(self) -> type[ModeloptBaseConfig]:
209207
def search_algorithm(self) -> type[BaseSearcher]:
210208
"""Specifies the search algorithm to use for this mode (if any)."""
211209
return MCoreMinitronSearcher
212-
213-
214-
@NASModeRegistry.register_mode
215-
@PruneModeRegistry.register_mode
216-
class MCoreGPTMinitronModeDescriptor(MCoreMinitronModeDescriptor):
217-
"""[Deprecated] Class to describe the ``"mcore_gpt_minitron"`` mode.
218-
219-
The properties of this mode can be inspected via the source code.
220-
"""
221-
222-
@property
223-
def name(self) -> str:
224-
"""Returns the value (str representation) of the mode."""
225-
warn(
226-
"`mcore_gpt_minitron` mode is deprecated will be removed in a later release. "
227-
"Please use `mcore_minitron` instead.",
228-
DeprecationWarning,
229-
)
230-
return "mcore_gpt_minitron"

0 commit comments

Comments
 (0)