diff --git a/CHANGELOG.rst b/CHANGELOG.rst index 8dc315c46..1ba449864 100755 --- a/CHANGELOG.rst +++ b/CHANGELOG.rst @@ -16,6 +16,7 @@ Model Optimizer Changelog (Linux) - ``high_precision_dtype`` default to fp16 in ONNX quantization, i.e. quantized output model weights are now FP16 by default. - Upgrade TensorRT-LLM dependency to 1.1.0rc2. - Support Phi-4-multimodal and Qwen2.5-VL quantized HF checkpoint export in ``examples/vlm_ptq``. +- Add Minitron pruning example for Megatron-LM framework. See ``examples/megatron-lm`` for more details. 0.35 (2025-09-04) ^^^^^^^^^^^^^^^^^ diff --git a/examples/megatron-lm/README.md b/examples/megatron-lm/README.md index 19d291e31..d706508fa 100644 --- a/examples/megatron-lm/README.md +++ b/examples/megatron-lm/README.md @@ -17,13 +17,13 @@ ## Support Matrix: {Model}x{Features} -| Model | Quantization | EAGLE3 | Q-LoRA | Distillation | -| ------------------------------------------------------ | -----------| ------ | ----- | ---- | -| `moonshotai/Kimi-K2-Instruct` | ✅ | **Online** | | | -| `Qwen/Qwen3-{30B-A3B, 235B-A22B}` | **WAR** | **Online** | | | -| `Qwen/Qwen3-{0.6B, 8B}` | ✅ | **Online** | | | -| `deepseek-ai/DeepSeek-R1` | ✅ | **Online** | | | -| `meta-llama/Llama-{3.1-8B, 3.1-405B, 3.2-1B}-Instruct` | ✅ | **Online** | | | +| Model | Quantization | EAGLE3 | Q-LoRA | Pruning (PP only) | Distillation | +| :---: | :---: | :---: | :---: | :---: | :---: | +| `moonshotai/Kimi-K2-Instruct` | ✅ | **Online** | | | | +| `Qwen/Qwen3-{30B-A3B, 235B-A22B}` | **WAR** | **Online** | | | | +| `Qwen/Qwen3-{0.6B, 8B}` | ✅ | **Online** | | ✅ | ✅ | +| `deepseek-ai/DeepSeek-R1` | ✅ | **Online** | | | | +| `meta-llama/Llama-{3.1-8B, 3.1-405B, 3.2-1B}-Instruct` | ✅ | **Online** | | ✅ | ✅ | ## Getting Started in a Local Environment @@ -50,20 +50,20 @@ USER_FSW= bash interactive.sh
-## ⭐ FP8 Post-Training Quantization (PTQ) +### ⭐ FP8 Post-Training Quantization (PTQ) Provide the pretrained checkpoint path through variable `${HF_MODEL_CKPT}`: ```sh \ TP=1 \ - HF_MODEL_CKPT= \ + HF_MODEL_CKPT= \ MLM_MODEL_SAVE=/tmp/Llama-3.2-1B-Instruct-FP8 \ bash megatron-lm/examples/post_training/modelopt/quantize.sh meta-llama/Llama-3.2-1B-Instruct fp8 \ PP=1 \ - HF_MODEL_CKPT= \ + HF_MODEL_CKPT= \ MLM_MODEL_LOAD=/tmp/Llama-3.2-1B-Instruct-FP8 \ EXPORT_DIR=/tmp/Llama-3.2-1B-Instruct-Export \ bash megatron-lm/examples/post_training/modelopt/export.sh meta-llama/Llama-3.2-1B-Instruct @@ -76,7 +76,7 @@ deployment (`/tmp/Llama-3.2-1B-Instruct-Export`).
-## ⭐ Online BF16 EAGLE3 Training +### ⭐ Online BF16 EAGLE3 Training Online EAGLE3 training has both the target (frozen) and draft models in the memory where the `hidden_states` required for training is generated on the fly. @@ -84,13 +84,13 @@ required for training is generated on the fly. ```sh \ TP=1 \ - HF_MODEL_CKPT= \ + HF_MODEL_CKPT= \ MLM_MODEL_SAVE=/tmp/Llama-3.2-1B-Eagle3 \ bash megatron-lm/examples/post_training/modelopt/eagle3.sh meta-llama/Llama-3.2-1B-Instruct \ PP=1 \ - HF_MODEL_CKPT= \ + HF_MODEL_CKPT= \ MLM_MODEL_LOAD=/tmp/Llama-3.2-1B-Eagle3 \ EXPORT_DIR=/tmp/Llama-3.2-1B-Eagle3-Export \ bash megatron-lm/examples/post_training/modelopt/export.sh meta-llama/Llama-3.2-1B-Instruct @@ -104,10 +104,31 @@ See [ADVANCED.md](ADVANCED.md) for a multi-gpu multi-node training example for `
-## ⭐ Offline BF16 EAGLE3 Training +### ⭐ Offline BF16 EAGLE3 Training Coming soon ... +### ⭐ Pruning + +Pruning is supported for GPT and Mamba models in Pipeline Parallel mode. Available pruning options are: + +- `TARGET_FFN_HIDDEN_SIZE` +- `TARGET_HIDDEN_SIZE` +- `TARGET_NUM_ATTENTION_HEADS` +- `TARGET_NUM_QUERY_GROUPS` +- `TARGET_MAMBA_NUM_HEADS` +- `TARGET_MAMBA_HEAD_DIM` +- `TARGET_NUM_LAYERS` +- `LAYERS_TO_DROP` (comma separated, 1-indexed list of layer numbers to directly drop) + +```sh +PP=1 \ +TARGET_NUM_LAYERS=24 \ +HF_MODEL_CKPT= \ +MLM_MODEL_SAVE=/tmp/Qwen3-8B-DPruned \ +bash megatron-lm/examples/post_training/modelopt/prune.sh qwen/Qwen3-8B +``` + ## Learn More About Configuration For simplicity, we use `shell` scripts and variables as arguments. Each script has at least 1 positional @@ -116,7 +137,7 @@ quantization. ```sh \ - HF_MODEL_CKPT=[pretrained_checkpoint] \ + HF_MODEL_CKPT= \ bash megatron-lm/examples/post_training/modelopt/quantize.sh [pretrained_model_card] [qformat] ``` diff --git a/examples/pruning/README.md b/examples/pruning/README.md index 6d0123e3d..9feb53a83 100644 --- a/examples/pruning/README.md +++ b/examples/pruning/README.md @@ -91,23 +91,17 @@ mtp.prune( ## Examples -### Minitron Pruning for NVIDIA NeMo / Megatron-LM LLMs (e.g. Llama 3) +### Minitron Pruning for Megatron-LM / NeMo Framework LLMs (e.g. Llama 3.1, Nemotron Nano) -Checkout the Minitron pruning example in the [NVIDIA NeMo repository](https://docs.nvidia.com/nemo-framework/user-guide/latest/model-optimization/pruning/pruning.html) which showcases the usage of the powerful Minitron pruning algorithm developed by NVIDIA Research for pruning LLMs like Llama 3.1 8B, Qwen 3 8B, Mistral NeMo 12B, etc. +Checkout the Minitron pruning example for the [Megatron-LM Framework](../megatron-lm/README.md#-pruning) and [NeMo Framework](https://docs.nvidia.com/nemo-framework/user-guide/latest/model-optimization/pruning/pruning.html) which showcases the usage of the powerful Minitron pruning algorithm developed by NVIDIA Research for pruning LLMs like Llama 3.1 8B, Qwen 3 8B, Nemotron Nano 12B v2, etc. -You can also look at the tutorial notebooks [here](https://github.com/NVIDIA-NeMo/NeMo/tree/main/tutorials/llm/llama/pruning-distillation) which showcase the usage of Minitron pruning followed by distillation for Llama 3.1 8B step-by-step in NeMo framework. Hugging Face models can also be converted to NeMo format and used subsequently as shown in the tutorial. +You can also look at the NeMo tutorial notebooks [here](https://github.com/NVIDIA-NeMo/NeMo/tree/main/tutorials/llm/llama/pruning-distillation) which showcase the usage of Minitron pruning followed by distillation for Llama 3.1 8B step-by-step in NeMo framework. Hugging Face models can also be converted to NeMo format and used subsequently as shown in the tutorial. Some of the models pruned using Minitron method followed by distillation and post-training are: - [Minitron Collection on Hugging Face](https://huggingface.co/collections/nvidia/minitron-669ac727dc9c86e6ab7f0f3e) - [NVIDIA-Nemotron-Nano-9B-v2](https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-9B-v2) -### GradNAS Pruning for HuggingFace Language Models (e.g. BERT) - -Checkout the BERT pruning example in [chained_optimizations](../chained_optimizations/README.md) directory -which showcases the usage of GradNAS for pruning BERT model for Question Answering followed by fine-tuning -with distillation and quantization. The example also demonstrates how to save and restore pruned models. - ### FastNAS Pruning for PyTorch Computer Vision Models Checkout the FastNAS pruning interactive notebook [cifar_resnet](./cifar_resnet.ipynb) in this directory @@ -115,6 +109,12 @@ which showcases the usage of FastNAS for pruning a ResNet 20 model for the CIFAR also how to profiling the model to understand the search space of possible pruning options and demonstrates the usage saving and restoring pruned models. +### GradNAS Pruning for HuggingFace Language Models (e.g. BERT) + +Checkout the BERT pruning example in [chained_optimizations](../chained_optimizations/README.md) directory +which showcases the usage of GradNAS for pruning BERT model for Question Answering followed by fine-tuning +with distillation and quantization. The example also demonstrates how to save and restore pruned models. + ## Resources - 📅 [Roadmap](https://github.com/NVIDIA/TensorRT-Model-Optimizer/issues/146) diff --git a/modelopt/torch/prune/plugins/mcore_minitron.py b/modelopt/torch/prune/plugins/mcore_minitron.py index 2fd4b439a..f59754350 100644 --- a/modelopt/torch/prune/plugins/mcore_minitron.py +++ b/modelopt/torch/prune/plugins/mcore_minitron.py @@ -24,8 +24,6 @@ Actual dynamic module implementations are at :mod:`modelopt.torch.nas.plugins.megatron`. """ -from warnings import warn - import torch from pydantic import create_model @@ -209,22 +207,3 @@ def config_class(self) -> type[ModeloptBaseConfig]: def search_algorithm(self) -> type[BaseSearcher]: """Specifies the search algorithm to use for this mode (if any).""" return MCoreMinitronSearcher - - -@NASModeRegistry.register_mode -@PruneModeRegistry.register_mode -class MCoreGPTMinitronModeDescriptor(MCoreMinitronModeDescriptor): - """[Deprecated] Class to describe the ``"mcore_gpt_minitron"`` mode. - - The properties of this mode can be inspected via the source code. - """ - - @property - def name(self) -> str: - """Returns the value (str representation) of the mode.""" - warn( - "`mcore_gpt_minitron` mode is deprecated will be removed in a later release. " - "Please use `mcore_minitron` instead.", - DeprecationWarning, - ) - return "mcore_gpt_minitron"