Skip to content

Commit 8e4d09d

Browse files
Add ModelOpt pruning docs
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
1 parent 6938b77 commit 8e4d09d

File tree

2 files changed

+51
-0
lines changed

2 files changed

+51
-0
lines changed

docs/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -51,6 +51,7 @@ training/peft.md
5151
training/packed-sequences.md
5252
training/multi-token-prediction.md
5353
training/distillation.md
54+
training/pruning.md
5455
training/callbacks.md
5556
```
5657

docs/training/pruning.md

Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
# Pruning
2+
3+
Pruning reduces model size by removing redundant parameters (e.g., shrinking hidden dimensions or layers) while preserving accuracy. In Megatron Bridge, pruning is provided by NVIDIA Model Optimizer (ModelOpt) using the Minitron algorithm for GPT and Mamba-based models loaded from HuggingFace.
4+
5+
## Pre-requisites
6+
7+
Running the pruning example requires Megatron-Bridge and Model-Optimizer dependencies. We recommend using the NeMo container (e.g., `nvcr.io/nvidia/nemo:26.02`). To use the latest ModelOpt scripts, mount your Model-Optimizer repo at `/opt/Megatron-Bridge/3rdparty/Model-Optimizer` or pull the latest changes inside the container (`cd /opt/Megatron-Bridge/3rdparty/Model-Optimizer && git checkout main && git pull`).
8+
9+
## Usage
10+
11+
### Prune to a target parameter count (NAS)
12+
13+
Example: prune Qwen3-8B to 6B on 2 GPUs (Pipeline Parallelism = 2), skipping pruning of `num_attention_heads`. Defaults: 1024 samples from [nemotron-post-training-dataset-v2](https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v2) for calibration, at most 20% depth (`num_layers`) and 40% width per prunable hyperparameter (`hidden_size`, `ffn_hidden_size`, ...), top-10 candidates evaluated for MMLU (5% sampled data) to select the best model.
14+
15+
```bash
16+
torchrun --nproc_per_node 2 /opt/Megatron-Bridge/3rdparty/Model-Optimizer/examples/megatron_bridge/prune_minitron.py \
17+
--hf_model_name_or_path Qwen/Qwen3-8B \
18+
--prune_target_params 6e9 \
19+
--hparams_to_skip num_attention_heads \
20+
--output_hf_path /tmp/Qwen3-8B-Pruned-6B
21+
```
22+
23+
### Prune to a specific architecture (manual config)
24+
25+
Example: prune Qwen3-8B to a fixed architecture. Defaults: 1024 samples from [nemotron-post-training-dataset-v2](https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v2) for calibration.
26+
27+
```bash
28+
torchrun --nproc_per_node 2 /opt/Megatron-Bridge/3rdparty/Model-Optimizer/examples/megatron_bridge/prune_minitron.py \
29+
--hf_model_name_or_path Qwen/Qwen3-8B \
30+
--prune_export_config '{"hidden_size": 3584, "ffn_hidden_size": 9216}' \
31+
--output_hf_path /tmp/Qwen3-8B-Pruned-6B-manual
32+
```
33+
34+
To see the full list of options for advanced configurations, run:
35+
36+
```bash
37+
torchrun --nproc_per_node 1 /opt/Megatron-Bridge/3rdparty/Model-Optimizer/examples/megatron_bridge/prune_minitron.py --help
38+
```
39+
40+
### Uneven pipeline parallelism
41+
42+
If the number of layers is not divisible by the number of GPUs (pipeline parallel size), set `--num_layers_in_first_pipeline_stage` and `--num_layers_in_last_pipeline_stage`. For example, Qwen3-8B with 36 layers on 8 GPUs: set both to 3 to get 3-5-5-5-5-5-5-3 layers per GPU.
43+
44+
## More information
45+
46+
For more details, see the [ModelOpt pruning README](https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/megatron_bridge#readme).
47+
48+
## Next steps: Knowledge Distillation
49+
50+
Knowledge Distillation is required to recover the performance of the pruned model. See the [Knowledge Distillation](distillation.md) guide for more details.

0 commit comments

Comments
 (0)