Skip to content

Commit 103b1bb

Browse files
Push latest changes
Signed-off-by: Keval Morabia <[email protected]>
1 parent dba0b37 commit 103b1bb

File tree

46 files changed

+555
-1359
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

46 files changed

+555
-1359
lines changed

CHANGELOG.rst

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,22 @@
11
Model Optimizer Changelog (Linux)
22
=================================
33

4-
0.35 (2025-08-xx)
4+
0.37 (2025-09-xx)
5+
^^^^^^^^^^^^^^^^^
6+
7+
**Deprecations**
8+
9+
**Bug Fixes**
10+
11+
**New Features**
12+
13+
0.35 (2025-09-04)
514
^^^^^^^^^^^^^^^^^
615

716
**Deprecations**
817

918
- Deprecate ``torch<2.6`` support.
19+
- Deprecate NeMo 1.0 model support.
1020

1121
**Bug Fixes**
1222

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ The **NVIDIA TensorRT Model Optimizer** (referred to as **Model Optimizer**, or
2020
**[Input]** Model Optimizer currently supports inputs of a [Hugging Face](https://huggingface.co/), [PyTorch](https://github.com/pytorch/pytorch) or [ONNX](https://github.com/onnx/onnx) model.
2121

2222
**[Optimize]** Model Optimizer provides Python APIs for users to easily compose the above model optimization techniques and export an optimized quantized checkpoint.
23-
Model Optimizer is also integrated with [NVIDIA NeMo](https://github.com/NVIDIA/NeMo), [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) and [Hugging Face Accelerate](https://github.com/huggingface/accelerate) for training required inference optimization techniques.
23+
Model Optimizer is also integrated with [NVIDIA NeMo](https://github.com/NVIDIA-NeMo/NeMo), [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) and [Hugging Face Accelerate](https://github.com/huggingface/accelerate) for training required inference optimization techniques.
2424

2525
**[Export for deployment]** Seamlessly integrated within the NVIDIA AI software ecosystem, the quantized checkpoint generated from Model Optimizer is ready for deployment in downstream inference frameworks like [SGLang](https://github.com/sgl-project/sglang), [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/quantization), [TensorRT](https://github.com/NVIDIA/TensorRT), or [vLLM](https://github.com/vllm-project/vllm).
2626

docs/source/getting_started/1_overview.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ Minimizing inference costs presents a significant challenge as generative AI mod
99
The `NVIDIA TensorRT Model Optimizer <https://github.com/NVIDIA/TensorRT-Model-Optimizer>`_ (referred to as Model Optimizer, or ModelOpt)
1010
is a library comprising state-of-the-art model optimization techniques including quantization and sparsity to compress model.
1111
It accepts a torch or ONNX model as input and provides Python APIs for users to easily stack different model optimization
12-
techniques to produce optimized & quantized checkpoints. Seamlessly integrated within the NVIDIA AI software ecosystem, the quantized checkpoint generated from Model Optimizer is ready for deployment in downstream inference frameworks like `TensorRT-LLM <https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/quantization>`_ or `TensorRT <https://github.com/NVIDIA/TensorRT>`_ (Linux). ModelOpt is integrated with `NVIDIA NeMo <https://github.com/NVIDIA/NeMo>`_ and `Megatron-LM <https://github.com/NVIDIA/Megatron-LM>`_ for training-in-the-loop optimization techniques. For enterprise users, the 8-bit quantization with Stable Diffusion is also available on `NVIDIA NIM <https://developer.nvidia.com/blog/nvidia-nim-offers-optimized-inference-microservices-for-deploying-ai-models-at-scale/>`_.
12+
techniques to produce optimized & quantized checkpoints. Seamlessly integrated within the NVIDIA AI software ecosystem, the quantized checkpoint generated from Model Optimizer is ready for deployment in downstream inference frameworks like `TensorRT-LLM <https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/quantization>`_ or `TensorRT <https://github.com/NVIDIA/TensorRT>`_ (Linux). ModelOpt is integrated with `NVIDIA NeMo <https://github.com/NVIDIA-NeMo/NeMo>`_ and `Megatron-LM <https://github.com/NVIDIA/Megatron-LM>`_ for training-in-the-loop optimization techniques. For enterprise users, the 8-bit quantization with Stable Diffusion is also available on `NVIDIA NIM <https://developer.nvidia.com/blog/nvidia-nim-offers-optimized-inference-microservices-for-deploying-ai-models-at-scale/>`_.
1313

1414
For Windows users, the `TensorRT Model Optimizer for Windows <https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/examples/windows/README.md>`_ (ModelOpt-Windows) delivers model compression techniques, including quantization, on Windows RTX PC systems. ModelOpt-Windows is optimized for efficient quantization, featuring local GPU calibration, reduced system and video memory consumption, and swift processing times. It integrates seamlessly with the Windows ecosystem, with optimized ONNX models as output for `Microsoft DirectML <https://github.com/microsoft/DirectML>`_ backends. Furthermore, ModelOpt-Windows supports SDKs such as `Microsoft Olive <https://github.com/microsoft/Olive>`_ and `ONNX Runtime <https://github.com/microsoft/onnxruntime>`_, enabling the deployment of quantized models across various independent hardware vendors through the DirectML path.
1515

docs/source/guides/3_pruning.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ Pruning
44

55
.. tip::
66

7-
Checkout `Llama 3.1 NeMo Minitron Pruning <https://github.com/NVIDIA/NeMo/tree/main/tutorials/llm/llama/pruning-distillation>`_ and
7+
Checkout `Llama 3.1 NeMo Minitron Pruning <https://github.com/NVIDIA-NeMo/NeMo/tree/main/tutorials/llm/llama/pruning-distillation>`_ and
88
`ResNet20 on CIFAR-10 Notebook <https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/examples/pruning/cifar_resnet.ipynb>`_
99
for an end-to-end example of pruning.
1010

docs/source/guides/7_nas.rst

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -361,11 +361,11 @@ can be converted into searchable units:
361361
# search over the number of layers (depth) in the sequential layer.
362362
nn.Sequential
363363
364-
# We convert Megatron-core / NeMo GPT-style models (e.g. Llama3.1, NeMo Mistral, etc.)
364+
# We convert Megatron-core / NeMo GPT or Mamba style models (e.g. Llama3.1, NeMo Mistral, NeMotron-H, etc.)
365365
# to automatically search over the MLP hidden size, number of attention heads, number of GQA groups,
366-
# and depth of the model.
367-
megatron.core.transformer.module.MegatronModule
368-
nemo.collections.nlp.models.language_modeling.megatron_gpt_model.MegatronGPTModel
366+
# number of mamba heads, mamba head dimension, and depth of the model.
367+
megatron.core.models.gpt.GPTModel
368+
megatron.core.models.mamba.MambaModel
369369
nemo.collections.llm.gpt.model.base.GPTModel
370370
371371
# We convert Hugging Face Attention layers to automatically search over the number of heads

examples/deepseek/ptq.py

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -276,7 +276,7 @@ def calibrate_loop(model):
276276
mtq_cfg["quant_cfg"]["*attn*weight_quantizer"] = {"num_bits": (4, 3), "axis": None}
277277
mtq_cfg["quant_cfg"]["*attn*input_quantizer"] = {"num_bits": (4, 3), "axis": None}
278278

279-
if args.enable_wo_quant and "FP4" in quant_cfg:
279+
if not args.disable_wo_quant and "FP4" in quant_cfg:
280280
mtq_cfg["quant_cfg"]["*wo*weight_quantizer"] = mtq_cfg["quant_cfg"]["*input_quantizer"]
281281
mtq_cfg["quant_cfg"]["*wo*input_quantizer"] = mtq_cfg["quant_cfg"]["*weight_quantizer"]
282282
## ptq
@@ -287,7 +287,7 @@ def calibrate_loop(model):
287287
return model
288288

289289

290-
def save_amax_and_quant_config(model, output_path: str, enable_fp8_kvcache: bool):
290+
def save_amax_and_quant_config(model, output_path: str, enable_fp8_kvcache: bool = True):
291291
"""Saves the amax values of the model to the output path."""
292292
world_size = int(os.getenv("WORLD_SIZE", "1"))
293293
rank = int(os.getenv("RANK", "0"))
@@ -353,8 +353,8 @@ def state_dict_filter(state_dict):
353353
)
354354
parser.add_argument("--batch_size", type=int, default=8, help="batch size for quantization.")
355355
parser.add_argument("--calib_size", type=int, default=512, help="samples for calibration.")
356-
parser.add_argument("--enable_fp8_kvcache", type=bool, default=True, help="enable fp8 kvcache.")
357-
parser.add_argument("--enable_wo_quant", action="store_true", help="enable MLA wo quant.")
356+
parser.add_argument("--disable_fp8_kvcache", action="store_true", help="disable fp8 kvcache.")
357+
parser.add_argument("--disable_wo_quant", action="store_true", help="disable MLA wo quant.")
358358
parser.add_argument("--trust_remote_code", action="store_true", help="trust remote code.")
359359

360360
args = parser.parse_args()
@@ -363,4 +363,4 @@ def state_dict_filter(state_dict):
363363
args.model_path, trust_remote_code=args.trust_remote_code
364364
)
365365
model = ptq(model, tokenizer, args.quant_cfg, args.batch_size, args.calib_size)
366-
save_amax_and_quant_config(model, args.output_path, args.enable_fp8_kvcache)
366+
save_amax_and_quant_config(model, args.output_path, not args.disable_fp8_kvcache)

examples/llm_distill/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -140,7 +140,7 @@ Loss balancers:
140140

141141
Checkout the stand-alone distillation script in the [NVIDIA NeMo repository](https://docs.nvidia.com/nemo-framework/user-guide/latest/model-optimization/distillation/distillation.html).
142142

143-
You can also look at the tutorial notebooks [here](https://github.com/NVIDIA/NeMo/tree/main/tutorials/llm/llama/pruning-distillation) which showcase the usage of Minitron pruning followed by distillation for Llama 3.1 8B step-by-step in NeMo framework.
143+
You can also look at the tutorial notebooks [here](https://github.com/NVIDIA-NeMo/NeMo/tree/main/tutorials/llm/llama/pruning-distillation) which showcase the usage of Minitron pruning followed by distillation for Llama 3.1 8B step-by-step in NeMo framework.
144144

145145
## Knowledge Distillation (KD) for HuggingFace Models
146146

examples/llm_eval/gen_model_answer.py

Lines changed: 1 addition & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -119,7 +119,6 @@ def run_eval(
119119
dtype,
120120
revision,
121121
engine_dir,
122-
vocab_file,
123122
nim_model,
124123
args,
125124
):
@@ -152,7 +151,6 @@ def run_eval(
152151
top_p=top_p,
153152
temperature=temperature,
154153
engine_dir=engine_dir,
155-
vocab_file=vocab_file,
156154
nim_model=nim_model,
157155
)
158156
for i in range(0, len(questions), chunk_size)
@@ -177,18 +175,11 @@ def get_model_answers(
177175
top_p=None,
178176
temperature=None,
179177
engine_dir=None,
180-
vocab_file=None,
181178
nim_model=None,
182179
):
183180
# Model Optimizer modification
184181
if engine_dir:
185-
if vocab_file:
186-
from modelopt.deploy.llm.nemo_utils import get_nemo_tokenizer
187-
188-
tokenizer = get_nemo_tokenizer(vocab_file)
189-
else:
190-
model_ckpt_path = model_path
191-
tokenizer = get_tokenizer(model_ckpt_path, trust_remote_code=args.trust_remote_code)
182+
tokenizer = get_tokenizer(model_path, trust_remote_code=args.trust_remote_code)
192183
if engine_dir:
193184
# get model type
194185
last_part = os.path.basename(engine_dir)
@@ -440,11 +431,6 @@ def reorg_answer_file(answer_file):
440431
type=str,
441432
help="The path to the TensorRT LLM engine directory.",
442433
)
443-
parser.add_argument(
444-
"--vocab-file",
445-
type=str,
446-
help="The path to the vocabulary file.",
447-
)
448434
parser.add_argument(
449435
"--nim-model",
450436
type=str,
@@ -517,7 +503,6 @@ def reorg_answer_file(answer_file):
517503
dtype=str_to_torch_dtype(args.dtype),
518504
revision=args.revision,
519505
engine_dir=args.engine_dir,
520-
vocab_file=args.vocab_file,
521506
nim_model=args.nim_model,
522507
args=args,
523508
)

examples/llm_eval/mmlu.py

Lines changed: 2 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -250,15 +250,8 @@ def main(
250250
# Model Optimizer modification
251251
# Enable automatic save/load of modelopt state huggingface checkpointing
252252
mto.enable_huggingface_checkpointing()
253-
if vocab_file := kwargs.get("vocab_file"):
254-
from modelopt.deploy.llm.nemo_utils import get_nemo_tokenizer
255-
256-
tokenizer = get_nemo_tokenizer(vocab_file)
257-
else:
258-
model_ckpt_path = kwargs["model_path"]
259-
tokenizer = get_tokenizer(
260-
model_ckpt_path, trust_remote_code=kwargs.get("trust_remote_code", False)
261-
)
253+
model_path = kwargs["model_path"]
254+
tokenizer = get_tokenizer(model_path, trust_remote_code=kwargs.get("trust_remote_code", False))
262255
if kwargs.get("engine_dir"):
263256
# get model type
264257
last_part = os.path.basename(kwargs["engine_dir"])

examples/llm_ptq/README.md

Lines changed: 14 additions & 51 deletions
Original file line numberDiff line numberDiff line change
@@ -105,45 +105,26 @@ Please reference our [framework scripts](#framework-scripts) and our [docs](http
105105

106106
## Support Matrix
107107

108-
### Supported Models
108+
### Hugging Face Supported Models
109109

110110
| Model | fp8 | int8_sq | int4_awq | w4a8_awq<sup>1</sup> | nvfp4<sup>5</sup> |
111111
| :---: | :---: | :---: | :---: | :---: | :---: |
112-
| GPTJ ||||| - |
113-
| LLAMA 2 ||||| - |
114-
| LLAMA 3, 3.1, 3.3 |||| ✅<sup>3</sup> ||
112+
| LLAMA 3.x |||| ✅<sup>3</sup> ||
115113
| LLAMA 4 <sup>6</sup> ||||||
116-
| LLAMA 2 (Nemo) ||||| - |
117-
| CodeLlama ||||| - |
118-
| Mistral ||||||
119-
| Mixtral 8x7B, 8x22B ||| ✅<sup>2</sup> |||
120-
| Snowflake Arctic<sup>2</sup> ||||| - |
121-
| Falcon 40B, 180B ||||| - |
122-
| Falcon 7B ||||| - |
123-
| MPT 7B, 30B ||||| - |
124-
| Baichuan 1, 2 ||||| - |
125-
| ChatGLM2, 3 6B ||||| - |
126-
| Bloom ||||| - |
127-
| Phi-1,2,3,4 |||| ✅<sup>3</sup> | - |
114+
| Mixtral ||| ✅<sup>2</sup> |||
115+
| Phi-3,4 |||| ✅<sup>3</sup> | - |
128116
| Phi-3.5 MOE ||||| - |
129117
| Llama-Nemotron Super ||||||
130118
| Llama-Nemotron Ultra ||||||
131-
| Nemotron 8B ||||| - |
132-
| Gemma 2B, 7B ||||| - |
133-
| Gemma 3 1B | ✅<sup>2</sup> |||| - |
134-
| RecurrentGemma 2B ||||| - |
135-
| StarCoder 2 ||||| - |
119+
| Gemma 3 | ✅<sup>2</sup> | - || - | - |
136120
| QWen 2, 2.5 <sup>4</sup> ||||||
137-
| QWen MOE || - | - | - ||
138121
| QWen3 MOE <sup>6</sup> || - | - | - ||
139122
| QwQ || - | - | - ||
140-
| DBRX ||||| - |
141-
| InternLM2 |||| ✅<sup>3</sup> | - |
142-
| Exaone ||||| - |
143-
| Minitron |||| ✅<sup>2</sup> ||
144123
| T5 ||||| - |
145124
| Whisper ||||| - |
146125

126+
> *This is a subset of the models supported. For the full list please check the [TensorRT-LLM support matrix](https://nvidia.github.io/TensorRT-LLM/reference/precision.html#support-matrix)*
127+
147128
> *<sup>1.</sup>The w4a8_awq is an experimental quantization scheme that may result in a higher accuracy penalty.* \
148129
> *<sup>2.</sup>For some models, there is only support for exporting quantized checkpoints.* \
149130
> *<sup>3.</sup>W4A8_AWQ is only available on some models but not all* \
@@ -155,6 +136,10 @@ Please reference our [framework scripts](#framework-scripts) and our [docs](http
155136
156137
> You can also create your own custom config using [this](https://nvidia.github.io/TensorRT-Model-Optimizer/guides/_pytorch_quantization.html#custom-calibration-algorithm) guide.
157138
139+
### NeMo Supported Models
140+
141+
Please refer to the [NeMo 2.0 PTQ documentation](https://docs.nvidia.com/nemo-framework/user-guide/latest/model-optimization/quantization/quantization.html#support-matrix) for supported models.
142+
158143
## AutoQuantize
159144

160145
[AutoQuantize (`mtq.auto_quantize`)](https://nvidia.github.io/TensorRT-Model-Optimizer/reference/generated/modelopt.torch.quantization.model_quant.html#modelopt.torch.quantization.model_quant.auto_quantize) is a PTQ algorithm which quantizes a model by searching for the best quantization format per-layer while meeting performance constraints specified by the user. `AutoQuantize` streamlines the trade-off of model accuracy and performance.
@@ -224,18 +209,6 @@ The example scripts above also have an additional flag `--tasks`, where the actu
224209
225210
> *NOTE: AutoQuantize requires backpropagation of the model. Models without backpropagation support (e.g., Llama-4) will not work with AutoQuantize.*
226211
227-
### AutoQuantize for NeMo models
228-
229-
The usage is similar for NeMo models to perform `AutoQuantize`. Please refer to the [NeMo Example Script](#nemo-example-script) section for the full setup instructions.
230-
231-
[Script](./scripts/nemo_example.sh)
232-
233-
```bash
234-
# --auto_quantize_bits specifies the constraint for `AutoQuantize`
235-
# --quant specifies the formats to be searched for `AutoQuantize`. Multiple formats can be searched over by passing them as comma separated values
236-
scripts/nemo_example.sh --type gpt --model $GPT_MODEL_FILE --quant fp8,int4_awq --auto_quantize_bits 6.4 --tp [1|2|4|8]
237-
```
238-
239212
## Real Quant
240213

241214
When working with large language models, memory constraints can be a significant challenge. ModelOpt provides a workflow for initializing HF models with compressed weights across multiple GPUs to dramatically reduce memory usage. Check `--low_memory_mode` option in hf_ptq.py for more details.
@@ -280,27 +253,17 @@ scripts/huggingface_example.sh --model $HF_PATH --quant [fp8|nvfp4|int8_sq|int4_
280253
281254
> *You can now add `--low_memory_mode` to the command when setting `--export_fmt=hf` to lower the memory requirements of the PTQ process. With this mode, the script will compress model weights to low precision before calibration. This mode is only supported for FP8 and NVFP4 with max calibration.*
282255
283-
#### Llama 4
284-
285-
We support FP8 and NVFP4 quantized Llama 4 model Hugging Face checkpoint export using the following command:
286-
287-
```bash
288-
python hf_ptq.py --pyt_ckpt_path=<llama4 model path> --export_path=<quantized hf checkpoint> --qformat=[fp8|nvfp4] --export_fmt=hf
289-
```
290-
291-
The quantized checkpoint can be deployed following the TensorRT-LLM instructions. Note since we only quantize the language model in Llama 4, the exported config has `Llama4ForCausalLM`, but TensorRT-LLM expects `Llama4ForConditionalGeneration` which is from the original Llama 4. Therefore our script will copy over the original config files to the exported checkpoint folder.
292-
293256
#### Deepseek R1
294257

295258
[PTQ for DeepSeek](../deepseek/README.md) shows how to quantize the DeepSeek model with FP4 and export to TensorRT-LLM.
296259

297-
### NeMo Example [Script](./scripts/nemo_example.sh)
260+
### NeMo Example Script
298261

299-
Please refer to the [NeMo PTQ documentation](https://docs.nvidia.com/nemo-framework/user-guide/latest/model-optimization/quantization/quantization.html) for more details.
262+
NeMo 2.0 framework PTQ and TensorRT-LLM deployment examples are maintained in the NeMo GitHub repo. Please refer to the [NeMo PTQ documentation](https://docs.nvidia.com/nemo-framework/user-guide/latest/model-optimization/quantization/quantization.html) for more details.
300263

301264
### Megatron-LM Example Script
302265

303-
Megatron-LM framework PTQ and TensorRT-LLM deployment examples are maintained in the Megatron-LM GitHub repo. Please refer to the examples [here](https://github.com/NVIDIA/Megatron-LM/tree/main/examples/export).
266+
Megatron-LM framework PTQ and TensorRT-LLM deployment examples are maintained in the Megatron-LM GitHub repo. Please refer to the examples [here](https://github.com/NVIDIA/Megatron-LM/tree/main/examples/post_training/modelopt).
304267

305268
## Evaluate Accuracy
306269

0 commit comments

Comments
 (0)