Skip to content

Commit 4c611e4

Browse files
Update files on GitHub
Signed-off-by: Keval Morabia <[email protected]>
1 parent b40f478 commit 4c611e4

File tree

139 files changed

+8482
-3164
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

139 files changed

+8482
-3164
lines changed

.pre-commit-config.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -96,6 +96,7 @@ repos:
9696
modelopt/torch/quantization/plugins/attention.py|
9797
modelopt/torch/speculative/eagle/utils.py|
9898
modelopt/torch/speculative/plugins/transformers.py|
99+
modelopt/torch/utils/plugins/megatron_mmlu.py|
99100
examples/chained_optimizations/bert_prune_distill_quantize.py|
100101
examples/deepseek/quantize_to_nvfp4.py|
101102
examples/deepseek/ptq.py|

CHANGELOG.rst

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,19 @@
11
Model Optimizer Changelog (Linux)
22
=================================
33

4+
0.35 (2025-08-xx)
5+
^^^^^^^^^^^^^^^^^
6+
7+
**Backward Breaking Changes**
8+
9+
**Deprecations**
10+
11+
**New Features**
12+
13+
- (Experimental) Add quantization support for custom TensorRT op in ONNX models.
14+
- Add support for Minifinetuning (MFT; https://arxiv.org/abs/2506.15702) self-corrective distillation, which enables training on small datasets with severely mitigated catastrophic forgetting.
15+
- Add tree decoding support for Megatron Eagle models.
16+
417
0.33 (2025-07-14)
518
^^^^^^^^^^^^^^^^^
619

@@ -20,7 +33,7 @@ Model Optimizer Changelog (Linux)
2033
- Add per node calibration support in ONNX quantization.
2134
- ModelOpt now supports quantization of tensor-parallel sharded Huggingface transformer models. This requires ``transformers>=4.52.0``.
2235
- Support quantization of FSDP2 wrapped models and add FSDP2 support in the ``llm_qat`` example.
23-
- Add NeMo 2 Simplified Flow examples for quantization aware training/distillation (QAT/QAD), speculative decoding, pruning & distilllation.
36+
- Add NeMo 2 Simplified Flow examples for quantization aware training/distillation (QAT/QAD), speculative decoding, pruning & distillation.
2437

2538
0.31 (2025-06-04)
2639
^^^^^^^^^^^^^^^^^

docs/source/getting_started/windows/_installation_with_olive.rst

Lines changed: 9 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -24,8 +24,9 @@ Setup Steps for Olive with ModelOpt-Windows
2424
$ pip install onnxruntime-genai-directml>=0.4.0
2525
$ pip install onnxruntime-directml==1.20.0
2626
27+
- Above onnxruntime and onnxruntime-genai packages enable Olive workflow with DirectML Execution-Provider (EP). To use other EPs, install corresponding packages.
2728

28-
Additionally, ensure that dependencies for TensorRT Model Optimizer - Windows are met as mentioned in the :ref:`Install-Page-Standalone-Windows`.
29+
- Additionally, ensure that dependencies for TensorRT Model Optimizer - Windows are met as mentioned in the :ref:`Install-Page-Standalone-Windows`.
2930

3031
**2. Configure Olive for TensorRT Model Optimizer – Windows**
3132

@@ -36,7 +37,11 @@ Setup Steps for Olive with ModelOpt-Windows
3637

3738
- **Add Other Passes:** Add additional passes to the Olive configuration file as needed for the desired Olive workflow of your input model. [Refer `phi3 <https://github.com/microsoft/Olive/tree/main/examples/phi3#quantize-models-with-nvidia-tensorrt-model-optimizer>`_ Olive example]
3839

39-
**4. Run the Optimization**
40+
**4. Install other dependencies**
41+
42+
- Install other requirements as needed by the Olive scripts and config.
43+
44+
**5. Run the Optimization**
4045

4146
- **Execute Optimization:** To start the optimization process, run the following commands:
4247

@@ -56,4 +61,5 @@ Setup Steps for Olive with ModelOpt-Windows
5661
5762
**Note**:
5863

59-
#. Currently, the TensorRT-Model Optimizer - Windows only supports Onnx Runtime GenAI based models in the Olive workflow.
64+
#. Currently, the TensorRT-Model Optimizer - Windows only supports Onnx Runtime GenAI based LLM models in the Olive workflow.
65+
#. To try out different LLMs and EPs in the Olive workflow of ModelOpt-Windows, refer the details provided in `phi3 <https://github.com/microsoft/Olive/tree/main/examples/phi3#quantize-models-with-nvidia-tensorrt-model-optimizer>`_ Olive example.

docs/source/guides/4_distillation.rst

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -62,6 +62,10 @@ Example usage:
6262
meta model. Thus, the same callable must be available in the namespace when restoring via
6363
the :meth:`mto.restore <modelopt.torch.opt.conversion.restore>` utility.
6464

65+
.. tip::
66+
When training the student on a small corpus of ground truth data, consider using :class:`MFTLoss <modelopt.torch.distill.MFTLoss>` for to perform Minifinetuning in lieu of the standard
67+
:class:`LogitsDistillationLoss <modelopt.torch.distill.losses.LogitsDistillationLoss>`. This will allow the student to learn from the teacher's distribution while adapting to the new data, improving the specialization of the new data without overwriting teacher's general knowledge.
68+
6569
.. note::
6670
As the model is not of the same class anymore, calling ``type()`` on the model after conversion
6771
will not work as expected.
@@ -124,6 +128,9 @@ maps or logits) which the teacher has already mastered. This can serve multiple
124128
**C.** Module replacement: One can replace a single module within a model with a more efficient one
125129
and use distillation on its original outputs to effectively re-integrate it into the whole model.
126130

131+
**D.** Minimal modification without catastrophic forgetting: A variant of distillation, called Minifinetuning,
132+
allows for training a model on even small datasets without losing the original model's knowledge.
133+
127134
Student
128135
^^^^^^^
129136

@@ -192,3 +199,15 @@ ground truth labels may be.
192199

193200

194201
.. _1: https://arxiv.org/abs/1803.03635
202+
203+
Minifinetuning
204+
^^^^^^^^^^^^^^
205+
206+
Minifinetuning is a technique that allows for training a model on even small datasets without losing the original
207+
model's knowledge. This is achieved by algorithmic modification of the teacher's distribution depending on its
208+
performance on the new dataset. The goal is to ensure that the separation between the correct and incorrect argmax
209+
tokens is large enough, which can be controlled by a threshold parameter. ModelOpt provides a pre-defined loss function
210+
for this purpose, called :class:`MFTDistillationLoss <modelopt.torch.distill.losses.MFTDistillationLoss>`, which can
211+
be used in place of the standard :class:`LogitsDistillationLoss <modelopt.torch.distill.losses.LogitsDistillationLoss>`.
212+
More information about the technique can be found in the original paper:
213+
`Minifinetuning: Low-Data Generation Domain Adaptation through Corrective Self-Distillation <https://arxiv.org/abs/2506.15702>`_.

examples/deepseek/ptq.py

Lines changed: 27 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -56,7 +56,13 @@
5656
from modelopt.torch.export.model_config import KV_CACHE_FP8
5757
from modelopt.torch.export.quant_utils import get_quant_config
5858
from modelopt.torch.quantization.nn import TensorQuantizer
59+
from modelopt.torch.quantization.utils import (
60+
is_quantized_column_parallel_linear,
61+
is_quantized_parallel_linear,
62+
is_quantized_row_parallel_linear,
63+
)
5964
from modelopt.torch.utils.dataset_utils import get_dataset_dataloader
65+
from modelopt.torch.utils.distributed import ParallelState
6066

6167
sys.path.append(str(Path(__file__).resolve().parent / "DeepSeek-V3/inference"))
6268
import model as deekseep_model
@@ -105,6 +111,11 @@ def __init__(self, *args, **kwargs):
105111
def _setup(self):
106112
self.input_quantizer = TensorQuantizer()
107113
self.weight_quantizer = TensorQuantizer()
114+
# Use TP parallel state
115+
self._parallel_state = ParallelState(data_parallel_group=-1, tensor_parallel_group=None)
116+
self._is_column_parallel = True
117+
118+
assert is_quantized_column_parallel_linear(self)
108119

109120
def forward(self, x: torch.Tensor) -> torch.Tensor:
110121
y = linear(
@@ -124,6 +135,11 @@ def __init__(self, *args, **kwargs):
124135
def _setup(self):
125136
self.input_quantizer = TensorQuantizer()
126137
self.weight_quantizer = TensorQuantizer()
138+
# Use TP parallel state
139+
self._parallel_state = ParallelState(data_parallel_group=-1, tensor_parallel_group=None)
140+
self._is_row_parallel = True
141+
142+
assert is_quantized_row_parallel_linear(self)
127143

128144
def forward(self, x: torch.Tensor) -> torch.Tensor:
129145
y = linear(
@@ -146,6 +162,10 @@ def __init__(self, *args, **kwargs):
146162
def _setup(self):
147163
self.input_quantizer = TensorQuantizer()
148164
self.weight_quantizer = TensorQuantizer()
165+
# No parallel state.
166+
self._parallel_state = ParallelState(data_parallel_group=-1, tensor_parallel_group=-1)
167+
168+
assert not is_quantized_parallel_linear(self)
149169

150170
def forward(self, x: torch.Tensor) -> torch.Tensor:
151171
y = linear(
@@ -238,6 +258,9 @@ def calibrate_loop(model):
238258
## handle DeepSeek model structures
239259
transformer = model.model if hasattr(model, "model") else model
240260

261+
# make sure all processes are ready before starting the calibration
262+
dist.barrier()
263+
241264
## quant config
242265
mtq_cfg = getattr(mtq, quant_cfg)
243266

@@ -332,9 +355,12 @@ def state_dict_filter(state_dict):
332355
parser.add_argument("--calib_size", type=int, default=512, help="samples for calibration.")
333356
parser.add_argument("--enable_fp8_kvcache", type=bool, default=True, help="enable fp8 kvcache.")
334357
parser.add_argument("--enable_wo_quant", action="store_true", help="enable MLA wo quant.")
358+
parser.add_argument("--trust_remote_code", action="store_true", help="trust remote code.")
335359

336360
args = parser.parse_args()
337361
model = load_deepseek_model(args.config, args.model_path, args.batch_size)
338-
tokenizer = AutoTokenizer.from_pretrained(args.model_path)
362+
tokenizer = AutoTokenizer.from_pretrained(
363+
args.model_path, trust_remote_code=args.trust_remote_code
364+
)
339365
model = ptq(model, tokenizer, args.quant_cfg, args.batch_size, args.calib_size)
340366
save_amax_and_quant_config(model, args.output_path, args.enable_fp8_kvcache)

examples/deepseek/quantize_fp8_to_nvfp4.sh

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -70,6 +70,12 @@ if [[ -z "$FP8_HF_PATH" ]]; then
7070
usage
7171
fi
7272

73+
# for KIMI-K2, copy tiktoken.model to tokenizer to the quantized checkpoint
74+
if [[ -f "$FP8_HF_PATH/tiktoken.model" ]]; then
75+
echo "tiktoken.model found in $FP8_HF_PATH"
76+
cp $FP8_HF_PATH/tiktoken.model $FP4_PATH/
77+
fi
78+
7379
# Copy miscellaneous files to the quantized checkpoint
7480
mkdir -p $FP4_PATH
7581
cp $FP8_HF_PATH/*.json $FP8_HF_PATH/*.py $FP4_PATH/

examples/diffusers/quantization/config.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@
1818
from calib.plugin_calib import PercentileCalibrator
1919
from utils import filter_func
2020

21-
from modelopt.core.torch.quantization.config import NVFP4_FP8_MHA_CONFIG # noqa: F401
21+
from modelopt.torch.quantization.config import NVFP4_FP8_MHA_CONFIG # noqa: F401
2222

2323
FP8_DEFAULT_CONFIG = {
2424
"quant_cfg": {

examples/llm_ptq/example_utils.py

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,13 @@
2020
import torch
2121
from accelerate import infer_auto_device_map, init_empty_weights
2222
from accelerate.utils import get_max_memory
23-
from transformers import AutoConfig, AutoModelForCausalLM, AutoProcessor, AutoTokenizer
23+
from transformers import (
24+
AutoConfig,
25+
AutoModelForCausalLM,
26+
AutoProcessor,
27+
AutoTokenizer,
28+
Llama4ForConditionalGeneration,
29+
)
2430

2531
from modelopt.torch.utils.image_processor import MllamaImageProcessor
2632

@@ -225,7 +231,7 @@ def get_model(
225231
**model_kwargs,
226232
)
227233
elif hf_config.model_type == "llama4":
228-
model = AutoModelForCausalLM.from_pretrained(
234+
model = Llama4ForConditionalGeneration.from_pretrained(
229235
ckpt_path,
230236
device_map=device_map,
231237
**model_kwargs,

examples/llm_ptq/hf_ptq.py

Lines changed: 20 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -85,7 +85,8 @@ def auto_quantize(
8585
# Check if all provided quantization formats are supported
8686
if args.export_fmt == "hf":
8787
assert all(
88-
qformat in ["fp8", "int4_awq", "nvfp4", "nvfp4_awq", "w4a8_awq", "fp8_pb_wo"]
88+
qformat
89+
in ["fp8", "int4_awq", "nvfp4", "nvfp4_awq", "w4a8_awq", "fp8_pb_wo", "w4a8_mxfp4_fp8"]
8990
for qformat in qformat_list
9091
), (
9192
"One or more quantization formats provided are not supported for unified checkpoint export"
@@ -110,9 +111,7 @@ def loss_func(output, data):
110111
# TRTLLM only support one quantization format or None (do not quantize, internally supported)
111112
quantization_formats=[QUANT_CFG_CHOICES[format] for format in qformat_list],
112113
num_calib_steps=len(calib_dataloader),
113-
num_score_steps=min(
114-
len(calib_dataloader), 128 // batch_size
115-
), # Limit the number of score steps to avoid long calibration time
114+
num_score_steps=len(calib_dataloader),
116115
verbose=True,
117116
disabled_layers=["*lm_head*"],
118117
)
@@ -218,6 +217,7 @@ def main(args):
218217
"nvfp4_awq",
219218
"w4a8_awq",
220219
"fp8_pb_wo",
220+
"w4a8_mxfp4_fp8",
221221
]
222222
or args.kv_cache_qformat in KV_QUANT_CFG_CHOICES
223223
), f"Quantization format {args.qformat} not supported for HF export path"
@@ -263,6 +263,9 @@ def main(args):
263263
device = model.model.device
264264
processor = None
265265
tokenizer = None
266+
267+
full_model = model
268+
266269
if model_type == "mllama":
267270
if args.dataset is None:
268271
args.dataset = "scienceqa"
@@ -300,6 +303,13 @@ def main(args):
300303
# Left padding usually provides better calibration result.
301304
tokenizer.padding_side = "left"
302305

306+
# We only quantize the language model for VLMs other than the type supported above.
307+
if hasattr(model, "language_model"):
308+
assert model_type == "llama4", (
309+
"Only llama4 should reach here. Please uncomment this check if you are modelopt developers."
310+
)
311+
model = model.language_model
312+
303313
if args.sparsity_fmt != "dense":
304314
if args.batch_size == 0:
305315
# Sparse algorithm takes more GPU memory so we reduce the batch_size by 4.
@@ -335,10 +345,6 @@ def main(args):
335345
)
336346

337347
if args.batch_size == 0:
338-
# TODO: Enable auto-batch size calculation for auto_quantize
339-
assert args.auto_quantize_bits is None, (
340-
"auto_quantize requires batch_size to be specified, please specify batch_size."
341-
)
342348
# Calibration/sparsification will actually take much more memory than regular inference
343349
# due to intermediate tensors for fake quantization. Setting sample_memory_usage_ratio
344350
# to 2 to avoid OOM for AWQ/SmoothQuant fake quantization as it will take more memory than inference.
@@ -358,10 +364,14 @@ def main(args):
358364
)
359365
else:
360366
sample_input_single_batch = None
367+
368+
run_auto_quant = args.auto_quantize_bits is not None
369+
361370
args.batch_size = get_max_batch_size(
362371
model,
363-
sample_memory_usage_ratio=sample_memory_usage_ratio,
372+
sample_memory_usage_ratio=sample_memory_usage_ratio if not run_auto_quant else 1.0,
364373
sample_input_single_batch=sample_input_single_batch,
374+
enable_grad=run_auto_quant,
365375
)
366376
args.batch_size = min(args.batch_size, args.calib_size)
367377

@@ -550,23 +560,9 @@ def output_decode(generated_ids, input_shape):
550560
)
551561
elif args.export_fmt == "hf":
552562
export_hf_checkpoint(
553-
model,
563+
full_model,
554564
export_dir=export_path,
555565
)
556-
if model_type == "llama4":
557-
# TRT-LLM expects the original model config instead of the config from text model,
558-
# so we need to copy the original model config to the export path.
559-
# Also we copy the preprocessor config to the export path.
560-
from transformers import AutoConfig, AutoProcessor
561-
562-
# Use HuggingFace API to handle both model IDs and local paths
563-
AutoConfig.from_pretrained(
564-
args.pyt_ckpt_path, trust_remote_code=args.trust_remote_code
565-
).save_pretrained(export_path)
566-
567-
AutoProcessor.from_pretrained(
568-
args.pyt_ckpt_path, trust_remote_code=args.trust_remote_code
569-
).save_pretrained(export_path)
570566
else:
571567
raise NotImplementedError(f"{args.export_fmt} not supported")
572568

@@ -639,12 +635,6 @@ def output_decode(generated_ids, input_shape):
639635
choices=KV_QUANT_CFG_CHOICES.keys(),
640636
help="Specify KV cache quantization format, default to fp8 if not provided",
641637
)
642-
parser.add_argument(
643-
"--vlm",
644-
help="Specify whether this is a visual-language model",
645-
default=False,
646-
action="store_true",
647-
)
648638
parser.add_argument(
649639
"--export_fmt",
650640
required=False,

0 commit comments

Comments
 (0)