NVIDIA
diff --git a/‎.pre-commit-config.yaml‎
Lines changed: 1 addition & 0 deletions b/‎.pre-commit-config.yaml‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎CHANGELOG.rst‎
Lines changed: 14 additions & 1 deletion b/‎CHANGELOG.rst‎
Lines changed: 14 additions & 1 deletion
diff --git a/‎docs/source/getting_started/windows/_installation_with_olive.rst‎
Lines changed: 9 additions & 3 deletions b/‎docs/source/getting_started/windows/_installation_with_olive.rst‎
Lines changed: 9 additions & 3 deletions
diff --git a/‎docs/source/guides/4_distillation.rst‎
Lines changed: 19 additions & 0 deletions b/‎docs/source/guides/4_distillation.rst‎
Lines changed: 19 additions & 0 deletions
diff --git a/‎examples/deepseek/ptq.py‎
Lines changed: 27 additions & 1 deletion b/‎examples/deepseek/ptq.py‎
Lines changed: 27 additions & 1 deletion
diff --git a/‎examples/deepseek/quantize_fp8_to_nvfp4.sh‎
Lines changed: 6 additions & 0 deletions b/‎examples/deepseek/quantize_fp8_to_nvfp4.sh‎
Lines changed: 6 additions & 0 deletions
diff --git a/‎examples/diffusers/quantization/config.py‎
Lines changed: 1 addition & 1 deletion b/‎examples/diffusers/quantization/config.py‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎examples/llm_ptq/example_utils.py‎
Lines changed: 8 additions & 2 deletions b/‎examples/llm_ptq/example_utils.py‎
Lines changed: 8 additions & 2 deletions
diff --git a/‎examples/llm_ptq/hf_ptq.py‎
Lines changed: 20 additions & 30 deletions b/‎examples/llm_ptq/hf_ptq.py‎
Lines changed: 20 additions & 30 deletions
@@ -96,6 +96,7 @@ repos:
               modelopt/torch/quantization/plugins/attention.py|
               modelopt/torch/speculative/eagle/utils.py|
               modelopt/torch/speculative/plugins/transformers.py|
+              modelopt/torch/utils/plugins/megatron_mmlu.py|
               examples/chained_optimizations/bert_prune_distill_quantize.py|
               examples/deepseek/quantize_to_nvfp4.py|
               examples/deepseek/ptq.py|
 
@@ -1,6 +1,19 @@
 Model Optimizer Changelog (Linux)
 =================================
 
+0.35 (2025-08-xx)
+^^^^^^^^^^^^^^^^^
+
+**Backward Breaking Changes**
+
+**Deprecations**
+
+**New Features**
+
+- (Experimental) Add quantization support for custom TensorRT op in ONNX models.
+- Add support for Minifinetuning (MFT; https://arxiv.org/abs/2506.15702) self-corrective distillation, which enables training on small datasets with severely mitigated catastrophic forgetting.
+- Add tree decoding support for Megatron Eagle models.
+
 0.33 (2025-07-14)
 ^^^^^^^^^^^^^^^^^
 
@@ -20,7 +33,7 @@ Model Optimizer Changelog (Linux)
 - Add per node calibration support in ONNX quantization.
 - ModelOpt now supports quantization of tensor-parallel sharded Huggingface transformer models. This requires ``transformers>=4.52.0``.
 - Support quantization of FSDP2 wrapped models and add FSDP2 support in the ``llm_qat`` example.
-- Add NeMo 2 Simplified Flow examples for quantization aware training/distillation (QAT/QAD), speculative decoding, pruning & distilllation.
+- Add NeMo 2 Simplified Flow examples for quantization aware training/distillation (QAT/QAD), speculative decoding, pruning & distillation.
 
 0.31 (2025-06-04)
 ^^^^^^^^^^^^^^^^^
 
@@ -24,8 +24,9 @@ Setup Steps for Olive with ModelOpt-Windows
             $ pip install onnxruntime-genai-directml>=0.4.0
             $ pip install onnxruntime-directml==1.20.0
 
+   - Above onnxruntime and onnxruntime-genai packages enable Olive workflow with DirectML Execution-Provider (EP). To use other EPs, install corresponding packages.
 
-     Additionally, ensure that dependencies for TensorRT Model Optimizer - Windows are met as mentioned in the :ref:`Install-Page-Standalone-Windows`.
+   - Additionally, ensure that dependencies for TensorRT Model Optimizer - Windows are met as mentioned in the :ref:`Install-Page-Standalone-Windows`.
 
 **2. Configure Olive for TensorRT Model Optimizer – Windows**
 
@@ -36,7 +37,11 @@ Setup Steps for Olive with ModelOpt-Windows
 
    - **Add Other Passes:** Add additional passes to the Olive configuration file as needed for the desired Olive workflow of your input model. [Refer `phi3 <https://github.com/microsoft/Olive/tree/main/examples/phi3#quantize-models-with-nvidia-tensorrt-model-optimizer>`_ Olive example]
 
-**4. Run the Optimization**
+**4. Install other dependencies**
+
+   - Install other requirements as needed by the Olive scripts and config.
+
+**5. Run the Optimization**
 
    - **Execute Optimization:** To start the optimization process, run the following commands:
 
@@ -56,4 +61,5 @@ Setup Steps for Olive with ModelOpt-Windows
 
 **Note**:
 
-#. Currently, the TensorRT-Model Optimizer - Windows only supports Onnx Runtime GenAI based models in the Olive workflow.
+#. Currently, the TensorRT-Model Optimizer - Windows only supports Onnx Runtime GenAI based LLM models in the Olive workflow.
+#. To try out different LLMs and EPs in the Olive workflow of ModelOpt-Windows, refer the details provided in `phi3 <https://github.com/microsoft/Olive/tree/main/examples/phi3#quantize-models-with-nvidia-tensorrt-model-optimizer>`_ Olive example.
@@ -62,6 +62,10 @@ Example usage:
     meta model. Thus, the same callable must be available in the namespace when restoring via
     the :meth:`mto.restore <modelopt.torch.opt.conversion.restore>` utility.
 
+.. tip::
+    When training the student on a small corpus of ground truth data, consider using :class:`MFTLoss <modelopt.torch.distill.MFTLoss>` for to perform Minifinetuning in lieu of the standard
+    :class:`LogitsDistillationLoss <modelopt.torch.distill.losses.LogitsDistillationLoss>`. This will allow the student to learn from the teacher's distribution while adapting to the new data, improving the specialization of the new data without overwriting teacher's general knowledge.
+
 .. note::
     As the model is not of the same class anymore, calling ``type()`` on the model after conversion
     will not work as expected.
@@ -124,6 +128,9 @@ maps or logits) which the teacher has already mastered. This can serve multiple
   **C.** Module replacement: One can replace a single module within a model with a more efficient one
   and use distillation on its original outputs to effectively re-integrate it into the whole model.
 
+  **D.** Minimal modification without catastrophic forgetting: A variant of distillation, called Minifinetuning,
+  allows for training a model on even small datasets without losing the original model's knowledge.
+
 Student
 ^^^^^^^
 
@@ -192,3 +199,15 @@ ground truth labels may be.
 
 
 .. _1: https://arxiv.org/abs/1803.03635
+
+Minifinetuning
+^^^^^^^^^^^^^^
+
+Minifinetuning is a technique that allows for training a model on even small datasets without losing the original
+model's knowledge. This is achieved by algorithmic modification of the teacher's distribution depending on its
+performance on the new dataset. The goal is to ensure that the separation between the correct and incorrect argmax
+tokens is large enough, which can be controlled by a threshold parameter. ModelOpt provides a pre-defined loss function
+for this purpose, called :class:`MFTDistillationLoss <modelopt.torch.distill.losses.MFTDistillationLoss>`, which can
+be used in place of the standard :class:`LogitsDistillationLoss <modelopt.torch.distill.losses.LogitsDistillationLoss>`.
+More information about the technique can be found in the original paper:
+`Minifinetuning: Low-Data Generation Domain Adaptation through Corrective Self-Distillation <https://arxiv.org/abs/2506.15702>`_.
@@ -56,7 +56,13 @@
 from modelopt.torch.export.model_config import KV_CACHE_FP8
 from modelopt.torch.export.quant_utils import get_quant_config
 from modelopt.torch.quantization.nn import TensorQuantizer
+from modelopt.torch.quantization.utils import (
+    is_quantized_column_parallel_linear,
+    is_quantized_parallel_linear,
+    is_quantized_row_parallel_linear,
+)
 from modelopt.torch.utils.dataset_utils import get_dataset_dataloader
+from modelopt.torch.utils.distributed import ParallelState
 
 sys.path.append(str(Path(__file__).resolve().parent / "DeepSeek-V3/inference"))
 import model as deekseep_model
@@ -105,6 +111,11 @@ def __init__(self, *args, **kwargs):
         def _setup(self):
             self.input_quantizer = TensorQuantizer()
             self.weight_quantizer = TensorQuantizer()
+            # Use TP parallel state
+            self._parallel_state = ParallelState(data_parallel_group=-1, tensor_parallel_group=None)
+            self._is_column_parallel = True
+
+            assert is_quantized_column_parallel_linear(self)
 
         def forward(self, x: torch.Tensor) -> torch.Tensor:
             y = linear(
@@ -124,6 +135,11 @@ def __init__(self, *args, **kwargs):
         def _setup(self):
             self.input_quantizer = TensorQuantizer()
             self.weight_quantizer = TensorQuantizer()
+            # Use TP parallel state
+            self._parallel_state = ParallelState(data_parallel_group=-1, tensor_parallel_group=None)
+            self._is_row_parallel = True
+
+            assert is_quantized_row_parallel_linear(self)
 
         def forward(self, x: torch.Tensor) -> torch.Tensor:
             y = linear(
@@ -146,6 +162,10 @@ def __init__(self, *args, **kwargs):
         def _setup(self):
             self.input_quantizer = TensorQuantizer()
             self.weight_quantizer = TensorQuantizer()
+            # No parallel state.
+            self._parallel_state = ParallelState(data_parallel_group=-1, tensor_parallel_group=-1)
+
+            assert not is_quantized_parallel_linear(self)
 
         def forward(self, x: torch.Tensor) -> torch.Tensor:
             y = linear(
@@ -238,6 +258,9 @@ def calibrate_loop(model):
     ## handle DeepSeek model structures
     transformer = model.model if hasattr(model, "model") else model
 
+    # make sure all processes are ready before starting the calibration
+    dist.barrier()
+
     ## quant config
     mtq_cfg = getattr(mtq, quant_cfg)
 
@@ -332,9 +355,12 @@ def state_dict_filter(state_dict):
     parser.add_argument("--calib_size", type=int, default=512, help="samples for calibration.")
     parser.add_argument("--enable_fp8_kvcache", type=bool, default=True, help="enable fp8 kvcache.")
     parser.add_argument("--enable_wo_quant", action="store_true", help="enable MLA wo quant.")
+    parser.add_argument("--trust_remote_code", action="store_true", help="trust remote code.")
 
     args = parser.parse_args()
     model = load_deepseek_model(args.config, args.model_path, args.batch_size)
-    tokenizer = AutoTokenizer.from_pretrained(args.model_path)
+    tokenizer = AutoTokenizer.from_pretrained(
+        args.model_path, trust_remote_code=args.trust_remote_code
+    )
     model = ptq(model, tokenizer, args.quant_cfg, args.batch_size, args.calib_size)
     save_amax_and_quant_config(model, args.output_path, args.enable_fp8_kvcache)
@@ -70,6 +70,12 @@ if [[ -z "$FP8_HF_PATH" ]]; then
     usage
 fi
 
+# for KIMI-K2, copy tiktoken.model to tokenizer to the quantized checkpoint
+if [[ -f "$FP8_HF_PATH/tiktoken.model" ]]; then
+    echo "tiktoken.model found in $FP8_HF_PATH"
+    cp $FP8_HF_PATH/tiktoken.model $FP4_PATH/
+fi
+
 # Copy miscellaneous files to the quantized checkpoint
 mkdir -p $FP4_PATH
 cp $FP8_HF_PATH/*.json $FP8_HF_PATH/*.py $FP4_PATH/
 
@@ -18,7 +18,7 @@
 from calib.plugin_calib import PercentileCalibrator
 from utils import filter_func
 
-from modelopt.core.torch.quantization.config import NVFP4_FP8_MHA_CONFIG  # noqa: F401
+from modelopt.torch.quantization.config import NVFP4_FP8_MHA_CONFIG  # noqa: F401
 
 FP8_DEFAULT_CONFIG = {
     "quant_cfg": {
 
@@ -20,7 +20,13 @@
 import torch
 from accelerate import infer_auto_device_map, init_empty_weights
 from accelerate.utils import get_max_memory
-from transformers import AutoConfig, AutoModelForCausalLM, AutoProcessor, AutoTokenizer
+from transformers import (
+    AutoConfig,
+    AutoModelForCausalLM,
+    AutoProcessor,
+    AutoTokenizer,
+    Llama4ForConditionalGeneration,
+)
 
 from modelopt.torch.utils.image_processor import MllamaImageProcessor
 
@@ -225,7 +231,7 @@ def get_model(
                 **model_kwargs,
             )
         elif hf_config.model_type == "llama4":
-            model = AutoModelForCausalLM.from_pretrained(
+            model = Llama4ForConditionalGeneration.from_pretrained(
                 ckpt_path,
                 device_map=device_map,
                 **model_kwargs,
 
@@ -85,7 +85,8 @@ def auto_quantize(
     # Check if all provided quantization formats are supported
     if args.export_fmt == "hf":
         assert all(
-            qformat in ["fp8", "int4_awq", "nvfp4", "nvfp4_awq", "w4a8_awq", "fp8_pb_wo"]
+            qformat
+            in ["fp8", "int4_awq", "nvfp4", "nvfp4_awq", "w4a8_awq", "fp8_pb_wo", "w4a8_mxfp4_fp8"]
             for qformat in qformat_list
         ), (
             "One or more quantization formats provided are not supported for unified checkpoint export"
@@ -110,9 +111,7 @@ def loss_func(output, data):
         # TRTLLM only support one quantization format or None (do not quantize, internally supported)
         quantization_formats=[QUANT_CFG_CHOICES[format] for format in qformat_list],
         num_calib_steps=len(calib_dataloader),
-        num_score_steps=min(
-            len(calib_dataloader), 128 // batch_size
-        ),  # Limit the number of score steps to avoid long calibration time
+        num_score_steps=len(calib_dataloader),
         verbose=True,
         disabled_layers=["*lm_head*"],
     )
@@ -218,6 +217,7 @@ def main(args):
                     "nvfp4_awq",
                     "w4a8_awq",
                     "fp8_pb_wo",
+                    "w4a8_mxfp4_fp8",
                 ]
                 or args.kv_cache_qformat in KV_QUANT_CFG_CHOICES
             ), f"Quantization format {args.qformat} not supported for HF export path"
@@ -263,6 +263,9 @@ def main(args):
         device = model.model.device
     processor = None
     tokenizer = None
+
+    full_model = model
+
     if model_type == "mllama":
         if args.dataset is None:
             args.dataset = "scienceqa"
@@ -300,6 +303,13 @@ def main(args):
         # Left padding usually provides better calibration result.
         tokenizer.padding_side = "left"
 
+        # We only quantize the language model for VLMs other than the type supported above.
+        if hasattr(model, "language_model"):
+            assert model_type == "llama4", (
+                "Only llama4 should reach here. Please uncomment this check if you are modelopt developers."
+            )
+            model = model.language_model
+
     if args.sparsity_fmt != "dense":
         if args.batch_size == 0:
             # Sparse algorithm takes more GPU memory so we reduce the batch_size by 4.
@@ -335,10 +345,6 @@ def main(args):
             )
 
         if args.batch_size == 0:
-            # TODO: Enable auto-batch size calculation for auto_quantize
-            assert args.auto_quantize_bits is None, (
-                "auto_quantize requires batch_size to be specified, please specify batch_size."
-            )
             # Calibration/sparsification will actually take much more memory than regular inference
             # due to intermediate tensors for fake quantization. Setting sample_memory_usage_ratio
             # to 2 to avoid OOM for AWQ/SmoothQuant fake quantization as it will take more memory than inference.
@@ -358,10 +364,14 @@ def main(args):
                 )
             else:
                 sample_input_single_batch = None
+
+            run_auto_quant = args.auto_quantize_bits is not None
+
             args.batch_size = get_max_batch_size(
                 model,
-                sample_memory_usage_ratio=sample_memory_usage_ratio,
+                sample_memory_usage_ratio=sample_memory_usage_ratio if not run_auto_quant else 1.0,
                 sample_input_single_batch=sample_input_single_batch,
+                enable_grad=run_auto_quant,
             )
             args.batch_size = min(args.batch_size, args.calib_size)
 
@@ -550,23 +560,9 @@ def output_decode(generated_ids, input_shape):
             )
         elif args.export_fmt == "hf":
             export_hf_checkpoint(
-                model,
+                full_model,
                 export_dir=export_path,
             )
-            if model_type == "llama4":
-                # TRT-LLM expects the original model config instead of the config from text model,
-                # so we need to copy the original model config to the export path.
-                # Also we copy the preprocessor config to the export path.
-                from transformers import AutoConfig, AutoProcessor
-
-                # Use HuggingFace API to handle both model IDs and local paths
-                AutoConfig.from_pretrained(
-                    args.pyt_ckpt_path, trust_remote_code=args.trust_remote_code
-                ).save_pretrained(export_path)
-
-                AutoProcessor.from_pretrained(
-                    args.pyt_ckpt_path, trust_remote_code=args.trust_remote_code
-                ).save_pretrained(export_path)
         else:
             raise NotImplementedError(f"{args.export_fmt} not supported")
 
@@ -639,12 +635,6 @@ def output_decode(generated_ids, input_shape):
         choices=KV_QUANT_CFG_CHOICES.keys(),
         help="Specify KV cache quantization format, default to fp8 if not provided",
     )
-    parser.add_argument(
-        "--vlm",
-        help="Specify whether this is a visual-language model",
-        default=False,
-        action="store_true",
-    )
     parser.add_argument(
         "--export_fmt",
         required=False,