NVIDIA
diff --git a/‎CHANGELOG.rst‎
Lines changed: 1 addition & 1 deletion b/‎CHANGELOG.rst‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎README.md‎
Lines changed: 2 additions & 0 deletions b/‎README.md‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎docs/source/guides/4_distillation.rst‎
Lines changed: 9 additions & 13 deletions b/‎docs/source/guides/4_distillation.rst‎
Lines changed: 9 additions & 13 deletions
diff --git a/‎examples/llm_distill/README.md‎
Lines changed: 16 additions & 22 deletions b/‎examples/llm_distill/README.md‎
Lines changed: 16 additions & 22 deletions
diff --git a/‎examples/llm_distill/accelerate_config/fsdp2.yaml‎
Lines changed: 25 additions & 0 deletions b/‎examples/llm_distill/accelerate_config/fsdp2.yaml‎
Lines changed: 25 additions & 0 deletions
diff --git a/‎examples/llm_distill/main.py‎
Lines changed: 14 additions & 28 deletions b/‎examples/llm_distill/main.py‎
Lines changed: 14 additions & 28 deletions
diff --git a/‎examples/llm_distill/requirements.txt‎
Lines changed: 1 addition & 1 deletion b/‎examples/llm_distill/requirements.txt‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎examples/llm_qat/README.md‎
Lines changed: 1 addition & 0 deletions b/‎examples/llm_qat/README.md‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎examples/nemo_run/common/in_memory_mmlu.py‎
Lines changed: 60 additions & 0 deletions b/‎examples/nemo_run/common/in_memory_mmlu.py‎
Lines changed: 60 additions & 0 deletions
diff --git a/‎examples/nemo_run/llama_chat_template.txt‎ renamed to ‎examples/nemo_run/common/llama_chat_template.txt‎ b/‎examples/nemo_run/llama_chat_template.txt‎ renamed to ‎examples/nemo_run/common/llama_chat_template.txt‎
@@ -9,13 +9,13 @@ Model Optimizer Changelog (Linux)
 - Deprecated ``quantize_mode`` argument in ``examples/onnx_ptq/evaluate.py`` to support strongly typing. Use ``engine_precision`` instead.
 - Deprecated TRT-LLM's TRT backend in ``examples/llm_ptq`` and ``examples/vlm_ptq``. Tasks ``build`` and ``benchmark`` support are removed and replaced with ``quant``. For performance evaluation, please use ``trtllm-bench`` directly.
 - ``--export_fmt`` flag in ``examples/llm_ptq`` is removed. By default we export to the unified Hugging Face checkpoint format.
-- ``int8_sq`` quantization format is deprecated from the ``examples/vlm_ptq`` with respect to the TensorRT-LLM's torch backend switch. Please refer to the previous releases if this quantization format is needed.
 - Deprecated ``examples/vlm_eval`` as it depends on the deprecated TRT-LLM's TRT backend.
 
 **New Features**
 
 - ``high_precision_dtype`` default to fp16 in ONNX quantization, i.e. quantized output model weights are now FP16 by default.
 - Upgrade TensorRT-LLM dependency to 1.1.0rc2.
+- Support Phi-4-multimodal and Qwen2.5-VL quantized HF checkpoint export in ``examples/vlm_ptq``.
 
 0.35 (2025-09-04)
 ^^^^^^^^^^^^^^^^^
 
@@ -26,6 +26,8 @@ Model Optimizer is also integrated with [NVIDIA NeMo](https://github.com/NVIDIA-
 
 ## Latest News
 
+- [2025/09/17] [An Introduction to Speculative Decoding for Reducing Latency in AI Inference](https://developer.nvidia.com/blog/an-introduction-to-speculative-decoding-for-reducing-latency-in-ai-inference/)
+- [2025/09/11] [How Quantization Aware Training Enables Low-Precision Accuracy Recovery](https://developer.nvidia.com/blog/how-quantization-aware-training-enables-low-precision-accuracy-recovery/)
 - [2025/08/29] [Fine-Tuning gpt-oss for Accuracy and Performance with Quantization Aware Training](https://developer.nvidia.com/blog/fine-tuning-gpt-oss-for-accuracy-and-performance-with-quantization-aware-training/)
 - [2025/08/01] [Optimizing LLMs for Performance and Accuracy with Post-Training Quantization](https://developer.nvidia.com/blog/optimizing-llms-for-performance-and-accuracy-with-post-training-quantization/)
 - [2025/06/24] [Introducing NVFP4 for Efficient and Accurate Low-Precision Inference](https://developer.nvidia.com/blog/introducing-nvfp4-for-efficient-and-accurate-low-precision-inference/)
 
@@ -16,9 +16,9 @@ a more powerful teacher model using :mod:`modelopt.torch.distill <modelopt.torch
     interaction between the two.
 #.  **Distillation training**: Seamlessly use the meta-model in place of the original model and run
     the original script with only one additional line of code for loss calculation.
-#.  **Checkpoint and re-load**: Save the model via :meth:`mto.save <modelopt.torch.opt.conversion.save>` and
-    restore via :meth:`mto.restore <modelopt.torch.opt.conversion.restore>`. See :ref:`saving and restoring <save-restore>`
-    to learn more.
+#.  **Checkpoint and re-load**: Save the model via :meth:`mto.save <modelopt.torch.opt.conversion.save>`
+    Note that restoring the model (via :meth:`mto.restore <modelopt.torch.opt.conversion.restore>`)
+    will not reinstantiate the distillation meta-model, in order to avoid unpickling issues.
 
 *To find out more about Distillation and related concepts, please refer to the below section*
 :ref:`Distillation Concepts <distillation-concepts>`.
@@ -44,7 +44,7 @@ Example usage:
 
     # Configure and convert for distillation
     distillation_config = {
-        # `teacher_model` is a model class or callable, or a tuple.
+        # `teacher_model` is a model, model class, callable, or a tuple.
         # If a tuple, it must be of the form (model_cls_or_callable,) or
         # (model_cls_or_callable, args) or (model_cls_or_callable, args, kwargs).
         "teacher_model": teacher_model,
@@ -53,15 +53,9 @@ Example usage:
     }
     distillation_model = mtd.convert(model, mode=[("kd_loss", distillation_config)])
 
-    # Export model in original class form
+    # Export model in original class, with only previously-present attributes
     model_exported = mtd.export(distillation_model)
 
-.. note::
-    The config requires a (non-lambda) Callable to return a teacher model in place of the model
-    itself. This is to avoid re-saving the teacher state dict upon saving the Distillation
-    meta model. Thus, the same callable must be available in the namespace when restoring via
-    the :meth:`mto.restore <modelopt.torch.opt.conversion.restore>` utility.
-
 .. tip::
     When training the student on a small corpus of ground truth data, consider using :class:`MFTLoss <modelopt.torch.distill.MFTLoss>` for to perform Minifinetuning in lieu of the standard
     :class:`LogitsDistillationLoss <modelopt.torch.distill.losses.LogitsDistillationLoss>`. This will allow the student to learn from the teacher's distribution while adapting to the new data, improving the specialization of the new data without overwriting teacher's general knowledge.
@@ -170,10 +164,12 @@ outputs in the same order as well:
 The intermediate outputs for the losses are captured by the
 :class:`DistillationModel <modelopt.torch.distill.distillation_model.DistillationModel>` and then the loss(es) are
 invoked using :meth:`DistillationModel.compute_kd_loss() <modelopt.torch.distill.distillation_model.DistillationModel.compute_kd_loss>`.
-If present, the original student's non-distillation loss is passed in as an argument.
+If present, the original student's non-distillation loss can be passed in as an argument.
 
 Writing a custom loss function is often necessary, especially to handle outputs that need to be processed
-to obtain the logits and activations.
+to obtain the logits and activations. Additional arguments to the loss function can be passed in to
+:meth:`DistillationModel.compute_kd_loss() <modelopt.torch.distill.distillation_model.DistillationModel.compute_kd_loss>`
+as ``kwargs``.
 
 Loss Balancer
 ^^^^^^^^^^^^^
 
@@ -39,13 +39,9 @@ First obtain both a pretrained model to act as the teacher and a (usually smalle
 ```python
 from transformers import AutoModelForCausalLM
 
-# Define student
+# Define student & teacher
 student_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
-
-# Define callable which returns teacher
-def teacher_factory():
-    teacher_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-70B-Instruct")
-    return teacher_model
+teacher_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-70B-Instruct")
 ```
 
 ### Set up the meta model
@@ -58,15 +54,15 @@ Please see an example Distillation setup below. This example assumes the outputs
 import modelopt.torch.distill as mtd
 
 distillation_config = {
-    "teacher_model": teacher_factory,  # model initializer
+    "teacher_model": teacher_model,
     "criterion": mtd.LogitsDistillationLoss(),  # callable receiving student and teacher outputs, in order
     "loss_balancer": mtd.StaticLossBalancer(),  # combines multiple losses; omit if only one distillation loss used
 }
 
 distillation_model = mtd.convert(student_model, mode=[("kd_loss", distillation_config)])
 ```
 
-The `teacher_model` can be either a callable which returns an `nn.Module` or a tuple of `(model_cls, args, kwargs)`. The `criterion` is the distillation loss used between student and teacher tensors. The `loss_balancer` determines how the original and distillation losses are combined (if needed).
+The `teacher_model` can be either a `nn.Module`, a callable which returns an `nn.Module`, or a tuple of `(model_cls, args, kwargs)`. The `criterion` is the distillation loss used between student and teacher tensors. The `loss_balancer` determines how the original and distillation losses are combined (if needed).
 
 See [Distillation](https://nvidia.github.io/TensorRT-Model-Optimizer/guides/4_distillation.html) for more info.
 
@@ -158,35 +154,33 @@ Keep in mind the training loss of the distillation run is not directly comparabl
 ### Train teacher
 
 ```bash
-accelerate launch --multi_gpu --mixed_precision bf16  main.py \
+accelerate launch --config-file ./accelerate_config/fsdp2.yaml \
+    main.py \
     --single_model \
     --teacher_name_or_path 'meta-llama/Llama-2-7b-hf' \
     --output_dir ./llama2-7b-sft \
-    --logging_steps 5 \
-    --max_steps 400 \
-    --max_seq_length 2048 \
+    --max_length 2048 \
     --per_device_train_batch_size 1 \
     --per_device_eval_batch_size 4 \
-    --gradient_checkpointing True \
-    --fsdp 'full_shard auto_wrap' \
-    --fsdp_transformer_layer_cls_to_wrap LlamaDecoderLayer
+    --max_steps 400 \
+    --logging_steps 5
 ```
 
 ### Distill teacher into student
 
 ```bash
-accelerate launch --multi_gpu --mixed_precision bf16  main.py \
+accelerate launch --config-file ./accelerate_config/fsdp2.yaml \
+    --fsdp_cpu_ram_efficient_loading False \
+    --fsdp_activation_checkpointing False \
+    main.py \
     --teacher_name_or_path ./llama2-7b-sft \
     --student_name_or_path 'TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T' \
     --output_dir ./llama2-distill \
-    --logging_steps 5 \
-    --max_steps 200 \
-    --max_seq_length 2048 \
+    --max_length 2048 \
     --per_device_train_batch_size 1 \
     --per_device_eval_batch_size 4 \
-    --gradient_checkpointing False \
-    --fsdp 'full_shard auto_wrap' \
-    --fsdp_transformer_layer_cls_to_wrap LlamaDecoderLayer
+    --max_steps 200 \
+    --logging_steps 5
 ```
 
 > [!NOTE]
 
@@ -0,0 +1,25 @@
+compute_environment: LOCAL_MACHINE
+debug: false
+distributed_type: FSDP
+downcast_bf16: 'no'
+enable_cpu_affinity: false
+fsdp_config:
+  fsdp_activation_checkpointing: true
+  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
+  fsdp_cpu_ram_efficient_loading: true
+  fsdp_offload_params: false
+  fsdp_reshard_after_forward: true
+  fsdp_state_dict_type: SHARDED_STATE_DICT
+  fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
+  fsdp_version: 2
+machine_rank: 0
+main_training_function: main
+mixed_precision: bf16
+num_machines: 1
+num_processes: gpu
+rdzv_backend: static
+same_network: true
+tpu_env: []
+tpu_use_cluster: false
+tpu_use_sudo: false
+use_cpu: false
@@ -21,7 +21,6 @@
 import torch
 import torch.distributed
 import transformers
-from accelerate import PartialState
 from accelerate.logging import get_logger
 from transformers import AutoTokenizer
 from trl import SFTTrainer
@@ -48,38 +47,28 @@ class TrainingArguments(transformers.TrainingArguments):
     do_train: bool = True
     do_eval: bool = True
     save_strategy: str = "no"
-    max_seq_length: int = 1024
+    max_length: int = 1024
     optim: str = "adamw_torch"
     learning_rate: float = 1e-5
     lr_scheduler_type: str = "cosine"
     dataloader_drop_last: bool = True
     dataset_num_proc: int = 8
-    dataset_batch_size: int = 500
     bf16: bool = True
     tf32: bool = True
 
 
 def llama_text_format_func(sample):
-    texts = []
-    for p, q, r in zip(sample["system_prompt"], sample["question"], sample["response"]):
-        if not p:
-            texts.append(f"<s>[INST] {q}[/INST]\n{r}</s>")
-        else:
-            texts.append(f"<s>[INST] <<SYS>>{p}<</SYS>>\n{q}[/INST]\n{r}</s>")
-    return texts
+    p, q, r = sample["system_prompt"], sample["question"], sample["response"]
+    if not p:
+        return f"<s>[INST] {q}[/INST]\n{r}</s>"
+    else:
+        return f"<s>[INST] <<SYS>>{p}<</SYS>>\n{q}[/INST]\n{r}</s>"
 
 
 class KDSFTTrainer(SFTTrainer, KDTrainer):
     pass
 
 
-def _teacher_factory(model_name_or_path):
-    return transformers.AutoModelForCausalLM.from_pretrained(
-        model_name_or_path,
-        device_map=PartialState().process_index,
-    )
-
-
 def train():
     parser = transformers.HfArgumentParser((ModelArguments, TrainingArguments))
     model_args, training_args = parser.parse_args_into_dataclasses()
@@ -117,34 +106,31 @@ def train():
 
     if model_args.single_model:
         logger.info("Loading single model only...")
-        model = _teacher_factory(model_path)
+        model = transformers.AutoModelForCausalLM.from_pretrained(
+            model_path, dtype=torch.bfloat16 if training_args.bf16 else None
+        )
         logger.info("Model loaded.")
     else:
         logger.info("Loading student model...")
         model = transformers.AutoModelForCausalLM.from_pretrained(
-            model_args.student_name_or_path,
-            device_map=PartialState().process_index,
+            model_args.student_name_or_path, dtype=torch.bfloat16 if training_args.bf16 else None
         )
         logger.info("Student loaded.")
         # Load checkpoint
         logger.info("Loading teacher model and converting to Distillation model...")
+        teacher_model = transformers.AutoModelForCausalLM.from_pretrained(
+            model_args.teacher_name_or_path, dtype=torch.bfloat16 if training_args.bf16 else None
+        )
         kd_config = {
-            "teacher_model": (
-                _teacher_factory,
-                (model_args.teacher_name_or_path,),
-                {},
-            ),
+            "teacher_model": teacher_model,
             "criterion": LMLogitsLoss(),
-            "expose_minimal_state_dict": False,  # FSDP forces us to disable this
         }
         model = mtd.convert(model, mode=[("kd_loss", kd_config)])
         logger.info("Models converted.")
 
     # Fix problematic settings that logger.info excessive warnings
     model.generation_config.temperature = None
     model.generation_config.top_p = None
-    if training_args.gradient_checkpointing:
-        training_args.gradient_checkpointing_kwargs = {"use_reentrant": False}
 
     # Trainer
     trainer_cls = SFTTrainer if model_args.single_model else KDSFTTrainer
 
@@ -1,2 +1,2 @@
 pyarrow
-trl==0.13.0
+trl>=0.23.0
@@ -11,6 +11,7 @@ Quantization Aware Training (QAT) helps to improve the model accuracy beyond pos
 | Support Matrix | View the support matrix to see quantization compatibility and feature availability across different models | \[[Link](#support-matrix)\] | |
 | End to End QAT | Example scripts demonstrating quantization techniques for optimizing Hugging Face models | \[[Link](#end-to-end-qat-example)\] | \[[docs](https://nvidia.github.io/TensorRT-Model-Optimizer/guides/1_quantization.html)\] |
 | End to End QAD | Example scripts demonstrating quantization aware distillation techniques for optimizing Hugging Face models | \[[Link](#end-to-end-qad-example)\] | \[[docs](https://nvidia.github.io/TensorRT-Model-Optimizer/guides/1_quantization.html)\] |
+| NeMo QAT/QAD Simplified Flow | Example script demonstrating end-to-end QAT/QAD in NeMo | \[[Link](../nemo_run/qat/README.md)\] | |
 | Evaluate Accuracy | Evaluating model accuracy after QAT/QAD (with fake quantization) | \[[Link](#testing-qat-model-with-llm-benchmarks-for-accuracy-evaluation)\] | |
 | Deployment | Deploying the model after QAT/QAD | \[[Link](#deployment)\] | |
 | QLoRA | Model training with reduced GPU memory | \[[Link](#end-to-end-qlora-with-real-quantization)\] | |
 
@@ -0,0 +1,60 @@
+# SPDX-FileCopyrightText: Copyright (c) 2023-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+
+from nemo.collections.llm.modelopt import setup_trainer_and_restore_model_with_modelopt_spec
+
+from modelopt.torch.export.plugins.nemo_run import _get_most_recent_ckpt
+from modelopt.torch.utils.plugins.megatron_mmlu import megatron_mmlu
+
+
+def parse_args():
+    parser = argparse.ArgumentParser(
+        description=(
+            "Run MMLU evaluation with ModelOpt Megatron model. Provide either --nemo_ckpt"
+            "or --finetuned_ckpt_dir"
+        )
+    )
+    group = parser.add_mutually_exclusive_group(required=True)
+    group.add_argument("--nemo_ckpt", type=str, required=False, help="Path to NeMo checkpoint.")
+    group.add_argument(
+        "--finetuned_ckpt_dir",
+        required=False,
+        type=str,
+        help="Checkpoint directory of 1 or more finetuned models",
+    )
+    parser.add_argument(
+        "--tensor_parallelism", type=int, default=1, help="Tensor parallelism size."
+    )
+    parser.add_argument(
+        "--pipeline_parallelism", type=int, default=1, help="Pipeline parallelism size."
+    )
+    return parser.parse_args()
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    ckpt_path = args.nemo_ckpt
+    if args.finetuned_ckpt_dir:
+        ckpt_path = _get_most_recent_ckpt(args.finetuned_ckpt_dir)
+    model, trainer = setup_trainer_and_restore_model_with_modelopt_spec(
+        ckpt_path,
+        tensor_model_parallel_size=args.tensor_parallelism,
+        pipeline_model_parallel_size=args.pipeline_parallelism,
+        devices=args.tensor_parallelism * args.pipeline_parallelism,
+    )
+    tokenizer = model.tokenizer.tokenizer
+    megatron_mmlu(model.module, tokenizer)
Original file line number	Diff line number	Diff line change
`@@ -1,2 +1,2 @@`
`1`	`1`	`pyarrow`
`2`		`-trl==0.13.0`
	`2`	`+trl>=0.23.0`