NVIDIA
diff --git a/‎.github/CODEOWNERS‎
Lines changed: 1 addition & 0 deletions b/‎.github/CODEOWNERS‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎.gitlab/tests.yml‎
Lines changed: 1 addition & 9 deletions b/‎.gitlab/tests.yml‎
Lines changed: 1 addition & 9 deletions
diff --git a/‎CHANGELOG.rst‎
Lines changed: 9 additions & 2 deletions b/‎CHANGELOG.rst‎
Lines changed: 9 additions & 2 deletions
diff --git a/‎README.md‎
Lines changed: 1 addition & 0 deletions b/‎README.md‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎docs/source/deployment/1_tensorrt_llm.rst‎
Lines changed: 5 additions & 2 deletions b/‎docs/source/deployment/1_tensorrt_llm.rst‎
Lines changed: 5 additions & 2 deletions
diff --git a/‎docs/source/guides/7_nas.rst‎
Lines changed: 9 additions & 0 deletions b/‎docs/source/guides/7_nas.rst‎
Lines changed: 9 additions & 0 deletions
diff --git a/‎examples/diffusers/cache_diffusion/requirements.txt‎
Lines changed: 1 addition & 0 deletions b/‎examples/diffusers/cache_diffusion/requirements.txt‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎examples/llm_distill/README.md‎
Lines changed: 10 additions & 35 deletions b/‎examples/llm_distill/README.md‎
Lines changed: 10 additions & 35 deletions
diff --git a/‎examples/llm_distill/accelerate_config/fsdp2.yaml‎
Lines changed: 2 additions & 2 deletions b/‎examples/llm_distill/accelerate_config/fsdp2.yaml‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎examples/llm_distill/main.py‎
Lines changed: 35 additions & 42 deletions b/‎examples/llm_distill/main.py‎
Lines changed: 35 additions & 42 deletions
@@ -22,6 +22,7 @@ modelopt/torch/distill @NVIDIA/modelopt-torch-distill-codeowners
 modelopt/torch/export @NVIDIA/modelopt-torch-export-codeowners
 modelopt/torch/nas @NVIDIA/modelopt-torch-nas-prune-codeowners
 modelopt/torch/opt @NVIDIA/modelopt-torch-opt-codeowners
+modelopt/torch/peft @NVIDIA/modelopt-torch-peft-codeowners
 modelopt/torch/prune @NVIDIA/modelopt-torch-nas-prune-codeowners
 modelopt/torch/quantization @NVIDIA/modelopt-torch-quantization-codeowners
 modelopt/torch/sparsity @NVIDIA/modelopt-torch-sparsity-codeowners
 
@@ -54,20 +54,12 @@ example-torch:
   timeout: 30m
   parallel:
     matrix:
-      - EXAMPLE: [llm_distill, llm_sparsity, speculative_decoding]
+      - EXAMPLE: [llm_distill, llm_qat, llm_sparsity, speculative_decoding]
   script:
     - pip install ".[hf,dev-test]"
     - find examples/$EXAMPLE -name "requirements.txt" | while read req_file; do pip install -r "$req_file" || exit 1; done
     - pytest -s tests/examples/$EXAMPLE
 
-# TODO: Fix llm_qat test hang in GitLab CI
-example-failing:
-  extends: example-torch
-  allow_failure: true
-  parallel:
-    matrix:
-      - EXAMPLE: [llm_qat]
-
 example-trtllm:
   extends: example-torch
   timeout: 60m
 
@@ -1,17 +1,24 @@
 Model Optimizer Changelog (Linux)
 =================================
 
-0.39 (2025-10-xx)
+0.39 (2025-11-xx)
 ^^^^^^^^^^^^^^^^^
 
 **Deprecations**
 
 **New Features**
 
 - Add flag ``op_types_to_exclude_fp16`` in ONNX quantization to exclude ops from being converted to FP16/BF16. Alternatively, for custom TensorRT ops, this can also be done by indicating ``'fp32'`` precision in ``trt_plugins_precision``.
+- Add LoRA mode support for MCore in a new peft submodule: ``modelopt.torch.peft.update_model(model, LORA_CFG)``.
 - Support PTQ and fakequant in vLLM for fast evaluation of arbitrary quantization formats. See ``examples/vllm_serve`` for more details.
+- Add support for ``nemotron-post-training-dataset-v2`` and ``nemotron-post-training-dataset-v1`` in ``examples/llm_ptq``. Default to a mix of ``cnn_dailymail`` and ``nemotron-post-training-dataset-v2`` if no dataset is specified.
+- Allow specifying ``calib_seq`` in ``examples/llm_ptq`` to set the maximum sequence length for calibration.
 
-0.37 (2025-09-xx)
+**Documentation**
+
+- Add general guidelines for Minitron pruning and distillation. See `examples/pruning/README.md <https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/examples/pruning#pruning-guidelines>`_ for more details.
+
+0.37 (2025-10-08)
 ^^^^^^^^^^^^^^^^^
 
 **Deprecations**
 
@@ -26,6 +26,7 @@ Model Optimizer is also integrated with [NVIDIA NeMo](https://github.com/NVIDIA-
 
 ## Latest News
 
+- [2025/10/07] [Pruning and Distilling LLMs Using NVIDIA TensorRT Model Optimizer](https://developer.nvidia.com/blog/pruning-and-distilling-llms-using-nvidia-tensorrt-model-optimizer/)
 - [2025/09/17] [An Introduction to Speculative Decoding for Reducing Latency in AI Inference](https://developer.nvidia.com/blog/an-introduction-to-speculative-decoding-for-reducing-latency-in-ai-inference/)
 - [2025/09/11] [How Quantization Aware Training Enables Low-Precision Accuracy Recovery](https://developer.nvidia.com/blog/how-quantization-aware-training-enables-low-precision-accuracy-recovery/)
 - [2025/08/29] [Fine-Tuning gpt-oss for Accuracy and Performance with Quantization Aware Training](https://developer.nvidia.com/blog/fine-tuning-gpt-oss-for-accuracy-and-performance-with-quantization-aware-training/)
 
@@ -2,12 +2,15 @@
 TensorRT-LLM
 ==========================
 
+**Deprecation Notice**: The export_tensorrt_llm_checkpoint API will be deprecated in future releases. Users are encouraged to transition to the :doc:`unified HF export API <3_unified_hf>`, which provides enhanced functionality and flexibility for exporting models to multiple inference frameworks including TensorRT-LLM, vLLM, and SGLang.
+
 .. note::
 
-    Please read the `TensorRT-LLM checkpoint workflow <https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/architecture/checkpoint.md>`_
+    Please read the `TensorRT-LLM checkpoint workflow <https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/legacy/architecture/checkpoint.md>`_
     first before going through this section.
 
 
+
 ModelOpt toolkit supports automatic conversion of ModelOpt exported LLM to the TensorRT-LLM checkpoint and the engines for accelerated inferencing.
 
 This conversion is achieved by:
@@ -144,4 +147,4 @@ If the :meth:`export_tensorrt_llm_checkpoint <modelopt.torch.export.model_config
 Convert to TensorRT-LLM
 =======================
 
-Once the TensorRT-LLM checkpoint is available, please follow the `TensorRT-LLM build API <https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/architecture/workflow.md#build-apis>`_ to build and deploy the quantized LLM.
+Once the TensorRT-LLM checkpoint is available, please follow the `TensorRT-LLM build API <https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/legacy/architecture/workflow.md#build-apis>`_ to build and deploy the quantized LLM.
@@ -635,3 +635,12 @@ The difference between NAS and pruning is summarized below.
         increased training time.
       - May provide similar performance to NAS in particular applications, however, usually exhibits
         worse performance due to the limited search space and training time.
+
+
+[Advanced] Adding a new NAS/Prune Algorithm
+===========================================
+
+* Please refer to this `template <https://github.com/NVIDIA/TensorRT-Model-Optimizer/compare/template/new-nas-mode>`_ 
+  for adding a new NAS algorithm.
+* Please refer to `mcore_minitron.py <https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/modelopt/torch/prune/plugins/mcore_minitron.py>`_
+  for an actual example of adding Minitron Pruning algorithm.
@@ -3,3 +3,4 @@ opencv-python>=4.8.1.78,<4.12.0.88
 peft>=0.10.0
 polygraphy==0.49.9
 sentencepiece
+transformers<4.57
@@ -49,8 +49,8 @@ First obtain both a pretrained model to act as the teacher and a (usually smalle
 from transformers import AutoModelForCausalLM
 
 # Define student & teacher
-student_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
-teacher_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-70B-Instruct")
+student_model = AutoModelForCausalLM.from_pretrained("student-model-id-or-path")
+teacher_model = AutoModelForCausalLM.from_pretrained("teacher-model-id-or-path")
 ```
 
 ### Set up the meta model
@@ -149,52 +149,27 @@ You can also look at the NeMo tutorial notebooks [here](https://github.com/NVIDI
 
 ## Knowledge Distillation (KD) for HuggingFace Models
 
-In this e2e example we finetune Llama-2 models on the [OpenOrca](https://huggingface.co/datasets/Open-Orca/OpenOrca)
-question-answer dataset as a minimal example to demonstrate a simple way of integrating Model Optimizer's KD feature.
+In this e2e example we finetune Llama-3.2 models on the [smol-smoltalk-Interaction-SFT](https://huggingface.co/datasets/ReactiveAI/smol-smoltalk-Interaction-SFT)
+dataset as a minimal example to demonstrate a simple way of integrating Model Optimizer's KD feature.
 
-First we do supervised finetuning (SFT) of a Llama-2-7b on OpenOrca dataset as the teacher, then distill it into
-a 1B-parameter model.
-
-Keep in mind the training loss of the distillation run is not directly comparable to the training loss of the teacher run.
+We replace normal supervised finetuning (SFT) of a Llama-3.2-1B base model by distilling information from Llama-3.2-3B-Instruct which has already been instruction-finetuned.
 
 > [!NOTE]
 > We can fit the following in memory using [FSDP](https://huggingface.co/docs/accelerate/en/usage_guides/fsdp) enabled on 8x RTX 6000 (total ~400GB VRAM)
 
-### Train teacher
-
-```bash
-accelerate launch --config-file ./accelerate_config/fsdp2.yaml \
-    main.py \
-    --single_model \
-    --teacher_name_or_path 'meta-llama/Llama-2-7b-hf' \
-    --output_dir ./llama2-7b-sft \
-    --max_length 2048 \
-    --per_device_train_batch_size 1 \
-    --per_device_eval_batch_size 4 \
-    --max_steps 400 \
-    --logging_steps 5
-```
-
-### Distill teacher into student
-
 ```bash
 accelerate launch --config-file ./accelerate_config/fsdp2.yaml \
-    --fsdp_cpu_ram_efficient_loading False \
-    --fsdp_activation_checkpointing False \
     main.py \
-    --teacher_name_or_path ./llama2-7b-sft \
-    --student_name_or_path 'TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T' \
-    --output_dir ./llama2-distill \
+    --teacher_name_or_path 'meta-llama/Llama-3.2-3B-Instruct' \
+    --student_name_or_path 'meta-llama/Llama-3.2-1B' \
+    --output_dir ./llama3.2-distill \
     --max_length 2048 \
-    --per_device_train_batch_size 1 \
-    --per_device_eval_batch_size 4 \
+    --per_device_train_batch_size 4 \
+    --per_device_eval_batch_size 8 \
     --max_steps 200 \
     --logging_steps 5
 ```
 
-> [!NOTE]
-> If you receive a `RuntimeError: unable to open file <...> in read-only mode: No such file or directory` simply re-run the command a second time.
-
 ## Resources
 
 - 📅 [Roadmap](https://github.com/NVIDIA/TensorRT-Model-Optimizer/issues/146)
 
@@ -4,9 +4,9 @@ distributed_type: FSDP
 downcast_bf16: 'no'
 enable_cpu_affinity: false
 fsdp_config:
-  fsdp_activation_checkpointing: true
+  fsdp_activation_checkpointing: false
   fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
-  fsdp_cpu_ram_efficient_loading: true
+  fsdp_cpu_ram_efficient_loading: false
   fsdp_offload_params: false
   fsdp_reshard_after_forward: true
   fsdp_state_dict_type: SHARDED_STATE_DICT
 
@@ -13,10 +13,11 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-import logging
 import os
 from dataclasses import dataclass
 
+os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
+
 import datasets
 import torch
 import torch.distributed
@@ -29,17 +30,13 @@
 import modelopt.torch.opt as mto
 from modelopt.torch.distill.plugins.huggingface import KDTrainer, LMLogitsLoss
 
-os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:512"
-
-logger = get_logger(__name__)
-logging.basicConfig(level=logging.INFO)
+logger = get_logger(__name__, log_level="INFO")
 
 
 @dataclass
 class ModelArguments:
     teacher_name_or_path: str | None = None
     student_name_or_path: str | None = None
-    single_model: bool = False
 
 
 @dataclass
@@ -57,12 +54,14 @@ class TrainingArguments(transformers.TrainingArguments):
     tf32: bool = True
 
 
-def llama_text_format_func(sample):
-    p, q, r = sample["system_prompt"], sample["question"], sample["response"]
-    if not p:
-        return f"<s>[INST] {q}[/INST]\n{r}</s>"
-    else:
-        return f"<s>[INST] <<SYS>>{p}<</SYS>>\n{q}[/INST]\n{r}</s>"
+def _format_smoltalk_chat_template(sample, tokenizer):
+    # smol-smoltalk-Interaction-SFT dataset has "query" and "answer" fields
+    # Convert them to messages format and use tokenizer's apply_chat_template
+    messages = [
+        {"role": "user", "content": sample["query"]},
+        {"role": "assistant", "content": sample["answer"]},
+    ]
+    return tokenizer.apply_chat_template(messages, tokenize=False)
 
 
 class KDSFTTrainer(SFTTrainer, KDTrainer):
@@ -91,55 +90,50 @@ def train():
         f"Using {int(num_accum_steps)} grad accumulation steps for effective batchsize of {total_batch_size}."
     )
 
+    # Dataset
     logger.info("Loading dataset...")
-    dset = datasets.load_dataset("Open-Orca/OpenOrca", split="train")
-    dset_splits = dset.train_test_split(train_size=25600, test_size=1700, seed=420)
+    dset = datasets.load_dataset("ReactiveAI/smol-smoltalk-Interaction-SFT", split="train")
+    dset_splits = dset.train_test_split(train_size=12800, test_size=1280, seed=420)
     dset_train, dset_eval = dset_splits["train"], dset_splits["test"]
     logger.info("Dataset loaded.")
 
+    # Tokenizer
     logger.info("Loading tokenizer...")
     model_path = model_args.teacher_name_or_path or model_args.student_name_or_path
     tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True)
     tokenizer.pad_token = tokenizer.eos_token
     tokenizer.padding_side = "right"
     logger.info("Tokenizer loaded.")
 
-    if model_args.single_model:
-        logger.info("Loading single model only...")
-        model = transformers.AutoModelForCausalLM.from_pretrained(
-            model_path, dtype=torch.bfloat16 if training_args.bf16 else None
-        )
-        logger.info("Model loaded.")
-    else:
-        logger.info("Loading student model...")
-        model = transformers.AutoModelForCausalLM.from_pretrained(
-            model_args.student_name_or_path, dtype=torch.bfloat16 if training_args.bf16 else None
-        )
-        logger.info("Student loaded.")
-        # Load checkpoint
-        logger.info("Loading teacher model and converting to Distillation model...")
-        teacher_model = transformers.AutoModelForCausalLM.from_pretrained(
-            model_args.teacher_name_or_path, dtype=torch.bfloat16 if training_args.bf16 else None
-        )
-        kd_config = {
-            "teacher_model": teacher_model,
-            "criterion": LMLogitsLoss(),
-        }
-        model = mtd.convert(model, mode=[("kd_loss", kd_config)])
-        logger.info("Models converted.")
+    # Model
+    logger.info("Loading student model...")
+    model = transformers.AutoModelForCausalLM.from_pretrained(
+        model_args.student_name_or_path, dtype=torch.bfloat16 if training_args.bf16 else None
+    )
+    logger.info("Student loaded.")
+    # Load checkpoint
+    logger.info("Loading teacher model and converting to Distillation model...")
+    teacher_model = transformers.AutoModelForCausalLM.from_pretrained(
+        model_args.teacher_name_or_path, dtype=torch.bfloat16 if training_args.bf16 else None
+    )
+    kd_config = {
+        "teacher_model": teacher_model,
+        "criterion": LMLogitsLoss(),
+    }
+    model = mtd.convert(model, mode=[("kd_loss", kd_config)])
+    logger.info("Models converted.")
 
     # Fix problematic settings that logger.info excessive warnings
     model.generation_config.temperature = None
     model.generation_config.top_p = None
 
     # Trainer
-    trainer_cls = SFTTrainer if model_args.single_model else KDSFTTrainer
-    trainer = trainer_cls(
+    trainer = KDSFTTrainer(
         model,
         training_args,
         train_dataset=dset_train,
         eval_dataset=dset_eval,
-        formatting_func=llama_text_format_func,
+        formatting_func=lambda sample: _format_smoltalk_chat_template(sample, tokenizer),
         processing_class=tokenizer,
     )
 
@@ -159,8 +153,7 @@ def train():
     # Save checkpoint
     logger.info("Saving checkpoint...")
     trainer.save_state()
-    kwargs = {"export_student": True} if not model_args.single_model else {}
-    trainer.save_model(trainer.args.output_dir, **kwargs)
+    trainer.save_model(trainer.args.output_dir, export_student=True)
     logger.info("Checkpoint saved.")