NVIDIA
diff --git a/‎.github/CODEOWNERS‎
Lines changed: 1 addition & 0 deletions b/‎.github/CODEOWNERS‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎.gitlab/tests.yml‎
Lines changed: 1 addition & 9 deletions b/‎.gitlab/tests.yml‎
Lines changed: 1 addition & 9 deletions
diff --git a/‎CHANGELOG.rst‎
Lines changed: 1 addition & 0 deletions b/‎CHANGELOG.rst‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎README.md‎
Lines changed: 1 addition & 0 deletions b/‎README.md‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎docs/source/deployment/1_tensorrt_llm.rst‎
Lines changed: 5 additions & 2 deletions b/‎docs/source/deployment/1_tensorrt_llm.rst‎
Lines changed: 5 additions & 2 deletions
diff --git a/‎docs/source/guides/7_nas.rst‎
Lines changed: 9 additions & 0 deletions b/‎docs/source/guides/7_nas.rst‎
Lines changed: 9 additions & 0 deletions
diff --git a/‎examples/llm_distill/main.py‎
Lines changed: 32 additions & 5 deletions b/‎examples/llm_distill/main.py‎
Lines changed: 32 additions & 5 deletions
diff --git a/‎examples/llm_distill/requirements.txt‎
Lines changed: 1 addition & 0 deletions b/‎examples/llm_distill/requirements.txt‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎examples/speculative_decoding/README.md‎
Lines changed: 54 additions & 15 deletions b/‎examples/speculative_decoding/README.md‎
Lines changed: 54 additions & 15 deletions
@@ -22,6 +22,7 @@ modelopt/torch/distill @NVIDIA/modelopt-torch-distill-codeowners
 modelopt/torch/export @NVIDIA/modelopt-torch-export-codeowners
 modelopt/torch/nas @NVIDIA/modelopt-torch-nas-prune-codeowners
 modelopt/torch/opt @NVIDIA/modelopt-torch-opt-codeowners
+modelopt/torch/peft @NVIDIA/modelopt-torch-peft-codeowners
 modelopt/torch/prune @NVIDIA/modelopt-torch-nas-prune-codeowners
 modelopt/torch/quantization @NVIDIA/modelopt-torch-quantization-codeowners
 modelopt/torch/sparsity @NVIDIA/modelopt-torch-sparsity-codeowners
 
@@ -54,20 +54,12 @@ example-torch:
   timeout: 30m
   parallel:
     matrix:
-      - EXAMPLE: [llm_distill, llm_sparsity, speculative_decoding]
+      - EXAMPLE: [llm_distill, llm_qat, llm_sparsity, speculative_decoding]
   script:
     - pip install ".[hf,dev-test]"
     - find examples/$EXAMPLE -name "requirements.txt" | while read req_file; do pip install -r "$req_file" || exit 1; done
     - pytest -s tests/examples/$EXAMPLE
 
-# TODO: Fix llm_qat test hang in GitLab CI
-example-failing:
-  extends: example-torch
-  allow_failure: true
-  parallel:
-    matrix:
-      - EXAMPLE: [llm_qat]
-
 example-trtllm:
   extends: example-torch
   timeout: 60m
 
@@ -9,6 +9,7 @@ Model Optimizer Changelog (Linux)
 **New Features**
 
 - Add flag ``op_types_to_exclude_fp16`` in ONNX quantization to exclude ops from being converted to FP16/BF16. Alternatively, for custom TensorRT ops, this can also be done by indicating ``'fp32'`` precision in ``trt_plugins_precision``.
+- Add LoRA mode support for MCore in a new peft submodule: ``modelopt.torch.peft.update_model(model, LORA_CFG)``.
 - Support PTQ and fakequant in vLLM for fast evaluation of arbitrary quantization formats. See ``examples/vllm_serve`` for more details.
 
 0.37 (2025-09-xx)
 
@@ -26,6 +26,7 @@ Model Optimizer is also integrated with [NVIDIA NeMo](https://github.com/NVIDIA-
 
 ## Latest News
 
+- [2025/10/07] [Pruning and Distilling LLMs Using NVIDIA TensorRT Model Optimizer](https://developer.nvidia.com/blog/pruning-and-distilling-llms-using-nvidia-tensorrt-model-optimizer/)
 - [2025/09/17] [An Introduction to Speculative Decoding for Reducing Latency in AI Inference](https://developer.nvidia.com/blog/an-introduction-to-speculative-decoding-for-reducing-latency-in-ai-inference/)
 - [2025/09/11] [How Quantization Aware Training Enables Low-Precision Accuracy Recovery](https://developer.nvidia.com/blog/how-quantization-aware-training-enables-low-precision-accuracy-recovery/)
 - [2025/08/29] [Fine-Tuning gpt-oss for Accuracy and Performance with Quantization Aware Training](https://developer.nvidia.com/blog/fine-tuning-gpt-oss-for-accuracy-and-performance-with-quantization-aware-training/)
 
@@ -2,12 +2,15 @@
 TensorRT-LLM
 ==========================
 
+**Deprecation Notice**: The export_tensorrt_llm_checkpoint API will be deprecated in future releases. Users are encouraged to transition to the :doc:`unified HF export API <3_unified_hf>`, which provides enhanced functionality and flexibility for exporting models to multiple inference frameworks including TensorRT-LLM, vLLM, and SGLang.
+
 .. note::
 
-    Please read the `TensorRT-LLM checkpoint workflow <https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/architecture/checkpoint.md>`_
+    Please read the `TensorRT-LLM checkpoint workflow <https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/legacy/architecture/checkpoint.md>`_
     first before going through this section.
 
 
+
 ModelOpt toolkit supports automatic conversion of ModelOpt exported LLM to the TensorRT-LLM checkpoint and the engines for accelerated inferencing.
 
 This conversion is achieved by:
@@ -144,4 +147,4 @@ If the :meth:`export_tensorrt_llm_checkpoint <modelopt.torch.export.model_config
 Convert to TensorRT-LLM
 =======================
 
-Once the TensorRT-LLM checkpoint is available, please follow the `TensorRT-LLM build API <https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/architecture/workflow.md#build-apis>`_ to build and deploy the quantized LLM.
+Once the TensorRT-LLM checkpoint is available, please follow the `TensorRT-LLM build API <https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/legacy/architecture/workflow.md#build-apis>`_ to build and deploy the quantized LLM.
@@ -635,3 +635,12 @@ The difference between NAS and pruning is summarized below.
         increased training time.
       - May provide similar performance to NAS in particular applications, however, usually exhibits
         worse performance due to the limited search space and training time.
+
+
+[Advanced] Adding a new NAS/Prune Algorithm
+===========================================
+
+* Please refer to this `template <https://github.com/NVIDIA/TensorRT-Model-Optimizer/compare/template/new-nas-mode>`_ 
+  for adding a new NAS algorithm.
+* Please refer to `mcore_minitron.py <https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/modelopt/torch/prune/plugins/mcore_minitron.py>`_
+  for an actual example of adding Minitron Pruning algorithm.
@@ -13,10 +13,11 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-import logging
 import os
 from dataclasses import dataclass
 
+os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
+
 import datasets
 import torch
 import torch.distributed
@@ -29,10 +30,7 @@
 import modelopt.torch.opt as mto
 from modelopt.torch.distill.plugins.huggingface import KDTrainer, LMLogitsLoss
 
-os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:512"
-
-logger = get_logger(__name__)
-logging.basicConfig(level=logging.INFO)
+logger = get_logger(__name__, log_level="INFO")
 
 
 @dataclass
@@ -69,6 +67,29 @@ class KDSFTTrainer(SFTTrainer, KDTrainer):
     pass
 
 
+def _save_model_fsdp_compat(
+    self,
+    output_dir: str | None = None,
+    _internal_call: bool = False,
+    *args,
+    **kwargs,
+):
+    output_dir = output_dir or self.args.output_dir
+    model = self.accelerator.unwrap_model(self.model)
+    if not _internal_call and self.is_fsdp_enabled:
+        state_dict = self.accelerator.get_state_dict(self.model)
+        if self.accelerator.is_main_process:
+            model.save_pretrained(
+                output_dir,
+                is_main_process=self.accelerator.is_main_process,
+                save_function=self.accelerator.save,
+                state_dict=state_dict,
+            )
+            self.processing_class.save_pretrained(output_dir)
+    else:
+        super(SFTTrainer, self).save_model(output_dir, _internal_call, *args, **kwargs)
+
+
 def train():
     parser = transformers.HfArgumentParser((ModelArguments, TrainingArguments))
     model_args, training_args = parser.parse_args_into_dataclasses()
@@ -77,6 +98,9 @@ def train():
     # modelopt state will be saved automatically to "modelopt_state.pth"
     mto.enable_huggingface_checkpointing()
 
+    # HACK: Fix FSDP2-incompatible save_model() function for SFTTrainer
+    SFTTrainer.save_model = _save_model_fsdp_compat
+
     # Set total batch size across all ranks to equal 64
     total_batch_size = 64
     num_accum_steps = total_batch_size / (
@@ -91,19 +115,22 @@ def train():
         f"Using {int(num_accum_steps)} grad accumulation steps for effective batchsize of {total_batch_size}."
     )
 
+    # Dataset
     logger.info("Loading dataset...")
     dset = datasets.load_dataset("Open-Orca/OpenOrca", split="train")
     dset_splits = dset.train_test_split(train_size=25600, test_size=1700, seed=420)
     dset_train, dset_eval = dset_splits["train"], dset_splits["test"]
     logger.info("Dataset loaded.")
 
+    # Tokenizer
     logger.info("Loading tokenizer...")
     model_path = model_args.teacher_name_or_path or model_args.student_name_or_path
     tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True)
     tokenizer.pad_token = tokenizer.eos_token
     tokenizer.padding_side = "right"
     logger.info("Tokenizer loaded.")
 
+    # Model
     if model_args.single_model:
         logger.info("Loading single model only...")
         model = transformers.AutoModelForCausalLM.from_pretrained(
 
@@ -1,2 +1,3 @@
 pyarrow
+transformers<5.0
 trl>=0.23.0
@@ -43,14 +43,16 @@ pip install -U nvidia-modelopt[hf]
 pip install -r requirements.txt
 ```
 
-We use [Daring-Anteater](https://huggingface.co/datasets/nvidia/Daring-Anteater) dataset in this example. Download by:
+### Data Preparation
+
+We use [Daring-Anteater](https://huggingface.co/datasets/nvidia/Daring-Anteater) dataset in this example. Prepare data by:
 
 ```bash
-apt-get update && apt-get install -y git-lfs
-git lfs install --system
-git clone https://huggingface.co/datasets/nvidia/Daring-Anteater
+python prepare_input_conversations/add_daring_anteater.py
 ```
 
+See [other-datasets](#other-datasets) section for other dataset options and instruction for user-provided data.
+
 ## Getting Started: Simplified Workflow
 
 ```bash
@@ -71,7 +73,7 @@ For small base models that fit in GPU memory, we can collocate them with draft m
 ```bash
 ./launch_train.sh --model $BASE_MODEL \
             --output_dir $OUTPUT_DIR \
-            --data Daring-Anteater/train.jsonl  \
+            --data input_conversations/daring-anteater.jsonl  \
             --num_gpu $NUM_GPU \
             --num_epochs $NUM_EPOCH \
             --eagle_config eagle_config.json
@@ -82,20 +84,35 @@ The saved modelopt checkpoint is similar in architecture to HF models. It can be
 
 ## Training Draft Model with Offline Base Model
 
-For large models, you can export intermediate hidden states to disk and train only the draft model. This significantly reduces GPU memory requirements, but requires several to tens of terabytes of storage depending on dataset size.
+For large models, you can export intermediate hidden states to disk and train only the draft model. This significantly reduces GPU memory requirements, but requires several to tens of terabytes of disk storage depending on dataset size.
+
+### Dumpping Hidden States to Disk
+
+We support two backends for generating base model hidden states. For better effciency, it is recommended to use TRT-LLM:
+
+```bash
+python collect_hidden_states/compute_hidden_states_trtllm.py \
+            --model $BASE_MODEL \ 
+            --input-file input_conversations/daring-anteater.jsonl \
+            --output-dir $HIDDEN_STATES_DIR
+```
 
-First, dump the base model's hidden states with the following command:
+**NOTE**: TRT-LLM installation needed for the above command.
+
+Alternatively, you can generate the same hidden states with HF:
 
 ```bash
 python collect_hidden_states/compute_hidden_states_hf.py \
             --model $BASE_MODEL \
-            --input-file Daring-Anteater/train.jsonl  \
+            --input-file input_conversations/daring-anteater.jsonl  \
             --output-dir $HIDDEN_STATES_DIR
 ```
 
-See [`run_hf_compute_hiddens_dp.sh`](./collect_hidden_states/run_hf_compute_hiddens_dp.sh) for a simple example using data parallelism (DP) to accelerate hidden state generation.
+**NOTE**: See [`run_hf_compute_hiddens_dp.sh`](./collect_hidden_states/run_hf_compute_hiddens_dp.sh) and [`run_trtllm_compute_hiddens_dp.sh`](./collect_hidden_states/run_trtllm_compute_hiddens_dp.sh) for a simple example using data parallelism (DP) to accelerate hidden state generation.
+
+### Train Draft Model with Dumped Hidden States
 
-Then, train draft model with `--offline-data` argument:
+Once we finish dumping hidden states, launch offline training with an extra `--offline-data` argument:
 
 ```bash
 ./launch_train.sh --model $BASE_MODEL \
@@ -109,13 +126,13 @@ Then, train draft model with `--offline-data` argument:
 
 ## Model Validation
 
-After training draft model, we can evaluate the saved modelopt checkpoint on MT-bench by:
+For online training checkpoints, we can run in-framework evaluation on MT-bench:
 
 ```bash
-python ar_validate.py --model_path $OUTPUT_DIR
+python ar_validate.py --model_path $ONLINE_CKPT
 ```
 
-Alternatively, we can export the checkpoint and run evaluation on serving frameworks. See sections below.
+**Note**: In-framework evaluation is supported only for online training. For offline training checkpoints, please export the model and evaluate it using serving frameworks.
 
 ## Export
 
@@ -168,6 +185,28 @@ See more details on deployment of quantized model to TRTLLM [here](../llm_ptq/RE
 
 ## Advanced Usage
 
+### Other Datasets
+
+In addition to `daring-anteater`, we provide scripts for adding several other commonly used datasets in `prepare_input_conversations`:
+
+```text
+prepare_input_conversations/
+    ├── add_daring_anteater.py
+    ├── add_mtbench.py
+    ├── add_sharegpt.py
+    ├── add_ultrachat.py
+    └── example_make_prompt_dataset.sh
+```
+
+To use your own datasets, please preprocess your data into a `.jsonl` file with each line in the format:
+
+```json
+{
+    "conversation_id": <unique id>, 
+    "conversations": [{"role":<user or assistant>, "content":<content>}]
+}
+```
+
 ### Data Synthesis
 
 To achieve higher acceptance rates during speculative decoding, it is beneficial to use conversations generated by the base model as training data. This ensures that the draft model's output distribution closely aligns with that of the base model.
@@ -184,7 +223,7 @@ Note: Add `--quantization=modelopt` flag for quantized models.
 Then, we generate conversations with the base model using prompts from Daring-Anteater:
 
 ```bash
-python server_generate.py --data_path Daring-Anteater/train.jsonl --output_path synthetic/train.jsonl
+python server_generate.py --data_path input_conversations/daring-anteater.jsonl --output_path synthetic/train.jsonl
 ```
 
 To add a system prompt, use the `--system_prompt <system_prompt_text>` argument.
@@ -196,7 +235,7 @@ For large scale data generation, please see [SLURM prepare data](SLURM_prepare_d
 We can optionally use smaller vocab size for the draft model for faster training and inference. E.g. Llama3.2-1B has a vocab size of 128256. In this example, we construct a draft vocab mapping of size 32k by finding the most commonly appeared vocabs in our training set:
 
 ```bash
-python calibrate_draft_vocab.py --model meta-llama/Llama-3.2-1B-Instruct --data Daring-Anteater/train.jsonl --draft_vocab_size 32000 --save_dir draft_vocab_cache
+python calibrate_draft_vocab.py --model meta-llama/Llama-3.2-1B-Instruct --data input_conversations/daring-anteater.jsonl --draft_vocab_size 32000 --save_dir draft_vocab_cache
 ```
 
 This will produce a `d2t.pt` file in `save_dir`, which is the mapping from draft token to target token. During inference, draft tokens can be mapped back to target tokens by `target_token = draft_token + d2t[draft_token]`.
Original file line number	Diff line number	Diff line change
`@@ -1,2 +1,3 @@`
`1`	`1`	`pyarrow`
	`2`	`+transformers<5.0`
`2`	`3`	`trl>=0.23.0`