Skip to content

Commit 9f4fc6c

Browse files
authored
Merge branch 'main' into 448/6
2 parents 906ce5c + 4df4091 commit 9f4fc6c

File tree

81 files changed

+3877
-464
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

81 files changed

+3877
-464
lines changed

.github/CODEOWNERS

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,7 @@ modelopt/torch/distill @NVIDIA/modelopt-torch-distill-codeowners
2222
modelopt/torch/export @NVIDIA/modelopt-torch-export-codeowners
2323
modelopt/torch/nas @NVIDIA/modelopt-torch-nas-prune-codeowners
2424
modelopt/torch/opt @NVIDIA/modelopt-torch-opt-codeowners
25+
modelopt/torch/peft @NVIDIA/modelopt-torch-peft-codeowners
2526
modelopt/torch/prune @NVIDIA/modelopt-torch-nas-prune-codeowners
2627
modelopt/torch/quantization @NVIDIA/modelopt-torch-quantization-codeowners
2728
modelopt/torch/sparsity @NVIDIA/modelopt-torch-sparsity-codeowners

.gitlab/tests.yml

Lines changed: 1 addition & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -54,20 +54,12 @@ example-torch:
5454
timeout: 30m
5555
parallel:
5656
matrix:
57-
- EXAMPLE: [llm_distill, llm_sparsity, speculative_decoding]
57+
- EXAMPLE: [llm_distill, llm_qat, llm_sparsity, speculative_decoding]
5858
script:
5959
- pip install ".[hf,dev-test]"
6060
- find examples/$EXAMPLE -name "requirements.txt" | while read req_file; do pip install -r "$req_file" || exit 1; done
6161
- pytest -s tests/examples/$EXAMPLE
6262

63-
# TODO: Fix llm_qat test hang in GitLab CI
64-
example-failing:
65-
extends: example-torch
66-
allow_failure: true
67-
parallel:
68-
matrix:
69-
- EXAMPLE: [llm_qat]
70-
7163
example-trtllm:
7264
extends: example-torch
7365
timeout: 60m

CHANGELOG.rst

Lines changed: 9 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,24 @@
11
Model Optimizer Changelog (Linux)
22
=================================
33

4-
0.39 (2025-10-xx)
4+
0.39 (2025-11-xx)
55
^^^^^^^^^^^^^^^^^
66

77
**Deprecations**
88

99
**New Features**
1010

1111
- Add flag ``op_types_to_exclude_fp16`` in ONNX quantization to exclude ops from being converted to FP16/BF16. Alternatively, for custom TensorRT ops, this can also be done by indicating ``'fp32'`` precision in ``trt_plugins_precision``.
12+
- Add LoRA mode support for MCore in a new peft submodule: ``modelopt.torch.peft.update_model(model, LORA_CFG)``.
1213
- Support PTQ and fakequant in vLLM for fast evaluation of arbitrary quantization formats. See ``examples/vllm_serve`` for more details.
14+
- Add support for ``nemotron-post-training-dataset-v2`` and ``nemotron-post-training-dataset-v1`` in ``examples/llm_ptq``. Default to a mix of ``cnn_dailymail`` and ``nemotron-post-training-dataset-v2`` if no dataset is specified.
15+
- Allow specifying ``calib_seq`` in ``examples/llm_ptq`` to set the maximum sequence length for calibration.
1316

14-
0.37 (2025-09-xx)
17+
**Documentation**
18+
19+
- Add general guidelines for Minitron pruning and distillation. See `examples/pruning/README.md <https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/examples/pruning#pruning-guidelines>`_ for more details.
20+
21+
0.37 (2025-10-08)
1522
^^^^^^^^^^^^^^^^^
1623

1724
**Deprecations**

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,7 @@ Model Optimizer is also integrated with [NVIDIA NeMo](https://github.com/NVIDIA-
2626

2727
## Latest News
2828

29+
- [2025/10/07] [Pruning and Distilling LLMs Using NVIDIA TensorRT Model Optimizer](https://developer.nvidia.com/blog/pruning-and-distilling-llms-using-nvidia-tensorrt-model-optimizer/)
2930
- [2025/09/17] [An Introduction to Speculative Decoding for Reducing Latency in AI Inference](https://developer.nvidia.com/blog/an-introduction-to-speculative-decoding-for-reducing-latency-in-ai-inference/)
3031
- [2025/09/11] [How Quantization Aware Training Enables Low-Precision Accuracy Recovery](https://developer.nvidia.com/blog/how-quantization-aware-training-enables-low-precision-accuracy-recovery/)
3132
- [2025/08/29] [Fine-Tuning gpt-oss for Accuracy and Performance with Quantization Aware Training](https://developer.nvidia.com/blog/fine-tuning-gpt-oss-for-accuracy-and-performance-with-quantization-aware-training/)

docs/source/deployment/1_tensorrt_llm.rst

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,12 +2,15 @@
22
TensorRT-LLM
33
==========================
44

5+
**Deprecation Notice**: The export_tensorrt_llm_checkpoint API will be deprecated in future releases. Users are encouraged to transition to the :doc:`unified HF export API <3_unified_hf>`, which provides enhanced functionality and flexibility for exporting models to multiple inference frameworks including TensorRT-LLM, vLLM, and SGLang.
6+
57
.. note::
68

7-
Please read the `TensorRT-LLM checkpoint workflow <https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/architecture/checkpoint.md>`_
9+
Please read the `TensorRT-LLM checkpoint workflow <https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/legacy/architecture/checkpoint.md>`_
810
first before going through this section.
911

1012

13+
1114
ModelOpt toolkit supports automatic conversion of ModelOpt exported LLM to the TensorRT-LLM checkpoint and the engines for accelerated inferencing.
1215

1316
This conversion is achieved by:
@@ -144,4 +147,4 @@ If the :meth:`export_tensorrt_llm_checkpoint <modelopt.torch.export.model_config
144147
Convert to TensorRT-LLM
145148
=======================
146149

147-
Once the TensorRT-LLM checkpoint is available, please follow the `TensorRT-LLM build API <https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/architecture/workflow.md#build-apis>`_ to build and deploy the quantized LLM.
150+
Once the TensorRT-LLM checkpoint is available, please follow the `TensorRT-LLM build API <https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/legacy/architecture/workflow.md#build-apis>`_ to build and deploy the quantized LLM.

docs/source/guides/7_nas.rst

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -635,3 +635,12 @@ The difference between NAS and pruning is summarized below.
635635
increased training time.
636636
- May provide similar performance to NAS in particular applications, however, usually exhibits
637637
worse performance due to the limited search space and training time.
638+
639+
640+
[Advanced] Adding a new NAS/Prune Algorithm
641+
===========================================
642+
643+
* Please refer to this `template <https://github.com/NVIDIA/TensorRT-Model-Optimizer/compare/template/new-nas-mode>`_
644+
for adding a new NAS algorithm.
645+
* Please refer to `mcore_minitron.py <https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/modelopt/torch/prune/plugins/mcore_minitron.py>`_
646+
for an actual example of adding Minitron Pruning algorithm.

examples/diffusers/cache_diffusion/requirements.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,3 +3,4 @@ opencv-python>=4.8.1.78,<4.12.0.88
33
peft>=0.10.0
44
polygraphy==0.49.9
55
sentencepiece
6+
transformers<4.57

examples/llm_distill/README.md

Lines changed: 10 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -49,8 +49,8 @@ First obtain both a pretrained model to act as the teacher and a (usually smalle
4949
from transformers import AutoModelForCausalLM
5050

5151
# Define student & teacher
52-
student_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
53-
teacher_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-70B-Instruct")
52+
student_model = AutoModelForCausalLM.from_pretrained("student-model-id-or-path")
53+
teacher_model = AutoModelForCausalLM.from_pretrained("teacher-model-id-or-path")
5454
```
5555

5656
### Set up the meta model
@@ -149,52 +149,27 @@ You can also look at the NeMo tutorial notebooks [here](https://github.com/NVIDI
149149

150150
## Knowledge Distillation (KD) for HuggingFace Models
151151

152-
In this e2e example we finetune Llama-2 models on the [OpenOrca](https://huggingface.co/datasets/Open-Orca/OpenOrca)
153-
question-answer dataset as a minimal example to demonstrate a simple way of integrating Model Optimizer's KD feature.
152+
In this e2e example we finetune Llama-3.2 models on the [smol-smoltalk-Interaction-SFT](https://huggingface.co/datasets/ReactiveAI/smol-smoltalk-Interaction-SFT)
153+
dataset as a minimal example to demonstrate a simple way of integrating Model Optimizer's KD feature.
154154

155-
First we do supervised finetuning (SFT) of a Llama-2-7b on OpenOrca dataset as the teacher, then distill it into
156-
a 1B-parameter model.
157-
158-
Keep in mind the training loss of the distillation run is not directly comparable to the training loss of the teacher run.
155+
We replace normal supervised finetuning (SFT) of a Llama-3.2-1B base model by distilling information from Llama-3.2-3B-Instruct which has already been instruction-finetuned.
159156

160157
> [!NOTE]
161158
> We can fit the following in memory using [FSDP](https://huggingface.co/docs/accelerate/en/usage_guides/fsdp) enabled on 8x RTX 6000 (total ~400GB VRAM)
162159
163-
### Train teacher
164-
165-
```bash
166-
accelerate launch --config-file ./accelerate_config/fsdp2.yaml \
167-
main.py \
168-
--single_model \
169-
--teacher_name_or_path 'meta-llama/Llama-2-7b-hf' \
170-
--output_dir ./llama2-7b-sft \
171-
--max_length 2048 \
172-
--per_device_train_batch_size 1 \
173-
--per_device_eval_batch_size 4 \
174-
--max_steps 400 \
175-
--logging_steps 5
176-
```
177-
178-
### Distill teacher into student
179-
180160
```bash
181161
accelerate launch --config-file ./accelerate_config/fsdp2.yaml \
182-
--fsdp_cpu_ram_efficient_loading False \
183-
--fsdp_activation_checkpointing False \
184162
main.py \
185-
--teacher_name_or_path ./llama2-7b-sft \
186-
--student_name_or_path 'TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T' \
187-
--output_dir ./llama2-distill \
163+
--teacher_name_or_path 'meta-llama/Llama-3.2-3B-Instruct' \
164+
--student_name_or_path 'meta-llama/Llama-3.2-1B' \
165+
--output_dir ./llama3.2-distill \
188166
--max_length 2048 \
189-
--per_device_train_batch_size 1 \
190-
--per_device_eval_batch_size 4 \
167+
--per_device_train_batch_size 4 \
168+
--per_device_eval_batch_size 8 \
191169
--max_steps 200 \
192170
--logging_steps 5
193171
```
194172

195-
> [!NOTE]
196-
> If you receive a `RuntimeError: unable to open file <...> in read-only mode: No such file or directory` simply re-run the command a second time.
197-
198173
## Resources
199174

200175
- 📅 [Roadmap](https://github.com/NVIDIA/TensorRT-Model-Optimizer/issues/146)

examples/llm_distill/accelerate_config/fsdp2.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,9 +4,9 @@ distributed_type: FSDP
44
downcast_bf16: 'no'
55
enable_cpu_affinity: false
66
fsdp_config:
7-
fsdp_activation_checkpointing: true
7+
fsdp_activation_checkpointing: false
88
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
9-
fsdp_cpu_ram_efficient_loading: true
9+
fsdp_cpu_ram_efficient_loading: false
1010
fsdp_offload_params: false
1111
fsdp_reshard_after_forward: true
1212
fsdp_state_dict_type: SHARDED_STATE_DICT

examples/llm_distill/main.py

Lines changed: 35 additions & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -13,10 +13,11 @@
1313
# See the License for the specific language governing permissions and
1414
# limitations under the License.
1515

16-
import logging
1716
import os
1817
from dataclasses import dataclass
1918

19+
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
20+
2021
import datasets
2122
import torch
2223
import torch.distributed
@@ -29,17 +30,13 @@
2930
import modelopt.torch.opt as mto
3031
from modelopt.torch.distill.plugins.huggingface import KDTrainer, LMLogitsLoss
3132

32-
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:512"
33-
34-
logger = get_logger(__name__)
35-
logging.basicConfig(level=logging.INFO)
33+
logger = get_logger(__name__, log_level="INFO")
3634

3735

3836
@dataclass
3937
class ModelArguments:
4038
teacher_name_or_path: str | None = None
4139
student_name_or_path: str | None = None
42-
single_model: bool = False
4340

4441

4542
@dataclass
@@ -57,12 +54,14 @@ class TrainingArguments(transformers.TrainingArguments):
5754
tf32: bool = True
5855

5956

60-
def llama_text_format_func(sample):
61-
p, q, r = sample["system_prompt"], sample["question"], sample["response"]
62-
if not p:
63-
return f"<s>[INST] {q}[/INST]\n{r}</s>"
64-
else:
65-
return f"<s>[INST] <<SYS>>{p}<</SYS>>\n{q}[/INST]\n{r}</s>"
57+
def _format_smoltalk_chat_template(sample, tokenizer):
58+
# smol-smoltalk-Interaction-SFT dataset has "query" and "answer" fields
59+
# Convert them to messages format and use tokenizer's apply_chat_template
60+
messages = [
61+
{"role": "user", "content": sample["query"]},
62+
{"role": "assistant", "content": sample["answer"]},
63+
]
64+
return tokenizer.apply_chat_template(messages, tokenize=False)
6665

6766

6867
class KDSFTTrainer(SFTTrainer, KDTrainer):
@@ -91,55 +90,50 @@ def train():
9190
f"Using {int(num_accum_steps)} grad accumulation steps for effective batchsize of {total_batch_size}."
9291
)
9392

93+
# Dataset
9494
logger.info("Loading dataset...")
95-
dset = datasets.load_dataset("Open-Orca/OpenOrca", split="train")
96-
dset_splits = dset.train_test_split(train_size=25600, test_size=1700, seed=420)
95+
dset = datasets.load_dataset("ReactiveAI/smol-smoltalk-Interaction-SFT", split="train")
96+
dset_splits = dset.train_test_split(train_size=12800, test_size=1280, seed=420)
9797
dset_train, dset_eval = dset_splits["train"], dset_splits["test"]
9898
logger.info("Dataset loaded.")
9999

100+
# Tokenizer
100101
logger.info("Loading tokenizer...")
101102
model_path = model_args.teacher_name_or_path or model_args.student_name_or_path
102103
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True)
103104
tokenizer.pad_token = tokenizer.eos_token
104105
tokenizer.padding_side = "right"
105106
logger.info("Tokenizer loaded.")
106107

107-
if model_args.single_model:
108-
logger.info("Loading single model only...")
109-
model = transformers.AutoModelForCausalLM.from_pretrained(
110-
model_path, dtype=torch.bfloat16 if training_args.bf16 else None
111-
)
112-
logger.info("Model loaded.")
113-
else:
114-
logger.info("Loading student model...")
115-
model = transformers.AutoModelForCausalLM.from_pretrained(
116-
model_args.student_name_or_path, dtype=torch.bfloat16 if training_args.bf16 else None
117-
)
118-
logger.info("Student loaded.")
119-
# Load checkpoint
120-
logger.info("Loading teacher model and converting to Distillation model...")
121-
teacher_model = transformers.AutoModelForCausalLM.from_pretrained(
122-
model_args.teacher_name_or_path, dtype=torch.bfloat16 if training_args.bf16 else None
123-
)
124-
kd_config = {
125-
"teacher_model": teacher_model,
126-
"criterion": LMLogitsLoss(),
127-
}
128-
model = mtd.convert(model, mode=[("kd_loss", kd_config)])
129-
logger.info("Models converted.")
108+
# Model
109+
logger.info("Loading student model...")
110+
model = transformers.AutoModelForCausalLM.from_pretrained(
111+
model_args.student_name_or_path, dtype=torch.bfloat16 if training_args.bf16 else None
112+
)
113+
logger.info("Student loaded.")
114+
# Load checkpoint
115+
logger.info("Loading teacher model and converting to Distillation model...")
116+
teacher_model = transformers.AutoModelForCausalLM.from_pretrained(
117+
model_args.teacher_name_or_path, dtype=torch.bfloat16 if training_args.bf16 else None
118+
)
119+
kd_config = {
120+
"teacher_model": teacher_model,
121+
"criterion": LMLogitsLoss(),
122+
}
123+
model = mtd.convert(model, mode=[("kd_loss", kd_config)])
124+
logger.info("Models converted.")
130125

131126
# Fix problematic settings that logger.info excessive warnings
132127
model.generation_config.temperature = None
133128
model.generation_config.top_p = None
134129

135130
# Trainer
136-
trainer_cls = SFTTrainer if model_args.single_model else KDSFTTrainer
137-
trainer = trainer_cls(
131+
trainer = KDSFTTrainer(
138132
model,
139133
training_args,
140134
train_dataset=dset_train,
141135
eval_dataset=dset_eval,
142-
formatting_func=llama_text_format_func,
136+
formatting_func=lambda sample: _format_smoltalk_chat_template(sample, tokenizer),
143137
processing_class=tokenizer,
144138
)
145139

@@ -159,8 +153,7 @@ def train():
159153
# Save checkpoint
160154
logger.info("Saving checkpoint...")
161155
trainer.save_state()
162-
kwargs = {"export_student": True} if not model_args.single_model else {}
163-
trainer.save_model(trainer.args.output_dir, **kwargs)
156+
trainer.save_model(trainer.args.output_dir, export_student=True)
164157
logger.info("Checkpoint saved.")
165158

166159

0 commit comments

Comments
 (0)