You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: CHANGELOG.rst
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -9,13 +9,13 @@ Model Optimizer Changelog (Linux)
9
9
- Deprecated ``quantize_mode`` argument in ``examples/onnx_ptq/evaluate.py`` to support strongly typing. Use ``engine_precision`` instead.
10
10
- Deprecated TRT-LLM's TRT backend in ``examples/llm_ptq`` and ``examples/vlm_ptq``. Tasks ``build`` and ``benchmark`` support are removed and replaced with ``quant``. For performance evaluation, please use ``trtllm-bench`` directly.
11
11
- ``--export_fmt`` flag in ``examples/llm_ptq`` is removed. By default we export to the unified Hugging Face checkpoint format.
12
-
- ``int8_sq`` quantization format is deprecated from the ``examples/vlm_ptq`` with respect to the TensorRT-LLM's torch backend switch. Please refer to the previous releases if this quantization format is needed.
13
12
- Deprecated ``examples/vlm_eval`` as it depends on the deprecated TRT-LLM's TRT backend.
14
13
15
14
**New Features**
16
15
17
16
- ``high_precision_dtype`` default to fp16 in ONNX quantization, i.e. quantized output model weights are now FP16 by default.
18
17
- Upgrade TensorRT-LLM dependency to 1.1.0rc2.
18
+
- Support Phi-4-multimodal and Qwen2.5-VL quantized HF checkpoint export in ``examples/vlm_ptq``.
Copy file name to clipboardExpand all lines: README.md
+2Lines changed: 2 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -26,6 +26,8 @@ Model Optimizer is also integrated with [NVIDIA NeMo](https://github.com/NVIDIA-
26
26
27
27
## Latest News
28
28
29
+
-[2025/09/17][An Introduction to Speculative Decoding for Reducing Latency in AI Inference](https://developer.nvidia.com/blog/an-introduction-to-speculative-decoding-for-reducing-latency-in-ai-inference/)
30
+
-[2025/09/11][How Quantization Aware Training Enables Low-Precision Accuracy Recovery](https://developer.nvidia.com/blog/how-quantization-aware-training-enables-low-precision-accuracy-recovery/)
29
31
-[2025/08/29][Fine-Tuning gpt-oss for Accuracy and Performance with Quantization Aware Training](https://developer.nvidia.com/blog/fine-tuning-gpt-oss-for-accuracy-and-performance-with-quantization-aware-training/)
30
32
-[2025/08/01][Optimizing LLMs for Performance and Accuracy with Post-Training Quantization](https://developer.nvidia.com/blog/optimizing-llms-for-performance-and-accuracy-with-post-training-quantization/)
31
33
-[2025/06/24][Introducing NVFP4 for Efficient and Accurate Low-Precision Inference](https://developer.nvidia.com/blog/introducing-nvfp4-for-efficient-and-accurate-low-precision-inference/)
# Export model in original class, with only previously-present attributes
57
57
model_exported = mtd.export(distillation_model)
58
58
59
-
.. note::
60
-
The config requires a (non-lambda) Callable to return a teacher model in place of the model
61
-
itself. This is to avoid re-saving the teacher state dict upon saving the Distillation
62
-
meta model. Thus, the same callable must be available in the namespace when restoring via
63
-
the :meth:`mto.restore <modelopt.torch.opt.conversion.restore>` utility.
64
-
65
59
.. tip::
66
60
When training the student on a small corpus of ground truth data, consider using :class:`MFTLoss <modelopt.torch.distill.MFTLoss>` for to perform Minifinetuning in lieu of the standard
67
61
:class:`LogitsDistillationLoss <modelopt.torch.distill.losses.LogitsDistillationLoss>`. This will allow the student to learn from the teacher's distribution while adapting to the new data, improving the specialization of the new data without overwriting teacher's general knowledge.
@@ -170,10 +164,12 @@ outputs in the same order as well:
170
164
The intermediate outputs for the losses are captured by the
171
165
:class:`DistillationModel <modelopt.torch.distill.distillation_model.DistillationModel>` and then the loss(es) are
172
166
invoked using :meth:`DistillationModel.compute_kd_loss() <modelopt.torch.distill.distillation_model.DistillationModel.compute_kd_loss>`.
173
-
If present, the original student's non-distillation loss is passed in as an argument.
167
+
If present, the original student's non-distillation loss can be passed in as an argument.
174
168
175
169
Writing a custom loss function is often necessary, especially to handle outputs that need to be processed
176
-
to obtain the logits and activations.
170
+
to obtain the logits and activations. Additional arguments to the loss function can be passed in to
The `teacher_model` can be either a callable which returns an `nn.Module` or a tuple of `(model_cls, args, kwargs)`. The `criterion` is the distillation loss used between student and teacher tensors. The `loss_balancer` determines how the original and distillation losses are combined (if needed).
65
+
The `teacher_model` can be either a `nn.Module`, a callable which returns an `nn.Module`, or a tuple of `(model_cls, args, kwargs)`. The `criterion` is the distillation loss used between student and teacher tensors. The `loss_balancer` determines how the original and distillation losses are combined (if needed).
70
66
71
67
See [Distillation](https://nvidia.github.io/TensorRT-Model-Optimizer/guides/4_distillation.html) for more info.
72
68
@@ -158,35 +154,33 @@ Keep in mind the training loss of the distillation run is not directly comparabl
Copy file name to clipboardExpand all lines: examples/llm_qat/README.md
+1Lines changed: 1 addition & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -11,6 +11,7 @@ Quantization Aware Training (QAT) helps to improve the model accuracy beyond pos
11
11
| Support Matrix | View the support matrix to see quantization compatibility and feature availability across different models |\[[Link](#support-matrix)\]||
12
12
| End to End QAT | Example scripts demonstrating quantization techniques for optimizing Hugging Face models |\[[Link](#end-to-end-qat-example)\]|\[[docs](https://nvidia.github.io/TensorRT-Model-Optimizer/guides/1_quantization.html)\]|
13
13
| End to End QAD | Example scripts demonstrating quantization aware distillation techniques for optimizing Hugging Face models |\[[Link](#end-to-end-qad-example)\]|\[[docs](https://nvidia.github.io/TensorRT-Model-Optimizer/guides/1_quantization.html)\]|
14
+
| NeMo QAT/QAD Simplified Flow | Example script demonstrating end-to-end QAT/QAD in NeMo |\[[Link](../nemo_run/qat/README.md)\]||
14
15
| Evaluate Accuracy | Evaluating model accuracy after QAT/QAD (with fake quantization) |\[[Link](#testing-qat-model-with-llm-benchmarks-for-accuracy-evaluation)\]||
15
16
| Deployment | Deploying the model after QAT/QAD |\[[Link](#deployment)\]||
16
17
| QLoRA | Model training with reduced GPU memory |\[[Link](#end-to-end-qlora-with-real-quantization)\]||
0 commit comments