Skip to content

Commit 92f430f

Browse files
Add files for 0.27.0 release
1 parent 3c48b94 commit 92f430f

File tree

293 files changed

+10169
-5045
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

293 files changed

+10169
-5045
lines changed

.dockerignore

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
docker
22
examples/**/.git
3-
examples/llm_ptq/saved_models*
3+
examples/**/saved_models*
44
**/experimental
55

66
##### Copied from .gitignore #####

.github/ISSUE_TEMPLATE/1_bug_report.md

Lines changed: 18 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,23 @@ assignees: ''
2020

2121
## System information
2222

23+
- Container used (if applicable): ?
24+
- OS (e.g., Ubuntu 22.04, CentOS 7, Windows 10): ? <!-- If Windows, please add the `windows` label to the issue. -->
25+
- CPU architecture (x86_64, aarch64): ?
26+
- GPU name (e.g. H100, A100, L40S): ?
27+
- GPU memory size: ?
28+
- Number of GPUs: ?
29+
- Library versions (if applicable):
30+
- Python: ?
31+
- ModelOpt version or commit hash: ?
32+
- CUDA: ?
33+
- PyTorch: ?
34+
- Transformers: ?
35+
- TensorRT-LLM: ?
36+
- ONNXRuntime: ?
37+
- TensorRT: ?
38+
- Any other details that may help: ?
39+
2340

2441
<details>
2542
<summary><b>Click to expand: Python script to automatically collect system information</b></summary>
@@ -103,24 +120,8 @@ print(" - Transformers: " + get_package_version("transformers"))
103120
print(" - TensorRT-LLM: " + get_package_version("tensorrt_llm"))
104121
print(" - ONNXRuntime: " + get_package_version("onnxruntime"))
105122
print(" - TensorRT: " + get_package_version("tensorrt"))
123+
print("- Any other details that may help: " + "?")
106124
print("=" * 70)
107125
```
108126

109127
</details>
110-
111-
- Container used (if applicable): ?
112-
- OS (e.g., Ubuntu 22.04, CentOS 7, Windows 10): ? <!-- If Windows, please add the `windows` label to the issue. -->
113-
- CPU architecture (x86_64, aarch64): ?
114-
- GPU name (e.g. H100, A100, L40S): ?
115-
- GPU memory size: ?
116-
- Number of GPUs: ?
117-
- Library versions (if applicable):
118-
- Python: ?
119-
- ModelOpt version or commit hash: ?
120-
- CUDA: ?
121-
- PyTorch: ?
122-
- Transformers: ?
123-
- TensorRT-LLM: ?
124-
- ONNXRuntime: ?
125-
- TensorRT: ?
126-
- Any other details that may help: ?

.github/ISSUE_TEMPLATE/2_feature_request.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,5 +17,5 @@ assignees: ''
1717
### Describe alternatives you've considered
1818

1919

20-
### Target Hardware/Use Case
20+
### Target hardware/use case
2121
<!-- Target hardware/use case this feature will be used for -->

.pre-commit-config.yaml

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -96,6 +96,8 @@ repos:
9696
modelopt/torch/speculative/eagle/utils.py|
9797
modelopt/torch/speculative/plugins/transformers.py|
9898
examples/chained_optimizations/bert_prune_distill_quantize.py|
99+
examples/deepseek/quantize_to_nvfp4.py|
100+
examples/deepseek/ptq.py|
99101
examples/diffusers/cache_diffusion/pipeline/models/sdxl.py|
100102
examples/diffusers/quantization/onnx_utils/export.py|
101103
examples/llm_eval/gen_model_answer.py|
@@ -108,8 +110,6 @@ repos:
108110
examples/speculative_decoding/main.py|
109111
examples/speculative_decoding/medusa_utils.py|
110112
examples/speculative_decoding/vllm_generate.py|
111-
examples/deepseek/quantize_to_nvfp4.py|
112-
examples/deepseek/ptq.py|
113113
)$
114114
115115
# Default hook for Apache 2.0 in core c/c++/cuda files
@@ -154,4 +154,3 @@ repos:
154154
- id: lychee
155155
args: ["--no-progress", "--exclude-loopback"]
156156
stages: [manual] # Only run with `pre-commit run --all-files --hook-stage manual lychee`
157-
exclude: internal/

CHANGELOG.rst

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,30 @@
11
Model Optimizer Changelog (Linux)
22
=================================
33

4+
0.27 (2025-04-03)
5+
^^^^^^^^^^^^^^^^^
6+
7+
**Backward Breaking Changes**
8+
9+
- Deprecate real quantization configs, please use :meth:`mtq.compress <modelopt.torch.quantization.compress>` API for model compression after quantization.
10+
11+
**New Features**
12+
13+
- Blockwise FP8 quantization support in unified model export.
14+
- Add quantization support to the Transformer Engine Linear module.
15+
- Add support for SVDQuant. Currently, only simulation is available; real deployment (for example, TensorRT deployment) support is coming soon.
16+
- To support distributed checkpoint resume expert-parallel (EP), ``modelopt_state`` in Megatron Core distributed checkpoint (used in NeMo and Megatron-LM) is stored differently. The legacy ``modelopt_state`` in the distributed checkpoint generated by previous modelopt version can still be loaded in 0.27 and 0.29 but will need to be stored in the new format.
17+
- Add triton-based NVFP4 quantization kernel that delivers approximately 40% performance improvement over the previous implementation.
18+
- Add a new API :meth:`mtq.compress <modelopt.torch.quantization.compress>` for model compression for weights after quantization.
19+
- Add option to simplify ONNX model before quantization is performed.
20+
- (Experimental) Improve support for ONNX models with custom TensorRT op:
21+
- Add support for ``--calibration_shapes`` flag.
22+
- Add automatic type and shape tensor propagation for full ORT support with TensorRT EP.
23+
24+
**Known Issues**
25+
26+
- Quantization of T5 models is broken. Please use ``nvidia-modelopt==0.25.0`` with ``transformers<4.50`` meanwhile.
27+
428
0.25 (2025-03-03)
529
^^^^^^^^^^^^^^^^^
630

README.md

Lines changed: 8 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,7 @@
1818

1919
## Latest News
2020

21+
- [2025/03/18] [World's Fastest DeepSeek-R1 Inference with Blackwell FP4 & Increasing Image Generation Efficiency on Blackwell](https://developer.nvidia.com/blog/nvidia-blackwell-delivers-world-record-deepseek-r1-inference-performance/)
2122
- [2025/02/25] Model Optimizer quantized NVFP4 models available on Hugging Face for download: [DeepSeek-R1-FP4](https://huggingface.co/nvidia/DeepSeek-R1-FP4), [Llama-3.3-70B-Instruct-FP4](https://huggingface.co/nvidia/Llama-3.3-70B-Instruct-FP4), [Llama-3.1-405B-Instruct-FP4](https://huggingface.co/nvidia/Llama-3.1-405B-Instruct-FP4)
2223
- [2025/01/28] Model Optimizer has added support for NVFP4. Check out an example of NVFP4 PTQ [here](./examples/llm_ptq/README.md#model-quantization-and-trt-llm-conversion).
2324
- [2025/01/28] Model Optimizer is now open source!
@@ -44,12 +45,12 @@
4445
- [Model Optimizer Overview](#model-optimizer-overview)
4546
- [Installation](#installation--docker)
4647
- [Techniques](#techniques)
47-
- [Quantization](#quantization)
48+
- [Quantization](#quantization-examples-docs)
4849
- [Quantized Checkpoints](#quantized-checkpoints)
49-
- [Pruning](#pruning)
50-
- [Distillation](#distillation)
51-
- [Speculative Decoding](#speculative-decoding)
52-
- [Sparsity](#sparsity)
50+
- [Pruning](#pruning-examples-docs)
51+
- [Distillation](#distillation-examples-docs)
52+
- [Speculative Decoding](#speculative-decoding-examples-docs)
53+
- [Sparsity](#sparsity-examples-docs)
5354
- [Examples](#examples)
5455
- [Support Matrix](#model-support-matrix)
5556
- [Benchmark](#benchmark)
@@ -114,7 +115,7 @@ Below is a short description of the techniques supported by Model Optimizer.
114115

115116
### Quantization \[[examples](./examples/README.md#quantization)\] \[[docs](https://nvidia.github.io/TensorRT-Model-Optimizer/guides/1_quantization.html)\]
116117

117-
Quantization is an effective model optimization technique for large models. Quantization with Model Optimizer can compress model size by 2x-4x, speeding up inference while preserving model quality. Model Optimizer enables highly performant quantization formats including NVFP4, FP8, INT8, INT4, etc and supports advanced algorithms such as SmoothQuant, AWQ, and Double Quantization with easy-to-use Python APIs. Both Post-training quantization (PTQ) and Quantization-aware training (QAT) are supported.
118+
Quantization is an effective model optimization technique for large models. Quantization with Model Optimizer can compress model size by 2x-4x, speeding up inference while preserving model quality. Model Optimizer enables highly performant quantization formats including NVFP4, FP8, INT8, INT4, etc and supports advanced algorithms such as SmoothQuant, AWQ, SVDQuant, and Double Quantization with easy-to-use Python APIs. Both Post-training quantization (PTQ) and Quantization-aware training (QAT) are supported.
118119

119120
#### Quantized Checkpoints
120121

@@ -158,7 +159,7 @@ Please find the [detailed performance benchmarks](./examples/benchmark.md).
158159

159160
## Roadmap
160161

161-
Please see our [product roadmap](https://github.com/NVIDIA/TensorRT-Model-Optimizer/issues/108).
162+
Please see our [product roadmap](https://github.com/NVIDIA/TensorRT-Model-Optimizer/issues/146).
162163

163164
## Release Notes
164165

docker/Dockerfile

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -2,15 +2,17 @@ FROM nvidia/cuda:12.8.0-devel-ubuntu22.04
22

33
WORKDIR /workspace
44

5-
RUN apt-get update && apt-get -y install python3.10 python3-pip python-is-python3 openmpi-bin libopenmpi-dev wget git git-lfs unzip jq cmake
5+
RUN apt-get update && \
6+
apt-get -y install python3.10 python3-pip python-is-python3 openmpi-bin libopenmpi-dev libgl1 libglib2.0-0 wget git git-lfs unzip jq cmake vim && \
7+
rm -rf /var/lib/apt/lists/*
68

79
ARG PIP_EXTRA_INDEX_URL="https://pypi.nvidia.com"
810
ENV PIP_EXTRA_INDEX_URL=$PIP_EXTRA_INDEX_URL
911
ENV PIP_NO_CACHE_DIR=off
1012

1113
# Install the latest setuptools using pip
12-
RUN rm -rf /usr/lib/python3/dist-packages/setuptools*
13-
RUN pip install --upgrade pip setuptools
14+
RUN rm -rf /usr/lib/python3/dist-packages/setuptools* && \
15+
pip install --upgrade pip setuptools
1416

1517
# Install TensorRT-LLM
1618
ARG TRT_LLM_VERSION=0.17.0
@@ -44,9 +46,10 @@ RUN python -c "import modelopt.torch.quantization.extensions as ext; ext.precomp
4446

4547
# Find and install requirements.txt files for all examples excluding windows
4648
COPY . TensorRT-Model-Optimizer
49+
RUN rm -rf TensorRT-Model-Optimizer/.git
4750
RUN find TensorRT-Model-Optimizer/examples -name "requirements.txt" | grep -v "windows" | while read req_file; do \
4851
echo "Installing from $req_file"; \
49-
pip install -r "$req_file"; \
52+
pip install -r "$req_file" || exit 1; \
5053
done
5154

5255
# Allow users to run without root

docs/source/deployment/3_unified_hf.rst

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,26 @@ The export API (:meth:`export_hf_checkpoint <modelopt.torch.export.unified_expor
3232
export_dir, # The directory where the exported files will be stored.
3333
)
3434
35+
Deployment Support Matrix
36+
==============================================
37+
38+
Currently, we support the following quantization formats with the unified HF export API:
39+
#. FP8
40+
#. FP8_PB
41+
#. NVFP4
42+
#. NVFP4_AWQ
43+
#. INT4_AWQ
44+
#. W4A8_AWQ
45+
46+
For deployment with TensorRT-LLM, we support llama 3.1, 3.3, Mixtral 8x7B, with FP8 and NVFP4 checkpoints; Medusa and Eagle FP8 checkpoints are also tested.
47+
48+
For deployment with vLLM, we support llama 3.1, 3.3, Mixtral 8x7B, with FP8 checkpoints.
49+
50+
For deployment with SGLang, we support llama 3.1, 3.3, with FP8 checkpoints.
51+
52+
Other models and quantization formats may work, but they are not thoroughly tested.
53+
54+
3555
Deployment with Selected Inference Frameworks
3656
==============================================
3757

docs/source/getting_started/1_overview.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@ Quantization
2323
^^^^^^^^^^^^
2424
Quantization is an effective model optimization technique for large models. Quantization with Model Optimizer can compress
2525
model size by 2x-4x, speeding up inference while preserving model quality. Model Optimizer enables highly performant
26-
quantization formats including NVFP4, FP8, INT8, INT4, etc and supports advanced algorithms such as SmoothQuant, AWQ, and
26+
quantization formats including NVFP4, FP8, INT8, INT4, etc and supports advanced algorithms such as SmoothQuant, AWQ, SVDQuant, and
2727
Double Quantization with easy-to-use Python APIs. Both Post-training quantization (PTQ) and Quantization-aware training (QAT)
2828
are supported. Visit :meth:`Quantization Format page <modelopt.torch.quantization.config>`
2929
for list of formats supported.

docs/source/getting_started/3_quantization.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ Quantization is an effective technique to reduce the memory footprint of deep le
99
accelerate the inference speed.
1010

1111
ModelOpt's :meth:`mtq.quantize() <modelopt.torch.quantization.model_quant.quantize>` API enables
12-
users to quantize a model with advanced algorithms like SmoothQuant, AWQ, and more. ModelOpt
12+
users to quantize a model with advanced algorithms like SmoothQuant, AWQ, SVDQuant, and more. ModelOpt
1313
supports both Post Training Quantization (PTQ) and Quantization Aware Training (QAT).
1414

1515
.. tip::

0 commit comments

Comments
 (0)