Skip to content

Commit 7a04743

Browse files
Update for 0.29.0 release
1 parent e048fb2 commit 7a04743

File tree

243 files changed

+9507
-3553
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

243 files changed

+9507
-3553
lines changed

.dockerignore

Lines changed: 3 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -41,8 +41,9 @@ docs/source/reference/generated
4141
env/
4242
venv/
4343

44-
# mypy
44+
# Linters
4545
**/.mypy_cache
46+
**/.ruff_cache
4647

4748
# Vscode
4849
.vscode/*
@@ -62,11 +63,6 @@ venv/
6263
**.safetensors
6364
**.bin
6465
**.pkl
66+
**.pickle
6567
**.tar.gz
6668
**.nemo
67-
68-
# Ignore temporary files created by tox
69-
pyproject.toml.bak
70-
71-
# Ignore git clones for tests
72-
medusa-vicuna-7b-v1.3/

.gitignore

Lines changed: 1 addition & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -57,11 +57,6 @@ venv/
5757
**.safetensors
5858
**.bin
5959
**.pkl
60+
**.pickle
6061
**.tar.gz
6162
**.nemo
62-
63-
# Ignore temporary files created by tox
64-
pyproject.toml.bak
65-
66-
# Ignore git clones for tests
67-
medusa-vicuna-7b-v1.3/

.pre-commit-config.yaml

Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -30,24 +30,25 @@ repos:
3030
- id: trailing-whitespace
3131

3232
- repo: https://github.com/executablebooks/mdformat
33-
rev: 0.7.21
33+
rev: 0.7.22
3434
hooks:
3535
- id: mdformat
36+
exclude: ^.github/
3637

3738
- repo: https://github.com/astral-sh/ruff-pre-commit
38-
rev: v0.9.4
39+
rev: v0.11.6
3940
hooks:
4041
- id: ruff
4142
args: [--fix, --exit-non-zero-on-fix]
4243
- id: ruff-format
4344

4445
- repo: https://github.com/pre-commit/mirrors-mypy
45-
rev: v1.14.1
46+
rev: v1.15.0
4647
hooks:
4748
- id: mypy
4849

4950
- repo: https://github.com/pre-commit/mirrors-clang-format
50-
rev: v16.0.4
51+
rev: v20.1.0
5152
hooks:
5253
- id: clang-format
5354
types_or: [c++, c, c#, cuda, java, javascript, objective-c, proto] # no json!
@@ -135,7 +136,7 @@ repos:
135136
types_or: [shell]
136137

137138
- repo: https://github.com/keith/pre-commit-buildifier
138-
rev: 8.0.1
139+
rev: 8.0.3
139140
hooks:
140141
- id: buildifier
141142
- id: buildifier-lint

CHANGELOG.rst

Lines changed: 35 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,20 +1,50 @@
11
Model Optimizer Changelog (Linux)
22
=================================
33

4-
0.27 (2025-04-03)
4+
0.29 (2025-05-08)
55
^^^^^^^^^^^^^^^^^
66

77
**Backward Breaking Changes**
88

9+
- Refactor ``SequentialQuantizer`` to improve its implementation and maintainability while preserving its functionality.
10+
11+
**Deprecations**
12+
13+
- Deprecate ``torch<2.4`` support.
14+
15+
**New Features**
16+
17+
- Upgrade LLM examples to use TensorRT-LLM 0.18.
18+
- Add new model support in the ``llm_ptq`` example: Gemma-3, Llama-Nemotron.
19+
- Add INT8 real quantization support.
20+
- Add an FP8 GEMM per-tensor quantization kernel for real quantization. After PTQ, you can leverage the :meth:`mtq.compress <modelopt.torch.quantization.compress>` API to accelerate evaluation of quantized models.
21+
- Use the shape of Pytorch parameters and buffers of :class:`TensorQuantizer <modelopt.torch.quantization.nn.modules.TensorQuantizer>` to initialize them during restore. This makes quantized model restoring more robust.
22+
- Support adding new custom quantization calibration algorithms. Please refer to :func:`mtq.calibrate <modelopt.torch.quantization.model_quant.calibrate>` or :ref:`custom calibration algorithm <custom_calibration_algorithm>` for more details.
23+
- Add EAGLE3 (``LlamaForCausalLMEagle3``) training and unified ModelOpt checkpoint export support for Megatron-LM.
24+
- Add support for ``--override_shapes`` flag to ONNX quantization.
25+
- ``--calibration_shapes`` is reserved for the input shapes used for calibration process.
26+
- ``--override_shapes`` is used to override the input shapes of the model with static shapes.
27+
- Add support for UNet ONNX quantization.
28+
- Enable ``concat_elimination`` pass by default to improve the performance of quantized ONNX models.
29+
- Enable Redundant Cast elimination pass by default in :meth:`moq.quantize <modelopt.onnx.quantization.quantize>`.
30+
- Add new attribute ``parallel_state`` to :class:`DynamicModule <modelopt.torch.opt.dynamic.DynamicModule>` to support distributed parallelism such as data parallel and tensor parallel.
31+
- Add MXFP8, NVFP4 quantized ONNX export support.
32+
- Add new example for torch quantization to ONNX for MXFP8, NVFP4 precision.
33+
34+
0.27 (2025-04-03)
35+
^^^^^^^^^^^^^^^^^
36+
37+
**Deprecations**
38+
939
- Deprecate real quantization configs, please use :meth:`mtq.compress <modelopt.torch.quantization.compress>` API for model compression after quantization.
1040

1141
**New Features**
1242

13-
- New model support in the ``llm_ptq`` example: OpenAI Whisper. Experimental support: Llama4, QwQ, Qwen MOE.
14-
- Blockwise FP8 quantization support in unified model export.
43+
- Add new model support in the ``llm_ptq`` example: OpenAI Whisper. Experimental support: Llama4, QwQ, Qwen MOE.
44+
- Add blockwise FP8 quantization support in unified model export.
1545
- Add quantization support to the Transformer Engine Linear module.
1646
- Add support for SVDQuant. Currently, only simulation is available; real deployment (for example, TensorRT deployment) support is coming soon.
17-
- To support distributed checkpoint resume expert-parallel (EP), ``modelopt_state`` in Megatron Core distributed checkpoint (used in NeMo and Megatron-LM) is stored differently. The legacy ``modelopt_state`` in the distributed checkpoint generated by previous modelopt version can still be loaded in 0.27 and 0.29 but will need to be stored in the new format.
47+
- Store ``modelopt_state`` in Megatron Core distributed checkpoint (used in NeMo and Megatron-LM) differently to support distributed checkpoint resume expert-parallel (EP). The legacy ``modelopt_state`` in the distributed checkpoint generated by previous modelopt version can still be loaded in 0.27 and 0.29 but will need to be stored in the new format.
1848
- Add triton-based NVFP4 quantization kernel that delivers approximately 40% performance improvement over the previous implementation.
1949
- Add a new API :meth:`mtq.compress <modelopt.torch.quantization.compress>` for model compression for weights after quantization.
2050
- Add option to simplify ONNX model before quantization is performed.
@@ -31,7 +61,7 @@ Model Optimizer Changelog (Linux)
3161
0.25 (2025-03-03)
3262
^^^^^^^^^^^^^^^^^
3363

34-
**Backward Breaking Changes**
64+
**Deprecations**
3565

3666
- Deprecate Torch 2.1 support.
3767
- Deprecate ``humaneval`` benchmark in ``llm_eval`` examples. Please use the newly added ``simple_eval`` instead.

README.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,7 @@
1818

1919
## Latest News
2020

21+
- [2025/04/21] [Adobe optimized deployment using TensorRT-Model-Optimizer + TensorRT leading to a 60% reduction in diffusion latency, a 40% reduction in total cost of ownership](https://developer.nvidia.com/blog/optimizing-transformer-based-diffusion-models-for-video-generation-with-nvidia-tensorrt/)
2122
- [2025/04/05] [NVIDIA Accelerates Inference on Meta Llama 4 Scout and Maverick](https://developer.nvidia.com/blog/nvidia-accelerates-inference-on-meta-llama-4-scout-and-maverick/). Check out how to quantize Llama4 for deployment acceleration [here](./examples/llm_ptq/README.md#llama-4)
2223
- [2025/03/18] [World's Fastest DeepSeek-R1 Inference with Blackwell FP4 & Increasing Image Generation Efficiency on Blackwell](https://developer.nvidia.com/blog/nvidia-blackwell-delivers-world-record-deepseek-r1-inference-performance/)
2324
- [2025/02/25] Model Optimizer quantized NVFP4 models available on Hugging Face for download: [DeepSeek-R1-FP4](https://huggingface.co/nvidia/DeepSeek-R1-FP4), [Llama-3.3-70B-Instruct-FP4](https://huggingface.co/nvidia/Llama-3.3-70B-Instruct-FP4), [Llama-3.1-405B-Instruct-FP4](https://huggingface.co/nvidia/Llama-3.1-405B-Instruct-FP4)
@@ -95,13 +96,13 @@ python -c "import modelopt; print(modelopt.__version__)"
9596
Alternatively, you can install it from [NVIDIA PyPI](https://pypi.org/project/nvidia-modelopt/) without TRT-LLM etc.
9697

9798
```bash
98-
pip install "nvidia-modelopt[all]" -U --extra-index-url https://pypi.nvidia.com
99+
pip install -U "nvidia-modelopt[all]"
99100
```
100101

101102
To install from source for local development, you can install it as follows:
102103

103104
```bash
104-
pip install -e ".[all]" --extra-index-url https://pypi.nvidia.com
105+
pip install -e ".[all]"
105106
```
106107

107108
When installing from source, please make sure to re-run the install command everytime you pull new changes in the repository so dependencies are also updated.

docker/Dockerfile

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
FROM nvidia/cuda:12.8.0-devel-ubuntu22.04
1+
FROM nvidia/cuda:12.8.1-devel-ubuntu22.04
22

33
WORKDIR /workspace
44

@@ -15,7 +15,7 @@ RUN rm -rf /usr/lib/python3/dist-packages/setuptools* && \
1515
pip install --upgrade pip setuptools
1616

1717
# Install TensorRT-LLM
18-
ARG TRT_LLM_VERSION=0.17.0
18+
ARG TRT_LLM_VERSION=0.18.1
1919
RUN pip install "tensorrt-llm~=$TRT_LLM_VERSION" -U
2020
RUN git clone --depth 1 --branch "v$TRT_LLM_VERSION" https://github.com/NVIDIA/TensorRT-LLM.git && \
2121
mkdir tensorrt-llm && \
@@ -28,7 +28,7 @@ ENV LD_LIBRARY_PATH=/usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs:$L
2828
ENV LD_LIBRARY_PATH=/usr/local/lib/python3.10/dist-packages/nvidia/cudnn/lib:$LD_LIBRARY_PATH
2929

3030
# Install TensorRT dev environment
31-
ARG TENSORRT_URL=https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/10.8.0/tars/TensorRT-10.8.0.43.Linux.x86_64-gnu.cuda-12.8.tar.gz
31+
ARG TENSORRT_URL=https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/10.9.0/tars/TensorRT-10.9.0.34.Linux.x86_64-gnu.cuda-12.8.tar.gz
3232
RUN wget -q -O tensorrt.tar.gz $TENSORRT_URL && \
3333
tar -xf tensorrt.tar.gz && \
3434
cp TensorRT-*/bin/trtexec /usr/local/bin && \
@@ -40,8 +40,8 @@ ENV TRT_LIB_PATH=/usr/local/lib/python3.10/dist-packages/tensorrt_libs
4040
ENV LD_LIBRARY_PATH=$TRT_LIB_PATH:$LD_LIBRARY_PATH
4141

4242
# Install modelopt with all optional dependencies and pre-compile CUDA extensions otherwise they take several minutes on every docker run
43-
RUN pip install "nvidia-modelopt[all]" -U
44-
ENV TORCH_CUDA_ARCH_LIST="8.0 8.6 8.7 8.9 9.0+PTX"
43+
RUN pip install -U "nvidia-modelopt[all,dev-test]"
44+
ENV TORCH_CUDA_ARCH_LIST="8.0 8.6 8.7 8.9 9.0 10.0+PTX"
4545
RUN python -c "import modelopt.torch.quantization.extensions as ext; ext.precompile()"
4646

4747
# Find and install requirements.txt files for all examples excluding windows

docs/source/getting_started/_installation_for_Linux.rst

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -16,9 +16,9 @@ Latest Model Optimizer (``nvidia-modelopt``) currently has the following system
1616
+-------------------------+-----------------------------+
1717
| CUDA | >=12.0 |
1818
+-------------------------+-----------------------------+
19-
| PyTorch (Optional) | >=2.2 |
19+
| PyTorch (Optional) | >=2.4 |
2020
+-------------------------+-----------------------------+
21-
| TensorRT-LLM (Optional) | 0.17 |
21+
| TensorRT-LLM (Optional) | 0.18 |
2222
+-------------------------+-----------------------------+
2323
| ONNX Runtime (Optional) | 1.20 (Python>=3.10) |
2424
+-------------------------+-----------------------------+
@@ -100,7 +100,7 @@ Make sure to upgrade to the latest version of ModelOpt (with appropriate optiona
100100

101101
.. code-block:: bash
102102
103-
pip install "nvidia-modelopt[all]" -U --extra-index-url https://pypi.nvidia.com
103+
pip install -U "nvidia-modelopt[all]"
104104
105105
If you want to install only partial dependencies, please replace ``[all]`` with the desired
106106
optional dependencies as described below.

docs/source/getting_started/windows/_installation_standalone.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@ To install the ModelOpt-Windows wheel, run the following command:
3434

3535
.. code-block:: bash
3636
37-
pip install "nvidia-modelopt[onnx]" --extra-index-url https://pypi.nvidia.com
37+
pip install "nvidia-modelopt[onnx]"
3838
3939
This command installs ModelOpt-Windows and its ONNX module, along with the *onnxruntime-directml* (v1.20.0) package. If ModelOpt-Windows is installed without the additional parameter, only the bare minimum dependencies will be installed, without the relevant module and dependencies.
4040

docs/source/guides/7_speculative_decoding.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -71,7 +71,7 @@ After converting to a speculative decoding model, you need to fine-tune the deco
7171
7272
mto.enable_huggingface_checkpointing()
7373
74-
trainer = Trainer(model=model, tokenizer=tokenizer, args=training_args, **data_module)
74+
trainer = Trainer(model=model, processing_class=tokenizer, args=training_args, **data_module)
7575
trainer._move_model_to_device(model, trainer.args.device)
7676
7777
trainer.train(resume_from_checkpoint=checkpoint)

0 commit comments

Comments
 (0)