NVIDIA
diff --git a/‎.dockerignore‎
Lines changed: 3 additions & 7 deletions b/‎.dockerignore‎
Lines changed: 3 additions & 7 deletions
diff --git a/‎.gitignore‎
Lines changed: 1 addition & 6 deletions b/‎.gitignore‎
Lines changed: 1 addition & 6 deletions
diff --git a/‎.pre-commit-config.yaml‎
Lines changed: 6 additions & 5 deletions b/‎.pre-commit-config.yaml‎
Lines changed: 6 additions & 5 deletions
diff --git a/‎CHANGELOG.rst‎
Lines changed: 35 additions & 5 deletions b/‎CHANGELOG.rst‎
Lines changed: 35 additions & 5 deletions
diff --git a/‎README.md‎
Lines changed: 3 additions & 2 deletions b/‎README.md‎
Lines changed: 3 additions & 2 deletions
diff --git a/‎docker/Dockerfile‎
Lines changed: 5 additions & 5 deletions b/‎docker/Dockerfile‎
Lines changed: 5 additions & 5 deletions
diff --git a/‎docs/source/getting_started/_installation_for_Linux.rst‎
Lines changed: 3 additions & 3 deletions b/‎docs/source/getting_started/_installation_for_Linux.rst‎
Lines changed: 3 additions & 3 deletions
diff --git a/‎docs/source/getting_started/windows/_installation_standalone.rst‎
Lines changed: 1 addition & 1 deletion b/‎docs/source/getting_started/windows/_installation_standalone.rst‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/source/guides/7_speculative_decoding.rst‎
Lines changed: 1 addition & 1 deletion b/‎docs/source/guides/7_speculative_decoding.rst‎
Lines changed: 1 addition & 1 deletion
@@ -41,8 +41,9 @@ docs/source/reference/generated
 env/
 venv/
 
-# mypy
+# Linters
 **/.mypy_cache
+**/.ruff_cache
 
 # Vscode
 .vscode/*
@@ -62,11 +63,6 @@ venv/
 **.safetensors
 **.bin
 **.pkl
+**.pickle
 **.tar.gz
 **.nemo
-
-# Ignore temporary files created by tox
-pyproject.toml.bak
-
-# Ignore git clones for tests
-medusa-vicuna-7b-v1.3/
@@ -57,11 +57,6 @@ venv/
 **.safetensors
 **.bin
 **.pkl
+**.pickle
 **.tar.gz
 **.nemo
-
-# Ignore temporary files created by tox
-pyproject.toml.bak
-
-# Ignore git clones for tests
-medusa-vicuna-7b-v1.3/
@@ -30,24 +30,25 @@ repos:
       - id: trailing-whitespace
 
   - repo: https://github.com/executablebooks/mdformat
-    rev: 0.7.21
+    rev: 0.7.22
     hooks:
       - id: mdformat
+        exclude: ^.github/
 
   - repo: https://github.com/astral-sh/ruff-pre-commit
-    rev: v0.9.4
+    rev: v0.11.6
     hooks:
       - id: ruff
         args: [--fix, --exit-non-zero-on-fix]
       - id: ruff-format
 
   - repo: https://github.com/pre-commit/mirrors-mypy
-    rev: v1.14.1
+    rev: v1.15.0
     hooks:
       - id: mypy
 
   - repo: https://github.com/pre-commit/mirrors-clang-format
-    rev: v16.0.4
+    rev: v20.1.0
     hooks:
       - id: clang-format
         types_or: [c++, c, c#, cuda, java, javascript, objective-c, proto] # no json!
@@ -135,7 +136,7 @@ repos:
         types_or: [shell]
 
   - repo: https://github.com/keith/pre-commit-buildifier
-    rev: 8.0.1
+    rev: 8.0.3
     hooks:
       - id: buildifier
       - id: buildifier-lint
 
@@ -1,20 +1,50 @@
 Model Optimizer Changelog (Linux)
 =================================
 
-0.27 (2025-04-03)
+0.29 (2025-05-08)
 ^^^^^^^^^^^^^^^^^
 
 **Backward Breaking Changes**
 
+- Refactor ``SequentialQuantizer`` to improve its implementation and maintainability while preserving its functionality.
+
+**Deprecations**
+
+- Deprecate ``torch<2.4`` support.
+
+**New Features**
+
+- Upgrade LLM examples to use TensorRT-LLM 0.18.
+- Add new model support in the ``llm_ptq`` example: Gemma-3, Llama-Nemotron.
+- Add INT8 real quantization support.
+- Add an FP8 GEMM per-tensor quantization kernel for real quantization. After PTQ, you can leverage the :meth:`mtq.compress <modelopt.torch.quantization.compress>` API to accelerate evaluation of quantized models.
+- Use the shape of Pytorch parameters and buffers of :class:`TensorQuantizer <modelopt.torch.quantization.nn.modules.TensorQuantizer>` to initialize them during restore. This makes quantized model restoring more robust.
+- Support adding new custom quantization calibration algorithms. Please refer to :func:`mtq.calibrate <modelopt.torch.quantization.model_quant.calibrate>` or :ref:`custom calibration algorithm <custom_calibration_algorithm>` for more details.
+- Add EAGLE3 (``LlamaForCausalLMEagle3``) training and unified ModelOpt checkpoint export support for Megatron-LM.
+- Add support for ``--override_shapes`` flag to ONNX quantization.
+   - ``--calibration_shapes`` is reserved for the input shapes used for calibration process.
+   - ``--override_shapes`` is used to override the input shapes of the model with static shapes.
+- Add support for UNet ONNX quantization.
+- Enable ``concat_elimination`` pass by default to improve the performance of quantized ONNX models.
+- Enable Redundant Cast elimination pass by default in :meth:`moq.quantize <modelopt.onnx.quantization.quantize>`.
+- Add new attribute ``parallel_state`` to :class:`DynamicModule <modelopt.torch.opt.dynamic.DynamicModule>` to support distributed parallelism such as data parallel and tensor parallel.
+- Add MXFP8, NVFP4 quantized ONNX export support.
+- Add new example for torch quantization to ONNX for MXFP8, NVFP4 precision.
+
+0.27 (2025-04-03)
+^^^^^^^^^^^^^^^^^
+
+**Deprecations**
+
 - Deprecate real quantization configs, please use :meth:`mtq.compress <modelopt.torch.quantization.compress>` API for model compression after quantization.
 
 **New Features**
 
-- New model support in the ``llm_ptq`` example: OpenAI Whisper. Experimental support: Llama4, QwQ, Qwen MOE.
-- Blockwise FP8 quantization support in unified model export.
+- Add new model support in the ``llm_ptq`` example: OpenAI Whisper. Experimental support: Llama4, QwQ, Qwen MOE.
+- Add blockwise FP8 quantization support in unified model export.
 - Add quantization support to the Transformer Engine Linear module.
 - Add support for SVDQuant. Currently, only simulation is available; real deployment (for example, TensorRT deployment) support is coming soon.
-- To support distributed checkpoint resume expert-parallel (EP), ``modelopt_state`` in Megatron Core distributed checkpoint (used in NeMo and Megatron-LM) is stored differently. The legacy ``modelopt_state`` in the distributed checkpoint generated by previous modelopt version can still be loaded in 0.27 and 0.29 but will need to be stored in the new format.
+- Store ``modelopt_state`` in Megatron Core distributed checkpoint (used in NeMo and Megatron-LM) differently to support distributed checkpoint resume expert-parallel (EP). The legacy ``modelopt_state`` in the distributed checkpoint generated by previous modelopt version can still be loaded in 0.27 and 0.29 but will need to be stored in the new format.
 - Add triton-based NVFP4 quantization kernel that delivers approximately 40% performance improvement over the previous implementation.
 - Add a new API :meth:`mtq.compress <modelopt.torch.quantization.compress>` for model compression for weights after quantization.
 - Add option to simplify ONNX model before quantization is performed.
@@ -31,7 +61,7 @@ Model Optimizer Changelog (Linux)
 0.25 (2025-03-03)
 ^^^^^^^^^^^^^^^^^
 
-**Backward Breaking Changes**
+**Deprecations**
 
 - Deprecate Torch 2.1 support.
 - Deprecate ``humaneval`` benchmark in ``llm_eval`` examples. Please use the newly added ``simple_eval`` instead.
 
@@ -18,6 +18,7 @@
 
 ## Latest News
 
+- [2025/04/21] [Adobe optimized deployment using TensorRT-Model-Optimizer + TensorRT leading to a 60% reduction in diffusion latency, a 40% reduction in total cost of ownership](https://developer.nvidia.com/blog/optimizing-transformer-based-diffusion-models-for-video-generation-with-nvidia-tensorrt/)
 - [2025/04/05] [NVIDIA Accelerates Inference on Meta Llama 4 Scout and Maverick](https://developer.nvidia.com/blog/nvidia-accelerates-inference-on-meta-llama-4-scout-and-maverick/). Check out how to quantize Llama4 for deployment acceleration [here](./examples/llm_ptq/README.md#llama-4)
 - [2025/03/18] [World's Fastest DeepSeek-R1 Inference with Blackwell FP4 & Increasing Image Generation Efficiency on Blackwell](https://developer.nvidia.com/blog/nvidia-blackwell-delivers-world-record-deepseek-r1-inference-performance/)
 - [2025/02/25] Model Optimizer quantized NVFP4 models available on Hugging Face for download: [DeepSeek-R1-FP4](https://huggingface.co/nvidia/DeepSeek-R1-FP4), [Llama-3.3-70B-Instruct-FP4](https://huggingface.co/nvidia/Llama-3.3-70B-Instruct-FP4), [Llama-3.1-405B-Instruct-FP4](https://huggingface.co/nvidia/Llama-3.1-405B-Instruct-FP4)
@@ -95,13 +96,13 @@ python -c "import modelopt; print(modelopt.__version__)"
 Alternatively, you can install it from [NVIDIA PyPI](https://pypi.org/project/nvidia-modelopt/) without TRT-LLM etc.
 
 ```bash
-pip install "nvidia-modelopt[all]" -U --extra-index-url https://pypi.nvidia.com
+pip install -U "nvidia-modelopt[all]"
 ```
 
 To install from source for local development, you can install it as follows:
 
 ```bash
-pip install -e ".[all]" --extra-index-url https://pypi.nvidia.com
+pip install -e ".[all]"
 ```
 
 When installing from source, please make sure to re-run the install command everytime you pull new changes in the repository so dependencies are also updated.
 
@@ -1,4 +1,4 @@
-FROM nvidia/cuda:12.8.0-devel-ubuntu22.04
+FROM nvidia/cuda:12.8.1-devel-ubuntu22.04
 
 WORKDIR /workspace
 
@@ -15,7 +15,7 @@ RUN rm -rf /usr/lib/python3/dist-packages/setuptools* && \
     pip install --upgrade pip setuptools
 
 # Install TensorRT-LLM
-ARG TRT_LLM_VERSION=0.17.0
+ARG TRT_LLM_VERSION=0.18.1
 RUN pip install "tensorrt-llm~=$TRT_LLM_VERSION" -U
 RUN git clone --depth 1 --branch "v$TRT_LLM_VERSION" https://github.com/NVIDIA/TensorRT-LLM.git && \
     mkdir tensorrt-llm && \
@@ -28,7 +28,7 @@ ENV LD_LIBRARY_PATH=/usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs:$L
 ENV LD_LIBRARY_PATH=/usr/local/lib/python3.10/dist-packages/nvidia/cudnn/lib:$LD_LIBRARY_PATH
 
 # Install TensorRT dev environment
-ARG TENSORRT_URL=https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/10.8.0/tars/TensorRT-10.8.0.43.Linux.x86_64-gnu.cuda-12.8.tar.gz
+ARG TENSORRT_URL=https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/10.9.0/tars/TensorRT-10.9.0.34.Linux.x86_64-gnu.cuda-12.8.tar.gz
 RUN wget -q -O tensorrt.tar.gz $TENSORRT_URL && \
     tar -xf tensorrt.tar.gz && \
     cp TensorRT-*/bin/trtexec /usr/local/bin && \
@@ -40,8 +40,8 @@ ENV TRT_LIB_PATH=/usr/local/lib/python3.10/dist-packages/tensorrt_libs
 ENV LD_LIBRARY_PATH=$TRT_LIB_PATH:$LD_LIBRARY_PATH
 
 # Install modelopt with all optional dependencies and pre-compile CUDA extensions otherwise they take several minutes on every docker run
-RUN pip install "nvidia-modelopt[all]" -U
-ENV TORCH_CUDA_ARCH_LIST="8.0 8.6 8.7 8.9 9.0+PTX"
+RUN pip install -U "nvidia-modelopt[all,dev-test]"
+ENV TORCH_CUDA_ARCH_LIST="8.0 8.6 8.7 8.9 9.0 10.0+PTX"
 RUN python -c "import modelopt.torch.quantization.extensions as ext; ext.precompile()"
 
 # Find and install requirements.txt files for all examples excluding windows
 
@@ -16,9 +16,9 @@ Latest Model Optimizer (``nvidia-modelopt``) currently has the following system
 +-------------------------+-----------------------------+
 | CUDA                    |  >=12.0                     |
 +-------------------------+-----------------------------+
-| PyTorch (Optional)      |  >=2.2                      |
+| PyTorch (Optional)      |  >=2.4                      |
 +-------------------------+-----------------------------+
-| TensorRT-LLM (Optional) |  0.17                       |
+| TensorRT-LLM (Optional) |  0.18                       |
 +-------------------------+-----------------------------+
 | ONNX Runtime (Optional) |  1.20 (Python>=3.10)        |
 +-------------------------+-----------------------------+
@@ -100,7 +100,7 @@ Make sure to upgrade to the latest version of ModelOpt (with appropriate optiona
 
 .. code-block:: bash
 
-    pip install "nvidia-modelopt[all]" -U --extra-index-url https://pypi.nvidia.com
+    pip install -U "nvidia-modelopt[all]"
 
 If you want to install only partial dependencies, please replace ``[all]`` with the desired
 optional dependencies as described below.
 
@@ -34,7 +34,7 @@ To install the ModelOpt-Windows wheel, run the following command:
 
 .. code-block:: bash
 
-    pip install "nvidia-modelopt[onnx]" --extra-index-url https://pypi.nvidia.com
+    pip install "nvidia-modelopt[onnx]"
 
 This command installs ModelOpt-Windows and its ONNX module, along with the *onnxruntime-directml* (v1.20.0) package. If ModelOpt-Windows is installed without the additional parameter, only the bare minimum dependencies will be installed, without the relevant module and dependencies.
 
 
@@ -71,7 +71,7 @@ After converting to a speculative decoding model, you need to fine-tune the deco
 
     mto.enable_huggingface_checkpointing()
 
-    trainer = Trainer(model=model, tokenizer=tokenizer, args=training_args, **data_module)
+    trainer = Trainer(model=model, processing_class=tokenizer, args=training_args, **data_module)
     trainer._move_model_to_device(model, trainer.args.device)
 
     trainer.train(resume_from_checkpoint=checkpoint)