NVIDIA
diff --git a/‎.dockerignore‎
Lines changed: 1 addition & 1 deletion b/‎.dockerignore‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎.github/ISSUE_TEMPLATE/1_bug_report.md‎
Lines changed: 18 additions & 17 deletions b/‎.github/ISSUE_TEMPLATE/1_bug_report.md‎
Lines changed: 18 additions & 17 deletions
diff --git a/‎.github/ISSUE_TEMPLATE/2_feature_request.md‎
Lines changed: 1 addition & 1 deletion b/‎.github/ISSUE_TEMPLATE/2_feature_request.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎.pre-commit-config.yaml‎
Lines changed: 2 additions & 3 deletions b/‎.pre-commit-config.yaml‎
Lines changed: 2 additions & 3 deletions
diff --git a/‎CHANGELOG.rst‎
Lines changed: 24 additions & 0 deletions b/‎CHANGELOG.rst‎
Lines changed: 24 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 8 additions & 7 deletions b/‎README.md‎
Lines changed: 8 additions & 7 deletions
diff --git a/‎docker/Dockerfile‎
Lines changed: 7 additions & 4 deletions b/‎docker/Dockerfile‎
Lines changed: 7 additions & 4 deletions
diff --git a/‎docs/source/deployment/3_unified_hf.rst‎
Lines changed: 20 additions & 0 deletions b/‎docs/source/deployment/3_unified_hf.rst‎
Lines changed: 20 additions & 0 deletions
diff --git a/‎docs/source/getting_started/1_overview.rst‎
Lines changed: 1 addition & 1 deletion b/‎docs/source/getting_started/1_overview.rst‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/source/getting_started/3_quantization.rst‎
Lines changed: 1 addition & 1 deletion b/‎docs/source/getting_started/3_quantization.rst‎
Lines changed: 1 addition & 1 deletion
@@ -1,6 +1,6 @@
 docker
 examples/**/.git
-examples/llm_ptq/saved_models*
+examples/**/saved_models*
 **/experimental
 
 ##### Copied from .gitignore #####
 
@@ -20,6 +20,23 @@ assignees: ''
 
 ## System information
 
+- Container used (if applicable): ?
+- OS (e.g., Ubuntu 22.04, CentOS 7, Windows 10): ? <!-- If Windows, please add the `windows` label to the issue. -->
+- CPU architecture (x86_64, aarch64): ?
+- GPU name (e.g. H100, A100, L40S): ?
+- GPU memory size: ?
+- Number of GPUs: ?
+- Library versions (if applicable):
+  - Python: ?
+  - ModelOpt version or commit hash: ?
+  - CUDA: ?
+  - PyTorch: ?
+  - Transformers: ?
+  - TensorRT-LLM: ?
+  - ONNXRuntime: ?
+  - TensorRT: ?
+- Any other details that may help: ?
+
 
 <details>
 <summary><b>Click to expand: Python script to automatically collect system information</b></summary>
@@ -103,24 +120,8 @@ print("  - Transformers: " + get_package_version("transformers"))
 print("  - TensorRT-LLM: " + get_package_version("tensorrt_llm"))
 print("  - ONNXRuntime: " + get_package_version("onnxruntime"))
 print("  - TensorRT: " + get_package_version("tensorrt"))
+print("- Any other details that may help: " + "?")
 print("=" * 70)
 ```
 
 </details>
-
-- Container used (if applicable): ?
-- OS (e.g., Ubuntu 22.04, CentOS 7, Windows 10): ? <!-- If Windows, please add the `windows` label to the issue. -->
-- CPU architecture (x86_64, aarch64): ?
-- GPU name (e.g. H100, A100, L40S): ?
-- GPU memory size: ?
-- Number of GPUs: ?
-- Library versions (if applicable):
-  - Python: ?
-  - ModelOpt version or commit hash: ?
-  - CUDA: ?
-  - PyTorch: ?
-  - Transformers: ?
-  - TensorRT-LLM: ?
-  - ONNXRuntime: ?
-  - TensorRT: ?
-- Any other details that may help: ?
 
@@ -17,5 +17,5 @@ assignees: ''
 ### Describe alternatives you've considered
 
 
-### Target Hardware/Use Case
+### Target hardware/use case
 <!-- Target hardware/use case this feature will be used for -->
@@ -96,6 +96,8 @@ repos:
               modelopt/torch/speculative/eagle/utils.py|
               modelopt/torch/speculative/plugins/transformers.py|
               examples/chained_optimizations/bert_prune_distill_quantize.py|
+              examples/deepseek/quantize_to_nvfp4.py|
+              examples/deepseek/ptq.py|
               examples/diffusers/cache_diffusion/pipeline/models/sdxl.py|
               examples/diffusers/quantization/onnx_utils/export.py|
               examples/llm_eval/gen_model_answer.py|
@@ -108,8 +110,6 @@ repos:
               examples/speculative_decoding/main.py|
               examples/speculative_decoding/medusa_utils.py|
               examples/speculative_decoding/vllm_generate.py|
-              examples/deepseek/quantize_to_nvfp4.py|
-              examples/deepseek/ptq.py|
           )$
 
       # Default hook for Apache 2.0 in core c/c++/cuda files
@@ -154,4 +154,3 @@ repos:
       - id: lychee
         args: ["--no-progress", "--exclude-loopback"]
         stages: [manual] # Only run with `pre-commit run --all-files --hook-stage manual lychee`
-        exclude: internal/
@@ -1,6 +1,30 @@
 Model Optimizer Changelog (Linux)
 =================================
 
+0.27 (2025-04-03)
+^^^^^^^^^^^^^^^^^
+
+**Backward Breaking Changes**
+
+- Deprecate real quantization configs, please use :meth:`mtq.compress <modelopt.torch.quantization.compress>` API for model compression after quantization.
+
+**New Features**
+
+- Blockwise FP8 quantization support in unified model export.
+- Add quantization support to the Transformer Engine Linear module.
+- Add support for SVDQuant. Currently, only simulation is available; real deployment (for example, TensorRT deployment) support is coming soon.
+- To support distributed checkpoint resume expert-parallel (EP), ``modelopt_state`` in Megatron Core distributed checkpoint (used in NeMo and Megatron-LM) is stored differently. The legacy ``modelopt_state`` in the distributed checkpoint generated by previous modelopt version can still be loaded in 0.27 and 0.29 but will need to be stored in the new format.
+- Add triton-based NVFP4 quantization kernel that delivers approximately 40% performance improvement over the previous implementation.
+- Add a new API :meth:`mtq.compress <modelopt.torch.quantization.compress>` for model compression for weights after quantization.
+- Add option to simplify ONNX model before quantization is performed.
+- (Experimental) Improve support for ONNX models with custom TensorRT op:
+   - Add support for ``--calibration_shapes`` flag.
+   - Add automatic type and shape tensor propagation for full ORT support with TensorRT EP.
+
+**Known Issues**
+
+- Quantization of T5 models is broken. Please use ``nvidia-modelopt==0.25.0`` with ``transformers<4.50`` meanwhile.
+
 0.25 (2025-03-03)
 ^^^^^^^^^^^^^^^^^
 
 
@@ -18,6 +18,7 @@
 
 ## Latest News
 
+- [2025/03/18] [World's Fastest DeepSeek-R1 Inference with Blackwell FP4 & Increasing Image Generation Efficiency on Blackwell](https://developer.nvidia.com/blog/nvidia-blackwell-delivers-world-record-deepseek-r1-inference-performance/)
 - [2025/02/25] Model Optimizer quantized NVFP4 models available on Hugging Face for download: [DeepSeek-R1-FP4](https://huggingface.co/nvidia/DeepSeek-R1-FP4), [Llama-3.3-70B-Instruct-FP4](https://huggingface.co/nvidia/Llama-3.3-70B-Instruct-FP4), [Llama-3.1-405B-Instruct-FP4](https://huggingface.co/nvidia/Llama-3.1-405B-Instruct-FP4)
 - [2025/01/28] Model Optimizer has added support for NVFP4. Check out an example of NVFP4 PTQ [here](./examples/llm_ptq/README.md#model-quantization-and-trt-llm-conversion).
 - [2025/01/28] Model Optimizer is now open source!
@@ -44,12 +45,12 @@
 - [Model Optimizer Overview](#model-optimizer-overview)
 - [Installation](#installation--docker)
 - [Techniques](#techniques)
-  - [Quantization](#quantization)
+  - [Quantization](#quantization-examples-docs)
     - [Quantized Checkpoints](#quantized-checkpoints)
-  - [Pruning](#pruning)
-  - [Distillation](#distillation)
-  - [Speculative Decoding](#speculative-decoding)
-  - [Sparsity](#sparsity)
+  - [Pruning](#pruning-examples-docs)
+  - [Distillation](#distillation-examples-docs)
+  - [Speculative Decoding](#speculative-decoding-examples-docs)
+  - [Sparsity](#sparsity-examples-docs)
 - [Examples](#examples)
 - [Support Matrix](#model-support-matrix)
 - [Benchmark](#benchmark)
@@ -114,7 +115,7 @@ Below is a short description of the techniques supported by Model Optimizer.
 
 ### Quantization \[[examples](./examples/README.md#quantization)\] \[[docs](https://nvidia.github.io/TensorRT-Model-Optimizer/guides/1_quantization.html)\]
 
-Quantization is an effective model optimization technique for large models. Quantization with Model Optimizer can compress model size by 2x-4x, speeding up inference while preserving model quality. Model Optimizer enables highly performant quantization formats including NVFP4, FP8, INT8, INT4, etc and supports advanced algorithms such as SmoothQuant, AWQ, and Double Quantization with easy-to-use Python APIs. Both Post-training quantization (PTQ) and Quantization-aware training (QAT) are supported.
+Quantization is an effective model optimization technique for large models. Quantization with Model Optimizer can compress model size by 2x-4x, speeding up inference while preserving model quality. Model Optimizer enables highly performant quantization formats including NVFP4, FP8, INT8, INT4, etc and supports advanced algorithms such as SmoothQuant, AWQ, SVDQuant, and Double Quantization with easy-to-use Python APIs. Both Post-training quantization (PTQ) and Quantization-aware training (QAT) are supported.
 
 #### Quantized Checkpoints
 
@@ -158,7 +159,7 @@ Please find the [detailed performance benchmarks](./examples/benchmark.md).
 
 ## Roadmap
 
-Please see our [product roadmap](https://github.com/NVIDIA/TensorRT-Model-Optimizer/issues/108).
+Please see our [product roadmap](https://github.com/NVIDIA/TensorRT-Model-Optimizer/issues/146).
 
 ## Release Notes
 
 
@@ -2,15 +2,17 @@ FROM nvidia/cuda:12.8.0-devel-ubuntu22.04
 
 WORKDIR /workspace
 
-RUN apt-get update && apt-get -y install python3.10 python3-pip python-is-python3 openmpi-bin libopenmpi-dev wget git git-lfs unzip jq cmake
+RUN apt-get update && \
+    apt-get -y install python3.10 python3-pip python-is-python3 openmpi-bin libopenmpi-dev libgl1 libglib2.0-0 wget git git-lfs unzip jq cmake vim && \
+    rm -rf /var/lib/apt/lists/*
 
 ARG PIP_EXTRA_INDEX_URL="https://pypi.nvidia.com"
 ENV PIP_EXTRA_INDEX_URL=$PIP_EXTRA_INDEX_URL
 ENV PIP_NO_CACHE_DIR=off
 
 # Install the latest setuptools using pip
-RUN rm -rf /usr/lib/python3/dist-packages/setuptools*
-RUN pip install --upgrade pip setuptools
+RUN rm -rf /usr/lib/python3/dist-packages/setuptools* && \
+    pip install --upgrade pip setuptools
 
 # Install TensorRT-LLM
 ARG TRT_LLM_VERSION=0.17.0
@@ -44,9 +46,10 @@ RUN python -c "import modelopt.torch.quantization.extensions as ext; ext.precomp
 
 # Find and install requirements.txt files for all examples excluding windows
 COPY . TensorRT-Model-Optimizer
+RUN rm -rf TensorRT-Model-Optimizer/.git
 RUN find TensorRT-Model-Optimizer/examples -name "requirements.txt" | grep -v "windows" | while read req_file; do \
         echo "Installing from $req_file"; \
-        pip install -r "$req_file"; \
+        pip install -r "$req_file" || exit 1; \
     done
 
 # Allow users to run without root
 
@@ -32,6 +32,26 @@ The export API (:meth:`export_hf_checkpoint <modelopt.torch.export.unified_expor
             export_dir,  # The directory where the exported files will be stored.
         )
 
+Deployment Support Matrix
+==============================================
+
+Currently, we support the following quantization formats with the unified HF export API:
+#. FP8
+#. FP8_PB
+#. NVFP4
+#. NVFP4_AWQ
+#. INT4_AWQ
+#. W4A8_AWQ
+
+For deployment with TensorRT-LLM, we support llama 3.1, 3.3, Mixtral 8x7B, with FP8 and NVFP4 checkpoints; Medusa and Eagle FP8 checkpoints are also tested.
+
+For deployment with vLLM, we support llama 3.1, 3.3, Mixtral 8x7B, with FP8 checkpoints.
+
+For deployment with SGLang, we support llama 3.1, 3.3, with FP8 checkpoints.
+
+Other models and quantization formats may work, but they are not thoroughly tested.
+
+
 Deployment with Selected Inference Frameworks
 ==============================================
 
 
@@ -23,7 +23,7 @@ Quantization
 ^^^^^^^^^^^^
 Quantization is an effective model optimization technique for large models. Quantization with Model Optimizer can compress
 model size by 2x-4x, speeding up inference while preserving model quality. Model Optimizer enables highly performant
-quantization formats including NVFP4, FP8, INT8, INT4, etc and supports advanced algorithms such as SmoothQuant, AWQ, and
+quantization formats including NVFP4, FP8, INT8, INT4, etc and supports advanced algorithms such as SmoothQuant, AWQ, SVDQuant, and
 Double Quantization with easy-to-use Python APIs. Both Post-training quantization (PTQ) and Quantization-aware training (QAT)
 are supported. Visit :meth:`Quantization Format page <modelopt.torch.quantization.config>`
 for list of formats supported.
 
@@ -9,7 +9,7 @@ Quantization is an effective technique to reduce the memory footprint of deep le
 accelerate the inference speed.
 
 ModelOpt's :meth:`mtq.quantize() <modelopt.torch.quantization.model_quant.quantize>` API enables
-users to quantize a model with advanced algorithms like SmoothQuant, AWQ, and more. ModelOpt
+users to quantize a model with advanced algorithms like SmoothQuant, AWQ, SVDQuant, and more. ModelOpt
 supports both Post Training Quantization (PTQ) and Quantization Aware Training (QAT).
 
 .. tip::