NVIDIA
diff --git a/‎.github/workflows/unit_tests.yml‎
Lines changed: 18 additions & 3 deletions b/‎.github/workflows/unit_tests.yml‎
Lines changed: 18 additions & 3 deletions
diff --git a/‎.pre-commit-config.yaml‎
Lines changed: 0 additions & 5 deletions b/‎.pre-commit-config.yaml‎
Lines changed: 0 additions & 5 deletions
diff --git a/‎CHANGELOG.rst‎
Lines changed: 12 additions & 5 deletions b/‎CHANGELOG.rst‎
Lines changed: 12 additions & 5 deletions
diff --git a/‎README.md‎
Lines changed: 5 additions & 4 deletions b/‎README.md‎
Lines changed: 5 additions & 4 deletions
diff --git a/‎docker/Dockerfile‎
Lines changed: 7 additions & 38 deletions b/‎docker/Dockerfile‎
Lines changed: 7 additions & 38 deletions
diff --git a/‎docker/build.sh‎
Lines changed: 1 addition & 1 deletion b/‎docker/build.sh‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/source/deployment/3_unified_hf.rst‎
Lines changed: 14 additions & 10 deletions b/‎docs/source/deployment/3_unified_hf.rst‎
Lines changed: 14 additions & 10 deletions
diff --git a/‎docs/source/getting_started/_installation_for_Linux.rst‎
Lines changed: 4 additions & 2 deletions b/‎docs/source/getting_started/_installation_for_Linux.rst‎
Lines changed: 4 additions & 2 deletions
diff --git a/‎docs/source/guides/2_pruning.rst‎
Lines changed: 4 additions & 3 deletions b/‎docs/source/guides/2_pruning.rst‎
Lines changed: 4 additions & 3 deletions
diff --git a/‎examples/diffusers/quantization/README.md‎
Lines changed: 5 additions & 9 deletions b/‎examples/diffusers/quantization/README.md‎
Lines changed: 5 additions & 9 deletions
@@ -64,7 +64,7 @@ jobs:
         with:
           python-version: "3.12"
       - name: Run unit tests (without coverage)
-        run: pip install tox && tox -e py312-torch28-unit
+        run: pip install tox && tox -e py312-torch28-tf_latest-unit
   multi-py:
     if: github.event_name == 'pull_request'
     needs: [linux]
@@ -79,7 +79,7 @@ jobs:
         with:
           python-version: "3.${{ matrix.py }}"
       - name: Run unit tests
-        run: pip install tox && tox -e py3${{ matrix.py }}-torch28-unit
+        run: pip install tox && tox -e py3${{ matrix.py }}-torch28-tf_latest-unit
   multi-torch:
     if: github.event_name == 'pull_request'
     needs: [linux]
@@ -94,7 +94,22 @@ jobs:
         with:
           python-version: "3.12"
       - name: Run unit tests
-        run: pip install tox && tox -e py312-torch${{ matrix.torch }}-unit
+        run: pip install tox && tox -e py312-torch${{ matrix.torch }}-tf_latest-unit
+  multi-transformers:
+    if: github.event_name == 'pull_request'
+    needs: [linux]
+    runs-on: ubuntu-latest
+    timeout-minutes: 30
+    strategy:
+      matrix:
+        tf: [min]
+    steps:
+      - uses: actions/checkout@v4
+      - uses: actions/setup-python@v5
+        with:
+          python-version: "3.12"
+      - name: Run unit tests
+        run: pip install tox && tox -e py312-torch28-tf_${{ matrix.tf }}-unit
   partial-install:
     if: github.event_name == 'pull_request'
     needs: [linux]
 
@@ -1,9 +1,4 @@
 # NOTE: Make sure to update version in dev requirements (setup.py) as well!
-exclude: >
-  (?x)^(
-      experimental/.*|
-  )$
-
 repos:
   - repo: https://github.com/pre-commit/pre-commit-hooks
     rev: v5.0.0
 
@@ -4,19 +4,28 @@ Model Optimizer Changelog (Linux)
 0.35 (2025-08-xx)
 ^^^^^^^^^^^^^^^^^
 
-**Backward Breaking Changes**
-
 **Deprecations**
 
 - Deprecate ``torch<2.6`` support.
 
+**Bug Fixes**
+
+- Fix attention head ranking logic for pruning Megatron Core GPT models.
+
 **New Features**
 
+- ModelOpt now supports PTQ and QAT for GPT-OSS models. See ``examples/gpt_oss`` for end-to-end PTQ/QAT example.
+- Add support for QAT with HuggingFace + DeepSpeed. See ``examples/gpt_oss`` for an example.
+- Add support for QAT with LoRA. The LoRA adapters can be folded into the base model after QAT and deployed just like a regular PTQ model. See ``examples/gpt_oss`` for an example.
+- ModelOpt provides convenient trainers such as :class:`QATTrainer`, :class:`QADTrainer`, :class:`KDTrainer`, :class:`QATSFTTrainer` which inherits from Huggingface trainers.
+  ModelOpt trainers can be used as drop in replacement of the correspoding Huggingface trainer. See usage examples in ``examples/gpt_oss``, ``examples/llm_qat`` or ``examples/llm_distill``.
 - (Experimental) Add quantization support for custom TensorRT op in ONNX models.
 - Add support for Minifinetuning (MFT; https://arxiv.org/abs/2506.15702) self-corrective distillation, which enables training on small datasets with severely mitigated catastrophic forgetting.
 - Add tree decoding support for Megatron Eagle models.
 - For most VLMs, we now explicitly disable quant on the vision part so we add them to the excluded_modules during HF export.
-- Add support for ``hidden_size`` and ``num_layers`` pruning for Megatron Core Mamba models in ``mcore_gpt_minitron`` mode.
+- Add support for ``mamba_num_heads``, ``mamba_head_dim``, ``hidden_size`` and ``num_layers`` pruning for Megatron Core Mamba or Hybrid Transformer Mamba models in ``mcore_minitron`` (previously ``mcore_gpt_minitron``) mode.
+- Add example for QAT/QAD training with `LLaMA Factory <https://github.com/hiyouga/LLaMA-Factory/tree/main>`_. See ``examples/llm_qat/llama_factory`` for more details.
+- Upgrade TensorRT-LLM dependency to 1.0.0rc6.
 
 0.33 (2025-07-14)
 ^^^^^^^^^^^^^^^^^
@@ -25,8 +34,6 @@ Model Optimizer Changelog (Linux)
 
 - PyTorch dependencies for ``modelopt.torch`` features are no longer optional and ``pip install nvidia-modelopt`` is now same as ``pip install nvidia-modelopt[torch]``.
 
-**Deprecations**
-
 **New Features**
 
 - Upgrade TensorRT-LLM dependency to 0.20.
 
@@ -74,7 +74,8 @@ For enterprise users, the 8-bit quantization with Stable Diffusion is also avail
 
 ## Installation / Docker
 
-To use Model Optimizer with full dependencies (e.g. TensorRT-LLM deployment), we recommend using the provided docker image.
+To use Model Optimizer with full dependencies (e.g. TensorRT/TensorRT-LLM deployment), we recommend using our provided docker image
+which is based on the [TensorRT-LLM](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release/tags) docker image with additional example-specific dependencies installed.
 
 After installing the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html),
 please run the following commands to build the Model Optimizer docker container which has all the necessary
@@ -96,7 +97,7 @@ docker run --gpus all -it --shm-size 20g --rm docker.io/library/modelopt_example
 python -c "import modelopt; print(modelopt.__version__)"
 ```
 
-Alternatively, you can install it from [PyPI](https://pypi.org/project/nvidia-modelopt/) without TRT-LLM etc.
+Alternatively, you can install it from [NVIDIA PyPI](https://pypi.org/project/nvidia-modelopt/) without TRT-LLM etc.
 
 ```bash
 pip install -U "nvidia-modelopt[all]"
@@ -105,7 +106,7 @@ pip install -U "nvidia-modelopt[all]"
 To install from source for local development, you can install it as follows:
 
 ```bash
-pip install -e ".[dev]"
+pip install -e ".[all]"
 ```
 
 When installing from source, please make sure to re-run the install command everytime you pull new changes in the repository so dependencies are also updated.
@@ -128,7 +129,7 @@ Quantization is an effective model optimization technique for large models. Quan
 
 ### Pruning \[[examples](./examples/README.md#pruning)\] \[[docs](https://nvidia.github.io/TensorRT-Model-Optimizer/guides/2_pruning.html)\]
 
-Pruning is a technique to reduce the model size and accelerate the inference by removing unnecessary weights. Model Optimizer provides Python APIs to prune Linear and Conv layers, and Transformer attention heads, MLP, embedding hidden size and number of layers (depth).
+Pruning is a technique to reduce the model size and accelerate the inference by removing unnecessary weights. Model Optimizer provides Python APIs to prune primitives such as Linear and Conv layers, and complex modules such as Transformer or Mamba heads, MLP, embedding hidden size and number of layers (depth).
 
 ### Distillation \[[examples](./examples/README.md#distillation)\] \[[docs](https://nvidia.github.io/TensorRT-Model-Optimizer/guides/4_distillation.html)\]
 
 
@@ -1,54 +1,23 @@
-FROM nvcr.io/nvidia/pytorch:25.04-py3
+FROM nvcr.io/nvidia/tensorrt-llm/release:1.0.0rc6
 
 ARG PIP_EXTRA_INDEX_URL="https://pypi.nvidia.com"
-ARG TRT_LLM_COMMIT=v0.20.0
-ARG REMOVE_TRT_LLM_SRC=1
-ARG CUDA_ARCH="89-real;90-real;100-real"
-
 ENV PIP_EXTRA_INDEX_URL=$PIP_EXTRA_INDEX_URL \
     PIP_NO_CACHE_DIR=off \
     PIP_CONSTRAINT= \
     TORCH_CUDA_ARCH_LIST="8.0 8.6 8.7 8.9 9.0 10.0+PTX"
 
-WORKDIR /workspace
-
-# Install TensorRT-LLM from source
-RUN --mount=type=ssh,id=nvidia git clone https://github.com/NVIDIA/TensorRT-LLM.git tensorrt-llm \
-    && cd tensorrt-llm \
-    && git checkout ${TRT_LLM_COMMIT} \
-    && git submodule update --init --recursive
-
-# Install required dependencies
-RUN bash tensorrt-llm/docker/common/install_base.sh $(python --version 2>&1 | awk '{print $2}')
-RUN bash tensorrt-llm/docker/common/install_cmake.sh
-RUN bash tensorrt-llm/docker/common/install_mpi4py.sh
-RUN bash tensorrt-llm/docker/common/install_tensorrt.sh
-RUN bash tensorrt-llm/docker/common/install_cuda_toolkit.sh
+RUN apt-get update && \
+    apt-get install -y libgl1 && \
+    rm -rf /var/lib/apt/lists/*
 
-RUN cd tensorrt-llm && git lfs install && git lfs pull
-
-RUN cd tensorrt-llm \
-    && ./scripts/build_wheel.py --job_count $(nproc) --clean --python_bindings --benchmarks --install --cuda_architecture=${CUDA_ARCH} \
-    && git rev-parse --short HEAD > /workspace/tensorrt-llm.commit \
-    && chmod -R 777 .
-RUN pip install tensorrt-llm/build/tensorrt_llm*.whl
+WORKDIR /workspace
 
-# Remove TensorRT-LLM source code to reduce image size except for benchmarks and examples folders
-RUN if [ "$REMOVE_TRT_LLM_SRC" = "1" ]; then \
-    mkdir -p tensorrt-llm_keep; \
-    mv tensorrt-llm/benchmarks tensorrt-llm_keep/benchmarks; \
-    mv tensorrt-llm/examples tensorrt-llm_keep/examples; \
-    rm -rf tensorrt-llm; \
-    mv tensorrt-llm_keep tensorrt-llm; \
-    fi
+RUN ln -s /app/tensorrt_llm /workspace/tensorrt_llm
 
 # Update PATH and LD_LIBRARY_PATH variables for the TensorRT binaries
-ENV LD_LIBRARY_PATH="/usr/local/tensorrt/targets/x86_64-linux-gnu/lib:${LD_LIBRARY_PATH}" \
+ENV LD_LIBRARY_PATH="/usr/lib/x86_64-linux-gnu:/usr/local/tensorrt/targets/x86_64-linux-gnu/lib:${LD_LIBRARY_PATH}" \
     PATH="/usr/local/tensorrt/targets/x86_64-linux-gnu/bin:${PATH}"
 
-# Export the path to 'libcudnn.so.X' needed by 'libonnxruntime_providers_tensorrt.so'
-ENV LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:$LD_LIBRARY_PATH
-
 # Install modelopt with all optional dependencies and pre-compile CUDA extensions otherwise they take several minutes on every docker run
 RUN pip install -U "nvidia-modelopt[all,dev-test]"
 RUN python -c "import modelopt.torch.quantization.extensions as ext; ext.precompile()"
 
@@ -16,4 +16,4 @@
 
 set -e
 
-docker build --progress=plain . -f docker/Dockerfile -t modelopt_examples:latest "$@"
+docker build --network=host --progress=plain . -f docker/Dockerfile -t modelopt_examples:latest "$@"
@@ -54,9 +54,10 @@ TensorRT-LLM
 ~~~~~~~~~~~~
 
 Models:
-  * Llama 4, 3.1, 3.3 (FP8, NVFP4)
-  * Qwen 3 (FP8, NVFP4)
-  * Deepseek R1 (NVFP4)
+  * Llama 4, 3.x (FP8, NVFP4)
+  * Qwen 3, 2.5 (FP8, NVFP4)
+  * Qwen 3 MoE (FP8, NVFP4)
+  * Deepseek R1/V3 (NVFP4)
   * Mixtral 8x7B (FP8, NVFP4)
   * Medusa (FP8)
   * Eagle (FP8)
@@ -67,21 +68,24 @@ vLLM
 ~~~~
 
 Models:
-  * Llama 3.1, 3.3 (FP8, NVFP4)
+  * Llama 4, 3.x (FP8, NVFP4)
+  * Qwen 3, 2.5 (FP8, NVFP4)
+  * Qwen 3 MoE (FP8, NVFP4)
   * Mixtral 8x7B (FP8)
-  * Deepseek R1 (NVFP4)
+  * Deepseek R1/V3 (NVFP4)
 
-Requirements: vLLM v0.9.1 or later
+Requirements: vLLM v0.10.1 or later
 
 SGLang
 ~~~~~~
 
 Models:
-  * Llama 3.1, 3.3 (FP8, NVFP4)
-  * Deepseek R1 (NVFP4)
-  * Llama 4 (FP8)
+  * Llama 4, 3.x (FP8, NVFP4)
+  * Qwen 3, 2.5 (FP8, NVFP4)
+  * Qwen 3 MoE (FP8, NVFP4)
+  * Deepseek R1/V3 (NVFP4)
 
-Requirements: SGLang v0.4.7 or later
+Requirements: SGLang v0.4.10 or later
 
 Note: While other models and quantization formats may work, they have not been thoroughly tested and validated.
 
 
@@ -18,7 +18,7 @@ Latest Model Optimizer (``nvidia-modelopt``) currently has the following system
 +-------------------------+-----------------------------+
 | PyTorch                 |  >=2.6                      |
 +-------------------------+-----------------------------+
-| TensorRT-LLM (Optional) |  0.20                       |
+| TensorRT-LLM (Optional) |  1.0.0rc6                   |
 +-------------------------+-----------------------------+
 | ONNX Runtime (Optional) |  1.22                       |
 +-------------------------+-----------------------------+
@@ -32,7 +32,9 @@ Environment setup
 
     **Using ModelOpt's docker image**
 
-    Easiest way to get started with using Model Optimizer and additional dependencies (e.g. TensorRT-LLM deployment) is to start from our docker image.
+    To use Model Optimizer with full dependencies (e.g. TensorRT/TensorRT-LLM deployment), we recommend using our provided docker image
+    which is based on the `TensorRT-LLM <https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release/tags>`_
+    docker image with additional example-specific dependencies installed.
 
     After installing the `NVIDIA Container Toolkit <https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html>`_,
     please run the following commands to build the Model Optimizer docker container which has all the necessary
 
@@ -17,9 +17,10 @@ attention heads of the model. More details on these pruning modes is as follows:
 
 #.  ``fastnas``: A pruning method recommended for Computer Vision models. Given a pretrained model,
     FastNAS finds the subnet which maximizes the score function while meeting the given constraints.
-#.  ``mcore_gpt_minitron``: A pruning method developed by NVIDIA Research for pruning GPT-style models (e.g. Llama 3)
-    in NVIDIA NeMo or Megatron-LM framework that are using Pipeline Parallelism. It uses the activation
-    magnitudes to prune the mlp, attention heads, GQA query groups, embedding hidden size and number of layers of the model.
+#.  ``mcore_minitron``: A pruning method developed by NVIDIA Research for pruning GPT, Mamba and Hybrid
+    Transformer Mamba models in NVIDIA NeMo or Megatron-LM framework. It uses the activation magnitudes to prune
+    the mlp, transformer attention heads, GQA query groups, mamba heads and head dimension, embedding hidden size
+    and number of layers of the model.
     Checkout more details of the algorithm in the `paper <https://arxiv.org/abs/2408.11796>`_.
 #.  ``gradnas``: A light-weight pruning method recommended for language models like Hugging Face BERT and GPT-J.
     It uses the gradient information to prune the model's linear layers and attention heads to meet the given constraints.
 
@@ -28,34 +28,31 @@ python quantize.py \
   --model {flux-dev|sdxl-1.0|sdxl-turbo|sd3-medium} \
   --format int8 --batch-size 2 \
   --calib-size 32 --collect-method min-mean \
-  --percentile 1.0 --alpha 0.8 \
-  --quant-level 3.0 --n-steps 20 \
+  --percentile 1.0 --alpha 0.8 --n-steps 20 \
   --model-dtype {Half/BFloat16} --trt-high-precision-dtype {Half|BFloat16} \
   --quantized-torch-ckpt-save-path ./{MODEL_NAME}.pt --onnx-dir {ONNX_DIR}
 ```
 
-### FLUX-Dev|SDXL|SDXL-Turbo FP8/FP4
+### FLUX-Dev|SDXL|SDXL-Turbo|LTX-Video FP8/FP4
 
 *In our example code, FP4 is only supported for Flux. However, you can modify our script to enable FP4 format support for your own model.*
 
 ```sh
 python quantize.py \
-  --model {flux-dev|sdxl-1.0|sdxl-turbo} --model-dtype {Half|BFloat16} --trt-high-precision-dtype {Half|BFloat16} \
-  --format {fp8|fp4} --batch-size 2 --calib-size {128|256} --quant-level {3.0|4.0} \
+  --model {flux-dev|sdxl-1.0|sdxl-turbo|ltx-video-dev} --model-dtype {Half|BFloat16} --trt-high-precision-dtype {Half|BFloat16} \
+  --format {fp8|fp4} --batch-size 2 --calib-size {128|256} --quantize-mha \
   --n-steps 20 --quantized-torch-ckpt-save-path ./{MODEL_NAME}.pt --collect-method default \
   --onnx-dir {ONNX_DIR}
 ```
 
-We recommend using a device with a minimum of 48GB of combined CPU and GPU memory for exporting ONNX models. Quant-level 4.0 requires additional memory.
+We recommend using a device with a minimum of 48GB of combined CPU and GPU memory for exporting ONNX models. If not, please use CPU for onnx export.
 
 #### Important Parameters
 
 - `percentile`: Control quantization scaling factors (amax) collecting range, meaning that we will collect the chosen amax in the range of `(n_steps * percentile)` steps. Recommendation: 1.0
 
 - `alpha`: A parameter in SmoothQuant, used for linear layers only. Recommendation: 0.8 for SDXL
 
-- `quant-level`: Which layers to be quantized, 1: `CNNs`, 2: `CNN + FFN`, 2.5: `CNN + FFN + QKV`, 3: `CNN + Almost all Linear (Including FFN, QKV, Proj and others)`, 4: `CNN + Almost all Linear + fMHA`. Recommendation: 2, 2.5 and 3, 4 is only for FP8, depending on the requirements for image quality & speedup. **You might notice a slight difference between FP8 quant level 3.0 and 4.0, as we are currently working to enhance the performance of FP8 fMHA.**
-
 - `calib-size`: For SDXL INT8, we recommend 32 or 64, for SDXL FP8, 128 is recommended.
 
 - `n_steps`: Recommendation: SD/SDXL 20 or 30, SDXL-Turbo 4.
@@ -138,7 +135,6 @@ python quantize.py \
   --format fp8 \
   --batch-size {1|2} \
   --calib-size 128 \
-  --quant-level 3.0 \
   --n-steps 20 \
   --quantized-torch-ckpt-save-path ./{MODEL}_fp8.pt \
   --collect-method default \
Original file line number	Diff line number	Diff line change
`@@ -16,4 +16,4 @@`
`16`	`16`
`17`	`17`	`set -e`
`18`	`18`
`19`		`-docker build --progress=plain . -f docker/Dockerfile -t modelopt_examples:latest "$@"`
	`19`	`+docker build --network=host --progress=plain . -f docker/Dockerfile -t modelopt_examples:latest "$@"`