Skip to content

Commit dcdc2df

Browse files
Push latest changes
Signed-off-by: Keval Morabia <[email protected]>
1 parent 4b28472 commit dcdc2df

File tree

118 files changed

+5404
-3064
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

118 files changed

+5404
-3064
lines changed

.github/workflows/unit_tests.yml

Lines changed: 18 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -64,7 +64,7 @@ jobs:
6464
with:
6565
python-version: "3.12"
6666
- name: Run unit tests (without coverage)
67-
run: pip install tox && tox -e py312-torch28-unit
67+
run: pip install tox && tox -e py312-torch28-tf_latest-unit
6868
multi-py:
6969
if: github.event_name == 'pull_request'
7070
needs: [linux]
@@ -79,7 +79,7 @@ jobs:
7979
with:
8080
python-version: "3.${{ matrix.py }}"
8181
- name: Run unit tests
82-
run: pip install tox && tox -e py3${{ matrix.py }}-torch28-unit
82+
run: pip install tox && tox -e py3${{ matrix.py }}-torch28-tf_latest-unit
8383
multi-torch:
8484
if: github.event_name == 'pull_request'
8585
needs: [linux]
@@ -94,7 +94,22 @@ jobs:
9494
with:
9595
python-version: "3.12"
9696
- name: Run unit tests
97-
run: pip install tox && tox -e py312-torch${{ matrix.torch }}-unit
97+
run: pip install tox && tox -e py312-torch${{ matrix.torch }}-tf_latest-unit
98+
multi-transformers:
99+
if: github.event_name == 'pull_request'
100+
needs: [linux]
101+
runs-on: ubuntu-latest
102+
timeout-minutes: 30
103+
strategy:
104+
matrix:
105+
tf: [min]
106+
steps:
107+
- uses: actions/checkout@v4
108+
- uses: actions/setup-python@v5
109+
with:
110+
python-version: "3.12"
111+
- name: Run unit tests
112+
run: pip install tox && tox -e py312-torch28-tf_${{ matrix.tf }}-unit
98113
partial-install:
99114
if: github.event_name == 'pull_request'
100115
needs: [linux]

.pre-commit-config.yaml

Lines changed: 0 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,4 @@
11
# NOTE: Make sure to update version in dev requirements (setup.py) as well!
2-
exclude: >
3-
(?x)^(
4-
experimental/.*|
5-
)$
6-
72
repos:
83
- repo: https://github.com/pre-commit/pre-commit-hooks
94
rev: v5.0.0

CHANGELOG.rst

Lines changed: 12 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -4,19 +4,28 @@ Model Optimizer Changelog (Linux)
44
0.35 (2025-08-xx)
55
^^^^^^^^^^^^^^^^^
66

7-
**Backward Breaking Changes**
8-
97
**Deprecations**
108

119
- Deprecate ``torch<2.6`` support.
1210

11+
**Bug Fixes**
12+
13+
- Fix attention head ranking logic for pruning Megatron Core GPT models.
14+
1315
**New Features**
1416

17+
- ModelOpt now supports PTQ and QAT for GPT-OSS models. See ``examples/gpt_oss`` for end-to-end PTQ/QAT example.
18+
- Add support for QAT with HuggingFace + DeepSpeed. See ``examples/gpt_oss`` for an example.
19+
- Add support for QAT with LoRA. The LoRA adapters can be folded into the base model after QAT and deployed just like a regular PTQ model. See ``examples/gpt_oss`` for an example.
20+
- ModelOpt provides convenient trainers such as :class:`QATTrainer`, :class:`QADTrainer`, :class:`KDTrainer`, :class:`QATSFTTrainer` which inherits from Huggingface trainers.
21+
ModelOpt trainers can be used as drop in replacement of the correspoding Huggingface trainer. See usage examples in ``examples/gpt_oss``, ``examples/llm_qat`` or ``examples/llm_distill``.
1522
- (Experimental) Add quantization support for custom TensorRT op in ONNX models.
1623
- Add support for Minifinetuning (MFT; https://arxiv.org/abs/2506.15702) self-corrective distillation, which enables training on small datasets with severely mitigated catastrophic forgetting.
1724
- Add tree decoding support for Megatron Eagle models.
1825
- For most VLMs, we now explicitly disable quant on the vision part so we add them to the excluded_modules during HF export.
19-
- Add support for ``hidden_size`` and ``num_layers`` pruning for Megatron Core Mamba models in ``mcore_gpt_minitron`` mode.
26+
- Add support for ``mamba_num_heads``, ``mamba_head_dim``, ``hidden_size`` and ``num_layers`` pruning for Megatron Core Mamba or Hybrid Transformer Mamba models in ``mcore_minitron`` (previously ``mcore_gpt_minitron``) mode.
27+
- Add example for QAT/QAD training with `LLaMA Factory <https://github.com/hiyouga/LLaMA-Factory/tree/main>`_. See ``examples/llm_qat/llama_factory`` for more details.
28+
- Upgrade TensorRT-LLM dependency to 1.0.0rc6.
2029

2130
0.33 (2025-07-14)
2231
^^^^^^^^^^^^^^^^^
@@ -25,8 +34,6 @@ Model Optimizer Changelog (Linux)
2534

2635
- PyTorch dependencies for ``modelopt.torch`` features are no longer optional and ``pip install nvidia-modelopt`` is now same as ``pip install nvidia-modelopt[torch]``.
2736

28-
**Deprecations**
29-
3037
**New Features**
3138

3239
- Upgrade TensorRT-LLM dependency to 0.20.

README.md

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -74,7 +74,8 @@ For enterprise users, the 8-bit quantization with Stable Diffusion is also avail
7474

7575
## Installation / Docker
7676

77-
To use Model Optimizer with full dependencies (e.g. TensorRT-LLM deployment), we recommend using the provided docker image.
77+
To use Model Optimizer with full dependencies (e.g. TensorRT/TensorRT-LLM deployment), we recommend using our provided docker image
78+
which is based on the [TensorRT-LLM](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release/tags) docker image with additional example-specific dependencies installed.
7879

7980
After installing the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html),
8081
please run the following commands to build the Model Optimizer docker container which has all the necessary
@@ -96,7 +97,7 @@ docker run --gpus all -it --shm-size 20g --rm docker.io/library/modelopt_example
9697
python -c "import modelopt; print(modelopt.__version__)"
9798
```
9899

99-
Alternatively, you can install it from [PyPI](https://pypi.org/project/nvidia-modelopt/) without TRT-LLM etc.
100+
Alternatively, you can install it from [NVIDIA PyPI](https://pypi.org/project/nvidia-modelopt/) without TRT-LLM etc.
100101

101102
```bash
102103
pip install -U "nvidia-modelopt[all]"
@@ -105,7 +106,7 @@ pip install -U "nvidia-modelopt[all]"
105106
To install from source for local development, you can install it as follows:
106107

107108
```bash
108-
pip install -e ".[dev]"
109+
pip install -e ".[all]"
109110
```
110111

111112
When installing from source, please make sure to re-run the install command everytime you pull new changes in the repository so dependencies are also updated.
@@ -128,7 +129,7 @@ Quantization is an effective model optimization technique for large models. Quan
128129

129130
### Pruning \[[examples](./examples/README.md#pruning)\] \[[docs](https://nvidia.github.io/TensorRT-Model-Optimizer/guides/2_pruning.html)\]
130131

131-
Pruning is a technique to reduce the model size and accelerate the inference by removing unnecessary weights. Model Optimizer provides Python APIs to prune Linear and Conv layers, and Transformer attention heads, MLP, embedding hidden size and number of layers (depth).
132+
Pruning is a technique to reduce the model size and accelerate the inference by removing unnecessary weights. Model Optimizer provides Python APIs to prune primitives such as Linear and Conv layers, and complex modules such as Transformer or Mamba heads, MLP, embedding hidden size and number of layers (depth).
132133

133134
### Distillation \[[examples](./examples/README.md#distillation)\] \[[docs](https://nvidia.github.io/TensorRT-Model-Optimizer/guides/4_distillation.html)\]
134135

docker/Dockerfile

Lines changed: 7 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -1,54 +1,23 @@
1-
FROM nvcr.io/nvidia/pytorch:25.04-py3
1+
FROM nvcr.io/nvidia/tensorrt-llm/release:1.0.0rc6
22

33
ARG PIP_EXTRA_INDEX_URL="https://pypi.nvidia.com"
4-
ARG TRT_LLM_COMMIT=v0.20.0
5-
ARG REMOVE_TRT_LLM_SRC=1
6-
ARG CUDA_ARCH="89-real;90-real;100-real"
7-
84
ENV PIP_EXTRA_INDEX_URL=$PIP_EXTRA_INDEX_URL \
95
PIP_NO_CACHE_DIR=off \
106
PIP_CONSTRAINT= \
117
TORCH_CUDA_ARCH_LIST="8.0 8.6 8.7 8.9 9.0 10.0+PTX"
128

13-
WORKDIR /workspace
14-
15-
# Install TensorRT-LLM from source
16-
RUN --mount=type=ssh,id=nvidia git clone https://github.com/NVIDIA/TensorRT-LLM.git tensorrt-llm \
17-
&& cd tensorrt-llm \
18-
&& git checkout ${TRT_LLM_COMMIT} \
19-
&& git submodule update --init --recursive
20-
21-
# Install required dependencies
22-
RUN bash tensorrt-llm/docker/common/install_base.sh $(python --version 2>&1 | awk '{print $2}')
23-
RUN bash tensorrt-llm/docker/common/install_cmake.sh
24-
RUN bash tensorrt-llm/docker/common/install_mpi4py.sh
25-
RUN bash tensorrt-llm/docker/common/install_tensorrt.sh
26-
RUN bash tensorrt-llm/docker/common/install_cuda_toolkit.sh
9+
RUN apt-get update && \
10+
apt-get install -y libgl1 && \
11+
rm -rf /var/lib/apt/lists/*
2712

28-
RUN cd tensorrt-llm && git lfs install && git lfs pull
29-
30-
RUN cd tensorrt-llm \
31-
&& ./scripts/build_wheel.py --job_count $(nproc) --clean --python_bindings --benchmarks --install --cuda_architecture=${CUDA_ARCH} \
32-
&& git rev-parse --short HEAD > /workspace/tensorrt-llm.commit \
33-
&& chmod -R 777 .
34-
RUN pip install tensorrt-llm/build/tensorrt_llm*.whl
13+
WORKDIR /workspace
3514

36-
# Remove TensorRT-LLM source code to reduce image size except for benchmarks and examples folders
37-
RUN if [ "$REMOVE_TRT_LLM_SRC" = "1" ]; then \
38-
mkdir -p tensorrt-llm_keep; \
39-
mv tensorrt-llm/benchmarks tensorrt-llm_keep/benchmarks; \
40-
mv tensorrt-llm/examples tensorrt-llm_keep/examples; \
41-
rm -rf tensorrt-llm; \
42-
mv tensorrt-llm_keep tensorrt-llm; \
43-
fi
15+
RUN ln -s /app/tensorrt_llm /workspace/tensorrt_llm
4416

4517
# Update PATH and LD_LIBRARY_PATH variables for the TensorRT binaries
46-
ENV LD_LIBRARY_PATH="/usr/local/tensorrt/targets/x86_64-linux-gnu/lib:${LD_LIBRARY_PATH}" \
18+
ENV LD_LIBRARY_PATH="/usr/lib/x86_64-linux-gnu:/usr/local/tensorrt/targets/x86_64-linux-gnu/lib:${LD_LIBRARY_PATH}" \
4719
PATH="/usr/local/tensorrt/targets/x86_64-linux-gnu/bin:${PATH}"
4820

49-
# Export the path to 'libcudnn.so.X' needed by 'libonnxruntime_providers_tensorrt.so'
50-
ENV LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:$LD_LIBRARY_PATH
51-
5221
# Install modelopt with all optional dependencies and pre-compile CUDA extensions otherwise they take several minutes on every docker run
5322
RUN pip install -U "nvidia-modelopt[all,dev-test]"
5423
RUN python -c "import modelopt.torch.quantization.extensions as ext; ext.precompile()"

docker/build.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,4 +16,4 @@
1616

1717
set -e
1818

19-
docker build --progress=plain . -f docker/Dockerfile -t modelopt_examples:latest "$@"
19+
docker build --network=host --progress=plain . -f docker/Dockerfile -t modelopt_examples:latest "$@"

docs/source/deployment/3_unified_hf.rst

Lines changed: 14 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -54,9 +54,10 @@ TensorRT-LLM
5454
~~~~~~~~~~~~
5555

5656
Models:
57-
* Llama 4, 3.1, 3.3 (FP8, NVFP4)
58-
* Qwen 3 (FP8, NVFP4)
59-
* Deepseek R1 (NVFP4)
57+
* Llama 4, 3.x (FP8, NVFP4)
58+
* Qwen 3, 2.5 (FP8, NVFP4)
59+
* Qwen 3 MoE (FP8, NVFP4)
60+
* Deepseek R1/V3 (NVFP4)
6061
* Mixtral 8x7B (FP8, NVFP4)
6162
* Medusa (FP8)
6263
* Eagle (FP8)
@@ -67,21 +68,24 @@ vLLM
6768
~~~~
6869

6970
Models:
70-
* Llama 3.1, 3.3 (FP8, NVFP4)
71+
* Llama 4, 3.x (FP8, NVFP4)
72+
* Qwen 3, 2.5 (FP8, NVFP4)
73+
* Qwen 3 MoE (FP8, NVFP4)
7174
* Mixtral 8x7B (FP8)
72-
* Deepseek R1 (NVFP4)
75+
* Deepseek R1/V3 (NVFP4)
7376

74-
Requirements: vLLM v0.9.1 or later
77+
Requirements: vLLM v0.10.1 or later
7578

7679
SGLang
7780
~~~~~~
7881

7982
Models:
80-
* Llama 3.1, 3.3 (FP8, NVFP4)
81-
* Deepseek R1 (NVFP4)
82-
* Llama 4 (FP8)
83+
* Llama 4, 3.x (FP8, NVFP4)
84+
* Qwen 3, 2.5 (FP8, NVFP4)
85+
* Qwen 3 MoE (FP8, NVFP4)
86+
* Deepseek R1/V3 (NVFP4)
8387

84-
Requirements: SGLang v0.4.7 or later
88+
Requirements: SGLang v0.4.10 or later
8589

8690
Note: While other models and quantization formats may work, they have not been thoroughly tested and validated.
8791

docs/source/getting_started/_installation_for_Linux.rst

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ Latest Model Optimizer (``nvidia-modelopt``) currently has the following system
1818
+-------------------------+-----------------------------+
1919
| PyTorch | >=2.6 |
2020
+-------------------------+-----------------------------+
21-
| TensorRT-LLM (Optional) | 0.20 |
21+
| TensorRT-LLM (Optional) | 1.0.0rc6 |
2222
+-------------------------+-----------------------------+
2323
| ONNX Runtime (Optional) | 1.22 |
2424
+-------------------------+-----------------------------+
@@ -32,7 +32,9 @@ Environment setup
3232

3333
**Using ModelOpt's docker image**
3434

35-
Easiest way to get started with using Model Optimizer and additional dependencies (e.g. TensorRT-LLM deployment) is to start from our docker image.
35+
To use Model Optimizer with full dependencies (e.g. TensorRT/TensorRT-LLM deployment), we recommend using our provided docker image
36+
which is based on the `TensorRT-LLM <https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release/tags>`_
37+
docker image with additional example-specific dependencies installed.
3638

3739
After installing the `NVIDIA Container Toolkit <https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html>`_,
3840
please run the following commands to build the Model Optimizer docker container which has all the necessary

docs/source/guides/2_pruning.rst

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -17,9 +17,10 @@ attention heads of the model. More details on these pruning modes is as follows:
1717

1818
#. ``fastnas``: A pruning method recommended for Computer Vision models. Given a pretrained model,
1919
FastNAS finds the subnet which maximizes the score function while meeting the given constraints.
20-
#. ``mcore_gpt_minitron``: A pruning method developed by NVIDIA Research for pruning GPT-style models (e.g. Llama 3)
21-
in NVIDIA NeMo or Megatron-LM framework that are using Pipeline Parallelism. It uses the activation
22-
magnitudes to prune the mlp, attention heads, GQA query groups, embedding hidden size and number of layers of the model.
20+
#. ``mcore_minitron``: A pruning method developed by NVIDIA Research for pruning GPT, Mamba and Hybrid
21+
Transformer Mamba models in NVIDIA NeMo or Megatron-LM framework. It uses the activation magnitudes to prune
22+
the mlp, transformer attention heads, GQA query groups, mamba heads and head dimension, embedding hidden size
23+
and number of layers of the model.
2324
Checkout more details of the algorithm in the `paper <https://arxiv.org/abs/2408.11796>`_.
2425
#. ``gradnas``: A light-weight pruning method recommended for language models like Hugging Face BERT and GPT-J.
2526
It uses the gradient information to prune the model's linear layers and attention heads to meet the given constraints.

examples/diffusers/quantization/README.md

Lines changed: 5 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -28,34 +28,31 @@ python quantize.py \
2828
--model {flux-dev|sdxl-1.0|sdxl-turbo|sd3-medium} \
2929
--format int8 --batch-size 2 \
3030
--calib-size 32 --collect-method min-mean \
31-
--percentile 1.0 --alpha 0.8 \
32-
--quant-level 3.0 --n-steps 20 \
31+
--percentile 1.0 --alpha 0.8 --n-steps 20 \
3332
--model-dtype {Half/BFloat16} --trt-high-precision-dtype {Half|BFloat16} \
3433
--quantized-torch-ckpt-save-path ./{MODEL_NAME}.pt --onnx-dir {ONNX_DIR}
3534
```
3635

37-
### FLUX-Dev|SDXL|SDXL-Turbo FP8/FP4
36+
### FLUX-Dev|SDXL|SDXL-Turbo|LTX-Video FP8/FP4
3837

3938
*In our example code, FP4 is only supported for Flux. However, you can modify our script to enable FP4 format support for your own model.*
4039

4140
```sh
4241
python quantize.py \
43-
--model {flux-dev|sdxl-1.0|sdxl-turbo} --model-dtype {Half|BFloat16} --trt-high-precision-dtype {Half|BFloat16} \
44-
--format {fp8|fp4} --batch-size 2 --calib-size {128|256} --quant-level {3.0|4.0} \
42+
--model {flux-dev|sdxl-1.0|sdxl-turbo|ltx-video-dev} --model-dtype {Half|BFloat16} --trt-high-precision-dtype {Half|BFloat16} \
43+
--format {fp8|fp4} --batch-size 2 --calib-size {128|256} --quantize-mha \
4544
--n-steps 20 --quantized-torch-ckpt-save-path ./{MODEL_NAME}.pt --collect-method default \
4645
--onnx-dir {ONNX_DIR}
4746
```
4847

49-
We recommend using a device with a minimum of 48GB of combined CPU and GPU memory for exporting ONNX models. Quant-level 4.0 requires additional memory.
48+
We recommend using a device with a minimum of 48GB of combined CPU and GPU memory for exporting ONNX models. If not, please use CPU for onnx export.
5049

5150
#### Important Parameters
5251

5352
- `percentile`: Control quantization scaling factors (amax) collecting range, meaning that we will collect the chosen amax in the range of `(n_steps * percentile)` steps. Recommendation: 1.0
5453

5554
- `alpha`: A parameter in SmoothQuant, used for linear layers only. Recommendation: 0.8 for SDXL
5655

57-
- `quant-level`: Which layers to be quantized, 1: `CNNs`, 2: `CNN + FFN`, 2.5: `CNN + FFN + QKV`, 3: `CNN + Almost all Linear (Including FFN, QKV, Proj and others)`, 4: `CNN + Almost all Linear + fMHA`. Recommendation: 2, 2.5 and 3, 4 is only for FP8, depending on the requirements for image quality & speedup. **You might notice a slight difference between FP8 quant level 3.0 and 4.0, as we are currently working to enhance the performance of FP8 fMHA.**
58-
5956
- `calib-size`: For SDXL INT8, we recommend 32 or 64, for SDXL FP8, 128 is recommended.
6057

6158
- `n_steps`: Recommendation: SD/SDXL 20 or 30, SDXL-Turbo 4.
@@ -138,7 +135,6 @@ python quantize.py \
138135
--format fp8 \
139136
--batch-size {1|2} \
140137
--calib-size 128 \
141-
--quant-level 3.0 \
142138
--n-steps 20 \
143139
--quantized-torch-ckpt-save-path ./{MODEL}_fp8.pt \
144140
--collect-method default \

0 commit comments

Comments
 (0)