-
Couldn't load subscription status.
- Fork 6.5k
[Quantization] Add TRT-ModelOpt as a Backend #11173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 33 commits
d7ca877
a016c56
eb73ab0
7fdb79e
a83bb98
9d9f0b9
7b09750
4fe06ee
71d8a7e
6c74c69
10fb9fe
6c65138
4b32567
915dbf0
3336a08
1c470f2
f823a2c
e78841e
8f88f29
212603f
24f1bcb
65097f1
97f94ae
752544f
415901f
482fe78
488282f
88259c9
e51be6a
d48835d
5c4a4ea
670202d
6dd903f
3f672d3
64d018c
395e75b
9034661
bbbc840
2076783
c53d251
1ddcc9c
5df6926
8439f01
b96da23
0bf90b0
b097f0f
cf054d2
0828f50
031298d
f345325
dd39595
d66709b
81f4785
8f60186
8daf21d
1a8806f
cb4e44b
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,134 @@ | ||||||||||||
| <!-- Copyright 2025 The HuggingFace Team. All rights reserved. | ||||||||||||
|
|
||||||||||||
| Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with | ||||||||||||
| the License. You may obtain a copy of the License at | ||||||||||||
|
|
||||||||||||
| http://www.apache.org/licenses/LICENSE-2.0 | ||||||||||||
|
|
||||||||||||
| Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on | ||||||||||||
| an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the | ||||||||||||
| specific language governing permissions and limitations under the License. --> | ||||||||||||
|
|
||||||||||||
| # Nvidia ModelOpt | ||||||||||||
|
|
||||||||||||
| [nvidia_modelopt](https://github.com/NVIDIA/TensorRT-Model-Optimizer) is a unified library of state-of-the-art model optimization techniques like quantization, pruning, distillation, speculative decoding, etc. It compresses deep learning models for downstream deployment frameworks like TensorRT-LLM or TensorRT to optimize inference speed. | ||||||||||||
SunMarc marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||||||||
|
|
||||||||||||
| Before you begin, make sure you have nvidia_modelopt installed. | ||||||||||||
|
|
||||||||||||
| ```bash | ||||||||||||
| pip install -U "nvidia_modelopt[hf]" | ||||||||||||
| ``` | ||||||||||||
|
|
||||||||||||
|
|
||||||||||||
| Quantize a model by passing [`NVIDIAModelOptConfig`] to [`~ModelMixin.from_pretrained`] (you can also load pre-quantized models). This works for any model in any modality, as long as it supports loading with [Accelerate](https://hf.co/docs/accelerate/index) and contains `torch.nn.Linear` layers. | ||||||||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||
|
|
||||||||||||
| The example below only quantizes the weights to FP8. | ||||||||||||
|
|
||||||||||||
| ```python | ||||||||||||
| import torch | ||||||||||||
| from diffusers import AutoModel, SanaPipeline, NVIDIAModelOptConfig | ||||||||||||
|
|
||||||||||||
| model_id = "Efficient-Large-Model/Sana_600M_1024px_diffusers" | ||||||||||||
| dtype = torch.bfloat16 | ||||||||||||
|
|
||||||||||||
| quantization_config = NVIDIAModelOptConfig(quant_type="FP8", quant_method="modelopt") | ||||||||||||
| transformer = AutoModel.from_pretrained( | ||||||||||||
| model_id, | ||||||||||||
| subfolder="transformer", | ||||||||||||
| quantization_config=quantization_config, | ||||||||||||
| torch_dtype=dtype, | ||||||||||||
| ) | ||||||||||||
| pipe = SanaPipeline.from_pretrained( | ||||||||||||
| model_id, | ||||||||||||
| transformer=transformer, | ||||||||||||
| torch_dtype=dtype, | ||||||||||||
| ) | ||||||||||||
|
Comment on lines
+33
to
+44
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Lets prefer using There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I have kept it similar to all the other quantization docs (quanto, torchao etc), can we keep it similar to them for now, in those doc they use specific quant config |
||||||||||||
| pipe.to("cuda") | ||||||||||||
|
|
||||||||||||
| print(f"Pipeline memory usage: {torch.cuda.max_memory_reserved() / 1024**3:.3f} GB") | ||||||||||||
|
|
||||||||||||
| prompt = "A cat holding a sign that says hello world" | ||||||||||||
| image = pipe( | ||||||||||||
| prompt, num_inference_steps=50, guidance_scale=4.5, max_sequence_length=512 | ||||||||||||
| ).images[0] | ||||||||||||
| image.save("output.png") | ||||||||||||
| ``` | ||||||||||||
|
|
||||||||||||
| ## NVIDIAModelOptConfig | ||||||||||||
|
|
||||||||||||
| The `NVIDIAModelOptConfig` class accepts three parameters: | ||||||||||||
| - `quant_type`: A string value mentioning one of the quantization types below. | ||||||||||||
| - `modules_to_not_convert`: A list of module full/partial module names for which quantization should not be performed. For example, to not perform any quantization of the [`SanaTransformer2DModel`]'s conv blocks, one would specify: `modules_to_not_convert=["conv"]`. | ||||||||||||
| - `kwargs`: A dict of keyword arguments to pass to the underlying quantization method which will be invoked based on `quant_type`. | ||||||||||||
|
|
||||||||||||
| ## Supported quantization types | ||||||||||||
|
|
||||||||||||
| ModelOpt supports weight-only, channel and block quantization int8, fp8, int4, nf4, and nvfp4. The quantization methods are designed to reduce the memory footprint of the model weights while maintaining the performance of the model during inference. | ||||||||||||
|
|
||||||||||||
| Weight-only quantization stores the model weights in a specific low-bit data type but performs computation with a higher-precision data type, like `bfloat16`. This lowers the memory requirements from model weights but retains the memory peaks for activation computation. | ||||||||||||
|
|
||||||||||||
| The quantization methods supported are as follows: | ||||||||||||
|
|
||||||||||||
| | **Quantization Type** | **Supported Schemes** | **Required Kwargs** | **Additional Notes** | | ||||||||||||
| |-----------------------|-----------------------|---------------------|----------------------| | ||||||||||||
| | **INT8** | `int8 weight only`, `int8 channel quantization`, `int8 block quantization` | `quant_type`, `quant_type + channel_quantize`, `quant_type + channel_quantize + block_quantize` | | ||||||||||||
| | **FP8** | `fp8 weight only`, `fp8 channel quantization`, `fp8 block quantization` | `quant_type`, `quant_type + channel_quantize`, `quant_type + channel_quantize + block_quantize` | | ||||||||||||
| | **INT4** | `int4 weight only`, `int4 block quantization` | `quant_type`, `quant_type + channel_quantize + block_quantize` | `channel_quantize = -1 is only supported for now`| | ||||||||||||
| | **NF4** | `nf4 weight only`, `nf4 double block quantization` | `quant_type`, `quant_type + channel_quantize + block_quantize + scale_channel_quantize` + `scale_block_quantize` | `channel_quantize = -1 and scale_channel_quantize = -1 are only supported for now` | | ||||||||||||
| | **NVFP4** | `nvfp4 weight only`, `nvfp4 block quantization` | `quant_type`, `quant_type + channel_quantize + block_quantize` | `channel_quantize = -1 is only supported for now`| | ||||||||||||
|
|
||||||||||||
| Note - Channel and Block quantization generally don't work well with convolutional layers. Please use the `modules_to_not_convert` argument to skip quantization for convolutional layers. | ||||||||||||
|
|
||||||||||||
| Refer to the [official modelopt documentation](https://nvidia.github.io/TensorRT-Model-Optimizer/) for a better understanding of the available quantization methods and the exhaustive list of configuration options available. | ||||||||||||
|
|
||||||||||||
| ## Serializing and Deserializing quantized models | ||||||||||||
|
|
||||||||||||
| To serialize a quantized model in a given dtype, first load the model with the desired quantization dtype and then save it using the [`~ModelMixin.save_pretrained`] method. | ||||||||||||
|
|
||||||||||||
| ```python | ||||||||||||
| import torch | ||||||||||||
| from diffusers import AutoModel, NVIDIAModelOptConfig | ||||||||||||
| from modelopt.torch.opt import enable_huggingface_checkpointing | ||||||||||||
|
|
||||||||||||
| enable_huggingface_checkpointing() | ||||||||||||
SunMarc marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||
|
|
||||||||||||
| model_id = "Efficient-Large-Model/Sana_600M_1024px_diffusers" | ||||||||||||
| quant_config_fp8 = {"quant_type": "FP8", "quant_method": "modelopt"} | ||||||||||||
| quant_config_fp8 = NVIDIAModelOptConfig(**quant_config_fp8) | ||||||||||||
| model = AutoModel.from_pretrained( | ||||||||||||
| model_id, | ||||||||||||
| subfolder="transformer", | ||||||||||||
| quantization_config=quant_config_fp8, | ||||||||||||
| torch_dtype=torch.bfloat16, | ||||||||||||
| ) | ||||||||||||
| model.save_pretrained('path/to/sana_fp8', safe_serialization=False) | ||||||||||||
| ``` | ||||||||||||
|
|
||||||||||||
| To load a serialized quantized model, use the [`~ModelMixin.from_pretrained`] method. | ||||||||||||
|
|
||||||||||||
| ```python | ||||||||||||
| import torch | ||||||||||||
| from diffusers import AutoModel, NVIDIAModelOptConfig, SanaPipeline | ||||||||||||
| from modelopt.torch.opt import enable_huggingface_checkpointing | ||||||||||||
|
|
||||||||||||
| enable_huggingface_checkpointing() | ||||||||||||
SunMarc marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||
|
|
||||||||||||
| quantization_config = NVIDIAModelOptConfig(quant_type="FP8", quant_method="modelopt") | ||||||||||||
| transformer = AutoModel.from_pretrained( | ||||||||||||
| "path/to/sana_fp8", | ||||||||||||
| subfolder="transformer", | ||||||||||||
| quantization_config=quantization_config, | ||||||||||||
| torch_dtype=torch.bfloat16, | ||||||||||||
| ) | ||||||||||||
| pipe = SanaPipeline.from_pretrained( | ||||||||||||
| "Efficient-Large-Model/Sana_600M_1024px_diffusers", | ||||||||||||
| transformer=transformer, | ||||||||||||
| torch_dtype=torch.bfloat16, | ||||||||||||
| ) | ||||||||||||
| pipe.to("cuda") | ||||||||||||
| prompt = "A cat holding a sign that says hello world" | ||||||||||||
| image = pipe( | ||||||||||||
| prompt, num_inference_steps=50, guidance_scale=4.5, max_sequence_length=512 | ||||||||||||
| ).images[0] | ||||||||||||
| image.save("output.png") | ||||||||||||
| ``` | ||||||||||||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1 @@ | ||
| from .modelopt_quantizer import NVIDIAModelOptQuantizer |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,163 @@ | ||
| from typing import TYPE_CHECKING, Any, Dict, List, Union | ||
|
|
||
| from ...utils import ( | ||
| get_module_from_name, | ||
| is_accelerate_available, | ||
| is_nvidia_modelopt_available, | ||
| is_nvidia_modelopt_version, | ||
| is_torch_available, | ||
| logging, | ||
| ) | ||
| from ..base import DiffusersQuantizer | ||
|
|
||
|
|
||
| if TYPE_CHECKING: | ||
| from ...models.modeling_utils import ModelMixin | ||
|
|
||
|
|
||
| if is_torch_available(): | ||
| import torch | ||
|
|
||
| if is_accelerate_available(): | ||
| from accelerate.utils import set_module_tensor_to_device | ||
|
|
||
|
|
||
| logger = logging.get_logger(__name__) | ||
|
|
||
|
|
||
| class NVIDIAModelOptQuantizer(DiffusersQuantizer): | ||
| r""" | ||
| Diffusers Quantizer for TensorRT Model Optimizer | ||
| """ | ||
|
|
||
| use_keep_in_fp32_modules = True | ||
| requires_calibration = False | ||
| required_packages = ["modelopt"] | ||
|
||
|
|
||
| def __init__(self, quantization_config, **kwargs): | ||
| super().__init__(quantization_config, **kwargs) | ||
|
|
||
| def validate_environment(self, *args, **kwargs): | ||
| if not is_nvidia_modelopt_available(): | ||
| raise ImportError( | ||
| "Loading an nvidia-modelopt quantized model requires nvidia-modelopt library (`pip install nvidia-modelopt`)" | ||
| ) | ||
|
|
||
| self.offload = False | ||
|
|
||
| device_map = kwargs.get("device_map", None) | ||
| if isinstance(device_map, dict): | ||
| if "cpu" in device_map.values() or "disk" in device_map.values(): | ||
| if self.pre_quantized: | ||
| raise ValueError( | ||
| "You are attempting to perform cpu/disk offload with a pre-quantized modelopt model " | ||
| "This is not supported yet. Please remove the CPU or disk device from the `device_map` argument." | ||
| ) | ||
| else: | ||
| self.offload = True | ||
|
|
||
| def check_if_quantized_param( | ||
| self, | ||
| model: "ModelMixin", | ||
| param_value: "torch.Tensor", | ||
| param_name: str, | ||
| state_dict: Dict[str, Any], | ||
| **kwargs, | ||
| ): | ||
| # ModelOpt imports diffusers internally. This is here to prevent circular imports | ||
| from modelopt.torch.quantization.utils import is_quantized | ||
|
|
||
| module, tensor_name = get_module_from_name(model, param_name) | ||
| if self.pre_quantized: | ||
| return True | ||
| elif is_quantized(module) and "weight" in tensor_name: | ||
| return True | ||
| return False | ||
|
|
||
| def create_quantized_param( | ||
| self, | ||
| model: "ModelMixin", | ||
| param_value: "torch.Tensor", | ||
| param_name: str, | ||
| target_device: "torch.device", | ||
| *args, | ||
| **kwargs, | ||
| ): | ||
| """ | ||
| Create the quantized parameter by calling .calibrate() after setting it to the module. | ||
| """ | ||
| # ModelOpt imports diffusers internally. This is here to prevent circular imports | ||
| import modelopt.torch.quantization as mtq | ||
|
|
||
| dtype = kwargs.get("dtype", torch.float32) | ||
| module, tensor_name = get_module_from_name(model, param_name) | ||
| if self.pre_quantized: | ||
| module._parameters[tensor_name] = torch.nn.Parameter(param_value.to(device=target_device)) | ||
| else: | ||
| set_module_tensor_to_device(model, param_name, target_device, param_value, dtype) | ||
| mtq.calibrate(module, self.quantization_config.modelopt_config["algorithm"], self.quantization_config.forward_loop) | ||
| mtq.compress(module) | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. mtq.compress compresses the model weights into lower-bit representations, allowing users to leverage it directly at the Torch level. However, as previously mentioned, to achieve actual speed improvements, we need to utilize the TensorRT runtime rather than the Torch runtime. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @ishan-modi could we also mention this bit in the docs? |
||
| module.weight.requires_grad = False | ||
sayakpaul marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| def adjust_max_memory(self, max_memory: Dict[str, Union[int, str]]) -> Dict[str, Union[int, str]]: | ||
| max_memory = {key: val * 0.90 for key, val in max_memory.items()} | ||
| return max_memory | ||
|
|
||
| def adjust_target_dtype(self, target_dtype: "torch.dtype") -> "torch.dtype": | ||
| if self.quantization_config.quant_type == "FP8": | ||
| target_dtype = torch.float8_e4m3fn | ||
| return target_dtype | ||
|
|
||
| def update_torch_dtype(self, torch_dtype: "torch.dtype" = None) -> "torch.dtype": | ||
| if torch_dtype is None: | ||
| logger.info("You did not specify `torch_dtype` in `from_pretrained`. Setting it to `torch.float32`.") | ||
| torch_dtype = torch.float32 | ||
| return torch_dtype | ||
|
|
||
| def _process_model_before_weight_loading( | ||
| self, | ||
| model: "ModelMixin", | ||
| device_map, | ||
| keep_in_fp32_modules: List[str] = [], | ||
| **kwargs, | ||
| ): | ||
| # ModelOpt imports diffusers internally. This is here to prevent circular imports | ||
| import modelopt.torch.opt as mto | ||
|
|
||
| if self.pre_quantized: | ||
| return | ||
|
|
||
| modules_to_not_convert = self.quantization_config.modules_to_not_convert | ||
|
|
||
| if modules_to_not_convert is None: | ||
| modules_to_not_convert = [] | ||
| if isinstance(modules_to_not_convert, str): | ||
| modules_to_not_convert = [modules_to_not_convert] | ||
| modules_to_not_convert.extend(keep_in_fp32_modules) | ||
|
|
||
| for module in modules_to_not_convert: | ||
| self.quantization_config.modelopt_config["quant_cfg"]["*" + module + "*"] = {"enable": False} | ||
| self.quantization_config.modules_to_not_convert = modules_to_not_convert | ||
| mto.apply_mode(model, mode=[("quantize", self.quantization_config.modelopt_config)]) | ||
| model.config.quantization_config = self.quantization_config | ||
|
|
||
| def _process_model_after_weight_loading(self, model, **kwargs): | ||
| # ModelOpt imports diffusers internally. This is here to prevent circular imports | ||
| from modelopt.torch.opt import ModeloptStateManager | ||
|
|
||
| if self.pre_quantized: | ||
| return model | ||
|
|
||
| for _, m in model.named_modules(): | ||
| if hasattr(m, ModeloptStateManager._state_key) and m is not model: | ||
| ModeloptStateManager.remove_state(m) | ||
|
|
||
| return model | ||
|
|
||
| @property | ||
| def is_trainable(self): | ||
| return True | ||
|
|
||
| @property | ||
| def is_serializable(self): | ||
| return True | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.