Skip to content

Commit f7425fc

Browse files
cjluo-nvyeyu-nvidia
authored andcommitted
[NVBug: 5525758] Update VLM-PTQ readme (#339)
Signed-off-by: Chenjie Luo <[email protected]> Signed-off-by: Ye Yu <[email protected]>
1 parent 89c65b9 commit f7425fc

File tree

2 files changed

+14
-41
lines changed

2 files changed

+14
-41
lines changed

CHANGELOG.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@ Model Optimizer Changelog (Linux)
1515

1616
- ``high_precision_dtype`` default to fp16 in ONNX quantization, i.e. quantized output model weights are now FP16 by default.
1717
- Upgrade TensorRT-LLM dependency to 1.1.0rc2.
18+
- Support Phi-4-multimodal and Qwen2.5-VL quantized HF checkpoint export in ``examples/vlm_ptq``.
1819

1920
0.35 (2025-09-04)
2021
^^^^^^^^^^^^^^^^^

examples/vlm_ptq/README.md

Lines changed: 13 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -36,15 +36,19 @@ Please refer to the [llm_ptq/README.md](../llm_ptq/README.md#current-out-of-the-
3636

3737
### Supported Models
3838

39-
| Model | type | fp8 | int8_sq | int4_awq | w4a8_awq<sup>1</sup> | nvfp4<sup>2</sup> |
40-
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
41-
| Llava | llava ||||||
42-
| VILA | vila ||||||
43-
| Phi-3-vision | phi ||||||
44-
| Qwen2.5-VL | qwen ||||||
39+
| Model | fp8 | int8_sq<sup>1</sup> | int4_awq | w4a8_awq<sup>2</sup> | nvfp4<sup>3</sup> |
40+
| :---: | :---: | :---: | :---: | :---: | :---: |
41+
| Llava ||||| - |
42+
| VILA ||||| - |
43+
| Phi-3-vision, Phi-4-multimodal ||||||
44+
| Qwen2, 2.5-VL ||||||
45+
| Gemma3 || - | - | - | - |
4546

46-
> *<sup>1.</sup>The w4a8_awq is an experimental quantization scheme that may result in a higher accuracy penalty.* \
47-
> *<sup>2.</sup>A selective set of the popular models are internally tested. The actual model support list may be longer. NVFP4 inference requires Blackwell GPUs and TensorRT-LLM v0.17 or later.*
47+
> *<sup>1.</sup>Only TensorRT-LLM checkpoint export is supported. Not compatible with the TensorRT-LLM torch backend* \
48+
> *<sup>2.</sup>The w4a8_awq is an experimental quantization scheme that may result in a higher accuracy penalty.* \
49+
> *<sup>3.</sup>A selective set of the popular models are internally tested. The actual model support list may be longer. NVFP4 inference requires Blackwell GPUs and TensorRT-LLM v0.17 or later.*
50+
51+
> *For detailed TensorRT-LLM torch backend multimodal support, please refer to [this doc](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/models/supported-models.md#multimodal-feature-support-matrix-pytorch-backend)*
4852
4953
> *The accuracy loss after PTQ may vary depending on the actual model and the quantization method. Different models may have different accuracy loss and usually the accuracy loss is more significant when the base model is small. If the accuracy after PTQ is not meeting the requirement, please try either modifying [hf_ptq.py](../llm_ptq/hf_ptq.py) and disabling the KV cache quantization or using the [QAT](./../llm_qat/README.md) instead.*
5054
@@ -56,40 +60,8 @@ The following scripts provide an all-in-one and step-by-step model quantization
5660

5761
### Hugging Face Example [Script](./scripts/huggingface_example.sh)
5862

59-
For [Llava](https://huggingface.co/llava-hf/llava-1.5-7b-hf):
60-
61-
```bash
62-
git clone https://huggingface.co/llava-hf/llava-1.5-7b-hf
63-
scripts/huggingface_example.sh --type llava --model llava-1.5-7b-hf --quant [fp8|int8_sq|int4_awq|w4a8_awq] --tp [1|2|4|8]
64-
```
65-
66-
For VILA models like [VILA1.5-3b](https://huggingface.co/Efficient-Large-Model/VILA1.5-3b):
67-
68-
```bash
69-
git clone https://huggingface.co/Efficient-Large-Model/VILA1.5-3b vila1.5-3b
70-
scripts/huggingface_example.sh --type vila --model vila1.5-3b --quant [fp8|int8_sq|int4_awq|w4a8_awq] --tp [1|2|4|8]
71-
```
72-
73-
For [Phi-3-vision](https://huggingface.co/microsoft/Phi-3-vision-128k-instruct):
74-
75-
```bash
76-
git clone https://huggingface.co/microsoft/Phi-3-vision-128k-instruct
77-
scripts/huggingface_example.sh --type phi --model Phi-3-vision-128k-instruct --quant [fp8|int8_sq|int4_awq|w4a8_awq]
78-
```
79-
80-
For [Qwen2.5-VL](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct):
81-
82-
```bash
83-
git clone https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct
84-
scripts/huggingface_example.sh --type qwen --model Qwen2.5-VL-7B-Instruct --quant [fp8|nvfp4|int8_sq|int4_awq|w4a8_awq]
85-
```
86-
87-
The example scripts above also have an additional flag `--tasks gqa`, which will trigger evaluation of the built TensorRT engine using GQA benchmark. Details of the evaluation is explained in this [tutorial](../vlm_eval/README.md).
88-
89-
If you encounter Out of Memory (OOM) issues during inference or evaluation, you can try lowering the `--kv_cache_free_gpu_memory_fraction` argument (default is 0.8) to reduce GPU memory usage for kv_cache:
90-
9163
```bash
92-
scripts/huggingface_example.sh --type phi --model Phi-3-vision-128k-instruct --quant fp8 --kv_cache_free_gpu_memory_fraction 0.5
64+
scripts/huggingface_example.sh --model <Hugging Face model card or checkpoint> --quant [fp8|nvfp4|int8_sq|int4_awq|w4a8_awq]
9365
```
9466

9567
## Pre-Quantized Checkpoints

0 commit comments

Comments
 (0)