[NVBug: 5525758] Update VLM-PTQ readme (#339)

cjluo-nv · web-flow · commit b7ed8cd7e6fa · 2025-09-18T16:51:00.000Z
Signed-off-by: Chenjie Luo &lt;chenjiel@nvidia.com&gt;
diff --git a/CHANGELOG.rst b/CHANGELOG.rst
@@ -15,6 +15,7 @@ Model Optimizer Changelog (Linux)
 
 - ``high_precision_dtype`` default to fp16 in ONNX quantization, i.e. quantized output model weights are now FP16 by default.
 - Upgrade TensorRT-LLM dependency to 1.1.0rc2.
+- Support Phi-4-multimodal and Qwen2.5-VL quantized HF checkpoint export in ``examples/vlm_ptq``.
 
 0.35 (2025-09-04)
 ^^^^^^^^^^^^^^^^^
diff --git a/examples/vlm_ptq/README.md b/examples/vlm_ptq/README.md
@@ -36,15 +36,19 @@ Please refer to the [llm_ptq/README.md](../llm_ptq/README.md#current-out-of-the-
 
 ### Supported Models
 
-| Model | type | fp8 | int8_sq | int4_awq | w4a8_awq<sup>1</sup> | nvfp4<sup>2</sup> |
-| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
-| Llava | llava | ✅ | ✅ | ✅ | ✅ | ❌ |
-| VILA | vila | ✅ | ✅ | ✅ | ✅ | ❌ |
-| Phi-3-vision | phi | ✅ | ✅ | ✅ | ✅ | ❌ |
-| Qwen2.5-VL | qwen | ✅ | ✅ | ✅ | ✅ | ❌ |
+| Model | fp8 | int8_sq<sup>1</sup> | int4_awq | w4a8_awq<sup>2</sup> | nvfp4<sup>3</sup> |
+| :---: | :---: | :---: | :---: | :---: | :---: |
+| Llava | ✅ | ✅ | ✅ | ✅ | - |
+| VILA | ✅ | ✅ | ✅ | ✅ | - |
+| Phi-3-vision, Phi-4-multimodal | ✅ | ✅ | ✅ | ✅ | ✅  |
+| Qwen2, 2.5-VL | ✅ | ✅ | ✅ | ✅ | ✅ |
+| Gemma3 | ✅ | - | - | - | - |
 
-> *<sup>1.</sup>The w4a8_awq is an experimental quantization scheme that may result in a higher accuracy penalty.* \
-> *<sup>2.</sup>A selective set of the popular models are internally tested. The actual model support list may be longer. NVFP4 inference requires Blackwell GPUs and TensorRT-LLM v0.17 or later.*
+> *<sup>1.</sup>Only TensorRT-LLM checkpoint export is supported. Not compatible with the TensorRT-LLM torch backend* \
+> *<sup>2.</sup>The w4a8_awq is an experimental quantization scheme that may result in a higher accuracy penalty.* \
+> *<sup>3.</sup>A selective set of the popular models are internally tested. The actual model support list may be longer. NVFP4 inference requires Blackwell GPUs and TensorRT-LLM v0.17 or later.*
+
+> *For detailed TensorRT-LLM torch backend multimodal support, please refer to [this doc](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/models/supported-models.md#multimodal-feature-support-matrix-pytorch-backend)*
 
 > *The accuracy loss after PTQ may vary depending on the actual model and the quantization method. Different models may have different accuracy loss and usually the accuracy loss is more significant when the base model is small. If the accuracy after PTQ is not meeting the requirement, please try either modifying [hf_ptq.py](../llm_ptq/hf_ptq.py) and disabling the KV cache quantization or using the [QAT](./../llm_qat/README.md) instead.*
 
@@ -56,40 +60,8 @@ The following scripts provide an all-in-one and step-by-step model quantization
 
 ### Hugging Face Example [Script](./scripts/huggingface_example.sh)
 
-For [Llava](https://huggingface.co/llava-hf/llava-1.5-7b-hf):
-
-```bash
-git clone https://huggingface.co/llava-hf/llava-1.5-7b-hf
-scripts/huggingface_example.sh --type llava --model llava-1.5-7b-hf --quant [fp8|int8_sq|int4_awq|w4a8_awq] --tp [1|2|4|8]
-```
-
-For VILA models like [VILA1.5-3b](https://huggingface.co/Efficient-Large-Model/VILA1.5-3b):
-
-```bash
-git clone https://huggingface.co/Efficient-Large-Model/VILA1.5-3b vila1.5-3b
-scripts/huggingface_example.sh --type vila --model vila1.5-3b --quant [fp8|int8_sq|int4_awq|w4a8_awq] --tp [1|2|4|8]
-```
-
-For [Phi-3-vision](https://huggingface.co/microsoft/Phi-3-vision-128k-instruct):
-
-```bash
-git clone https://huggingface.co/microsoft/Phi-3-vision-128k-instruct
-scripts/huggingface_example.sh --type phi --model Phi-3-vision-128k-instruct --quant [fp8|int8_sq|int4_awq|w4a8_awq]
-```
-
-For [Qwen2.5-VL](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct):
-
-```bash
-git clone https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct
-scripts/huggingface_example.sh --type qwen --model Qwen2.5-VL-7B-Instruct --quant [fp8|nvfp4|int8_sq|int4_awq|w4a8_awq]
-```
-
-The example scripts above also have an additional flag `--tasks gqa`, which will trigger evaluation of the built TensorRT engine using GQA benchmark. Details of the evaluation is explained in this [tutorial](../vlm_eval/README.md).
-
-If you encounter Out of Memory (OOM) issues during inference or evaluation, you can try lowering the `--kv_cache_free_gpu_memory_fraction` argument (default is 0.8) to reduce GPU memory usage for kv_cache:
-
 ```bash
-scripts/huggingface_example.sh --type phi --model Phi-3-vision-128k-instruct --quant fp8 --kv_cache_free_gpu_memory_fraction 0.5
+scripts/huggingface_example.sh --model <Hugging Face model card or checkpoint> --quant [fp8|nvfp4|int8_sq|int4_awq|w4a8_awq]
 ```
 
 ## Pre-Quantized Checkpoints