Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ Model Optimizer Changelog (Linux)

- ``high_precision_dtype`` default to fp16 in ONNX quantization, i.e. quantized output model weights are now FP16 by default.
- Upgrade TensorRT-LLM dependency to 1.1.0rc2.
- Support Phi-4-multimodal and Qwen2.5-VL quantized HF checkpoint export in ``examples/vlm_ptq``.

0.35 (2025-09-04)
^^^^^^^^^^^^^^^^^
Expand Down
54 changes: 13 additions & 41 deletions examples/vlm_ptq/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,15 +36,19 @@ Please refer to the [llm_ptq/README.md](../llm_ptq/README.md#current-out-of-the-

### Supported Models

| Model | type | fp8 | int8_sq | int4_awq | w4a8_awq<sup>1</sup> | nvfp4<sup>2</sup> |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| Llava | llava | ✅ | ✅ | ✅ | ✅ | ❌ |
| VILA | vila | ✅ | ✅ | ✅ | ✅ | ❌ |
| Phi-3-vision | phi | ✅ | ✅ | ✅ | ✅ | ❌ |
| Qwen2.5-VL | qwen | ✅ | ✅ | ✅ | ✅ | ❌ |
| Model | fp8 | int8_sq<sup>1</sup> | int4_awq | w4a8_awq<sup>2</sup> | nvfp4<sup>3</sup> |
| :---: | :---: | :---: | :---: | :---: | :---: |
| Llava | ✅ | ✅ | ✅ | ✅ | - |
| VILA | ✅ | ✅ | ✅ | ✅ | - |
| Phi-3-vision, Phi-4-multimodal | ✅ | ✅ | ✅ | ✅ | ✅ |
| Qwen2, 2.5-VL | ✅ | ✅ | ✅ | ✅ | ✅ |
| Gemma3 | ✅ | - | - | - | - |

> *<sup>1.</sup>The w4a8_awq is an experimental quantization scheme that may result in a higher accuracy penalty.* \
> *<sup>2.</sup>A selective set of the popular models are internally tested. The actual model support list may be longer. NVFP4 inference requires Blackwell GPUs and TensorRT-LLM v0.17 or later.*
> *<sup>1.</sup>Only TensorRT-LLM checkpoint export is supported. Not compatible with the TensorRT-LLM torch backend* \
> *<sup>2.</sup>The w4a8_awq is an experimental quantization scheme that may result in a higher accuracy penalty.* \
> *<sup>3.</sup>A selective set of the popular models are internally tested. The actual model support list may be longer. NVFP4 inference requires Blackwell GPUs and TensorRT-LLM v0.17 or later.*

> *For detailed TensorRT-LLM torch backend multimodal support, please refer to [this doc](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/models/supported-models.md#multimodal-feature-support-matrix-pytorch-backend)*

Comment on lines +51 to 52
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

🧩 Analysis chain

Verify external link and naming consistency.

Confirm the anchor “multimodal‑feature‑support‑matrix‑pytorch‑backend” still exists and that “PyTorch backend” casing matches upstream docs.


🌐 Web query:

Does the TensorRT-LLM repo contain the "Multimodal feature support matrix — PyTorch backend" section at docs/source/models/supported-models.md with that anchor, and what exact section title/casing do they use?

💡 Result:

Short answer: No — there is no section with the exact anchor/title "Multimodal feature support matrix — PyTorch backend" at docs/source/models/supported-models.md. The repository/site uses the heading "Models (PyTorch Backend)". [1][2]

Sources:

  • TensorRT-LLM Support Matrix (shows "Models (PyTorch Backend)"). [1]
  • GitHub copy of support-matrix.md in the repo (shows "Support Matrix" / "Models" headings). [2]

Update README link and 'PyTorch Backend' casing to match upstream docs. The anchor "multimodal‑feature‑support‑matrix‑pytorch‑backend" does not exist; upstream uses the heading "Models (PyTorch Backend)". Change examples/vlm_ptq/README.md (lines 51–52) to link to the correct section/anchor and match the "PyTorch Backend" casing.

🤖 Prompt for AI Agents
In examples/vlm_ptq/README.md around lines 51–52, the README links to a
non-existent anchor and uses incorrect casing "torch backend"; update the link
anchor to match the upstream heading "Models (PyTorch Backend)" (use
.../supported-models.md#models-pytorch-backend) and change the displayed text to
"PyTorch Backend" (and ensure the surrounding phrase matches upstream casing).

> *The accuracy loss after PTQ may vary depending on the actual model and the quantization method. Different models may have different accuracy loss and usually the accuracy loss is more significant when the base model is small. If the accuracy after PTQ is not meeting the requirement, please try either modifying [hf_ptq.py](../llm_ptq/hf_ptq.py) and disabling the KV cache quantization or using the [QAT](./../llm_qat/README.md) instead.*

Expand All @@ -56,40 +60,8 @@ The following scripts provide an all-in-one and step-by-step model quantization

### Hugging Face Example [Script](./scripts/huggingface_example.sh)

For [Llava](https://huggingface.co/llava-hf/llava-1.5-7b-hf):

```bash
git clone https://huggingface.co/llava-hf/llava-1.5-7b-hf
scripts/huggingface_example.sh --type llava --model llava-1.5-7b-hf --quant [fp8|int8_sq|int4_awq|w4a8_awq] --tp [1|2|4|8]
```

For VILA models like [VILA1.5-3b](https://huggingface.co/Efficient-Large-Model/VILA1.5-3b):

```bash
git clone https://huggingface.co/Efficient-Large-Model/VILA1.5-3b vila1.5-3b
scripts/huggingface_example.sh --type vila --model vila1.5-3b --quant [fp8|int8_sq|int4_awq|w4a8_awq] --tp [1|2|4|8]
```

For [Phi-3-vision](https://huggingface.co/microsoft/Phi-3-vision-128k-instruct):

```bash
git clone https://huggingface.co/microsoft/Phi-3-vision-128k-instruct
scripts/huggingface_example.sh --type phi --model Phi-3-vision-128k-instruct --quant [fp8|int8_sq|int4_awq|w4a8_awq]
```

For [Qwen2.5-VL](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct):

```bash
git clone https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct
scripts/huggingface_example.sh --type qwen --model Qwen2.5-VL-7B-Instruct --quant [fp8|nvfp4|int8_sq|int4_awq|w4a8_awq]
```

The example scripts above also have an additional flag `--tasks gqa`, which will trigger evaluation of the built TensorRT engine using GQA benchmark. Details of the evaluation is explained in this [tutorial](../vlm_eval/README.md).

If you encounter Out of Memory (OOM) issues during inference or evaluation, you can try lowering the `--kv_cache_free_gpu_memory_fraction` argument (default is 0.8) to reduce GPU memory usage for kv_cache:

```bash
scripts/huggingface_example.sh --type phi --model Phi-3-vision-128k-instruct --quant fp8 --kv_cache_free_gpu_memory_fraction 0.5
scripts/huggingface_example.sh --model <Hugging Face model card or checkpoint> --quant [fp8|nvfp4|int8_sq|int4_awq|w4a8_awq]
```

## Pre-Quantized Checkpoints
Expand Down
Loading