Skip to content

Commit 85b309f

Browse files
authored
Qwen quantize and hf export support in examples (#311)
Signed-off-by: Riyad Islam <[email protected]>
1 parent b913290 commit 85b309f

File tree

5 files changed

+30
-8
lines changed

5 files changed

+30
-8
lines changed

examples/llm_ptq/hf_ptq.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -742,7 +742,7 @@ def output_decode(generated_ids, input_shape):
742742
)
743743
parser.add_argument(
744744
"--verbose",
745-
help="Print verbose output (e.g. quantization summary). Disable by --no_verbose.",
745+
help="Print verbose output (e.g. quantization summary). Disable by --no-verbose.",
746746
default=True,
747747
action=argparse.BooleanOptionalAction,
748748
)

examples/vlm_ptq/README.md

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,7 @@ Please refer to the [llm_ptq/README.md](../llm_ptq/README.md#current-out-of-the-
4141
| Llava | llava ||||||
4242
| VILA | vila ||||||
4343
| Phi-3-vision | phi ||||||
44+
| Qwen2.5-VL | qwen ||||||
4445

4546
> *<sup>1.</sup>The w4a8_awq is an experimental quantization scheme that may result in a higher accuracy penalty.* \
4647
> *<sup>2.</sup>A selective set of the popular models are internally tested. The actual model support list may be longer. NVFP4 inference requires Blackwell GPUs and TensorRT-LLM v0.17 or later.*
@@ -51,7 +52,7 @@ Please refer to the [llm_ptq/README.md](../llm_ptq/README.md#current-out-of-the-
5152

5253
Please refer to the [llm_ptq/README.md](../llm_ptq/README.md) about the details of model quantization.
5354

54-
The following scripts provide an all-in-one and step-by-step model quantization example for Llava, VILA and Phi-3-vision models. The quantization format and the number of GPUs will be supplied as inputs to these scripts. By default, we build the engine for the fp8 format and 1 GPU.
55+
The following scripts provide an all-in-one and step-by-step model quantization example for Llava, VILA, Phi-3-vision and Qwen2.5-VL models. The quantization format and the number of GPUs will be supplied as inputs to these scripts. By default, we build the engine for the fp8 format and 1 GPU.
5556

5657
### Hugging Face Example [Script](./scripts/huggingface_example.sh)
5758

@@ -76,6 +77,13 @@ git clone https://huggingface.co/microsoft/Phi-3-vision-128k-instruct
7677
scripts/huggingface_example.sh --type phi --model Phi-3-vision-128k-instruct --quant [fp8|int8_sq|int4_awq|w4a8_awq]
7778
```
7879

80+
For [Qwen2.5-VL](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct):
81+
82+
```bash
83+
git clone https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct
84+
scripts/huggingface_example.sh --type qwen --model Qwen2.5-VL-7B-Instruct --export_fmt hf --quant [fp8|nvfp4|int8_sq|int4_awq|w4a8_awq]
85+
```
86+
7987
The example scripts above also have an additional flag `--tasks gqa`, which will trigger evaluation of the built TensorRT engine using GQA benchmark. Details of the evaluation is explained in this [tutorial](../vlm_eval/README.md).
8088

8189
If you encounter Out of Memory (OOM) issues during inference or evaluation, you can try lowering the `--kv_cache_free_gpu_memory_fraction` argument (default is 0.8) to reduce GPU memory usage for kv_cache:

examples/vlm_ptq/scripts/huggingface_example.sh

Lines changed: 13 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -30,10 +30,10 @@ for i in $(env | grep ^PMI_ | cut -d"=" -f 1); do unset -v $i; done
3030
for i in $(env | grep ^PMIX_ | cut -d"=" -f 1); do unset -v $i; done
3131

3232
case $MODEL_TYPE in
33-
llava|phi|vila|mllama)
33+
llava|phi|vila|mllama|qwen)
3434
;;
3535
*)
36-
echo "Unsupported type argument: Expected one of: [llava, phi, vila, mllama]" >&2
36+
echo "Unsupported type argument: Expected one of: [llava, phi, vila, mllama, qwen]" >&2
3737
exit 1
3838
esac
3939

@@ -58,10 +58,10 @@ case $SPARSITY_FMT in
5858
esac
5959

6060
case $QFORMAT in
61-
fp8|int8_sq|int4_awq|w4a8_awq|fp16|bf16)
61+
fp8|nvfp4|int8_sq|int4_awq|w4a8_awq|fp16|bf16)
6262
;;
6363
*)
64-
echo "Unknown quant argument: Expected one of: [fp8, int8_sq, int4_awq, w4a8_awq, fp16, bf16]" >&2
64+
echo "Unknown quant argument: Expected one of: [fp8, nvfp4, int8_sq, int4_awq, w4a8_awq, fp16, bf16]" >&2
6565
exit 1
6666
esac
6767

@@ -91,7 +91,7 @@ fi
9191

9292
BUILD_MAX_OUTPUT_LEN=512
9393

94-
if [ "$MODEL_TYPE" = "llava" ] || [ "$MODEL_TYPE" = "vila" ]; then
94+
if [ "$MODEL_TYPE" = "llava" ] || [ "$MODEL_TYPE" = "vila" ] || [ "$MODEL_TYPE" = "qwen" ]; then
9595
BUILD_MAX_BATCH_SIZE=20
9696
else
9797
BUILD_MAX_BATCH_SIZE=4
@@ -149,6 +149,9 @@ case "${MODEL_TYPE}" in
149149
PTQ_ARGS+=" --kv_cache_qformat none "
150150
VLM_ARGS=" --max_encoder_input_len=6404 --skip_run"
151151
;;
152+
"qwen")
153+
PTQ_ARGS+=" --kv_cache_qformat none "
154+
;;
152155
esac
153156

154157
if [ "${MODEL_TYPE}" = "vila" ]; then
@@ -177,6 +180,7 @@ if [[ $TASKS =~ "build" ]] || [[ ! -d "$ENGINE_DIR" ]] || [[ ! $(ls -A $ENGINE_D
177180
--inference_tensor_parallel=$TP \
178181
--inference_pipeline_parallel=$PP \
179182
--export_fmt=$EXPORT_FORMAT \
183+
--no-verbose \
180184
$PTQ_ARGS
181185
else
182186
echo "Quantized model config $MODEL_CONFIG exists, skipping the quantization stage"
@@ -213,6 +217,10 @@ case "${MODEL_TYPE}" in
213217
"phi")
214218
VISUAL_MODEL_TYPE="phi-3-vision"
215219
;;
220+
"qwen")
221+
# Map generic type to TRT-LLM multimodal model type
222+
VISUAL_MODEL_TYPE="qwen2_vl"
223+
;;
216224
esac
217225

218226

modelopt/torch/export/model_config_export.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -362,6 +362,7 @@ def torch_to_tensorrt_llm_checkpoint(
362362
"glm",
363363
"llama",
364364
"mllama",
365+
"qwen",
365366
], f"lm_head not available for decoder {decoder_type}"
366367
config.share_embedding_table = True
367368

modelopt/torch/export/plugins/hf_spec_export.py

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -82,7 +82,12 @@ def rename_and_prune_if_spec_decoding(model: nn.Module, post_state_dict: dict):
8282

8383
def set_config_if_spec_decoding(model: nn.Module, config_data: dict):
8484
"""Return the config of draft model in official format."""
85-
if len(model._modelopt_state) != 1 or model._modelopt_state[0][0] != "eagle":
85+
opt_modes = getattr(model, "_modelopt_state", None)
86+
if (
87+
not isinstance(opt_modes, (list, tuple))
88+
or len(opt_modes) != 1
89+
or opt_modes[0][0] != "eagle"
90+
):
8691
# return as is
8792
return config_data
8893

0 commit comments

Comments
 (0)