Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions examples/llm_ptq/hf_ptq.py
Original file line number Diff line number Diff line change
Expand Up @@ -316,6 +316,7 @@ def main(args):
mtq.quantize(child, disabled_quant_cfg, forward_loop=None)

model = model.language_model
model_type = get_model_type(model)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we still need this model_type?

Copy link
Contributor Author

@yueshen2016 yueshen2016 Sep 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only for int8_sq format, as later it will be exported to tensorrt_llm ckpt. Without line, the model_type will be unknown, as this nvbug shows.


if args.sparsity_fmt != "dense":
Comment on lines 318 to 321
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Recompute quantized-state (and device) after swapping to submodel.

Great call to refresh model_type on the language submodel. However, gating later on model_is_already_quantized (computed before the swap) can now be wrong for VLMs whose container is quantized but language_model is not. Also refresh device in case the submodule is on a different device.

Apply this diff:

             model = model.language_model
-            model_type = get_model_type(model)
+            model_type = get_model_type(model)
+            # Keep subsequent logic consistent with the sub‑model we actually operate on.
+            model_is_already_quantized = is_quantized(model)
+            device = getattr(model, "device", device)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
model = model.language_model
model_type = get_model_type(model)
if args.sparsity_fmt != "dense":
model = model.language_model
model_type = get_model_type(model)
# Keep subsequent logic consistent with the sub‑model we actually operate on.
model_is_already_quantized = is_quantized(model)
device = getattr(model, "device", device)
if args.sparsity_fmt != "dense":
🤖 Prompt for AI Agents
In examples/llm_ptq/hf_ptq.py around lines 318 to 321, after swapping to the
language submodel (model = model.language_model) and refreshing model_type, also
recompute model_is_already_quantized and device there so they reflect the
submodel state (the container may be quantized while language_model is not, or
the submodule may be on a different device). Update the code to move or
duplicate the logic that sets model_is_already_quantized and device to
immediately follow the swap and model_type refresh, ensuring subsequent gates
(e.g., the args.sparsity_fmt != "dense" branch) use the corrected values.

if args.batch_size == 0:
Expand Down
2 changes: 1 addition & 1 deletion examples/vlm_ptq/scripts/huggingface_example.sh
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,7 @@ if [ -n "$KV_CACHE_QUANT" ]; then
PTQ_ARGS+=" --kv_cache_qformat=$KV_CACHE_QUANT "
fi

if [ "${MODEL_TYPE}" = "vila" ]; then
if [[ "${MODEL_NAME,,}" == *"vila"* ]]; then
# Install required dependency for VILA
pip install -r ../vlm_ptq/requirements-vila.txt
# Clone original VILA repo
Expand Down
Loading