-
Notifications
You must be signed in to change notification settings - Fork 161
Reinstate int8_sq support for vlm_example. #333
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Chenjie Luo <[email protected]>
WalkthroughRemoved a deprecation note for int8_sq in CHANGELOG. Updated the VLM PTQ Hugging Face example script to accept int8_sq, improved the unknown-arg error message, and added inference tensor/pipeline parallel flags when invoking the quantization script. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor U as User
participant SH as huggingface_example.sh
participant PTQ as hf_ptq.py
participant BE as Inference Backend
U->>SH: Run with QFORMAT and TP/PP
alt QFORMAT recognized (e.g., int8, int8_sq, w4a16, ...)
note right of SH: Includes new int8_sq option
SH->>PTQ: Invoke with --qformat QFORMAT<br/>--inference_tensor_parallel=$TP<br/>--inference_pipeline_parallel=$PP
PTQ->>BE: Configure model + quantization
BE-->>PTQ: Ready
PTQ-->>SH: Quantization artifacts
SH-->>U: Success
else Unknown QFORMAT
SH-->>U: Error listing accepted values (incl. int8_sq)
end
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10–15 minutes Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (2)
examples/vlm_ptq/scripts/huggingface_example.sh (2)
92-100
: Guard TP/PP flags to avoid passing empty values.If TP/PP are unset, argparse may treat “= ” as invalid. Default to 1 or append only if set.
Apply one of these:
Option A (defaults):
- --inference_tensor_parallel=$TP \ - --inference_pipeline_parallel=$PP \ + --inference_tensor_parallel=${TP:-1} \ + --inference_pipeline_parallel=${PP:-1} \Option B (conditional append earlier, keeping this block unchanged):
PTQ_ARGS="" +if [ -n "$TP" ]; then PTQ_ARGS+=" --inference_tensor_parallel $TP "; fi +if [ -n "$PP" ]; then PTQ_ARGS+=" --inference_pipeline_parallel $PP "; fi
106-118
: nvfp4 GPU check is unreachable due to the earlier non‑fp8 early‑exit.The block at Lines 111–118 never executes when QFORMAT=nvfp4. Fold it into the non‑fp8 branch so users still get the Blackwell warning.
Apply:
-if [[ "$QFORMAT" != "fp8" ]]; then - echo "For quant format $QFORMAT, please refer to the TensorRT-LLM documentation for deployment. Checkpoint saved to $SAVE_PATH." - exit 0 -fi - -if [[ "$QFORMAT" == *"nvfp4"* ]] || [[ "$KV_CACHE_QUANT" == *"nvfp4"* ]]; then - cuda_major=$(nvidia-smi --query-gpu=compute_cap --format=csv,noheader -i 0 | cut -d. -f1) - - if [ "$cuda_major" -lt 10 ]; then - echo "Please deploy the NVFP4 checkpoint on a Blackwell GPU. Checkpoint export_path: $SAVE_PATH" - exit 0 - fi -fi +if [[ "$QFORMAT" != "fp8" ]]; then + if [[ "$QFORMAT" == *"nvfp4"* ]] || [[ "$KV_CACHE_QUANT" == *"nvfp4"* ]]; then + cuda_major=$(nvidia-smi --query-gpu=compute_cap --format=csv,noheader -i 0 | cut -d. -f1) + if [ "$cuda_major" -lt 10 ]; then + echo "Please deploy the NVFP4 checkpoint on a Blackwell GPU. Checkpoint export_path: $SAVE_PATH" + fi + fi + echo "For quant format $QFORMAT, please refer to the TensorRT-LLM documentation for deployment. Checkpoint saved to $SAVE_PATH." + exit 0 +fi
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
CHANGELOG.rst
(0 hunks)examples/vlm_ptq/scripts/huggingface_example.sh
(2 hunks)
💤 Files with no reviewable changes (1)
- CHANGELOG.rst
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: linux
- GitHub Check: code-quality
- GitHub Check: build-docs
case $QFORMAT in | ||
fp8|int4_awq|w4a8_awq|nvfp4) | ||
fp8|int8_sq|int4_awq|w4a8_awq|nvfp4) | ||
;; | ||
*) | ||
echo "Unknown quant argument: Expected one of: [fp8, int4_awq, w4a8_awq, nvfp4]" >&2 | ||
echo "Unknown quant argument: Expected one of: [fp8, int8_sq, int4_awq, w4a8_awq, nvfp4]" >&2 | ||
exit 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Re‑enablement of int8_sq looks good.
Case arm and error message now include int8_sq; matches the PR goal.
Run to confirm downstream accepts this qformat:
🏁 Script executed:
#!/bin/bash
# Verify hf_ptq.py accepts int8_sq for --qformat and exposes the new inference flags
fd -t f hf_ptq.py
rg -nP -C2 -- 'add_argument\((["'\''])--qformat\1.*choices=.*int8_sq' $(fd -t f hf_ptq.py)
rg -nP -C2 -- 'add_argument\((["'\''])--inference_tensor_parallel\1' $(fd -t f hf_ptq.py)
rg -nP -C2 -- 'add_argument\((["'\''])--inference_pipeline_parallel\1' $(fd -t f hf_ptq.py)
# Check parser provides TP/PP defaults
fd -t f parser.sh
rg -n -C2 '\b(TP|PP)=' $(fd -t f parser.sh)
Length of output: 1012
Add int8_sq to --qformat choices in examples/llm_ptq/hf_ptq.py — hf_ptq.py's argparse for --qformat does not include int8_sq (no match found); --inference_tensor_parallel / --inference_pipeline_parallel are present and examples/llm_ptq/scripts/parser.sh sets TP=1 and PP=1.
🤖 Prompt for AI Agents
In examples/vlm_ptq/scripts/huggingface_example.sh around lines 37-42 the script
accepts "int8_sq" in the case switch but examples/llm_ptq/hf_ptq.py's argparse
for --qformat does not include "int8_sq", causing a mismatch; update the
argparse definition in examples/llm_ptq/hf_ptq.py to include "int8_sq" among the
allowed choices for --qformat (keeping other existing options intact), and
ensure help text reflects this addition so the CLI and example script are
consistent with TP/PP handling from parser.sh.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #333 +/- ##
==========================================
- Coverage 73.82% 73.82% -0.01%
==========================================
Files 172 172
Lines 17438 17438
==========================================
- Hits 12874 12873 -1
- Misses 4564 4565 +1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Signed-off-by: Chenjie Luo <[email protected]>
Signed-off-by: Chenjie Luo <[email protected]> Signed-off-by: Ye Yu <[email protected]>
What does this PR do?
Revert removal of int8_sq support for vlm_ptq example.
Overview: ?
Just that the int8_sq for vlm_ptq will still generate a TensorRT-LLM checkpoint and language only. User needs to use the TensorRT-LLM's TRT backend + separate imaging network building using TRT for inference. The torch runtime is not available.
Summary by CodeRabbit
New Features
Documentation