-
Notifications
You must be signed in to change notification settings - Fork 162
Qwen quantize and hf export support in examples #311
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughAdds qwen model support in VLM PTQ scripts, includes qwen in shared-embedding export whitelist, hardens a guard in HF spec export, and corrects a CLI help string; also changes PTQ invocations to use --no-verbose for quieter quantization. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor U as User
participant S as huggingface_example.sh
participant C as model_config
participant Q as hf_ptq.py
U->>S: Run script with MODEL_TYPE=qwen
S->>C: Set VISUAL_MODEL_TYPE = qwen2_vl
S->>S: BUILD_MAX_BATCH_SIZE = 20
S->>S: Append PTQ_ARGS: --kv_cache_qformat none
S->>Q: Invoke quantization with --no-verbose
Q-->>S: Produce quantized HF artifacts
sequenceDiagram
autonumber
participant E as set_config_if_spec_decoding
participant M as model
participant C as config_data
E->>M: Read _modelopt_state
alt _modelopt_state is list/tuple of length 1 and first element starts with "eagle"
E->>E: Apply spec-decoding config changes
E-->>C: Return mutated config_data
else
E-->>C: Return config_data unchanged
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Pre-merge checks (3 passed)✅ Passed checks (3 passed)
Poem
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (2)
examples/vlm_ptq/scripts/huggingface_example.sh (2)
94-99
: Default max batch size 20 for qwen: verify memory headroomQwen2.5-VL-7B can be memory heavy; batch 20 may OOM on smaller GPUs during build. Consider making this model-aware or gating by available VRAM.
Run a quick dry-run on your target GPU(s) to confirm batch 20 is safe; otherwise reduce or make it configurable via an env var (e.g., BUILD_MAX_BATCH_SIZE).
152-155
: Don’t force override EXPORT_FORMAT for qwenLet users opt into TRT-LLM builds explicitly; default to HF only if unset. This preserves flexibility and reduces surprise.
Apply:
"qwen") PTQ_ARGS+=" --kv_cache_qformat none " - EXPORT_FORMAT="hf" + : "${EXPORT_FORMAT:=hf}" ;;
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
examples/llm_ptq/hf_ptq.py
(1 hunks)examples/vlm_ptq/scripts/huggingface_example.sh
(5 hunks)modelopt/torch/export/model_config_export.py
(1 hunks)modelopt/torch/export/plugins/hf_spec_export.py
(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: build-docs
- GitHub Check: code-quality
🔇 Additional comments (5)
examples/llm_ptq/hf_ptq.py (1)
744-748
: Help text nit: flag spelling fix LGTMUpdated hint matches argparse.BooleanOptionalAction’s
--no-verbose
. No functional changes.modelopt/torch/export/model_config_export.py (1)
355-368
: Enable shared-embedding path for qwen when lm_head is absent — OKAdding "qwen" to the whitelist aligns with weight tying in many Qwen decoders and unblocks TRT-LLM export when lm_head isn’t present on rank.
If you have a small Qwen decoder handy, please sanity-check that:
- config.lm_head is None on the inspected rank, and
- export sets config.share_embedding_table = True.
modelopt/torch/export/plugins/hf_spec_export.py (1)
85-93
: Robust guard for _modelopt_state access — good hardeningConsistent with rename_and_prune_if_spec_decoding; avoids index errors when speculative mode isn’t active.
examples/vlm_ptq/scripts/huggingface_example.sh (2)
33-37
: Add qwen to allowed MODEL_TYPE — OKValidation message updated accordingly.
184-185
: Suppress verbose quant logs — OKMatches hf_ptq.py’s BooleanOptionalAction flag.
bc62a24
to
312280e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hf_spec_export.py
look good to me. Thanks for the fix!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
modelopt/torch/export/model_config_export.py (1)
356-367
: De-duplicate whitelist into a constant.Minor cleanup: move the decoder whitelist to a module-level constant to keep it single-sourced and easier to extend.
Apply within this block:
- assert decoder_type in [ - "mpt", - "gpt2", - "gemma", - "gemma2", - "gemma3", - "glm", - "llama", - "mllama", - "qwen", - ], f"lm_head not available for decoder {decoder_type}" + assert decoder_type in SHARED_EMBEDDING_DECODERS, ( + f"lm_head not available for decoder {decoder_type}" + )And add near the imports (top of file):
# Decoders with tied vocab_embedding and lm_head; lm_head may be absent during traversal. SHARED_EMBEDDING_DECODERS: tuple[str, ...] = ( "mpt", "gpt2", "gemma", "gemma2", "gemma3", "glm", "llama", "mllama", "qwen", )
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
examples/llm_ptq/hf_ptq.py
(1 hunks)examples/vlm_ptq/scripts/huggingface_example.sh
(6 hunks)modelopt/torch/export/model_config_export.py
(1 hunks)modelopt/torch/export/plugins/hf_spec_export.py
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
- examples/llm_ptq/hf_ptq.py
- examples/vlm_ptq/scripts/huggingface_example.sh
- modelopt/torch/export/plugins/hf_spec_export.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: code-quality
- GitHub Check: build-docs
🔇 Additional comments (1)
modelopt/torch/export/model_config_export.py (1)
356-367
: ERROR: Unable to execute the requested run_scripts. Please ensure the script and tag formatting is correct.
Signed-off-by: Riyad Islam <[email protected]>
312280e
to
0bd0218
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
examples/vlm_ptq/scripts/huggingface_example.sh (1)
220-223
: qwen2_vl mapping: ensure downstream support in visual engine/runMapping
qwen
→qwen2_vl
is correct, but confirmvlm_visual_engine.py
andvlm_run.py
have a branch forqwen2_vl
; otherwise the visual build/run will fail.Quick check:
#!/bin/bash rg -nC3 '\bqwen2_vl\b' examples/vlm_ptq/vlm_visual_engine.py examples/vlm_ptq/vlm_run.py || true
🧹 Nitpick comments (5)
modelopt/torch/export/model_config_export.py (1)
347-368
: Qwen share-embedding assertion: prefer capability check over hardcoded allowlistAdding "qwen" to the allowlist works, but this path fires when lm_head is absent and we unconditionally force shared embeddings. For better safety across Qwen variants (incl. VL), check tie_word_embeddings from the HF config (if available) instead of relying solely on a model-name allowlist.
Apply this targeted refactor to use HF config when present:
- elif training_pipeline_parallel == 1: - # Models that share weights for lm_head and vocab_embedding - assert decoder_type in [ - "mpt", - "gpt2", - "gemma", - "gemma2", - "gemma3", - "glm", - "llama", - "mllama", - "qwen", - ], f"lm_head not available for decoder {decoder_type}" - config.share_embedding_table = True + elif training_pipeline_parallel == 1: + # Models that share weights for lm_head and vocab_embedding + tied = bool(getattr(hf_config, "tie_word_embeddings", False)) + known_tied = [ + "mpt", + "gpt2", + "gemma", + "gemma2", + "gemma3", + "glm", + "llama", + "mllama", + "qwen", + ] + if tied or decoder_type in known_tied: + config.share_embedding_table = True + else: + raise AssertionError(f"lm_head not available for decoder {decoder_type} and embeddings are not tied")Verification ask:
- Confirm Qwen2.5-VL’s HF config sets tie_word_embeddings=True; otherwise, this new branch will assert and help catch mismatches early.
examples/vlm_ptq/README.md (1)
55-56
: Tighten wording: mention Qwen explicitly and keep list order consistentMinor clarity nit: the sentence listing models now includes Qwen—looks good. Consider ordering models in the list the same way as in the support matrix for easy scanning.
examples/vlm_ptq/scripts/huggingface_example.sh (3)
60-66
: NVFP4 option needs environment guardrailsSince NVFP4 requires specific HW/SW (e.g., Blackwell + TRT-LLM >= 0.17), add a quick preflight check to fail fast with a clear message when prerequisites aren’t met.
Example guard (non-blocking; adjust as needed):
case $QFORMAT in fp8|nvfp4|int8_sq|int4_awq|w4a8_awq|fp16|bf16) ;; *) echo "Unknown quant argument: Expected one of: [fp8, nvfp4, int8_sq, int4_awq, w4a8_awq, fp16, bf16]" >&2 exit 1 esac + +# Optional: basic NVFP4 preflight +if [ "$QFORMAT" = "nvfp4" ]; then + echo "[Info] NVFP4 requires Blackwell GPUs and TensorRT-LLM >= 0.17." + # TODO: add environment checks here if desired (driver, GPU arch, TRT-LLM version). +fi
94-98
: Batch size 20 for qwen — verify memory headroomBumping default build batch size to 20 can OOM on modest GPUs, especially with large vision encoders. Consider 8–16 by default or document the expected minimum GPU memory.
17-25
: Add pipefail to catch subshell/pipe errors earlyMinor robustness improvement.
-set -e +set -e +set -o pipefail
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
examples/llm_ptq/hf_ptq.py
(1 hunks)examples/vlm_ptq/README.md
(3 hunks)examples/vlm_ptq/scripts/huggingface_example.sh
(6 hunks)modelopt/torch/export/model_config_export.py
(1 hunks)modelopt/torch/export/plugins/hf_spec_export.py
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- examples/llm_ptq/hf_ptq.py
- modelopt/torch/export/plugins/hf_spec_export.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: linux
- GitHub Check: build-docs
- GitHub Check: code-quality
🔇 Additional comments (3)
examples/vlm_ptq/scripts/huggingface_example.sh (3)
32-37
: Add qwen to supported MODEL_TYPE — LGTMThe new type gate is correct and the error message is updated accordingly.
152-155
: KV cache kept high precision for qwen — LGTMFor Qwen, forcing
--kv_cache_qformat none
is consistent with the PR objective to keep KV in high precision.
183-184
: Remove outdated check:--no-verbose
is already supported
Theparser.add_argument("--verbose", …, action=argparse.BooleanOptionalAction)
in examples/llm_ptq/hf_ptq.py automatically creates both--verbose
and--no-verbose
flags, so no changes are required.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #311 +/- ##
==========================================
- Coverage 73.88% 73.87% -0.01%
==========================================
Files 172 172
Lines 17439 17439
==========================================
- Hits 12884 12883 -1
- Misses 4555 4556 +1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
/ok to test 043d2ce |
Signed-off-by: Riyad Islam <[email protected]> Signed-off-by: Jingyu Xin <[email protected]>
Signed-off-by: Riyad Islam <[email protected]>
What does this PR do?
Type of change: new example
Overview: Qwen2.5-VL-7B-Instruct needs to be quantized with KV cache is high precision. This MR add supports of Qwen in example with necessary fixes.
Usage
# Add a code snippet demonstrating how to use this
Testing
Before your PR is "Ready for review"
Additional Information
Summary by CodeRabbit
New Features
Bug Fixes
Documentation
Style
Chores