Qwen quantize and hf export support in examples #311

i-riyad · 2025-09-09T20:33:10Z

What does this PR do?

Type of change: new example

Overview: Qwen2.5-VL-7B-Instruct needs to be quantized with KV cache is high precision. This MR add supports of Qwen in example with necessary fixes.

Usage

scripts/huggingface_example.sh --type qwen --model Qwen2.5-VL-7B-Instruct --quant fp8 --export_fmt hf

# Add a code snippet demonstrating how to use this

Testing

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed.
Is this change backward compatible?: Yes
Did you write any new necessary tests?: No
Did you add or update any necessary documentation?: Yes
Did you update Changelog?: No

Additional Information

Summary by CodeRabbit

New Features
- Added support for the Qwen model type across the VLM quantization flow, including mapping, config adjustments, and increased max batch size for Qwen.
Bug Fixes
- Made detection of speculative-decoding state more robust to avoid errors when state is missing or unexpected.
Documentation
- Added Qwen2.5-VL details and usage examples to PTQ documentation and examples.
Style
- Clarified CLI help text for the verbose option.
Chores
- Added a --no-verbose option to reduce quantization log output.

copy-pr-bot · 2025-09-09T20:33:14Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2025-09-09T20:33:18Z

Walkthrough

Adds qwen model support in VLM PTQ scripts, includes qwen in shared-embedding export whitelist, hardens a guard in HF spec export, and corrects a CLI help string; also changes PTQ invocations to use --no-verbose for quieter quantization.

Changes

Cohort / File(s)	Summary
CLI/help text tweak `examples/llm_ptq/hf_ptq.py`	Updated the `--verbose` help text to reference `--no-verbose`; no logic changes.
VLM PTQ: qwen support & PTQ args `examples/vlm_ptq/scripts/huggingface_example.sh`	Added `qwen` to allowed `MODEL_TYPE`, added `qwen` branch that appends `--kv_cache_qformat none` to `PTQ_ARGS`, increased `BUILD_MAX_BATCH_SIZE` to 20 for qwen, set `VISUAL_MODEL_TYPE` to `qwen2_vl`, and invoke quantization with `--no-verbose`.
Export config: qwen embedding share `modelopt/torch/export/model_config_export.py`	Added `"qwen"` to decoder types that share embedding tables and lack `lm_head` when `training_pipeline_parallel == 1`.
HF spec export guard hardening `modelopt/torch/export/plugins/hf_spec_export.py`	`set_config_if_spec_decoding` now safely reads `_modelopt_state` with type/length/content checks (must be a single entry whose first element is `"eagle"`); otherwise returns `config_data` early.
Docs: VLM PTQ README `examples/vlm_ptq/README.md`	Added Qwen2.5-VL to supported models matrix, updated example usage and a dedicated Qwen2.5-VL section with HF clone & script invocation.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor U as User
  participant S as huggingface_example.sh
  participant C as model_config
  participant Q as hf_ptq.py

  U->>S: Run script with MODEL_TYPE=qwen
  S->>C: Set VISUAL_MODEL_TYPE = qwen2_vl
  S->>S: BUILD_MAX_BATCH_SIZE = 20
  S->>S: Append PTQ_ARGS: --kv_cache_qformat none
  S->>Q: Invoke quantization with --no-verbose
  Q-->>S: Produce quantized HF artifacts

sequenceDiagram
  autonumber
  participant E as set_config_if_spec_decoding
  participant M as model
  participant C as config_data

  E->>M: Read _modelopt_state
  alt _modelopt_state is list/tuple of length 1 and first element starts with "eagle"
    E->>E: Apply spec-decoding config changes
    E-->>C: Return mutated config_data
  else
    E-->>C: Return config_data unchanged
  end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Pre-merge checks (3 passed)

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The title clearly summarizes the primary change of adding Qwen quantization and Hugging Face export support within the example scripts, is concise and specific without extraneous details, and aligns with the PR’s main objective of extending examples for Qwen models.
Description Check	✅ Passed	The description outlines the purpose of the PR, specifies the type of change as a new example, provides usage instructions, and directly relates to the addition of Qwen support in examples, making it relevant and on-topic.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

Poem

I twitch my ears at "qwen" tonight,
New branches sprout in script and byte.
Quiet flags hush the PTQ light,
Guards peek first before they bite.
Hop—embeddings shared in flight. 🐇✨

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch rislam/qwen-fix

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

examples/vlm_ptq/scripts/huggingface_example.sh (2)
94-99: Default max batch size 20 for qwen: verify memory headroom

Qwen2.5-VL-7B can be memory heavy; batch 20 may OOM on smaller GPUs during build. Consider making this model-aware or gating by available VRAM.

Run a quick dry-run on your target GPU(s) to confirm batch 20 is safe; otherwise reduce or make it configurable via an env var (e.g., BUILD_MAX_BATCH_SIZE).

152-155: Don’t force override EXPORT_FORMAT for qwen

Let users opt into TRT-LLM builds explicitly; default to HF only if unset. This preserves flexibility and reduces surprise.

Apply:
     "qwen")
         PTQ_ARGS+=" --kv_cache_qformat none "
-        EXPORT_FORMAT="hf"
+        : "${EXPORT_FORMAT:=hf}"
         ;;

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d6d2e75 and bc62a24.

📒 Files selected for processing (4)

examples/llm_ptq/hf_ptq.py (1 hunks)
examples/vlm_ptq/scripts/huggingface_example.sh (5 hunks)
modelopt/torch/export/model_config_export.py (1 hunks)
modelopt/torch/export/plugins/hf_spec_export.py (1 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: build-docs
GitHub Check: code-quality

🔇 Additional comments (5)

examples/llm_ptq/hf_ptq.py (1)

744-748: Help text nit: flag spelling fix LGTM

Updated hint matches argparse.BooleanOptionalAction’s --no-verbose. No functional changes.

modelopt/torch/export/model_config_export.py (1)

355-368: Enable shared-embedding path for qwen when lm_head is absent — OK

Adding "qwen" to the whitelist aligns with weight tying in many Qwen decoders and unblocks TRT-LLM export when lm_head isn’t present on rank.

If you have a small Qwen decoder handy, please sanity-check that:

config.lm_head is None on the inspected rank, and

export sets config.share_embedding_table = True.

modelopt/torch/export/plugins/hf_spec_export.py (1)

85-93: Robust guard for _modelopt_state access — good hardening

Consistent with rename_and_prune_if_spec_decoding; avoids index errors when speculative mode isn’t active.

examples/vlm_ptq/scripts/huggingface_example.sh (2)

33-37: Add qwen to allowed MODEL_TYPE — OK

Validation message updated accordingly.

184-185: Suppress verbose quant logs — OK

Matches hf_ptq.py’s BooleanOptionalAction flag.

examples/vlm_ptq/scripts/huggingface_example.sh

h-guo18

hf_spec_export.py look good to me. Thanks for the fix!

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

modelopt/torch/export/model_config_export.py (1)

356-367: De-duplicate whitelist into a constant.

Minor cleanup: move the decoder whitelist to a module-level constant to keep it single-sourced and easier to extend.

Apply within this block:

-                assert decoder_type in [
-                    "mpt",
-                    "gpt2",
-                    "gemma",
-                    "gemma2",
-                    "gemma3",
-                    "glm",
-                    "llama",
-                    "mllama",
-                    "qwen",
-                ], f"lm_head not available for decoder {decoder_type}"
+                assert decoder_type in SHARED_EMBEDDING_DECODERS, (
+                    f"lm_head not available for decoder {decoder_type}"
+                )

And add near the imports (top of file):

# Decoders with tied vocab_embedding and lm_head; lm_head may be absent during traversal.
SHARED_EMBEDDING_DECODERS: tuple[str, ...] = (
    "mpt",
    "gpt2",
    "gemma",
    "gemma2",
    "gemma3",
    "glm",
    "llama",
    "mllama",
    "qwen",
)

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bc62a24 and 312280e.

📒 Files selected for processing (4)

examples/llm_ptq/hf_ptq.py (1 hunks)
examples/vlm_ptq/scripts/huggingface_example.sh (6 hunks)
modelopt/torch/export/model_config_export.py (1 hunks)
modelopt/torch/export/plugins/hf_spec_export.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (3)

examples/llm_ptq/hf_ptq.py
examples/vlm_ptq/scripts/huggingface_example.sh
modelopt/torch/export/plugins/hf_spec_export.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: code-quality
GitHub Check: build-docs

🔇 Additional comments (1)

modelopt/torch/export/model_config_export.py (1)

356-367: ERROR: Unable to execute the requested run_scripts. Please ensure the script and tag formatting is correct.

Signed-off-by: Riyad Islam <[email protected]>

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

examples/vlm_ptq/scripts/huggingface_example.sh (1)
220-223: qwen2_vl mapping: ensure downstream support in visual engine/run

Mapping qwen → qwen2_vl is correct, but confirm vlm_visual_engine.py and vlm_run.py have a branch for qwen2_vl; otherwise the visual build/run will fail.

Quick check:
#!/bin/bash
rg -nC3 '\bqwen2_vl\b' examples/vlm_ptq/vlm_visual_engine.py examples/vlm_ptq/vlm_run.py || true

🧹 Nitpick comments (5)

modelopt/torch/export/model_config_export.py (1)
347-368: Qwen share-embedding assertion: prefer capability check over hardcoded allowlist

Adding "qwen" to the allowlist works, but this path fires when lm_head is absent and we unconditionally force shared embeddings. For better safety across Qwen variants (incl. VL), check tie_word_embeddings from the HF config (if available) instead of relying solely on a model-name allowlist.

Apply this targeted refactor to use HF config when present:
-            elif training_pipeline_parallel == 1:
-                # Models that share weights for lm_head and vocab_embedding
-                assert decoder_type in [
-                    "mpt",
-                    "gpt2",
-                    "gemma",
-                    "gemma2",
-                    "gemma3",
-                    "glm",
-                    "llama",
-                    "mllama",
-                    "qwen",
-                ], f"lm_head not available for decoder {decoder_type}"
-                config.share_embedding_table = True
+            elif training_pipeline_parallel == 1:
+                # Models that share weights for lm_head and vocab_embedding
+                tied = bool(getattr(hf_config, "tie_word_embeddings", False))
+                known_tied = [
+                    "mpt",
+                    "gpt2",
+                    "gemma",
+                    "gemma2",
+                    "gemma3",
+                    "glm",
+                    "llama",
+                    "mllama",
+                    "qwen",
+                ]
+                if tied or decoder_type in known_tied:
+                    config.share_embedding_table = True
+                else:
+                    raise AssertionError(f"lm_head not available for decoder {decoder_type} and embeddings are not tied")
Verification ask:

Confirm Qwen2.5-VL’s HF config sets tie_word_embeddings=True; otherwise, this new branch will assert and help catch mismatches early.
examples/vlm_ptq/README.md (1)

55-56: Tighten wording: mention Qwen explicitly and keep list order consistent

Minor clarity nit: the sentence listing models now includes Qwen—looks good. Consider ordering models in the list the same way as in the support matrix for easy scanning.
examples/vlm_ptq/scripts/huggingface_example.sh (3)
60-66: NVFP4 option needs environment guardrails

Since NVFP4 requires specific HW/SW (e.g., Blackwell + TRT-LLM >= 0.17), add a quick preflight check to fail fast with a clear message when prerequisites aren’t met.

Example guard (non-blocking; adjust as needed):
 case $QFORMAT in
     fp8|nvfp4|int8_sq|int4_awq|w4a8_awq|fp16|bf16)
         ;;
     *)
         echo "Unknown quant argument: Expected one of: [fp8, nvfp4, int8_sq, int4_awq, w4a8_awq, fp16, bf16]" >&2
         exit 1
 esac
+
+# Optional: basic NVFP4 preflight
+if [ "$QFORMAT" = "nvfp4" ]; then
+    echo "[Info] NVFP4 requires Blackwell GPUs and TensorRT-LLM >= 0.17."
+    # TODO: add environment checks here if desired (driver, GPU arch, TRT-LLM version).
+fi
94-98: Batch size 20 for qwen — verify memory headroom

Bumping default build batch size to 20 can OOM on modest GPUs, especially with large vision encoders. Consider 8–16 by default or document the expected minimum GPU memory.

17-25: Add pipefail to catch subshell/pipe errors early

Minor robustness improvement.
-set -e
+set -e
+set -o pipefail

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 312280e and 0bd0218.

📒 Files selected for processing (5)

examples/llm_ptq/hf_ptq.py (1 hunks)
examples/vlm_ptq/README.md (3 hunks)
examples/vlm_ptq/scripts/huggingface_example.sh (6 hunks)
modelopt/torch/export/model_config_export.py (1 hunks)
modelopt/torch/export/plugins/hf_spec_export.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

examples/llm_ptq/hf_ptq.py
modelopt/torch/export/plugins/hf_spec_export.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: linux
GitHub Check: build-docs
GitHub Check: code-quality

🔇 Additional comments (3)

examples/vlm_ptq/scripts/huggingface_example.sh (3)

32-37: Add qwen to supported MODEL_TYPE — LGTM

The new type gate is correct and the error message is updated accordingly.

152-155: KV cache kept high precision for qwen — LGTM

For Qwen, forcing --kv_cache_qformat none is consistent with the PR objective to keep KV in high precision.

183-184: Remove outdated check: --no-verbose is already supported
The parser.add_argument("--verbose", …, action=argparse.BooleanOptionalAction) in examples/llm_ptq/hf_ptq.py automatically creates both --verbose and --no-verbose flags, so no changes are required.

examples/vlm_ptq/README.md

codecov · 2025-09-09T22:02:59Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.87%. Comparing base (b913290) to head (043d2ce).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #311      +/-   ##
==========================================
- Coverage   73.88%   73.87%   -0.01%     
==========================================
  Files         172      172              
  Lines       17439    17439              
==========================================
- Hits        12884    12883       -1     
- Misses       4555     4556       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

kevalmorabia97 · 2025-09-10T05:55:46Z

/ok to test 043d2ce

Signed-off-by: Riyad Islam <[email protected]> Signed-off-by: Jingyu Xin <[email protected]>

Signed-off-by: Riyad Islam <[email protected]>

i-riyad requested review from a team as code owners September 9, 2025 20:33

i-riyad requested a review from cjluo-nv September 9, 2025 20:33

coderabbitai bot reviewed Sep 9, 2025

View reviewed changes

examples/vlm_ptq/scripts/huggingface_example.sh Show resolved Hide resolved

cjluo-nv approved these changes Sep 9, 2025

View reviewed changes

i-riyad requested a review from h-guo18 September 9, 2025 20:42

i-riyad force-pushed the rislam/qwen-fix branch from bc62a24 to 312280e Compare September 9, 2025 21:00

h-guo18 approved these changes Sep 9, 2025

View reviewed changes

coderabbitai bot reviewed Sep 9, 2025

View reviewed changes

qwen quantize and hf export support in examples

0bd0218

Signed-off-by: Riyad Islam <[email protected]>

i-riyad force-pushed the rislam/qwen-fix branch from 312280e to 0bd0218 Compare September 9, 2025 21:49

coderabbitai bot reviewed Sep 9, 2025

View reviewed changes

examples/vlm_ptq/README.md Show resolved Hide resolved

i-riyad enabled auto-merge (squash) September 10, 2025 04:09

Merge branch 'main' into rislam/qwen-fix

043d2ce

i-riyad merged commit 85b309f into main Sep 10, 2025
24 checks passed

i-riyad deleted the rislam/qwen-fix branch September 10, 2025 07:11

i-riyad mentioned this pull request Sep 10, 2025

Add Quantization Support for Qwen2.5-VL Model #270

Closed

jingyu-ml pushed a commit that referenced this pull request Sep 10, 2025

Qwen quantize and hf export support in examples (#311)

3c9ab0d

Signed-off-by: Riyad Islam <[email protected]> Signed-off-by: Jingyu Xin <[email protected]>

benchislett pushed a commit that referenced this pull request Sep 15, 2025

Qwen quantize and hf export support in examples (#311)

f7fe0e0

Signed-off-by: Riyad Islam <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Qwen quantize and hf export support in examples #311

Qwen quantize and hf export support in examples #311

Uh oh!

i-riyad commented Sep 9, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

copy-pr-bot bot commented Sep 9, 2025

Uh oh!

coderabbitai bot commented Sep 9, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

h-guo18 left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

codecov bot commented Sep 9, 2025 •

edited

Loading

Uh oh!

kevalmorabia97 commented Sep 10, 2025

Uh oh!

Uh oh!

Uh oh!

Qwen quantize and hf export support in examples #311

Qwen quantize and hf export support in examples #311

Uh oh!

Conversation

i-riyad commented Sep 9, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Sep 9, 2025

Uh oh!

coderabbitai bot commented Sep 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Pre-merge checks (3 passed)

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

h-guo18 left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov bot commented Sep 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

kevalmorabia97 commented Sep 10, 2025

Uh oh!

Uh oh!

Uh oh!

i-riyad commented Sep 9, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Sep 9, 2025 •

edited

Loading

codecov bot commented Sep 9, 2025 •

edited

Loading