Skip to content

Conversation

i-riyad
Copy link
Contributor

@i-riyad i-riyad commented Sep 9, 2025

What does this PR do?

Type of change: new example

Overview: Qwen2.5-VL-7B-Instruct needs to be quantized with KV cache is high precision. This MR add supports of Qwen in example with necessary fixes.

Usage

scripts/huggingface_example.sh --type qwen --model Qwen2.5-VL-7B-Instruct --quant fp8 --export_fmt hf
# Add a code snippet demonstrating how to use this

Testing

Before your PR is "Ready for review"

  • Make sure you read and follow Contributor guidelines and your commits are signed.
  • Is this change backward compatible?: Yes
  • Did you write any new necessary tests?: No
  • Did you add or update any necessary documentation?: Yes
  • Did you update Changelog?: No

Additional Information

Summary by CodeRabbit

  • New Features

    • Added support for the Qwen model type across the VLM quantization flow, including mapping, config adjustments, and increased max batch size for Qwen.
  • Bug Fixes

    • Made detection of speculative-decoding state more robust to avoid errors when state is missing or unexpected.
  • Documentation

    • Added Qwen2.5-VL details and usage examples to PTQ documentation and examples.
  • Style

    • Clarified CLI help text for the verbose option.
  • Chores

    • Added a --no-verbose option to reduce quantization log output.

@i-riyad i-riyad requested review from a team as code owners September 9, 2025 20:33
@i-riyad i-riyad requested a review from cjluo-nv September 9, 2025 20:33
Copy link

copy-pr-bot bot commented Sep 9, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Copy link

coderabbitai bot commented Sep 9, 2025

Walkthrough

Adds qwen model support in VLM PTQ scripts, includes qwen in shared-embedding export whitelist, hardens a guard in HF spec export, and corrects a CLI help string; also changes PTQ invocations to use --no-verbose for quieter quantization.

Changes

Cohort / File(s) Summary
CLI/help text tweak
examples/llm_ptq/hf_ptq.py
Updated the --verbose help text to reference --no-verbose; no logic changes.
VLM PTQ: qwen support & PTQ args
examples/vlm_ptq/scripts/huggingface_example.sh
Added qwen to allowed MODEL_TYPE, added qwen branch that appends --kv_cache_qformat none to PTQ_ARGS, increased BUILD_MAX_BATCH_SIZE to 20 for qwen, set VISUAL_MODEL_TYPE to qwen2_vl, and invoke quantization with --no-verbose.
Export config: qwen embedding share
modelopt/torch/export/model_config_export.py
Added "qwen" to decoder types that share embedding tables and lack lm_head when training_pipeline_parallel == 1.
HF spec export guard hardening
modelopt/torch/export/plugins/hf_spec_export.py
set_config_if_spec_decoding now safely reads _modelopt_state with type/length/content checks (must be a single entry whose first element is "eagle"); otherwise returns config_data early.
Docs: VLM PTQ README
examples/vlm_ptq/README.md
Added Qwen2.5-VL to supported models matrix, updated example usage and a dedicated Qwen2.5-VL section with HF clone & script invocation.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor U as User
  participant S as huggingface_example.sh
  participant C as model_config
  participant Q as hf_ptq.py

  U->>S: Run script with MODEL_TYPE=qwen
  S->>C: Set VISUAL_MODEL_TYPE = qwen2_vl
  S->>S: BUILD_MAX_BATCH_SIZE = 20
  S->>S: Append PTQ_ARGS: --kv_cache_qformat none
  S->>Q: Invoke quantization with --no-verbose
  Q-->>S: Produce quantized HF artifacts
Loading
sequenceDiagram
  autonumber
  participant E as set_config_if_spec_decoding
  participant M as model
  participant C as config_data

  E->>M: Read _modelopt_state
  alt _modelopt_state is list/tuple of length 1 and first element starts with "eagle"
    E->>E: Apply spec-decoding config changes
    E-->>C: Return mutated config_data
  else
    E-->>C: Return config_data unchanged
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Pre-merge checks (3 passed)

✅ Passed checks (3 passed)
Check name Status Explanation
Title Check ✅ Passed The title clearly summarizes the primary change of adding Qwen quantization and Hugging Face export support within the example scripts, is concise and specific without extraneous details, and aligns with the PR’s main objective of extending examples for Qwen models.
Description Check ✅ Passed The description outlines the purpose of the PR, specifies the type of change as a new example, provides usage instructions, and directly relates to the addition of Qwen support in examples, making it relevant and on-topic.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

Poem

I twitch my ears at "qwen" tonight,
New branches sprout in script and byte.
Quiet flags hush the PTQ light,
Guards peek first before they bite.
Hop—embeddings shared in flight. 🐇✨

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch rislam/qwen-fix

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
examples/vlm_ptq/scripts/huggingface_example.sh (2)

94-99: Default max batch size 20 for qwen: verify memory headroom

Qwen2.5-VL-7B can be memory heavy; batch 20 may OOM on smaller GPUs during build. Consider making this model-aware or gating by available VRAM.

Run a quick dry-run on your target GPU(s) to confirm batch 20 is safe; otherwise reduce or make it configurable via an env var (e.g., BUILD_MAX_BATCH_SIZE).


152-155: Don’t force override EXPORT_FORMAT for qwen

Let users opt into TRT-LLM builds explicitly; default to HF only if unset. This preserves flexibility and reduces surprise.

Apply:

     "qwen")
         PTQ_ARGS+=" --kv_cache_qformat none "
-        EXPORT_FORMAT="hf"
+        : "${EXPORT_FORMAT:=hf}"
         ;;
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d6d2e75 and bc62a24.

📒 Files selected for processing (4)
  • examples/llm_ptq/hf_ptq.py (1 hunks)
  • examples/vlm_ptq/scripts/huggingface_example.sh (5 hunks)
  • modelopt/torch/export/model_config_export.py (1 hunks)
  • modelopt/torch/export/plugins/hf_spec_export.py (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: build-docs
  • GitHub Check: code-quality
🔇 Additional comments (5)
examples/llm_ptq/hf_ptq.py (1)

744-748: Help text nit: flag spelling fix LGTM

Updated hint matches argparse.BooleanOptionalAction’s --no-verbose. No functional changes.

modelopt/torch/export/model_config_export.py (1)

355-368: Enable shared-embedding path for qwen when lm_head is absent — OK

Adding "qwen" to the whitelist aligns with weight tying in many Qwen decoders and unblocks TRT-LLM export when lm_head isn’t present on rank.

If you have a small Qwen decoder handy, please sanity-check that:

  • config.lm_head is None on the inspected rank, and
  • export sets config.share_embedding_table = True.
modelopt/torch/export/plugins/hf_spec_export.py (1)

85-93: Robust guard for _modelopt_state access — good hardening

Consistent with rename_and_prune_if_spec_decoding; avoids index errors when speculative mode isn’t active.

examples/vlm_ptq/scripts/huggingface_example.sh (2)

33-37: Add qwen to allowed MODEL_TYPE — OK

Validation message updated accordingly.


184-185: Suppress verbose quant logs — OK

Matches hf_ptq.py’s BooleanOptionalAction flag.

Copy link
Contributor

@h-guo18 h-guo18 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hf_spec_export.py look good to me. Thanks for the fix!

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
modelopt/torch/export/model_config_export.py (1)

356-367: De-duplicate whitelist into a constant.

Minor cleanup: move the decoder whitelist to a module-level constant to keep it single-sourced and easier to extend.

Apply within this block:

-                assert decoder_type in [
-                    "mpt",
-                    "gpt2",
-                    "gemma",
-                    "gemma2",
-                    "gemma3",
-                    "glm",
-                    "llama",
-                    "mllama",
-                    "qwen",
-                ], f"lm_head not available for decoder {decoder_type}"
+                assert decoder_type in SHARED_EMBEDDING_DECODERS, (
+                    f"lm_head not available for decoder {decoder_type}"
+                )

And add near the imports (top of file):

# Decoders with tied vocab_embedding and lm_head; lm_head may be absent during traversal.
SHARED_EMBEDDING_DECODERS: tuple[str, ...] = (
    "mpt",
    "gpt2",
    "gemma",
    "gemma2",
    "gemma3",
    "glm",
    "llama",
    "mllama",
    "qwen",
)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bc62a24 and 312280e.

📒 Files selected for processing (4)
  • examples/llm_ptq/hf_ptq.py (1 hunks)
  • examples/vlm_ptq/scripts/huggingface_example.sh (6 hunks)
  • modelopt/torch/export/model_config_export.py (1 hunks)
  • modelopt/torch/export/plugins/hf_spec_export.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
  • examples/llm_ptq/hf_ptq.py
  • examples/vlm_ptq/scripts/huggingface_example.sh
  • modelopt/torch/export/plugins/hf_spec_export.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: code-quality
  • GitHub Check: build-docs
🔇 Additional comments (1)
modelopt/torch/export/model_config_export.py (1)

356-367: ERROR: Unable to execute the requested run_scripts. Please ensure the script and tag formatting is correct.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
examples/vlm_ptq/scripts/huggingface_example.sh (1)

220-223: qwen2_vl mapping: ensure downstream support in visual engine/run

Mapping qwenqwen2_vl is correct, but confirm vlm_visual_engine.py and vlm_run.py have a branch for qwen2_vl; otherwise the visual build/run will fail.

Quick check:

#!/bin/bash
rg -nC3 '\bqwen2_vl\b' examples/vlm_ptq/vlm_visual_engine.py examples/vlm_ptq/vlm_run.py || true
🧹 Nitpick comments (5)
modelopt/torch/export/model_config_export.py (1)

347-368: Qwen share-embedding assertion: prefer capability check over hardcoded allowlist

Adding "qwen" to the allowlist works, but this path fires when lm_head is absent and we unconditionally force shared embeddings. For better safety across Qwen variants (incl. VL), check tie_word_embeddings from the HF config (if available) instead of relying solely on a model-name allowlist.

Apply this targeted refactor to use HF config when present:

-            elif training_pipeline_parallel == 1:
-                # Models that share weights for lm_head and vocab_embedding
-                assert decoder_type in [
-                    "mpt",
-                    "gpt2",
-                    "gemma",
-                    "gemma2",
-                    "gemma3",
-                    "glm",
-                    "llama",
-                    "mllama",
-                    "qwen",
-                ], f"lm_head not available for decoder {decoder_type}"
-                config.share_embedding_table = True
+            elif training_pipeline_parallel == 1:
+                # Models that share weights for lm_head and vocab_embedding
+                tied = bool(getattr(hf_config, "tie_word_embeddings", False))
+                known_tied = [
+                    "mpt",
+                    "gpt2",
+                    "gemma",
+                    "gemma2",
+                    "gemma3",
+                    "glm",
+                    "llama",
+                    "mllama",
+                    "qwen",
+                ]
+                if tied or decoder_type in known_tied:
+                    config.share_embedding_table = True
+                else:
+                    raise AssertionError(f"lm_head not available for decoder {decoder_type} and embeddings are not tied")

Verification ask:

  • Confirm Qwen2.5-VL’s HF config sets tie_word_embeddings=True; otherwise, this new branch will assert and help catch mismatches early.
examples/vlm_ptq/README.md (1)

55-56: Tighten wording: mention Qwen explicitly and keep list order consistent

Minor clarity nit: the sentence listing models now includes Qwen—looks good. Consider ordering models in the list the same way as in the support matrix for easy scanning.

examples/vlm_ptq/scripts/huggingface_example.sh (3)

60-66: NVFP4 option needs environment guardrails

Since NVFP4 requires specific HW/SW (e.g., Blackwell + TRT-LLM >= 0.17), add a quick preflight check to fail fast with a clear message when prerequisites aren’t met.

Example guard (non-blocking; adjust as needed):

 case $QFORMAT in
     fp8|nvfp4|int8_sq|int4_awq|w4a8_awq|fp16|bf16)
         ;;
     *)
         echo "Unknown quant argument: Expected one of: [fp8, nvfp4, int8_sq, int4_awq, w4a8_awq, fp16, bf16]" >&2
         exit 1
 esac
+
+# Optional: basic NVFP4 preflight
+if [ "$QFORMAT" = "nvfp4" ]; then
+    echo "[Info] NVFP4 requires Blackwell GPUs and TensorRT-LLM >= 0.17."
+    # TODO: add environment checks here if desired (driver, GPU arch, TRT-LLM version).
+fi

94-98: Batch size 20 for qwen — verify memory headroom

Bumping default build batch size to 20 can OOM on modest GPUs, especially with large vision encoders. Consider 8–16 by default or document the expected minimum GPU memory.


17-25: Add pipefail to catch subshell/pipe errors early

Minor robustness improvement.

-set -e
+set -e
+set -o pipefail
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 312280e and 0bd0218.

📒 Files selected for processing (5)
  • examples/llm_ptq/hf_ptq.py (1 hunks)
  • examples/vlm_ptq/README.md (3 hunks)
  • examples/vlm_ptq/scripts/huggingface_example.sh (6 hunks)
  • modelopt/torch/export/model_config_export.py (1 hunks)
  • modelopt/torch/export/plugins/hf_spec_export.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • examples/llm_ptq/hf_ptq.py
  • modelopt/torch/export/plugins/hf_spec_export.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: linux
  • GitHub Check: build-docs
  • GitHub Check: code-quality
🔇 Additional comments (3)
examples/vlm_ptq/scripts/huggingface_example.sh (3)

32-37: Add qwen to supported MODEL_TYPE — LGTM

The new type gate is correct and the error message is updated accordingly.


152-155: KV cache kept high precision for qwen — LGTM

For Qwen, forcing --kv_cache_qformat none is consistent with the PR objective to keep KV in high precision.


183-184: Remove outdated check: --no-verbose is already supported
The parser.add_argument("--verbose", …, action=argparse.BooleanOptionalAction) in examples/llm_ptq/hf_ptq.py automatically creates both --verbose and --no-verbose flags, so no changes are required.

Copy link

codecov bot commented Sep 9, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.87%. Comparing base (b913290) to head (043d2ce).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #311      +/-   ##
==========================================
- Coverage   73.88%   73.87%   -0.01%     
==========================================
  Files         172      172              
  Lines       17439    17439              
==========================================
- Hits        12884    12883       -1     
- Misses       4555     4556       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@i-riyad i-riyad enabled auto-merge (squash) September 10, 2025 04:09
@kevalmorabia97
Copy link
Collaborator

/ok to test 043d2ce

@i-riyad i-riyad merged commit 85b309f into main Sep 10, 2025
24 checks passed
@i-riyad i-riyad deleted the rislam/qwen-fix branch September 10, 2025 07:11
jingyu-ml pushed a commit that referenced this pull request Sep 10, 2025
benchislett pushed a commit that referenced this pull request Sep 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants