-
Notifications
You must be signed in to change notification settings - Fork 162
Added support for qwen3-next quantization and export #323
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughExtends recognition and handling for Qwen3Next MoE blocks: layer utilities now detect Qwen3Next model types and MOE block names; the HuggingFace quantization plugin optionally imports and registers Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Loader as ModelLoader
participant LU as layer_utils
participant HF as HF_Quant_Plugin
participant QR as QuantModuleRegistry
participant Q as Quantizer
Loader->>LU: get_experts_list(model_type)
LU-->>Loader: returns gate_proj/down_proj/up_proj for qwen3next*
Loader->>LU: is_moe(block)
LU-->>Loader: true for Qwen3NextSparseMoeBlock
note over HF,QR: Optional registration at import time
HF->>HF: try import Qwen3NextSparseMoeBlock
alt import succeeds
HF->>QR: register "hf.Qwen3NextSparseMoeBlock" -> _QuantMoeSparseMoe
QR-->>HF: registered
else import fails
HF-->>HF: skip registration
end
Loader->>Q: request quantization
Q->>QR: resolve handler for block type
QR-->>Q: _QuantMoeSparseMoe (if registered)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (2)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
Tip 👮 Agentic pre-merge checks are now available in preview!Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.
Please see the documentation for more information. Example: reviews:
pre_merge_checks:
custom_checks:
- name: "Undocumented Breaking Changes"
mode: "warning"
instructions: |
Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal). Please share your feedback with us on this Discord post. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #323 +/- ##
==========================================
- Coverage 73.88% 73.88% -0.01%
==========================================
Files 172 172
Lines 17444 17444
==========================================
- Hits 12889 12888 -1
- Misses 4555 4556 +1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
modelopt/torch/export/layer_utils.py (1)
88-103
: Qwen3-Next detection: good; normalize model_type and de-duplicate variant list.Works as intended. To harden against caller casing and avoid future copy-paste, normalize once and keep a single tuple of supported Qwen MoE variants.
Apply:
def get_experts_list(module: torch.nn.Module, model_type: str): """Returns list of grouped experts by linear name for given module.""" experts_list = [] # Define linear layer names for different model types - if "mixtralforcausallm" in model_type: + model_type = model_type.lower() + if "mixtralforcausallm" in model_type: linear_names = ["w1", "w2", "w3"] - elif any( - qwen_variant in model_type - for qwen_variant in [ - "qwenmoeforcausallm", - "qwen2moeforcausallm", - "qwen3moeforcausallm", - "qwen3nextforcausallm", - ] - ): + elif any(qv in model_type for qv in ( + "qwenmoeforcausallm", + "qwen2moeforcausallm", + "qwen3moeforcausallm", + "qwen3nextforcausallm", + )): linear_names = ["gate_proj", "down_proj", "up_proj"] else: raise NotImplementedError(f" {model_type} not supported")
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
modelopt/torch/export/layer_utils.py
(3 hunks)modelopt/torch/quantization/plugins/huggingface.py
(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
modelopt/torch/quantization/plugins/huggingface.py (1)
modelopt/torch/quantization/conversion.py (1)
register
(326-367)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: multi-torch (27)
- GitHub Check: windows
🔇 Additional comments (3)
modelopt/torch/export/layer_utils.py (2)
329-343
: Include Qwen3NextSparseMoeBlock in is_moe — LGTM.Adds the new block to MOE detection without altering existing behavior.
996-1005
: LGTM — verify HF Qwen3NextSparseMoeBlock exposes gate_proj / down_proj / up_proj.Consistent with Qwen2/3 MoE. Verification here failed (ModuleNotFoundError: No module named 'transformers'). Run locally and paste output:
#!/bin/bash python - <<'PY' import importlib, inspect m = importlib.import_module("transformers.models.qwen3_next.modeling_qwen3_next") blk = getattr(m, "Qwen3NextSparseMoeBlock", None) print("Has Qwen3NextSparseMoeBlock:", blk is not None) if blk: print("Constructor signature:", inspect.signature(blk.__init__)) print("Class attrs (proj/gate/expert):", [a for a in dir(blk) if any(k in a for k in ("proj","gate","expert"))]) PYmodelopt/torch/quantization/plugins/huggingface.py (1)
562-571
: Optional registration for Qwen3NextSparseMoeBlock — keep try/except; confirm HF export
modelopt/torch/quantization/plugins/huggingface.py (≈lines 562–571) contains the registration. Runtime verification failed here because 'transformers' is not installed and the modelopt package import failed — confirm that transformers.models.qwen3_next.modeling_qwen3_next exports Qwen3NextSparseMoeBlock in your target Transformers release and that QuantModuleRegistry.get(Qwen3NextSparseMoeBlock) returns the registered entry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This works for now.
We need to look into how the deployment framework is handling expert quantization (whether the experts are quantized in isolation or whether they are quantized at once, whether the quantization parameters like per-tensor scale are shared between the experts etc.).
Figuring out these details will be critical for QAT support. Cc @cjluo-nv @RalphMao
Please share if you have any particular thoughts:
Signed-off-by: Kinjal Patel <[email protected]>
1ceb40e
to
49c23e9
Compare
Signed-off-by: Kinjal Patel <[email protected]> Signed-off-by: Ye Yu <[email protected]>
What does this PR do?
Support for Qwen3-Next quantization and hf export
Overview:
Added support for the new Qwen3-Next model quantization and export in huggingface compatible format
Usage
See
example/llm_ptq/hf_ptq.py
Testing
Tested by quantizing and exporting the Qwen/Qwen3-Next-80B-A3B-Instruct and Qwen/Qwen3-Next-80B-A3B-Thinking models
Before your PR is "Ready for review"
Additional Information
Summary by CodeRabbit