feat: add Qwen3.5 MoE calibration module#2383
feat: add Qwen3.5 MoE calibration module#2383Sehyo wants to merge 12 commits intovllm-project:mainfrom
Conversation
Summary of ChangesHello @Sehyo, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a specialized calibration module for Qwen3.5 Mixture-of-Experts (MoE) models, designed to facilitate efficient NVFP4 quantization of their expert weights. By dynamically restructuring the MoE block to expose individual expert layers as standard linear modules, it enables the application of fine-grained quantization techniques. A new example script demonstrates this process, ensuring broader compatibility and optimized performance for these large language models. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review. Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed. |
There was a problem hiding this comment.
Code Review
This pull request introduces a calibration module for Qwen3.5 MoE models, enabling NVFP4 quantization. The changes include the core module implementation, its registration within the modeling package, and a comprehensive example script demonstrating its usage on a large-scale model. The implementation correctly unfuses expert weights into individual nn.Linear layers, which is crucial for quantization. The approach of using disable_onloading to handle large model weights on the CPU is well-considered. I have identified one potential issue in the forward pass logic that could lead to errors for MoE models configured with top_k=1, and I have provided a suggestion to address it.
83c7bd8 to
1d428f9
Compare
|
Requesting review alt. ready tag and enhancement tag. |
|
Quantized version with this PR: |
dsikka
left a comment
There was a problem hiding this comment.
This looks really good - thank you!
|
The quality checks have failed. Please run |
|
keep getting RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling |
Is this an error from VLLM? |
|
I have detected an issue in the current upstream version of VLLM which causes the Qwen3.5 NVFP4 quant to fail. in Qwen 3.5 Gated Delta Net, we have some fused / merged projections: and VLLM does fusing like: .. Which assumes plain weight tensors which are concatable.. But NVFP4 format stores weights in weight_packed (4bit packed) way. --> Fused weights are garbage I am currently trying to write a fix for this, if I succeed to get it working will submit a PR to vllm repo as well. |
If we skip quantizing the linear attn layers, wont this issue be resolved? |
|
Do you mind adding a test similar to the tests in this folder: https://github.com/vllm-project/llm-compressor/tree/main/tests/llmcompressor/modeling |
Yes for those layers it does not matter. |
Sure, will do it! |
642ba83 to
d030961
Compare
|
@dsikka Tests have been added. |
|
Review Request |
|
Came to my attention that the MTP modules are dropped from the quant. I am away until sunday but can fix it then. |
|
@Sehyo I would switch it to W4A16 Scheme; the group size is for it to work on Exllama in my RDNA3 |
|
@Sehyo Hi, I’ve encountered a couple of issues while running a modified version of your example code. Modification to the quantization script: scheme_0 = FP8_DYNAMIC
scheme_0["targets"] = ["re:.*self_attn.o_proj", "re:.*linear_attn.in_proj_qkv", "re:.*linear_attn.in_proj_z", "re:.*linear_attn.out_proj"]
scheme_1 = NVFP4
scheme_1["targets"] = ["re:.*self_attn.(q|k|v)_proj", "re:.*mlp.experts.*.*_proj"]
ignore = ["re:.*lm_head", "re:visual.*", "re:model.visual.*", "re:.*mlp.gate$", "re:.*norm.*", "re:.*mlp.gate$", "re:.*mlp.shared_expert_gate$", "re:.*mtp.*", "re:.*conv1d.*", "re:.*in_proj_a*", "re:.*in_proj_b*", "re:.*in_proj_c*"]
recipe = QuantizationModifier(
config_groups={"group_0": scheme_0, "group_1": scheme_1}, ignore=ignore
)Expected behavior: Result: Only Another issue: the exported tokenizer metadata appears to use an unexpected class: |
There was and still may be, an issue using Mixed Precision with NVFP4 in VLLM. Be aware of that, as that may be occurring here. I closed my PR, as I didn't see yours @Sehyo . Your code was very close to mine, and your MTP handling is solid for peeps who turn it on. Thanks for Submitting this. |
@phaelon74 Thanks for the information! I’ll open a separate issue to discuss this, since it seems unrelated to this PR. I wonder if this is specific to the new Edit: I found the issue. Turns out the regex wasn’t matching in my script. scheme_0 = FP8_DYNAMIC
scheme_0["targets"] = [
"re:.*self_attn.o_proj$",
"re:.*linear_attn.in_proj_qkv$",
"re:.*linear_attn.in_proj_z$",
"re:.*linear_attn.out_proj$",
]
scheme_1 = NVFP416
scheme_1["targets"] = [
"re:.*self_attn.(q|k|v)_proj$",
"re:.*mlp.experts.*.*_proj$",
]
ignore = ["re:.*lm_head", "re:visual.*", "re:model.visual.*", "re:.*mlp.gate$", "re:.*norm.*", "re:.*mlp.gate$", "re:.*mlp.shared_expert_gate$", "re:.*mtp.*", "re:.*conv1d.*", "re:.*in_proj_a+", "re:.*in_proj_b+", "re:.*in_proj_c+"]
recipe = QuantizationModifier(
config_groups={"group_0": scheme_0, "group_1": scheme_1}, ignore=ignore
) |
There was a problem hiding this comment.
Overall this looks fine but I dont quite understand why we need an updated regex pattern, _update_config_expanded_ignore, or _graft_extra_weights?i I think generally, if we want to expand regex mapping, that shoud be done in a follow-up PR as it is not specific to Qwen3.5
I am able to generate quantized checkpoints without this
| # by regex (e.g. MoE router modules that aren't nn.Linear). | ||
| # Store expanded names on the model so the save wrapper can ensure | ||
| # they appear in config.json. | ||
| regex_patterns = [p for p in self.ignore if p.startswith("re:")] |
There was a problem hiding this comment.
Can you explain why you need this?
There was a problem hiding this comment.
I did not have this in mine, and mine quanted and loaded successfully in VLLM, so would love to know as well.
|
The quality checks have failed. Please run |
Graft extra weights is for re-adding MTP weights back in as they get dropped. |
@Sehyo I think we want to do this at the end when we're saving the checkpoint, not in the middle of calibration as it does not impact quantization. Do you mind also resolving the quality issues? |
|
The quality checks have failed. Please run |
I noticed your code uses from transformers import AutoProcessor, AutoTokenizer, Qwen3_5MoeForConditionalGeneration, which requires transformers>=5.2.0. However, the from llmcompressor import oneshot code indicates that the latest version of llmcompressor depends on transformers>=4.56.1, <=4.57.6. |
|
Hi @Sehyo I am going to break this PR and land it in smaller pieces as some of this functionality is now out of date. Thank you for the contribution! |
Apologies for this ask @dsikka , but can you map it out please, as I am having to use my PR to making my Qwen3.5 quants work, so would be nice to know which PRs you will align into implementation, so I know when they land, etc. |
Sorry been busy the last days. |
|
Possibly a stupid question: But how does this work without also relaxing the transformers upper bound? The Qwen3.5-MOE architecture has only been supported since transformers 5.2.0 and the current upper bound compatible with llmcompressor is 4.57.6 (or similar) |
At somepoint LLM_Compressor/VLLM will support Transformers 5.2/5.3 For now, Do it in this method: It will then work. |
Thanks for the quick response. So this PR "only" adds the architecture support and the full functionality will still depend on other changes. |
This PR and/or mine, add Qwen3.5 MoE Modeling files, which allow for activating all Experts during calibration. To ensure intelligence persists, you must either use a MASSIVE sampling size for calibration (think 16,000 samples) or you must use a modeling file to activate all experts. Transformers 5.X is more than just Qwen3.5, it's an amalgamation of a bunch of serious changes, that will take time for VLLM and LLM_Compressor to fix for/against. So that work is still ongoing from those teams. At some point in the future, both will natively support transformers >5.x |
|
FYI - closing in favour of: #2482 |
#2467) ## Summary - Add Qwen3.5-27B example for NVFP4A16 quantization (`w4a16_fp4/nvfp4`) - Add Qwen3.5-27B example for MXFP4A16 quantization (`w4a16_fp4/mxfp4`) Ignore list includes: - `lm_head` — output head - `re:.*visual.*` — vision encoder (Qwen3.5 is a VLM) - `re:.*linear_attn.*` — Gated DeltaNet fused projections incompatible with microscale formats (ref #2383) - `re:.*mtp.*` — multi-token prediction modules > **Note:** Qwen3.5 (`qwen3_5` arch) requires `transformers>=5.x` which is not yet compatible with llm-compressor. This PR is ready to land once the transformers version bump is completed. ## Test plan - [x] Verify quantization runs on Qwen3.5-27B with NVFP4A16 (blocked on transformers compat) - [x] Verify quantization runs on Qwen3.5-27B with MXFP4A16 (blocked on transformers compat) - [x] Confirm sample generation produces coherent output --------- Signed-off-by: Ziming <frankziming26@outlook.com> Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com>
Summary
CalibrationQwen3_5MoeSparseMoeBlockcalibration module that unfuses Qwen3.5's 3D fused expert parameters into individualQwen3_5MoeMLPmodules withnn.Linearlayers, enabling NVFP4 quantization of expert weightsmodeling/__init__.pyQwen/Qwen3.5-397B-A17BDetails
Qwen3.5 MoE (
Qwen3_5MoeSparseMoeBlock) stores all expert weights in fused 3Dnn.Parametertensors (gate_up_proj: [num_experts, 2*intermediate, hidden],down_proj: [num_experts, hidden, intermediate]). The calibration module unfuses these into individual MLP modules sotargets="Linear"can match and quantize them.The implementation follows the same pattern as
CalibrateQwen3VLMoeTextSparseMoeBlockwithis_permanent=True, and includesdisable_onloading()for safe CPU access to offloaded parameters on large models.