[Qwen3VLMoe] Add linearized definition and FP8 Quantization Example#1874
[Qwen3VLMoe] Add linearized definition and FP8 Quantization Example#1874
Conversation
|
👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review. Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed. |
Summary of ChangesHello @dsikka, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request aims to extend the model compression framework by adding support for linearizing the Qwen3VLMoe model's text sparse Mixture-of-Experts blocks. This enhancement allows for specialized handling and potential optimization of this specific model architecture within the Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request aims to add a linearized module definition for Qwen3VLMoe, likely to aid in model calibration. While the intent is clear, the implementation has a few critical issues that will prevent it from working correctly. There's an incorrect import path for the new module, and the new module itself is incomplete, with an incorrect function signature for its replacement function and a missing forward method in the main module class. My review includes specific suggestions to fix these issues.
kylesayrs
left a comment
There was a problem hiding this comment.
Can you add a test in this style? https://github.com/vllm-project/llm-compressor/blob/main/tests/llmcompressor/modeling/test_calib_qwen3.py
No because I haven't written the forward pass yet to validate all experts are calibrated. |
kylesayrs
left a comment
There was a problem hiding this comment.
Let's add test in follow up
brian-dellabetta
left a comment
There was a problem hiding this comment.
LGTM! i think it's fair to say users will only hit the Qwen 3 VL Moe import now if they are using the model, so we don't need to wrap the import in a try/catch
…llm-project#1874) SUMMARY: - Updates the MoE layer to use a linearized definition such that we can quantize and run the model in vLLM - Wraps the gate layer so that it is properly ignored - this is hack for now. We will need to do this properly in ct - Not adding forward pass for now; will add a forward pass as a follow-up but would like it in for the release to enable FP8 quantization - Note - requires latest transformers TEST PLAN: - Produces `/proving-grounds/engine/hub_cache/Qwen3-VL-235B-A22B-Instruct-FP8_DYNAMIC` which generates coherent generations: ```python if __name__ == '__main__': import torch from vllm import LLM, SamplingParams import torch prompts = [ "The Swiss Alps are", "Brad Marchand is", "The Toronto Maple Leafs are" ] # Create a sampling params object for greedy sampling sampling_params = SamplingParams(temperature=0.80, top_p=0.95, max_tokens=40, min_tokens=10) llm = LLM("/proving-grounds/engine/hub_cache/Qwen3-VL-235B-A22B-Instruct-FP8_DYNAMIC", tensor_parallel_size=2, max_model_len=4096, enforce_eager=True) output = llm.generate(prompts, sampling_params) for out in output: print(out.outputs[0].text) ``` Generations: ```bash a true paradise for nature lovers and outdoor enthusiasts. With their snow-capped peaks, lush green valleys, and crystal-clear lakes, the Alps offer a stunning backdrop for a wide range of activities. Whether a prominent figure in the NHL, known for his exceptional performance and leadership. He has won the Art Ross Trophy as the NHL's leading scorer, with 110 points (32 goals and a professional ice hockey team based in Toronto, Ontario, Canada. They are members of the Atlantic Division in the Eastern Conference of the National Hockey League (NHL). The team was established in 1 ``` Signed-off-by: Cassie Jeon <cajeon@redhat.com>
SUMMARY: - Need to update links when the following PRs land: 1. vllm-project#1886 2. vllm-project#1874 3. vllm-project#1889 Signed-off-by: Cassie Jeon <cajeon@redhat.com>
SUMMARY:
TEST PLAN:
/proving-grounds/engine/hub_cache/Qwen3-VL-235B-A22B-Instruct-FP8_DYNAMICwhich generates coherent generations:Generations: