-
Notifications
You must be signed in to change notification settings - Fork 468
[Feat][quantization] Support new version w4a8 dynamic quantization for Linear layers #3311
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
55a7718
to
c2c50fc
Compare
c2c50fc
to
9e87cc3
Compare
Signed-off-by: Anionex <[email protected]>
Signed-off-by: Anionex <[email protected]>
…method Signed-off-by: Anionex <[email protected]>
Signed-off-by: Anionex <[email protected]>
Signed-off-by: Anionex <[email protected]>
Signed-off-by: Anionex <[email protected]>
Signed-off-by: Anionex <[email protected]>
Signed-off-by: Anionex <[email protected]>
9e87cc3
to
61605d4
Compare
…sions Signed-off-by: Anionex <[email protected]>
Signed-off-by: Anionex <[email protected]>
Signed-off-by: Anionex <[email protected]>
26bf542
to
2f11331
Compare
Hi maintainers,
, which is unrelated to this pr. Could you please help skip this test? Or do you have any other advice? Thank you |
What this PR does / why we need it?
Problem Description:
The existing implementation for the w4a8-dynamic linear method only supports the old quantization format from msmodelslim. When attempting to load models quantized with the new version, vLLM encounters errors due to mismatched tensor shapes and unprocessed quantization parameters.
Relavant issues:
Proposed Changes:
details
Support for new w4a8-dynamic format:
2x int4
in anint8
), which has a halved dimension. It tells the vLLM loader how to unpack it using_packed_dim
and_packed_factor
.scale_bias
parameter, setting its shape based on the layer type, as required by msmodelslim. For api consistency and future use, thelayer_type
parameter was also added to other quantization methods..view(torch.int32)
since they're pre-packed, while old ones are processed withnpu_convert_weight_to_int4pack
.New unit and E2E tests:
Does this PR introduce any user-facing change?
no
How was this patch tested?
I implement relevant unit tests and e2e tests and test the changes with following commands:
I also tested Hunyuan-1.8B-Instruct quantized with the new w4a8-dynamic format:
All tests mentioned passed locally.
NOTE: I use quantization model from my own repo in test_offline_inference_distributed.py. Here is the description: Anionex/Qwen3-1.7B-W4A8-V1 (including quantization steps).This should be replaced by a model in vllm-ascend ci modelscope repo.
Thanks for reading!