[Feat][quantization] Support new version w4a8 dynamic quantization for Linear layers #3311

Anionex · 2025-10-03T19:48:22Z

What this PR does / why we need it?

Problem Description:

The existing implementation for the w4a8-dynamic linear method only supports the old quantization format from msmodelslim. When attempting to load models quantized with the new version, vLLM encounters errors due to mismatched tensor shapes and unprocessed quantization parameters.

Relavant issues:

Proposed Changes:

Add support for w4a8 dynamic(new format) in AscendW4A8DynamicLinearMethod and TorchairAscendW4A8DynamicLinearMethod
Add unit tests and e2e tests for w4a8 dynamic new and old format models

details

Support for new w4a8-dynamic format:
- Detects quantization format by reading the "version" field in quant_description to ensure backward compatibility.
- Handles the new pre-packed weight format (2x int4 in an int8), which has a halved dimension. It tells the vLLM loader how to unpack it using _packed_dim and _packed_factor.
- Supports the new scale_bias parameter, setting its shape based on the layer type, as required by msmodelslim. For api consistency and future use, the layer_type parameter was also added to other quantization methods.
- Updates the weight processing logic: new format weights are handled with .view(torch.int32) since they're pre-packed, while old ones are processed with npu_convert_weight_to_int4pack.
New unit and E2E tests:
- Added unit tests that verify the logic for both the old and new formats.
- Split the distributed E2E test to confirm that both old and new format models work correctly.

Theoretically, these changes will provide support for all common new version w4a8(dynamic) models from msmodelslim.

Does this PR introduce any user-facing change?

no

How was this patch tested?

I implement relevant unit tests and e2e tests and test the changes with following commands:

# unit tests
python -m pytest tests/ut/quantization/test_w4a8_dynamic.py tests/ut/torchair/quantization/test_torchair_w4a8_dynamic.py -v

# e2e tests
pytest tests/e2e/singlecard/test_quantization.py -v -s

pytest tests/e2e/multicard/test_offline_inference_distributed.py::test_models_distributed_Qwen3_W4A8DYNAMIC_new_version -v -s
pytest tests/e2e/multicard/test_offline_inference_distributed.py::test_models_distributed_Qwen3_W4A8DYNAMIC_old_version -v -s
pytest tests/e2e/multicard/test_offline_inference_distributed.py::test_models_distributed_DeepSeek_W4A8DYNAMIC -v -s

I also tested Hunyuan-1.8B-Instruct quantized with the new w4a8-dynamic format:

vllm serve ./models/Hunyuan-1.8B-Instruct-quantized --gpu-memory-utilization 0.96 --quantization ascend --max-model-len 9600 --seed 0 --max-num-batched-tokens 16384

All tests mentioned passed locally.

NOTE: I use quantization model from my own repo in test_offline_inference_distributed.py. Here is the description: Anionex/Qwen3-1.7B-W4A8-V1 (including quantization steps).This should be replaced by a model in vllm-ascend ci modelscope repo.

Thanks for reading!

vLLM version: v0.11.0rc3
vLLM main: vllm-project/vllm@releases/v0.11.0

github-actions · 2025-10-03T19:48:31Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Signed-off-by: Anionex <[email protected]>

…method Signed-off-by: Anionex <[email protected]>

Signed-off-by: Anionex <[email protected]>

…sions Signed-off-by: Anionex <[email protected]>

Signed-off-by: Anionex <[email protected]>

Anionex · 2025-10-06T09:04:54Z

Hi maintainers,
It seems that the failed ci test errors in ascend test / full / e2e-full (releases/v0.11.0) is the same:

TypeError: CustomDeepseekV2DecoderLayer.__init__() takes 3 positional arguments but 4 were given

, which is unrelated to this pr.
(I noted that https://github.com/vllm-project/vllm-ascend/actions/runs/18274504364/job/52023422660 also encountered this error)

Could you please help skip this test? Or do you have any other advice? Thank you
cc: @Yikun

github-actions bot added the module:quantization label Oct 3, 2025

Anionex force-pushed the w4a8_dynamic_linear_method branch from 55a7718 to c2c50fc Compare October 5, 2025 15:35

github-actions bot added the module:tests label Oct 5, 2025

Anionex force-pushed the w4a8_dynamic_linear_method branch from c2c50fc to 9e87cc3 Compare October 5, 2025 15:44

Anionex added 8 commits October 5, 2025 23:54

feat(quant): support w4a8 dynamic quantization v1.0.0 for linear layers

b10e23a

Signed-off-by: Anionex <[email protected]>

test(quant): add unit tests to LinearMethod for w4a8 dynamic v1.0.0

82cff61

Signed-off-by: Anionex <[email protected]>

feat(quant): add optional layer_type parameter to get_pergroup_param …

8e6507f

…method Signed-off-by: Anionex <[email protected]>

fix(quant): fix logic for scale_bias dim

4bc3a3c

Signed-off-by: Anionex <[email protected]>

test(quant): add e2e tests for w4a8 quantization old and new versions

6e6e748

Signed-off-by: Anionex <[email protected]>

docs(quant): simplify and clarify comments

4ca0519

Signed-off-by: Anionex <[email protected]>

docs(tests): add TODO to transfer new ci model

dc88722

Signed-off-by: Anionex <[email protected]>

style: format code with pre-commit hooks

61605d4

Signed-off-by: Anionex <[email protected]>

Anionex force-pushed the w4a8_dynamic_linear_method branch from 9e87cc3 to 61605d4 Compare October 5, 2025 15:55

Anionex changed the title ~~[Bugfix][Kernel] Implement new w4a8 dynamic quantization for LinearMethod~~ [Feat][quantization] Support new version w4a8 dynamic quantization for Linear layers Oct 5, 2025

Anionex marked this pull request as ready for review October 5, 2025 17:37

Yikun added ready read for review ready-for-test start test by label for PR labels Oct 6, 2025

Anionex added 3 commits October 6, 2025 14:52

fix(test): rename and add tests for Qwen3 W4A8DYNAMIC old and new ver…

dc5c02f

…sions Signed-off-by: Anionex <[email protected]>

fix(quant): add layer_type parameter to torchair w8a8 dynamic

891b4cc

Signed-off-by: Anionex <[email protected]>

style: format code with pre-commit hooks

2f11331

Signed-off-by: Anionex <[email protected]>

Anionex force-pushed the w4a8_dynamic_linear_method branch from 26bf542 to 2f11331 Compare October 6, 2025 07:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feat][quantization] Support new version w4a8 dynamic quantization for Linear layers #3311

[Feat][quantization] Support new version w4a8 dynamic quantization for Linear layers #3311

Anionex commented Oct 3, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Oct 3, 2025

Uh oh!

Anionex commented Oct 6, 2025 •

edited

Loading

Uh oh!

Uh oh!

[Feat][quantization] Support new version w4a8 dynamic quantization for Linear layers #3311

Are you sure you want to change the base?

[Feat][quantization] Support new version w4a8 dynamic quantization for Linear layers #3311

Conversation

Anionex commented Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

github-actions bot commented Oct 3, 2025

Uh oh!

Anionex commented Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Anionex commented Oct 3, 2025 •

edited

Loading

Anionex commented Oct 6, 2025 •

edited

Loading