Skip to content

Conversation

Anionex
Copy link

@Anionex Anionex commented Oct 3, 2025

What this PR does / why we need it?

Problem Description:

The existing implementation for the w4a8-dynamic linear method only supports the old quantization format from msmodelslim. When attempting to load models quantized with the new version, vLLM encounters errors due to mismatched tensor shapes and unprocessed quantization parameters.

Relavant issues:

Proposed Changes:

  1. Add support for w4a8 dynamic(new format) in AscendW4A8DynamicLinearMethod and TorchairAscendW4A8DynamicLinearMethod
  2. Add unit tests and e2e tests for w4a8 dynamic new and old format models
details
  1. Support for new w4a8-dynamic format:

    • Detects quantization format by reading the "version" field in quant_description to ensure backward compatibility.
    • Handles the new pre-packed weight format (2x int4 in an int8), which has a halved dimension. It tells the vLLM loader how to unpack it using _packed_dim and _packed_factor.
    • Supports the new scale_bias parameter, setting its shape based on the layer type, as required by msmodelslim. For api consistency and future use, the layer_type parameter was also added to other quantization methods.
    • Updates the weight processing logic: new format weights are handled with .view(torch.int32) since they're pre-packed, while old ones are processed with npu_convert_weight_to_int4pack.
  2. New unit and E2E tests:

    • Added unit tests that verify the logic for both the old and new formats.
    • Split the distributed E2E test to confirm that both old and new format models work correctly.
Theoretically, these changes will provide support for all common new version w4a8(dynamic) models from msmodelslim.

Does this PR introduce any user-facing change?

no

How was this patch tested?

I implement relevant unit tests and e2e tests and test the changes with following commands:

# unit tests
python -m pytest tests/ut/quantization/test_w4a8_dynamic.py tests/ut/torchair/quantization/test_torchair_w4a8_dynamic.py -v

# e2e tests
pytest tests/e2e/singlecard/test_quantization.py -v -s

pytest tests/e2e/multicard/test_offline_inference_distributed.py::test_models_distributed_Qwen3_W4A8DYNAMIC_new_version -v -s
pytest tests/e2e/multicard/test_offline_inference_distributed.py::test_models_distributed_Qwen3_W4A8DYNAMIC_old_version -v -s
pytest tests/e2e/multicard/test_offline_inference_distributed.py::test_models_distributed_DeepSeek_W4A8DYNAMIC -v -s

I also tested Hunyuan-1.8B-Instruct quantized with the new w4a8-dynamic format:

vllm serve ./models/Hunyuan-1.8B-Instruct-quantized --gpu-memory-utilization 0.96 --quantization ascend --max-model-len 9600 --seed 0 --max-num-batched-tokens 16384 

All tests mentioned passed locally.

NOTE: I use quantization model from my own repo in test_offline_inference_distributed.py. Here is the description: Anionex/Qwen3-1.7B-W4A8-V1 (including quantization steps).This should be replaced by a model in vllm-ascend ci modelscope repo.

Thanks for reading!

Copy link

github-actions bot commented Oct 3, 2025

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

@Anionex Anionex force-pushed the w4a8_dynamic_linear_method branch from 55a7718 to c2c50fc Compare October 5, 2025 15:35
@Anionex Anionex force-pushed the w4a8_dynamic_linear_method branch from c2c50fc to 9e87cc3 Compare October 5, 2025 15:44
@Anionex Anionex force-pushed the w4a8_dynamic_linear_method branch from 9e87cc3 to 61605d4 Compare October 5, 2025 15:55
@Anionex Anionex changed the title [Bugfix][Kernel] Implement new w4a8 dynamic quantization for LinearMethod [Feat][quantization] Support new version w4a8 dynamic quantization for Linear layers Oct 5, 2025
@Anionex Anionex marked this pull request as ready for review October 5, 2025 17:37
@Yikun Yikun added ready read for review ready-for-test start test by label for PR labels Oct 6, 2025
@Anionex Anionex force-pushed the w4a8_dynamic_linear_method branch from 26bf542 to 2f11331 Compare October 6, 2025 07:35
@Anionex
Copy link
Author

Anionex commented Oct 6, 2025

Hi maintainers,
It seems that the failed ci test errors in ascend test / full / e2e-full (releases/v0.11.0) is the same:

TypeError: CustomDeepseekV2DecoderLayer.__init__() takes 3 positional arguments but 4 were given

, which is unrelated to this pr.
(I noted that https://github.com/vllm-project/vllm-ascend/actions/runs/18274504364/job/52023422660 also encountered this error)

Could you please help skip this test? Or do you have any other advice? Thank you
cc: @Yikun

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module:quantization module:tests ready read for review ready-for-test start test by label for PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants