Skip to content

[Bug]: LoRA + Ascend Quantization Fails on Qwen3-32B-W8A8 with AscendRMSNorm AttributeErrorΒ #4308

@scott-sensimintel

Description

@scott-sensimintel

Your current environment

The output of `python collect_env.py`
Your output of above commands here

PyTorch version: 2.7.1+cpu
Is debug build: False

OS: openEuler 22.03 (LTS-SP4) (aarch64)
GCC version: (GCC) 10.3.1
Clang version: Could not collect
CMake version: version 4.1.2
Libc version: glibc-2.34

Python version: 3.10.17 (main, May 8 2025, 08:13:48) [GCC 10.3.1] (64-bit runtime)
Python platform: Linux-6.8.0-31-generic-aarch64-with-glibc2.34

CPU:
Architecture: aarch64
CPU op-mode(s): 64-bit
Byte Order: Little Endian
CPU(s): 192
On-line CPU(s) list: 0-191
Vendor ID: HiSilicon
Model name: Kunpeng-920
Model: 0
Thread(s) per core: 1
Core(s) per cluster: 48
Socket(s): -
Cluster(s): 4
Stepping: 0x1
BogoMIPS: 200.00
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma dcpop asimddp asimdfhm ssbs
L1d cache: 12 MiB (192 instances)
L1i cache: 12 MiB (192 instances)
L2 cache: 96 MiB (192 instances)
L3 cache: 192 MiB (8 instances)
NUMA node(s): 8
NUMA node0 CPU(s): 0-23
NUMA node1 CPU(s): 24-47
NUMA node2 CPU(s): 48-71
NUMA node3 CPU(s): 72-95
NUMA node4 CPU(s): 96-119
NUMA node5 CPU(s): 120-143
NUMA node6 CPU(s): 144-167
NUMA node7 CPU(s): 168-191
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed: Not affected
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1: Mitigation; __user pointer sanitization
Vulnerability Spectre v2: Not affected
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] pyzmq==27.1.0
[pip3] torch==2.7.1+cpu
[pip3] torch_npu==2.7.1.dev20250724
[pip3] torchvision==0.22.1
[pip3] transformers==4.57.1
[conda] Could not collect
vLLM Version: 0.11.0
vLLM Ascend Version: 0.11.0rc0

ENV Variables:
ATB_OPSRUNNER_KERNEL_CACHE_TILING_SIZE=10240
ATB_OPSRUNNER_KERNEL_CACHE_LOCAL_COUNT=1
ATB_STREAM_SYNC_EVERY_RUNNER_ENABLE=0
ATB_OPSRUNNER_SETUP_CACHE_ENABLE=1
ATB_WORKSPACE_MEM_ALLOC_GLOBAL=0
ATB_DEVICE_TILING_BUFFER_BLOCK_NUM=32
ATB_STREAM_SYNC_EVERY_KERNEL_ENABLE=0
ATB_OPSRUNNER_KERNEL_CACHE_GLOABL_COUNT=5
ATB_HOME_PATH=/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1
ASCEND_TOOLKIT_HOME=/usr/local/Ascend/ascend-toolkit/latest
ATB_COMPARE_TILING_EVERY_KERNEL=0
ASCEND_OPP_PATH=/usr/local/Ascend/ascend-toolkit/latest/opp
LD_LIBRARY_PATH=/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/lib:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/examples:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/tests/atbopstest:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64/plugin:/usr/local/Ascend/ascend-toolkit/latest/lib64:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/opskernel:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/nnengine:/usr/local/Ascend/ascend-toolkit/latest/opp/built-in/op_impl/ai_core/tbe/op_tiling/lib/linux/aarch64:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0/lib:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0/examples:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0/tests/atbopstest:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64/plugin:/usr/local/Ascend/ascend-toolkit/latest/lib64:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/opskernel:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/nnengine:/usr/local/Ascend/ascend-toolkit/latest/opp/built-in/op_impl/ai_core/tbe/op_tiling:/usr/local/Ascend/driver/lib64/common/:/usr/local/Ascend/driver/lib64/driver/:
ASCEND_AICPU_PATH=/usr/local/Ascend/ascend-toolkit/latest
ATB_OPSRUNNER_KERNEL_CACHE_TYPE=3
ATB_RUNNER_POOL_SIZE=64
ATB_STREAM_SYNC_EVERY_OPERATION_ENABLE=0
ASCEND_HOME_PATH=/usr/local/Ascend/ascend-toolkit/latest
ATB_MATMUL_SHUFFLE_K_ENABLE=1
ATB_LAUNCH_KERNEL_WITH_TILING=1
ATB_WORKSPACE_MEM_ALLOC_ALG_TYPE=1
ATB_HOST_TILING_BUFFER_BLOCK_NUM=128
ATB_SHARE_MEMORY_NAME_SUFFIX=
TORCH_DEVICE_BACKEND_AUTOLOAD=1
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1

NPU:
+------------------------------------------------------------------------------------------------+
| npu-smi 25.0.rc1.1 Version: 25.0.rc1.1 |
+---------------------------+---------------+----------------------------------------------------+
| NPU Name | Health | Power(W) Temp(C) Hugepages-Usage(page)|
| Chip | Bus-Id | AICore(%) Memory-Usage(MB) HBM-Usage(MB) |
+===========================+===============+====================================================+
| 0 910B4-1 | OK | 103.1 40 0 / 0 |
| 0 | 0000:C1:00.0 | 0 0 / 0 3386 / 65536 |
+===========================+===============+====================================================+
| 1 910B4-1 | OK | 92.9 37 0 / 0 |
| 0 | 0000:C2:00.0 | 0 0 / 0 3379 / 65536 |
+===========================+===============+====================================================+
| 2 910B4-1 | OK | 90.4 37 0 / 0 |
| 0 | 0000:81:00.0 | 0 0 / 0 3380 / 65536 |
+===========================+===============+====================================================+
| 3 910B4-1 | OK | 90.5 37 0 / 0 |
| 0 | 0000:82:00.0 | 0 0 / 0 3379 / 65536 |
+===========================+===============+====================================================+
+---------------------------+---------------+----------------------------------------------------+
| NPU Chip | Process id | Process name | Process memory(MB) |
+===========================+===============+====================================================+
| No running processes found in NPU 0 |
+===========================+===============+====================================================+
| No running processes found in NPU 1 |
+===========================+===============+====================================================+
| No running processes found in NPU 2 |
+===========================+===============+====================================================+
| No running processes found in NPU 3 |
+===========================+===============+====================================================+

CANN:
package_name=Ascend-cann-toolkit
version=8.1.RC1
innerversion=V100R001C21SPC001B238
compatible_version=[V100R001C15],[V100R001C18],[V100R001C19],[V100R001C20],[V100R001C21]
arch=aarch64
os=linux
path=/usr/local/Ascend/ascend-toolkit/8.1.RC1/aarch64-linux

πŸ› Describe the bug

Summary

When serving Qwen3-32B-w8a8 with vLLM using Ascend quantization and enabling LoRA, vLLM crashes with an AscendRMSNorm attribute error.

The issue only appears when combining:

  • W8A8 quantized base model and
  • --enable-lora + --lora-modules ... and
  • --quantization "ascend"

Both FP16+LoRA and W8A8 without LoRA work correctly.


Commands and Behaviors

❌ Failing command

vllm serve /root/data/Qwen/Qwen3-32B-w8a8 \
  --tensor_parallel_size=4 \
  --enable-lora \
  --lora-modules icd_model=./all_adaptor \
  --quantization "ascend" \
  --port 8000

Error:

INFO 11-20 08:15:07 [parallel_state.py:1208] rank 0 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
INFO 11-20 08:15:07 [parallel_state.py:1208] rank 1 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 1, EP rank 1
INFO 11-20 08:15:07 [parallel_state.py:1208] rank 2 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 2, EP rank 2
INFO 11-20 08:15:07 [parallel_state.py:1208] rank 3 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 3, EP rank 3
(Worker_TP0 pid=11657) INFO 11-20 08:15:29 [model_runner_v1.py:2627] Starting to load model /root/data/Qwen/Qwen3-32B-w8a8...
(Worker_TP0 pid=11657) INFO 11-20 08:15:29 [utils.py:60] Using the vLLM Ascend Quantization now!
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
(Worker_TP3 pid=11916) INFO 11-20 08:15:35 [model_runner_v1.py:2627] Starting to load model /root/data/Qwen/Qwen3-32B-w8a8...
(Worker_TP3 pid=11916) INFO 11-20 08:15:35 [utils.py:60] Using the vLLM Ascend Quantization now!
(Worker_TP2 pid=11680) INFO 11-20 08:15:36 [model_runner_v1.py:2627] Starting to load model /root/data/Qwen/Qwen3-32B-w8a8...
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:07<00:00,  7.02s/it]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:07<00:00,  7.02s/it]
(Worker_TP0 pid=11657) 
(Worker_TP2 pid=11680) INFO 11-20 08:15:38 [utils.py:60] Using the vLLM Ascend Quantization now!
(Worker_TP0 pid=11657) INFO 11-20 08:15:38 [default_loader.py:267] Loading weights took 7.89 seconds
(Worker_TP0 pid=11657) INFO 11-20 08:15:38 [punica_selector.py:19] Using PunicaWrapperNPU.
(Worker_TP1 pid=11661) INFO 11-20 08:15:39 [model_runner_v1.py:2627] Starting to load model /root/data/Qwen/Qwen3-32B-w8a8...
(Worker_TP0 pid=11657) INFO 11-20 08:15:39 [model_runner_v1.py:2661] Loading model weights took 10.0903 GB
(Worker_TP1 pid=11661) INFO 11-20 08:15:40 [utils.py:60] Using the vLLM Ascend Quantization now!
(Worker_TP3 pid=11916) INFO 11-20 08:15:44 [default_loader.py:267] Loading weights took 8.23 seconds
(Worker_TP3 pid=11916) INFO 11-20 08:15:45 [punica_selector.py:19] Using PunicaWrapperNPU.
(Worker_TP3 pid=11916) INFO 11-20 08:15:46 [model_runner_v1.py:2661] Loading model weights took 10.0903 GB
(Worker_TP2 pid=11680) INFO 11-20 08:15:48 [default_loader.py:267] Loading weights took 10.01 seconds
(Worker_TP2 pid=11680) INFO 11-20 08:15:48 [punica_selector.py:19] Using PunicaWrapperNPU.
(Worker_TP2 pid=11680) INFO 11-20 08:15:50 [model_runner_v1.py:2661] Loading model weights took 10.0903 GB
(Worker_TP1 pid=11661) INFO 11-20 08:15:50 [default_loader.py:267] Loading weights took 10.28 seconds
(Worker_TP1 pid=11661) INFO 11-20 08:15:51 [punica_selector.py:19] Using PunicaWrapperNPU.
(Worker_TP1 pid=11661) INFO 11-20 08:15:52 [model_runner_v1.py:2661] Loading model weights took 10.0903 GB
(Worker_TP0 pid=11657) ERROR 11-20 08:15:55 [multiproc_executor.py:671] WorkerProc hit an exception.
(Worker_TP0 pid=11657) ERROR 11-20 08:15:55 [multiproc_executor.py:671] Traceback (most recent call last):
(Worker_TP0 pid=11657) ERROR 11-20 08:15:55 [multiproc_executor.py:671]   File "/vllm-workspace/vllm/vllm/v1/executor/multiproc_executor.py", line 666, in worker_busy_loop
(Worker_TP0 pid=11657) ERROR 11-20 08:15:55 [multiproc_executor.py:671]     output = func(*args, **kwargs)
(Worker_TP0 pid=11657) ERROR 11-20 08:15:55 [multiproc_executor.py:671]   File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/worker_v1.py", line 205, in determine_available_memory
(Worker_TP0 pid=11657) ERROR 11-20 08:15:55 [multiproc_executor.py:671]     self.model_runner.profile_run()
(Worker_TP0 pid=11657) ERROR 11-20 08:15:55 [multiproc_executor.py:671]   File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 2509, in profile_run
(Worker_TP0 pid=11657) ERROR 11-20 08:15:55 [multiproc_executor.py:671]     hidden_states = self._dummy_run(self.max_num_tokens,
(Worker_TP0 pid=11657) ERROR 11-20 08:15:55 [multiproc_executor.py:671]   File "/usr/local/python3.10.17/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(Worker_TP0 pid=11657) ERROR 11-20 08:15:55 [multiproc_executor.py:671]     return func(*args, **kwargs)
(Worker_TP0 pid=11657) ERROR 11-20 08:15:55 [multiproc_executor.py:671]   File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 2475, in _dummy_run
(Worker_TP0 pid=11657) ERROR 11-20 08:15:55 [multiproc_executor.py:671]     hidden_states = self._generate_dummy_run_hidden_states(
(Worker_TP0 pid=11657) ERROR 11-20 08:15:55 [multiproc_executor.py:671]   File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 2320, in _generate_dummy_run_hidden_states
(Worker_TP0 pid=11657) ERROR 11-20 08:15:55 [multiproc_executor.py:671]     hidden_states = self.model(input_ids=input_ids,
(Worker_TP0 pid=11657) ERROR 11-20 08:15:55 [multiproc_executor.py:671]   File "/usr/local/python3.10.17/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
(Worker_TP0 pid=11657) ERROR 11-20 08:15:55 [multiproc_executor.py:671]     return self._call_impl(*args, **kwargs)
(Worker_TP0 pid=11657) ERROR 11-20 08:15:55 [multiproc_executor.py:671]   File "/usr/local/python3.10.17/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
(Worker_TP0 pid=11657) ERROR 11-20 08:15:55 [multiproc_executor.py:671]     return forward_call(*args, **kwargs)
(Worker_TP0 pid=11657) ERROR 11-20 08:15:55 [multiproc_executor.py:671]   File "/vllm-workspace/vllm/vllm/model_executor/models/qwen3.py", line 323, in forward
(Worker_TP0 pid=11657) ERROR 11-20 08:15:55 [multiproc_executor.py:671]     hidden_states = self.model(input_ids, positions, intermediate_tensors,
(Worker_TP0 pid=11657) ERROR 11-20 08:15:55 [multiproc_executor.py:671]   File "/vllm-workspace/vllm/vllm/compilation/decorators.py", line 310, in __call__
(Worker_TP0 pid=11657) ERROR 11-20 08:15:55 [multiproc_executor.py:671]     output = self.compiled_callable(*args, **kwargs)
(Worker_TP0 pid=11657) ERROR 11-20 08:15:55 [multiproc_executor.py:671]   File "/vllm-workspace/vllm/vllm/model_executor/models/qwen2.py", line 364, in forward
(Worker_TP0 pid=11657) ERROR 11-20 08:15:55 [multiproc_executor.py:671]     hidden_states, residual = layer(positions, hidden_states, residual)
(Worker_TP0 pid=11657) ERROR 11-20 08:15:55 [multiproc_executor.py:671]   File "/usr/local/python3.10.17/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
(Worker_TP0 pid=11657) ERROR 11-20 08:15:55 [multiproc_executor.py:671]     return self._call_impl(*args, **kwargs)
(Worker_TP0 pid=11657) ERROR 11-20 08:15:55 [multiproc_executor.py:671]   File "/usr/local/python3.10.17/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
(Worker_TP0 pid=11657) ERROR 11-20 08:15:55 [multiproc_executor.py:671]     return forward_call(*args, **kwargs)
(Worker_TP0 pid=11657) ERROR 11-20 08:15:55 [multiproc_executor.py:671]   File "/vllm-workspace/vllm/vllm/model_executor/models/qwen3.py", line 235, in forward
(Worker_TP0 pid=11657) ERROR 11-20 08:15:55 [multiproc_executor.py:671]     hidden_states, residual = self.post_attention_layernorm(
(Worker_TP0 pid=11657) ERROR 11-20 08:15:55 [multiproc_executor.py:671]   File "/usr/local/python3.10.17/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
(Worker_TP0 pid=11657) ERROR 11-20 08:15:55 [multiproc_executor.py:671]     return self._call_impl(*args, **kwargs)
(Worker_TP0 pid=11657) ERROR 11-20 08:15:55 [multiproc_executor.py:671]   File "/usr/local/python3.10.17/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
(Worker_TP0 pid=11657) ERROR 11-20 08:15:55 [multiproc_executor.py:671]     return forward_call(*args, **kwargs)
(Worker_TP0 pid=11657) ERROR 11-20 08:15:55 [multiproc_executor.py:671]   File "/vllm-workspace/vllm/vllm/model_executor/custom_op.py", line 44, in forward
(Worker_TP0 pid=11657) ERROR 11-20 08:15:55 [multiproc_executor.py:671]     return self._forward_method(*args, **kwargs)
(Worker_TP0 pid=11657) ERROR 11-20 08:15:55 [multiproc_executor.py:671]   File "/vllm-workspace/vllm-ascend/vllm_ascend/ops/layernorm.py", line 70, in forward_oot
(Worker_TP0 pid=11657) ERROR 11-20 08:15:55 [multiproc_executor.py:671]     self, x, residual, self.next_need_quant_fusion_linear)
(Worker_TP0 pid=11657) ERROR 11-20 08:15:55 [multiproc_executor.py:671]   File "/usr/local/python3.10.17/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1940, in __getattr__
(Worker_TP0 pid=11657) ERROR 11-20 08:15:55 [multiproc_executor.py:671]     raise AttributeError(
(Worker_TP0 pid=11657) ERROR 11-20 08:15:55 [multiproc_executor.py:671] AttributeError: 'AscendRMSNorm' object has no attribute 'next_need_quant_fusion_linear'
(EngineCore_DP0 pid=11521) ERROR 11-20 08:15:55 [core.py:708] EngineCore failed to start.
(EngineCore_DP0 pid=11521) ERROR 11-20 08:15:55 [core.py:708] Traceback (most recent call last):
(EngineCore_DP0 pid=11521) ERROR 11-20 08:15:55 [core.py:708]   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 699, in run_engine_core
(EngineCore_DP0 pid=11521) ERROR 11-20 08:15:55 [core.py:708]     engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=11521) ERROR 11-20 08:15:55 [core.py:708]   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 498, in __init__
(EngineCore_DP0 pid=11521) ERROR 11-20 08:15:55 [core.py:708]     super().__init__(vllm_config, executor_class, log_stats,
(EngineCore_DP0 pid=11521) ERROR 11-20 08:15:55 [core.py:708]   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 92, in __init__
(EngineCore_DP0 pid=11521) ERROR 11-20 08:15:55 [core.py:708]     self._initialize_kv_caches(vllm_config)
(EngineCore_DP0 pid=11521) ERROR 11-20 08:15:55 [core.py:708]   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 190, in _initialize_kv_caches
(EngineCore_DP0 pid=11521) ERROR 11-20 08:15:55 [core.py:708]     self.model_executor.determine_available_memory())
(EngineCore_DP0 pid=11521) ERROR 11-20 08:15:55 [core.py:708]   File "/vllm-workspace/vllm/vllm/v1/executor/abstract.py", line 85, in determine_available_memory
(EngineCore_DP0 pid=11521) ERROR 11-20 08:15:55 [core.py:708]     return self.collective_rpc("determine_available_memory")
(EngineCore_DP0 pid=11521) ERROR 11-20 08:15:55 [core.py:708]   File "/vllm-workspace/vllm/vllm/v1/executor/multiproc_executor.py", line 264, in collective_rpc
(EngineCore_DP0 pid=11521) ERROR 11-20 08:15:55 [core.py:708]     result = get_response(w, dequeue_timeout,
(EngineCore_DP0 pid=11521) ERROR 11-20 08:15:55 [core.py:708]   File "/vllm-workspace/vllm/vllm/v1/executor/multiproc_executor.py", line 248, in get_response
(EngineCore_DP0 pid=11521) ERROR 11-20 08:15:55 [core.py:708]     raise RuntimeError(
(EngineCore_DP0 pid=11521) ERROR 11-20 08:15:55 [core.py:708] RuntimeError: Worker failed with error ''AscendRMSNorm' object has no attribute 'next_need_quant_fusion_linear'', please check the stack trace above for the root cause
(EngineCore_DP0 pid=11521) ERROR 11-20 08:16:05 [multiproc_executor.py:154] Worker proc VllmWorker-3 died unexpectedly, shutting down executor.
(EngineCore_DP0 pid=11521) Process EngineCore_DP0:
(EngineCore_DP0 pid=11521) Traceback (most recent call last):
(EngineCore_DP0 pid=11521)   File "/usr/local/python3.10.17/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=11521)     self.run()
(EngineCore_DP0 pid=11521)   File "/usr/local/python3.10.17/lib/python3.10/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=11521)     self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=11521)   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 712, in run_engine_core
(EngineCore_DP0 pid=11521)     raise e
(EngineCore_DP0 pid=11521)   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 699, in run_engine_core
(EngineCore_DP0 pid=11521)     engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=11521)   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 498, in __init__
(EngineCore_DP0 pid=11521)     super().__init__(vllm_config, executor_class, log_stats,
(EngineCore_DP0 pid=11521)   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 92, in __init__
(EngineCore_DP0 pid=11521)     self._initialize_kv_caches(vllm_config)
(EngineCore_DP0 pid=11521)   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 190, in _initialize_kv_caches
(EngineCore_DP0 pid=11521)     self.model_executor.determine_available_memory())
(EngineCore_DP0 pid=11521)   File "/vllm-workspace/vllm/vllm/v1/executor/abstract.py", line 85, in determine_available_memory
(EngineCore_DP0 pid=11521)     return self.collective_rpc("determine_available_memory")
(EngineCore_DP0 pid=11521)   File "/vllm-workspace/vllm/vllm/v1/executor/multiproc_executor.py", line 264, in collective_rpc
(EngineCore_DP0 pid=11521)     result = get_response(w, dequeue_timeout,
(EngineCore_DP0 pid=11521)   File "/vllm-workspace/vllm/vllm/v1/executor/multiproc_executor.py", line 248, in get_response
(EngineCore_DP0 pid=11521)     raise RuntimeError(
(EngineCore_DP0 pid=11521) RuntimeError: Worker failed with error ''AscendRMSNorm' object has no attribute 'next_need_quant_fusion_linear'', please check the stack trace above for the root cause
(APIServer pid=11383) Traceback (most recent call last):
(APIServer pid=11383)   File "/usr/local/python3.10.17/bin/vllm", line 8, in <module>
(APIServer pid=11383)     sys.exit(main())
(APIServer pid=11383)   File "/vllm-workspace/vllm/vllm/entrypoints/cli/main.py", line 54, in main
(APIServer pid=11383)     args.dispatch_function(args)
(APIServer pid=11383)   File "/vllm-workspace/vllm/vllm/entrypoints/cli/serve.py", line 57, in cmd
(APIServer pid=11383)     uvloop.run(run_server(args))
(APIServer pid=11383)   File "/usr/local/python3.10.17/lib/python3.10/site-packages/uvloop/__init__.py", line 69, in run
(APIServer pid=11383)     return loop.run_until_complete(wrapper())
(APIServer pid=11383)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=11383)   File "/usr/local/python3.10.17/lib/python3.10/site-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=11383)     return await main
(APIServer pid=11383)   File "/vllm-workspace/vllm/vllm/entrypoints/openai/api_server.py", line 1884, in run_server
(APIServer pid=11383)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=11383)   File "/vllm-workspace/vllm/vllm/entrypoints/openai/api_server.py", line 1902, in run_server_worker
(APIServer pid=11383)     async with build_async_engine_client(
(APIServer pid=11383)   File "/usr/local/python3.10.17/lib/python3.10/contextlib.py", line 199, in __aenter__
(APIServer pid=11383)     return await anext(self.gen)
(APIServer pid=11383)   File "/vllm-workspace/vllm/vllm/entrypoints/openai/api_server.py", line 180, in build_async_engine_client
(APIServer pid=11383)     async with build_async_engine_client_from_engine_args(
(APIServer pid=11383)   File "/usr/local/python3.10.17/lib/python3.10/contextlib.py", line 199, in __aenter__
(APIServer pid=11383)     return await anext(self.gen)
(APIServer pid=11383)   File "/vllm-workspace/vllm/vllm/entrypoints/openai/api_server.py", line 225, in build_async_engine_client_from_engine_args
(APIServer pid=11383)     async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=11383)   File "/vllm-workspace/vllm/vllm/utils/__init__.py", line 1572, in inner
(APIServer pid=11383)     return fn(*args, **kwargs)
(APIServer pid=11383)   File "/vllm-workspace/vllm/vllm/v1/engine/async_llm.py", line 207, in from_vllm_config
(APIServer pid=11383)     return cls(
(APIServer pid=11383)   File "/vllm-workspace/vllm/vllm/v1/engine/async_llm.py", line 134, in __init__
(APIServer pid=11383)     self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=11383)   File "/vllm-workspace/vllm/vllm/v1/engine/core_client.py", line 102, in make_async_mp_client
(APIServer pid=11383)     return AsyncMPClient(*client_args)
(APIServer pid=11383)   File "/vllm-workspace/vllm/vllm/v1/engine/core_client.py", line 769, in __init__
(APIServer pid=11383)     super().__init__(
(APIServer pid=11383)   File "/vllm-workspace/vllm/vllm/v1/engine/core_client.py", line 448, in __init__
(APIServer pid=11383)     with launch_core_engines(vllm_config, executor_class,
(APIServer pid=11383)   File "/usr/local/python3.10.17/lib/python3.10/contextlib.py", line 142, in __exit__
(APIServer pid=11383)     next(self.gen)
(APIServer pid=11383)   File "/vllm-workspace/vllm/vllm/v1/engine/utils.py", line 732, in launch_core_engines
(APIServer pid=11383)     wait_for_engine_startup(
(APIServer pid=11383)   File "/vllm-workspace/vllm/vllm/v1/engine/utils.py", line 785, in wait_for_engine_startup
(APIServer pid=11383)     raise RuntimeError("Engine core initialization failed. "
(APIServer pid=11383) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
(APIServer pid=11383) [ERROR] 2025-11-20-08:16:12 (PID:11383, Device:-1, RankID:-1) ERR99999 UNKNOWN applicaiton exception

So the engine fails at runtime with:

'AscendRMSNorm' object has no attribute 'next_need_quant_fusion_linear'


βœ… Working command: FP16 + LoRA

vllm serve /root/data/Qwen/Qwen3-32B \
  --tensor_parallel_size=4 \
  --enable-lora \
  --lora-modules icd_model=./all_adaptor \
  --port 8000
  • Base model in FP16
  • LoRA adapter enabled
  • No --quantization "ascend"

This configuration serves successfully.


βœ… Working command: W8A8 without LoRA

vllm serve /root/data/Qwen/Qwen3-32B-w8a8 \
  --tensor_parallel_size=4 \
  --port 8000
  • Base model W8A8 quantized
  • No LoRA
  • No --quantization "ascend" flag

This configuration also serves successfully.


Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions