-
Notifications
You must be signed in to change notification settings - Fork 583
Description
Your current environment
The output of `python collect_env.py`
Your output of above commands here
PyTorch version: 2.7.1+cpu
Is debug build: False
OS: openEuler 22.03 (LTS-SP4) (aarch64)
GCC version: (GCC) 10.3.1
Clang version: Could not collect
CMake version: version 4.1.2
Libc version: glibc-2.34
Python version: 3.10.17 (main, May 8 2025, 08:13:48) [GCC 10.3.1] (64-bit runtime)
Python platform: Linux-6.8.0-31-generic-aarch64-with-glibc2.34
CPU:
Architecture: aarch64
CPU op-mode(s): 64-bit
Byte Order: Little Endian
CPU(s): 192
On-line CPU(s) list: 0-191
Vendor ID: HiSilicon
Model name: Kunpeng-920
Model: 0
Thread(s) per core: 1
Core(s) per cluster: 48
Socket(s): -
Cluster(s): 4
Stepping: 0x1
BogoMIPS: 200.00
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma dcpop asimddp asimdfhm ssbs
L1d cache: 12 MiB (192 instances)
L1i cache: 12 MiB (192 instances)
L2 cache: 96 MiB (192 instances)
L3 cache: 192 MiB (8 instances)
NUMA node(s): 8
NUMA node0 CPU(s): 0-23
NUMA node1 CPU(s): 24-47
NUMA node2 CPU(s): 48-71
NUMA node3 CPU(s): 72-95
NUMA node4 CPU(s): 96-119
NUMA node5 CPU(s): 120-143
NUMA node6 CPU(s): 144-167
NUMA node7 CPU(s): 168-191
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed: Not affected
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1: Mitigation; __user pointer sanitization
Vulnerability Spectre v2: Not affected
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] pyzmq==27.1.0
[pip3] torch==2.7.1+cpu
[pip3] torch_npu==2.7.1.dev20250724
[pip3] torchvision==0.22.1
[pip3] transformers==4.57.1
[conda] Could not collect
vLLM Version: 0.11.0
vLLM Ascend Version: 0.11.0rc0
ENV Variables:
ATB_OPSRUNNER_KERNEL_CACHE_TILING_SIZE=10240
ATB_OPSRUNNER_KERNEL_CACHE_LOCAL_COUNT=1
ATB_STREAM_SYNC_EVERY_RUNNER_ENABLE=0
ATB_OPSRUNNER_SETUP_CACHE_ENABLE=1
ATB_WORKSPACE_MEM_ALLOC_GLOBAL=0
ATB_DEVICE_TILING_BUFFER_BLOCK_NUM=32
ATB_STREAM_SYNC_EVERY_KERNEL_ENABLE=0
ATB_OPSRUNNER_KERNEL_CACHE_GLOABL_COUNT=5
ATB_HOME_PATH=/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1
ASCEND_TOOLKIT_HOME=/usr/local/Ascend/ascend-toolkit/latest
ATB_COMPARE_TILING_EVERY_KERNEL=0
ASCEND_OPP_PATH=/usr/local/Ascend/ascend-toolkit/latest/opp
LD_LIBRARY_PATH=/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/lib:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/examples:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/tests/atbopstest:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64/plugin:/usr/local/Ascend/ascend-toolkit/latest/lib64:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/opskernel:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/nnengine:/usr/local/Ascend/ascend-toolkit/latest/opp/built-in/op_impl/ai_core/tbe/op_tiling/lib/linux/aarch64:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0/lib:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0/examples:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0/tests/atbopstest:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64/plugin:/usr/local/Ascend/ascend-toolkit/latest/lib64:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/opskernel:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/nnengine:/usr/local/Ascend/ascend-toolkit/latest/opp/built-in/op_impl/ai_core/tbe/op_tiling:/usr/local/Ascend/driver/lib64/common/:/usr/local/Ascend/driver/lib64/driver/:
ASCEND_AICPU_PATH=/usr/local/Ascend/ascend-toolkit/latest
ATB_OPSRUNNER_KERNEL_CACHE_TYPE=3
ATB_RUNNER_POOL_SIZE=64
ATB_STREAM_SYNC_EVERY_OPERATION_ENABLE=0
ASCEND_HOME_PATH=/usr/local/Ascend/ascend-toolkit/latest
ATB_MATMUL_SHUFFLE_K_ENABLE=1
ATB_LAUNCH_KERNEL_WITH_TILING=1
ATB_WORKSPACE_MEM_ALLOC_ALG_TYPE=1
ATB_HOST_TILING_BUFFER_BLOCK_NUM=128
ATB_SHARE_MEMORY_NAME_SUFFIX=
TORCH_DEVICE_BACKEND_AUTOLOAD=1
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
NPU:
+------------------------------------------------------------------------------------------------+
| npu-smi 25.0.rc1.1 Version: 25.0.rc1.1 |
+---------------------------+---------------+----------------------------------------------------+
| NPU Name | Health | Power(W) Temp(C) Hugepages-Usage(page)|
| Chip | Bus-Id | AICore(%) Memory-Usage(MB) HBM-Usage(MB) |
+===========================+===============+====================================================+
| 0 910B4-1 | OK | 103.1 40 0 / 0 |
| 0 | 0000:C1:00.0 | 0 0 / 0 3386 / 65536 |
+===========================+===============+====================================================+
| 1 910B4-1 | OK | 92.9 37 0 / 0 |
| 0 | 0000:C2:00.0 | 0 0 / 0 3379 / 65536 |
+===========================+===============+====================================================+
| 2 910B4-1 | OK | 90.4 37 0 / 0 |
| 0 | 0000:81:00.0 | 0 0 / 0 3380 / 65536 |
+===========================+===============+====================================================+
| 3 910B4-1 | OK | 90.5 37 0 / 0 |
| 0 | 0000:82:00.0 | 0 0 / 0 3379 / 65536 |
+===========================+===============+====================================================+
+---------------------------+---------------+----------------------------------------------------+
| NPU Chip | Process id | Process name | Process memory(MB) |
+===========================+===============+====================================================+
| No running processes found in NPU 0 |
+===========================+===============+====================================================+
| No running processes found in NPU 1 |
+===========================+===============+====================================================+
| No running processes found in NPU 2 |
+===========================+===============+====================================================+
| No running processes found in NPU 3 |
+===========================+===============+====================================================+
CANN:
package_name=Ascend-cann-toolkit
version=8.1.RC1
innerversion=V100R001C21SPC001B238
compatible_version=[V100R001C15],[V100R001C18],[V100R001C19],[V100R001C20],[V100R001C21]
arch=aarch64
os=linux
path=/usr/local/Ascend/ascend-toolkit/8.1.RC1/aarch64-linux
π Describe the bug
Summary
When serving Qwen3-32B-w8a8 with vLLM using Ascend quantization and enabling LoRA, vLLM crashes with an AscendRMSNorm attribute error.
The issue only appears when combining:
- W8A8 quantized base model and
--enable-lora+--lora-modules ...and--quantization "ascend"
Both FP16+LoRA and W8A8 without LoRA work correctly.
Commands and Behaviors
β Failing command
vllm serve /root/data/Qwen/Qwen3-32B-w8a8 \
--tensor_parallel_size=4 \
--enable-lora \
--lora-modules icd_model=./all_adaptor \
--quantization "ascend" \
--port 8000Error:
INFO 11-20 08:15:07 [parallel_state.py:1208] rank 0 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
INFO 11-20 08:15:07 [parallel_state.py:1208] rank 1 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 1, EP rank 1
INFO 11-20 08:15:07 [parallel_state.py:1208] rank 2 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 2, EP rank 2
INFO 11-20 08:15:07 [parallel_state.py:1208] rank 3 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 3, EP rank 3
(Worker_TP0 pid=11657) INFO 11-20 08:15:29 [model_runner_v1.py:2627] Starting to load model /root/data/Qwen/Qwen3-32B-w8a8...
(Worker_TP0 pid=11657) INFO 11-20 08:15:29 [utils.py:60] Using the vLLM Ascend Quantization now!
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
(Worker_TP3 pid=11916) INFO 11-20 08:15:35 [model_runner_v1.py:2627] Starting to load model /root/data/Qwen/Qwen3-32B-w8a8...
(Worker_TP3 pid=11916) INFO 11-20 08:15:35 [utils.py:60] Using the vLLM Ascend Quantization now!
(Worker_TP2 pid=11680) INFO 11-20 08:15:36 [model_runner_v1.py:2627] Starting to load model /root/data/Qwen/Qwen3-32B-w8a8...
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:07<00:00, 7.02s/it]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:07<00:00, 7.02s/it]
(Worker_TP0 pid=11657)
(Worker_TP2 pid=11680) INFO 11-20 08:15:38 [utils.py:60] Using the vLLM Ascend Quantization now!
(Worker_TP0 pid=11657) INFO 11-20 08:15:38 [default_loader.py:267] Loading weights took 7.89 seconds
(Worker_TP0 pid=11657) INFO 11-20 08:15:38 [punica_selector.py:19] Using PunicaWrapperNPU.
(Worker_TP1 pid=11661) INFO 11-20 08:15:39 [model_runner_v1.py:2627] Starting to load model /root/data/Qwen/Qwen3-32B-w8a8...
(Worker_TP0 pid=11657) INFO 11-20 08:15:39 [model_runner_v1.py:2661] Loading model weights took 10.0903 GB
(Worker_TP1 pid=11661) INFO 11-20 08:15:40 [utils.py:60] Using the vLLM Ascend Quantization now!
(Worker_TP3 pid=11916) INFO 11-20 08:15:44 [default_loader.py:267] Loading weights took 8.23 seconds
(Worker_TP3 pid=11916) INFO 11-20 08:15:45 [punica_selector.py:19] Using PunicaWrapperNPU.
(Worker_TP3 pid=11916) INFO 11-20 08:15:46 [model_runner_v1.py:2661] Loading model weights took 10.0903 GB
(Worker_TP2 pid=11680) INFO 11-20 08:15:48 [default_loader.py:267] Loading weights took 10.01 seconds
(Worker_TP2 pid=11680) INFO 11-20 08:15:48 [punica_selector.py:19] Using PunicaWrapperNPU.
(Worker_TP2 pid=11680) INFO 11-20 08:15:50 [model_runner_v1.py:2661] Loading model weights took 10.0903 GB
(Worker_TP1 pid=11661) INFO 11-20 08:15:50 [default_loader.py:267] Loading weights took 10.28 seconds
(Worker_TP1 pid=11661) INFO 11-20 08:15:51 [punica_selector.py:19] Using PunicaWrapperNPU.
(Worker_TP1 pid=11661) INFO 11-20 08:15:52 [model_runner_v1.py:2661] Loading model weights took 10.0903 GB
(Worker_TP0 pid=11657) ERROR 11-20 08:15:55 [multiproc_executor.py:671] WorkerProc hit an exception.
(Worker_TP0 pid=11657) ERROR 11-20 08:15:55 [multiproc_executor.py:671] Traceback (most recent call last):
(Worker_TP0 pid=11657) ERROR 11-20 08:15:55 [multiproc_executor.py:671] File "/vllm-workspace/vllm/vllm/v1/executor/multiproc_executor.py", line 666, in worker_busy_loop
(Worker_TP0 pid=11657) ERROR 11-20 08:15:55 [multiproc_executor.py:671] output = func(*args, **kwargs)
(Worker_TP0 pid=11657) ERROR 11-20 08:15:55 [multiproc_executor.py:671] File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/worker_v1.py", line 205, in determine_available_memory
(Worker_TP0 pid=11657) ERROR 11-20 08:15:55 [multiproc_executor.py:671] self.model_runner.profile_run()
(Worker_TP0 pid=11657) ERROR 11-20 08:15:55 [multiproc_executor.py:671] File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 2509, in profile_run
(Worker_TP0 pid=11657) ERROR 11-20 08:15:55 [multiproc_executor.py:671] hidden_states = self._dummy_run(self.max_num_tokens,
(Worker_TP0 pid=11657) ERROR 11-20 08:15:55 [multiproc_executor.py:671] File "/usr/local/python3.10.17/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(Worker_TP0 pid=11657) ERROR 11-20 08:15:55 [multiproc_executor.py:671] return func(*args, **kwargs)
(Worker_TP0 pid=11657) ERROR 11-20 08:15:55 [multiproc_executor.py:671] File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 2475, in _dummy_run
(Worker_TP0 pid=11657) ERROR 11-20 08:15:55 [multiproc_executor.py:671] hidden_states = self._generate_dummy_run_hidden_states(
(Worker_TP0 pid=11657) ERROR 11-20 08:15:55 [multiproc_executor.py:671] File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 2320, in _generate_dummy_run_hidden_states
(Worker_TP0 pid=11657) ERROR 11-20 08:15:55 [multiproc_executor.py:671] hidden_states = self.model(input_ids=input_ids,
(Worker_TP0 pid=11657) ERROR 11-20 08:15:55 [multiproc_executor.py:671] File "/usr/local/python3.10.17/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
(Worker_TP0 pid=11657) ERROR 11-20 08:15:55 [multiproc_executor.py:671] return self._call_impl(*args, **kwargs)
(Worker_TP0 pid=11657) ERROR 11-20 08:15:55 [multiproc_executor.py:671] File "/usr/local/python3.10.17/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
(Worker_TP0 pid=11657) ERROR 11-20 08:15:55 [multiproc_executor.py:671] return forward_call(*args, **kwargs)
(Worker_TP0 pid=11657) ERROR 11-20 08:15:55 [multiproc_executor.py:671] File "/vllm-workspace/vllm/vllm/model_executor/models/qwen3.py", line 323, in forward
(Worker_TP0 pid=11657) ERROR 11-20 08:15:55 [multiproc_executor.py:671] hidden_states = self.model(input_ids, positions, intermediate_tensors,
(Worker_TP0 pid=11657) ERROR 11-20 08:15:55 [multiproc_executor.py:671] File "/vllm-workspace/vllm/vllm/compilation/decorators.py", line 310, in __call__
(Worker_TP0 pid=11657) ERROR 11-20 08:15:55 [multiproc_executor.py:671] output = self.compiled_callable(*args, **kwargs)
(Worker_TP0 pid=11657) ERROR 11-20 08:15:55 [multiproc_executor.py:671] File "/vllm-workspace/vllm/vllm/model_executor/models/qwen2.py", line 364, in forward
(Worker_TP0 pid=11657) ERROR 11-20 08:15:55 [multiproc_executor.py:671] hidden_states, residual = layer(positions, hidden_states, residual)
(Worker_TP0 pid=11657) ERROR 11-20 08:15:55 [multiproc_executor.py:671] File "/usr/local/python3.10.17/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
(Worker_TP0 pid=11657) ERROR 11-20 08:15:55 [multiproc_executor.py:671] return self._call_impl(*args, **kwargs)
(Worker_TP0 pid=11657) ERROR 11-20 08:15:55 [multiproc_executor.py:671] File "/usr/local/python3.10.17/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
(Worker_TP0 pid=11657) ERROR 11-20 08:15:55 [multiproc_executor.py:671] return forward_call(*args, **kwargs)
(Worker_TP0 pid=11657) ERROR 11-20 08:15:55 [multiproc_executor.py:671] File "/vllm-workspace/vllm/vllm/model_executor/models/qwen3.py", line 235, in forward
(Worker_TP0 pid=11657) ERROR 11-20 08:15:55 [multiproc_executor.py:671] hidden_states, residual = self.post_attention_layernorm(
(Worker_TP0 pid=11657) ERROR 11-20 08:15:55 [multiproc_executor.py:671] File "/usr/local/python3.10.17/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
(Worker_TP0 pid=11657) ERROR 11-20 08:15:55 [multiproc_executor.py:671] return self._call_impl(*args, **kwargs)
(Worker_TP0 pid=11657) ERROR 11-20 08:15:55 [multiproc_executor.py:671] File "/usr/local/python3.10.17/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
(Worker_TP0 pid=11657) ERROR 11-20 08:15:55 [multiproc_executor.py:671] return forward_call(*args, **kwargs)
(Worker_TP0 pid=11657) ERROR 11-20 08:15:55 [multiproc_executor.py:671] File "/vllm-workspace/vllm/vllm/model_executor/custom_op.py", line 44, in forward
(Worker_TP0 pid=11657) ERROR 11-20 08:15:55 [multiproc_executor.py:671] return self._forward_method(*args, **kwargs)
(Worker_TP0 pid=11657) ERROR 11-20 08:15:55 [multiproc_executor.py:671] File "/vllm-workspace/vllm-ascend/vllm_ascend/ops/layernorm.py", line 70, in forward_oot
(Worker_TP0 pid=11657) ERROR 11-20 08:15:55 [multiproc_executor.py:671] self, x, residual, self.next_need_quant_fusion_linear)
(Worker_TP0 pid=11657) ERROR 11-20 08:15:55 [multiproc_executor.py:671] File "/usr/local/python3.10.17/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1940, in __getattr__
(Worker_TP0 pid=11657) ERROR 11-20 08:15:55 [multiproc_executor.py:671] raise AttributeError(
(Worker_TP0 pid=11657) ERROR 11-20 08:15:55 [multiproc_executor.py:671] AttributeError: 'AscendRMSNorm' object has no attribute 'next_need_quant_fusion_linear'
(EngineCore_DP0 pid=11521) ERROR 11-20 08:15:55 [core.py:708] EngineCore failed to start.
(EngineCore_DP0 pid=11521) ERROR 11-20 08:15:55 [core.py:708] Traceback (most recent call last):
(EngineCore_DP0 pid=11521) ERROR 11-20 08:15:55 [core.py:708] File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 699, in run_engine_core
(EngineCore_DP0 pid=11521) ERROR 11-20 08:15:55 [core.py:708] engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=11521) ERROR 11-20 08:15:55 [core.py:708] File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 498, in __init__
(EngineCore_DP0 pid=11521) ERROR 11-20 08:15:55 [core.py:708] super().__init__(vllm_config, executor_class, log_stats,
(EngineCore_DP0 pid=11521) ERROR 11-20 08:15:55 [core.py:708] File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 92, in __init__
(EngineCore_DP0 pid=11521) ERROR 11-20 08:15:55 [core.py:708] self._initialize_kv_caches(vllm_config)
(EngineCore_DP0 pid=11521) ERROR 11-20 08:15:55 [core.py:708] File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 190, in _initialize_kv_caches
(EngineCore_DP0 pid=11521) ERROR 11-20 08:15:55 [core.py:708] self.model_executor.determine_available_memory())
(EngineCore_DP0 pid=11521) ERROR 11-20 08:15:55 [core.py:708] File "/vllm-workspace/vllm/vllm/v1/executor/abstract.py", line 85, in determine_available_memory
(EngineCore_DP0 pid=11521) ERROR 11-20 08:15:55 [core.py:708] return self.collective_rpc("determine_available_memory")
(EngineCore_DP0 pid=11521) ERROR 11-20 08:15:55 [core.py:708] File "/vllm-workspace/vllm/vllm/v1/executor/multiproc_executor.py", line 264, in collective_rpc
(EngineCore_DP0 pid=11521) ERROR 11-20 08:15:55 [core.py:708] result = get_response(w, dequeue_timeout,
(EngineCore_DP0 pid=11521) ERROR 11-20 08:15:55 [core.py:708] File "/vllm-workspace/vllm/vllm/v1/executor/multiproc_executor.py", line 248, in get_response
(EngineCore_DP0 pid=11521) ERROR 11-20 08:15:55 [core.py:708] raise RuntimeError(
(EngineCore_DP0 pid=11521) ERROR 11-20 08:15:55 [core.py:708] RuntimeError: Worker failed with error ''AscendRMSNorm' object has no attribute 'next_need_quant_fusion_linear'', please check the stack trace above for the root cause
(EngineCore_DP0 pid=11521) ERROR 11-20 08:16:05 [multiproc_executor.py:154] Worker proc VllmWorker-3 died unexpectedly, shutting down executor.
(EngineCore_DP0 pid=11521) Process EngineCore_DP0:
(EngineCore_DP0 pid=11521) Traceback (most recent call last):
(EngineCore_DP0 pid=11521) File "/usr/local/python3.10.17/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=11521) self.run()
(EngineCore_DP0 pid=11521) File "/usr/local/python3.10.17/lib/python3.10/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=11521) self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=11521) File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 712, in run_engine_core
(EngineCore_DP0 pid=11521) raise e
(EngineCore_DP0 pid=11521) File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 699, in run_engine_core
(EngineCore_DP0 pid=11521) engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=11521) File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 498, in __init__
(EngineCore_DP0 pid=11521) super().__init__(vllm_config, executor_class, log_stats,
(EngineCore_DP0 pid=11521) File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 92, in __init__
(EngineCore_DP0 pid=11521) self._initialize_kv_caches(vllm_config)
(EngineCore_DP0 pid=11521) File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 190, in _initialize_kv_caches
(EngineCore_DP0 pid=11521) self.model_executor.determine_available_memory())
(EngineCore_DP0 pid=11521) File "/vllm-workspace/vllm/vllm/v1/executor/abstract.py", line 85, in determine_available_memory
(EngineCore_DP0 pid=11521) return self.collective_rpc("determine_available_memory")
(EngineCore_DP0 pid=11521) File "/vllm-workspace/vllm/vllm/v1/executor/multiproc_executor.py", line 264, in collective_rpc
(EngineCore_DP0 pid=11521) result = get_response(w, dequeue_timeout,
(EngineCore_DP0 pid=11521) File "/vllm-workspace/vllm/vllm/v1/executor/multiproc_executor.py", line 248, in get_response
(EngineCore_DP0 pid=11521) raise RuntimeError(
(EngineCore_DP0 pid=11521) RuntimeError: Worker failed with error ''AscendRMSNorm' object has no attribute 'next_need_quant_fusion_linear'', please check the stack trace above for the root cause
(APIServer pid=11383) Traceback (most recent call last):
(APIServer pid=11383) File "/usr/local/python3.10.17/bin/vllm", line 8, in <module>
(APIServer pid=11383) sys.exit(main())
(APIServer pid=11383) File "/vllm-workspace/vllm/vllm/entrypoints/cli/main.py", line 54, in main
(APIServer pid=11383) args.dispatch_function(args)
(APIServer pid=11383) File "/vllm-workspace/vllm/vllm/entrypoints/cli/serve.py", line 57, in cmd
(APIServer pid=11383) uvloop.run(run_server(args))
(APIServer pid=11383) File "/usr/local/python3.10.17/lib/python3.10/site-packages/uvloop/__init__.py", line 69, in run
(APIServer pid=11383) return loop.run_until_complete(wrapper())
(APIServer pid=11383) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=11383) File "/usr/local/python3.10.17/lib/python3.10/site-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=11383) return await main
(APIServer pid=11383) File "/vllm-workspace/vllm/vllm/entrypoints/openai/api_server.py", line 1884, in run_server
(APIServer pid=11383) await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=11383) File "/vllm-workspace/vllm/vllm/entrypoints/openai/api_server.py", line 1902, in run_server_worker
(APIServer pid=11383) async with build_async_engine_client(
(APIServer pid=11383) File "/usr/local/python3.10.17/lib/python3.10/contextlib.py", line 199, in __aenter__
(APIServer pid=11383) return await anext(self.gen)
(APIServer pid=11383) File "/vllm-workspace/vllm/vllm/entrypoints/openai/api_server.py", line 180, in build_async_engine_client
(APIServer pid=11383) async with build_async_engine_client_from_engine_args(
(APIServer pid=11383) File "/usr/local/python3.10.17/lib/python3.10/contextlib.py", line 199, in __aenter__
(APIServer pid=11383) return await anext(self.gen)
(APIServer pid=11383) File "/vllm-workspace/vllm/vllm/entrypoints/openai/api_server.py", line 225, in build_async_engine_client_from_engine_args
(APIServer pid=11383) async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=11383) File "/vllm-workspace/vllm/vllm/utils/__init__.py", line 1572, in inner
(APIServer pid=11383) return fn(*args, **kwargs)
(APIServer pid=11383) File "/vllm-workspace/vllm/vllm/v1/engine/async_llm.py", line 207, in from_vllm_config
(APIServer pid=11383) return cls(
(APIServer pid=11383) File "/vllm-workspace/vllm/vllm/v1/engine/async_llm.py", line 134, in __init__
(APIServer pid=11383) self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=11383) File "/vllm-workspace/vllm/vllm/v1/engine/core_client.py", line 102, in make_async_mp_client
(APIServer pid=11383) return AsyncMPClient(*client_args)
(APIServer pid=11383) File "/vllm-workspace/vllm/vllm/v1/engine/core_client.py", line 769, in __init__
(APIServer pid=11383) super().__init__(
(APIServer pid=11383) File "/vllm-workspace/vllm/vllm/v1/engine/core_client.py", line 448, in __init__
(APIServer pid=11383) with launch_core_engines(vllm_config, executor_class,
(APIServer pid=11383) File "/usr/local/python3.10.17/lib/python3.10/contextlib.py", line 142, in __exit__
(APIServer pid=11383) next(self.gen)
(APIServer pid=11383) File "/vllm-workspace/vllm/vllm/v1/engine/utils.py", line 732, in launch_core_engines
(APIServer pid=11383) wait_for_engine_startup(
(APIServer pid=11383) File "/vllm-workspace/vllm/vllm/v1/engine/utils.py", line 785, in wait_for_engine_startup
(APIServer pid=11383) raise RuntimeError("Engine core initialization failed. "
(APIServer pid=11383) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
(APIServer pid=11383) [ERROR] 2025-11-20-08:16:12 (PID:11383, Device:-1, RankID:-1) ERR99999 UNKNOWN applicaiton exception
So the engine fails at runtime with:
'AscendRMSNorm' object has no attribute 'next_need_quant_fusion_linear'
β Working command: FP16 + LoRA
vllm serve /root/data/Qwen/Qwen3-32B \
--tensor_parallel_size=4 \
--enable-lora \
--lora-modules icd_model=./all_adaptor \
--port 8000- Base model in FP16
- LoRA adapter enabled
- No
--quantization "ascend"
This configuration serves successfully.
β Working command: W8A8 without LoRA
vllm serve /root/data/Qwen/Qwen3-32B-w8a8 \
--tensor_parallel_size=4 \
--port 8000- Base model W8A8 quantized
- No LoRA
- No
--quantization "ascend"flag
This configuration also serves successfully.