Skip to content

[Bug]: Qwen/Qwen2.5-7B-Instruct-1M start failed caused by dual_chunk_attention_configย #4309

@zhangxinyuehfad

Description

@zhangxinyuehfad

Your current environment

The output of `python collect_env.py`
PyTorch version: 2.7.1+cpu
Is debug build: False

OS: Ubuntu 22.04.5 LTS (aarch64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04.2) 11.4.0
Clang version: Could not collect
CMake version: version 4.1.2
Libc version: glibc-2.35

Python version: 3.11.13 (main, Nov  2 2025, 10:27:27) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.10.0-182.0.0.95.r1941_123.hce2.aarch64-aarch64-with-glibc2.35

CPU:
Architecture:                         aarch64
CPU op-mode(s):                       64-bit
Byte Order:                           Little Endian
CPU(s):                               320
On-line CPU(s) list:                  0-319
Vendor ID:                            HiSilicon
Model:                                0
Thread(s) per core:                   1
Core(s) per cluster:                  80
Socket(s):                            -
Cluster(s):                           4
Stepping:                             0x0
Frequency boost:                      disabled
CPU max MHz:                          3000.0000
CPU min MHz:                          400.0000
BogoMIPS:                             200.00
Flags:                                fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm ssbs sb paca pacg dcpodp flagm2 frint svei8mm svef32mm svef64mm svebf16 i8mm bf16 dgh rng ecv
L1d cache:                            20 MiB (320 instances)
L1i cache:                            20 MiB (320 instances)
L2 cache:                             400 MiB (320 instances)
L3 cache:                             560 MiB (8 instances)
NUMA node(s):                         8
NUMA node0 CPU(s):                    0-39
NUMA node1 CPU(s):                    40-79
NUMA node2 CPU(s):                    80-119
NUMA node3 CPU(s):                    120-159
NUMA node4 CPU(s):                    160-199
NUMA node5 CPU(s):                    200-239
NUMA node6 CPU(s):                    240-279
NUMA node7 CPU(s):                    280-319
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Not affected
Vulnerability Mds:                    Not affected
Vulnerability Meltdown:               Not affected
Vulnerability Mmio stale data:        Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Not affected
Vulnerability Spec rstack overflow:   Not affected
Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:             Mitigation; __user pointer sanitization
Vulnerability Spectre v2:             Not affected
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Not affected

Versions of relevant libraries:
[pip3] mypy==1.11.1
[pip3] mypy_extensions==1.1.0
[pip3] numpy==1.26.4
[pip3] pyzmq==27.1.0
[pip3] sentence-transformers==5.1.2
[pip3] torch==2.7.1+cpu
[pip3] torch_npu==2.7.1
[pip3] torchvision==0.22.1
[pip3] transformers==4.57.1
[pip3] zmq==0.0.0
[conda] Could not collect
vLLM Version: 0.11.0
vLLM Ascend Version: 0.11.0rc1

ENV Variables:
ATB_OPSRUNNER_KERNEL_CACHE_LOCAL_COUNT=1
ATB_STREAM_SYNC_EVERY_RUNNER_ENABLE=0
ATB_OPSRUNNER_SETUP_CACHE_ENABLE=1
ATB_WORKSPACE_MEM_ALLOC_GLOBAL=1
ATB_DEVICE_TILING_BUFFER_BLOCK_NUM=32
ASCEND_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
ATB_STREAM_SYNC_EVERY_KERNEL_ENABLE=0
ATB_OPSRUNNER_KERNEL_CACHE_GLOABL_COUNT=5
VLLM_USE_MODELSCOPE=True
PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256
ATB_HOME_PATH=/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1
ASCEND_TOOLKIT_HOME=/usr/local/Ascend/ascend-toolkit/latest
ATB_COMPARE_TILING_EVERY_KERNEL=0
ASCEND_OPP_PATH=/usr/local/Ascend/ascend-toolkit/latest/opp
LD_LIBRARY_PATH=/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/lib:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/examples:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/tests/atbopstest:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64/plugin:/usr/local/Ascend/ascend-toolkit/latest/lib64:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/opskernel:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/nnengine:/usr/local/Ascend/ascend-toolkit/latest/opp/built-in/op_impl/ai_core/tbe/op_tiling/lib/linux/aarch64:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0/lib:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0/examples:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0/tests/atbopstest:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64/plugin:/usr/local/Ascend/ascend-toolkit/latest/lib64:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/opskernel:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/nnengine:/usr/local/Ascend/ascend-toolkit/latest/opp/built-in/op_impl/ai_core/tbe/op_tiling:/usr/local/Ascend/driver/lib64/common/:/usr/local/Ascend/driver/lib64/driver/:
ASCEND_AICPU_PATH=/usr/local/Ascend/ascend-toolkit/latest
ATB_STREAM_SYNC_EVERY_OPERATION_ENABLE=0
ASCEND_HOME_PATH=/usr/local/Ascend/ascend-toolkit/latest
ATB_MATMUL_SHUFFLE_K_ENABLE=1
ATB_WORKSPACE_MEM_ALLOC_ALG_TYPE=1
ATB_HOST_TILING_BUFFER_BLOCK_NUM=128
ATB_SHARE_MEMORY_NAME_SUFFIX=
TORCH_DEVICE_BACKEND_AUTOLOAD=1
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1


NPU:
+------------------------------------------------------------------------------------------------+
| npu-smi 24.1.rc3.7               Version: 24.1.rc3.7                                           |
+---------------------------+---------------+----------------------------------------------------+
| NPU   Name                | Health        | Power(W)    Temp(C)           Hugepages-Usage(page)|
| Chip  Phy-ID              | Bus-Id        | AICore(%)   Memory-Usage(MB)  HBM-Usage(MB)        |
+===========================+===============+====================================================+
| 0     Ascend910           | OK            | 174.5       39                0    / 0             |
| 0     0                   | 0000:9D:00.0  | 0           0    / 0          3439 / 65536         |
+------------------------------------------------------------------------------------------------+
| 0     Ascend910           | OK            | -           37                0    / 0             |
| 1     1                   | 0000:9F:00.0  | 0           0    / 0          3206 / 65536         |
+===========================+===============+====================================================+
| 1     Ascend910           | OK            | 184.7       38                0    / 0             |
| 0     2                   | 0000:99:00.0  | 0           0    / 0          3432 / 65536         |
+------------------------------------------------------------------------------------------------+
| 1     Ascend910           | OK            | -           38                0    / 0             |
| 1     3                   | 0000:9B:00.0  | 0           0    / 0          3207 / 65536         |
+===========================+===============+====================================================+
| 2     Ascend910           | OK            | 177.7       37                0    / 0             |
| 0     4                   | 0000:95:00.0  | 0           0    / 0          3443 / 65536         |
+------------------------------------------------------------------------------------------------+
| 2     Ascend910           | OK            | -           37                0    / 0             |
| 1     5                   | 0000:97:00.0  | 0           0    / 0          3197 / 65536         |
+===========================+===============+====================================================+
| 3     Ascend910           | OK            | 188.7       37                0    / 0             |
| 0     6                   | 0000:91:00.0  | 0           0    / 0          3433 / 65536         |
+------------------------------------------------------------------------------------------------+
| 3     Ascend910           | OK            | -           37                0    / 0             |
| 1     7                   | 0000:93:00.0  | 0           0    / 0          3209 / 65536         |
+===========================+===============+====================================================+
+---------------------------+---------------+----------------------------------------------------+
| NPU     Chip              | Process id    | Process name             | Process memory(MB)      |
+===========================+===============+====================================================+
| No running processes found in NPU 0                                                            |
+===========================+===============+====================================================+
| No running processes found in NPU 1                                                            |
+===========================+===============+====================================================+
| No running processes found in NPU 2                                                            |
+===========================+===============+====================================================+
| No running processes found in NPU 3                                                            |
+===========================+===============+====================================================+

CANN:
package_name=Ascend-cann-toolkit
version=8.3.RC1
innerversion=V100R001C23SPC001B235
compatible_version=[V100R001C15],[V100R001C18],[V100R001C19],[V100R001C20],[V100R001C21],[V100R001C23]
arch=aarch64
os=linux
path=/usr/local/Ascend/ascend-toolkit/8.3.RC1/aarch64-linux

๐Ÿ› Describe the bug

command:

VLLM_USE_MODELSCOPE=True vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen2.5-7B-Instruct-1M --max-model-len 200000 --dtype float16 --trust-remote-code &

error log:

But it runs ok by deleting dual_chunk_attention_config in config.json of Qwen2.5-7B-Instruct-1M.

(EngineCore_DP0 pid=42742) INFO 11-20 07:50:39 [parallel_state.py:1208] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(EngineCore_DP0 pid=42742) INFO 11-20 07:50:40 [model_runner_v1.py:2642] Starting to load model /root/.cache/modelscope/hub/models/Qwen/Qwen2.5-7B-Instruct-1M...
(EngineCore_DP0 pid=42742) ERROR 11-20 07:50:41 [core.py:708] EngineCore failed to start.
(EngineCore_DP0 pid=42742) ERROR 11-20 07:50:41 [core.py:708] Traceback (most recent call last):
(EngineCore_DP0 pid=42742) ERROR 11-20 07:50:41 [core.py:708]   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 699, in run_engine_core
(EngineCore_DP0 pid=42742) ERROR 11-20 07:50:41 [core.py:708]     engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=42742) ERROR 11-20 07:50:41 [core.py:708]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=42742) ERROR 11-20 07:50:41 [core.py:708]   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 498, in __init__
(EngineCore_DP0 pid=42742) ERROR 11-20 07:50:41 [core.py:708]     super().__init__(vllm_config, executor_class, log_stats,
(EngineCore_DP0 pid=42742) ERROR 11-20 07:50:41 [core.py:708]   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 83, in __init__
(EngineCore_DP0 pid=42742) ERROR 11-20 07:50:41 [core.py:708]     self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=42742) ERROR 11-20 07:50:41 [core.py:708]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=42742) ERROR 11-20 07:50:41 [core.py:708]   File "/vllm-workspace/vllm/vllm/executor/executor_base.py", line 54, in __init__
(EngineCore_DP0 pid=42742) ERROR 11-20 07:50:41 [core.py:708]     self._init_executor()
(EngineCore_DP0 pid=42742) ERROR 11-20 07:50:41 [core.py:708]   File "/vllm-workspace/vllm/vllm/executor/uniproc_executor.py", line 55, in _init_executor
(EngineCore_DP0 pid=42742) ERROR 11-20 07:50:41 [core.py:708]     self.collective_rpc("load_model")
(EngineCore_DP0 pid=42742) ERROR 11-20 07:50:41 [core.py:708]   File "/vllm-workspace/vllm/vllm/executor/uniproc_executor.py", line 83, in collective_rpc
(EngineCore_DP0 pid=42742) ERROR 11-20 07:50:41 [core.py:708]     return [run_method(self.driver_worker, method, args, kwargs)]
(EngineCore_DP0 pid=42742) ERROR 11-20 07:50:41 [core.py:708]             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=42742) ERROR 11-20 07:50:41 [core.py:708]   File "/vllm-workspace/vllm/vllm/utils/__init__.py", line 3122, in run_method
(EngineCore_DP0 pid=42742) ERROR 11-20 07:50:41 [core.py:708]     return func(*args, **kwargs)
(EngineCore_DP0 pid=42742) ERROR 11-20 07:50:41 [core.py:708]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=42742) ERROR 11-20 07:50:41 [core.py:708]   File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/worker_v1.py", line 313, in load_model
(EngineCore_DP0 pid=42742) ERROR 11-20 07:50:41 [core.py:708]     self.model_runner.load_model()
(EngineCore_DP0 pid=42742) ERROR 11-20 07:50:41 [core.py:708]   File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 2645, in load_model
(EngineCore_DP0 pid=42742) ERROR 11-20 07:50:41 [core.py:708]     self.model = get_model(vllm_config=self.vllm_config)
(EngineCore_DP0 pid=42742) ERROR 11-20 07:50:41 [core.py:708]                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=42742) ERROR 11-20 07:50:41 [core.py:708]   File "/vllm-workspace/vllm/vllm/model_executor/model_loader/__init__.py", line 119, in get_model
(EngineCore_DP0 pid=42742) ERROR 11-20 07:50:41 [core.py:708]     return loader.load_model(vllm_config=vllm_config,
(EngineCore_DP0 pid=42742) ERROR 11-20 07:50:41 [core.py:708]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=42742) ERROR 11-20 07:50:41 [core.py:708]   File "/vllm-workspace/vllm/vllm/model_executor/model_loader/base_loader.py", line 45, in load_model
(EngineCore_DP0 pid=42742) ERROR 11-20 07:50:41 [core.py:708]     model = initialize_model(vllm_config=vllm_config,
(EngineCore_DP0 pid=42742) ERROR 11-20 07:50:41 [core.py:708]             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=42742) ERROR 11-20 07:50:41 [core.py:708]   File "/vllm-workspace/vllm/vllm/model_executor/model_loader/utils.py", line 63, in initialize_model
(EngineCore_DP0 pid=42742) ERROR 11-20 07:50:41 [core.py:708]     return model_class(vllm_config=vllm_config, prefix=prefix)
(EngineCore_DP0 pid=42742) ERROR 11-20 07:50:41 [core.py:708]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=42742) ERROR 11-20 07:50:41 [core.py:708]   File "/vllm-workspace/vllm/vllm/model_executor/models/qwen2.py", line 468, in __init__
(EngineCore_DP0 pid=42742) ERROR 11-20 07:50:41 [core.py:708]     self.model = Qwen2Model(vllm_config=vllm_config,
(EngineCore_DP0 pid=42742) ERROR 11-20 07:50:41 [core.py:708]                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=42742) ERROR 11-20 07:50:41 [core.py:708]   File "/vllm-workspace/vllm/vllm/compilation/decorators.py", line 201, in __init__
(EngineCore_DP0 pid=42742) ERROR 11-20 07:50:41 [core.py:708]     old_init(self, vllm_config=vllm_config, prefix=prefix, **kwargs)
(EngineCore_DP0 pid=42742) ERROR 11-20 07:50:41 [core.py:708]   File "/vllm-workspace/vllm/vllm/model_executor/models/qwen2.py", line 319, in __init__
(EngineCore_DP0 pid=42742) ERROR 11-20 07:50:41 [core.py:708]     self.start_layer, self.end_layer, self.layers = make_layers(
(EngineCore_DP0 pid=42742) ERROR 11-20 07:50:41 [core.py:708]                                                     ^^^^^^^^^^^^
(EngineCore_DP0 pid=42742) ERROR 11-20 07:50:41 [core.py:708]   File "/vllm-workspace/vllm/vllm/model_executor/models/utils.py", line 629, in make_layers
(EngineCore_DP0 pid=42742) ERROR 11-20 07:50:41 [core.py:708]     [PPMissingLayer() for _ in range(start_layer)] + [
(EngineCore_DP0 pid=42742) ERROR 11-20 07:50:41 [core.py:708]                                                      ^
(EngineCore_DP0 pid=42742) ERROR 11-20 07:50:41 [core.py:708]   File "/vllm-workspace/vllm/vllm/model_executor/models/utils.py", line 630, in <listcomp>
(EngineCore_DP0 pid=42742) ERROR 11-20 07:50:41 [core.py:708]     maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}"))
(EngineCore_DP0 pid=42742) ERROR 11-20 07:50:41 [core.py:708]                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=42742) ERROR 11-20 07:50:41 [core.py:708]   File "/vllm-workspace/vllm/vllm/model_executor/models/qwen2.py", line 321, in <lambda>
(EngineCore_DP0 pid=42742) ERROR 11-20 07:50:41 [core.py:708]     lambda prefix: decoder_layer_type(config=config,
(EngineCore_DP0 pid=42742) ERROR 11-20 07:50:41 [core.py:708]                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=42742) ERROR 11-20 07:50:41 [core.py:708]   File "/vllm-workspace/vllm/vllm/model_executor/models/qwen2.py", line 219, in __init__
(EngineCore_DP0 pid=42742) ERROR 11-20 07:50:41 [core.py:708]     self.self_attn = Qwen2Attention(
(EngineCore_DP0 pid=42742) ERROR 11-20 07:50:41 [core.py:708]                      ^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=42742) ERROR 11-20 07:50:41 [core.py:708]   File "/vllm-workspace/vllm/vllm/model_executor/models/qwen2.py", line 155, in __init__
(EngineCore_DP0 pid=42742) ERROR 11-20 07:50:41 [core.py:708]     self.rotary_emb = get_rope(
(EngineCore_DP0 pid=42742) ERROR 11-20 07:50:41 [core.py:708]                       ^^^^^^^^^
(EngineCore_DP0 pid=42742) ERROR 11-20 07:50:41 [core.py:708]   File "/vllm-workspace/vllm/vllm/model_executor/layers/rotary_embedding/__init__.py", line 70, in get_rope
(EngineCore_DP0 pid=42742) ERROR 11-20 07:50:41 [core.py:708]     rotary_emb = DualChunkRotaryEmbedding(head_size, rotary_dim,
(EngineCore_DP0 pid=42742) ERROR 11-20 07:50:41 [core.py:708]                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=42742) ERROR 11-20 07:50:41 [core.py:708]   File "/vllm-workspace/vllm/vllm/model_executor/layers/rotary_embedding/dual_chunk_rope.py", line 37, in __init__
(EngineCore_DP0 pid=42742) ERROR 11-20 07:50:41 [core.py:708]     self.device = torch.device(f"cuda:{torch.cuda.current_device()}")
(EngineCore_DP0 pid=42742) ERROR 11-20 07:50:41 [core.py:708]                                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=42742) ERROR 11-20 07:50:41 [core.py:708]   File "/usr/local/python3.11.13/lib/python3.11/site-packages/torch/cuda/__init__.py", line 1026, in current_device
(EngineCore_DP0 pid=42742) ERROR 11-20 07:50:41 [core.py:708]     _lazy_init()
(EngineCore_DP0 pid=42742) ERROR 11-20 07:50:41 [core.py:708]   File "/usr/local/python3.11.13/lib/python3.11/site-packages/torch/cuda/__init__.py", line 363, in _lazy_init
(EngineCore_DP0 pid=42742) ERROR 11-20 07:50:41 [core.py:708]     raise AssertionError("Torch not compiled with CUDA enabled")
(EngineCore_DP0 pid=42742) ERROR 11-20 07:50:41 [core.py:708] AssertionError: Torch not compiled with CUDA enabled

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions