Skip to content

[Bug]: The qkv_rmsnorm_rope operator is not registered due to the conflict between triton and triton-ascend versions in A+X environmentย #6737

@lianyiibo

Description

@lianyiibo

Your current environment

The output of `python collect_env.py`
Collecting environment information...
PyTorch version: 2.9.0+cpu
Is debug build: False

OS: Ubuntu 22.04.5 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04.2) 11.4.0
Clang version: Could not collect
CMake version: version 4.2.1
Libc version: glibc-2.35

Python version: 3.11.13 (main, Nov 20 2025, 16:03:05) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-6.8.0-71-generic-x86_64-with-glibc2.35

CPU:
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        46 bits physical, 57 bits virtual
Byte Order:                           Little Endian
CPU(s):                               240
On-line CPU(s) list:                  0-239
Vendor ID:                            GenuineIntel
BIOS Vendor ID:                       Intel(R) Corporation
Model name:                           Intel(R) Xeon(R) Platinum 8490H
BIOS Model name:                      Intel(R) Xeon(R) Platinum 8490H
CPU family:                           6
Model:                                143
Thread(s) per core:                   2
Core(s) per socket:                   60
Socket(s):                            2
Stepping:                             6
BogoMIPS:                             3800.00
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cat_l2 cdp_l3 intel_ppin cdp_l2 ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect user_shstk avx_vnni avx512_bf16 wbnoinvd dtherm ida arat pln pts vnmi avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b enqcmd fsrm md_clear serialize tsxldtrk pconfig arch_lbr ibt amx_bf16 avx512_fp16 amx_tile amx_int8 flush_l1d arch_capabilities
Virtualization:                       VT-x
L1d cache:                            5.6 MiB (120 instances)
L1i cache:                            3.8 MiB (120 instances)
L2 cache:                             240 MiB (120 instances)
L3 cache:                             225 MiB (2 instances)
NUMA node(s):                         2
NUMA node0 CPU(s):                    0-59,120-179
NUMA node1 CPU(s):                    60-119,180-239
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Not affected
Vulnerability Mds:                    Not affected
Vulnerability Meltdown:               Not affected
Vulnerability Mmio stale data:        Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Not affected
Vulnerability Spec rstack overflow:   Not affected
Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; Enhanced / Automatic IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS SW sequence; BHI BHI_DIS_S
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Not affected

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] pyzmq==27.1.0
[pip3] torch==2.9.0+cpu
[pip3] torch_npu==2.9.0
[pip3] torchvision==0.24.0+cpu
[pip3] transformers==4.57.3
[pip3] triton-ascend==3.2.0
[conda] Could not collect
vLLM Version: 0.14.1.dev1+gd68209402 (git sha: d68209402)
vLLM Ascend Version: 0.14.0rc1

ENV Variables:
ATB_OPSRUNNER_KERNEL_CACHE_LOCAL_COUNT=1
ATB_STREAM_SYNC_EVERY_RUNNER_ENABLE=0
ATB_OPSRUNNER_SETUP_CACHE_ENABLE=1
ATB_WORKSPACE_MEM_ALLOC_GLOBAL=1
ATB_DEVICE_TILING_BUFFER_BLOCK_NUM=32
ATB_STREAM_SYNC_EVERY_KERNEL_ENABLE=0
ATB_OPSRUNNER_KERNEL_CACHE_GLOABL_COUNT=5
ATB_HOME_PATH=/root/miniconda/envs/cann-env-8.5.0/Ascend/nnal/atb/latest/atb/cxx_abi_1
ASCEND_TOOLKIT_HOME=/root/miniconda/envs/cann-env-8.5.0/Ascend/cann-8.5.0
ATB_COMPARE_TILING_EVERY_KERNEL=0
ASCEND_OPP_PATH=/root/miniconda/envs/cann-env-8.5.0/Ascend/cann-8.5.0/opp
LD_LIBRARY_PATH=/root/miniconda/envs/cann-env-8.5.0/Ascend/nnal/atb/latest/atb/cxx_abi_1/lib:/root/miniconda/envs/cann-env-8.5.0/Ascend/nnal/atb/latest/atb/cxx_abi_1/examples:/root/miniconda/envs/cann-env-8.5.0/Ascend/nnal/atb/latest/atb/cxx_abi_1/tests/atbopstest:/root/miniconda/envs/cann-env-8.5.0/Ascend/cann-8.5.0/lib64:/root/miniconda/envs/cann-env-8.5.0/Ascend/cann-8.5.0/lib64/plugin/opskernel:/root/miniconda/envs/cann-env-8.5.0/Ascend/cann-8.5.0/lib64/plugin/nnengine:/root/miniconda/envs/cann-env-8.5.0/Ascend/cann-8.5.0/opp/built-in/op_impl/ai_core/tbe/op_tiling/lib/linux/x86_64:/root/miniconda/envs/cann-env-8.5.0/Ascend/cann-8.5.0/tools/aml/lib64:/root/miniconda/envs/cann-env-8.5.0/Ascend/cann-8.5.0/tools/aml/lib64/plugin:/usr/local/Ascend/driver/lib64:/usr/local/Ascend/driver/lib64/common:/usr/local/Ascend/driver/lib64/driver:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/lib:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/examples:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/tests/atbopstest:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64/plugin:/usr/local/Ascend/ascend-toolkit/latest/lib64:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/opskernel:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/nnengine:/usr/local/Ascend/ascend-toolkit/latest/opp/built-in/op_impl/ai_core/tbe/op_tiling/lib/linux/x86_64:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/lib:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/examples:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/tests/atbopstest:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64/plugin:/usr/local/Ascend/ascend-toolkit/latest/lib64:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/opskernel:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/nnengine:/usr/local/Ascend/ascend-toolkit/latest/opp/built-in/op_impl/ai_core/tbe/op_tiling/lib/linux/x86_64:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64/plugin:/usr/local/Ascend/ascend-toolkit/latest/lib64:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/opskernel:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/nnengine:/usr/local/Ascend/ascend-toolkit/latest/opp/built-in/op_impl/ai_core/tbe/op_tiling/lib/linux/x86_64:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/lib:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/examples:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/tests/atbopstest:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64/plugin:/usr/local/Ascend/ascend-toolkit/latest/lib64:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/opskernel:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/nnengine:/usr/local/Ascend/ascend-toolkit/latest/opp/built-in/op_impl/ai_core/tbe/op_tiling/lib/linux/x86_64:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0/lib:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0/examples:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0/tests/atbopstest:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64/plugin:/usr/local/Ascend/ascend-toolkit/latest/lib64:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/opskernel:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/nnengine:/usr/local/Ascend/ascend-toolkit/latest/opp/built-in/op_impl/ai_core/tbe/op_tiling:/usr/local/Ascend/driver/lib64/common/:/usr/local/Ascend/driver/lib64/driver/:
ASCEND_AICPU_PATH=/root/miniconda/envs/cann-env-8.5.0/Ascend/cann-8.5.0
ATB_STREAM_SYNC_EVERY_OPERATION_ENABLE=0
ASCEND_HOME_PATH=/root/miniconda/envs/cann-env-8.5.0/Ascend/cann-8.5.0
ATB_MATMUL_SHUFFLE_K_ENABLE=1
ATB_WORKSPACE_MEM_ALLOC_ALG_TYPE=1
ATB_HOST_TILING_BUFFER_BLOCK_NUM=128
ATB_SHARE_MEMORY_NAME_SUFFIX=
TORCH_DEVICE_BACKEND_AUTOLOAD=1
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1


NPU:
+------------------------------------------------------------------------------------------------+
| npu-smi 25.5.0                   Version: 25.5.0                                               |
+---------------------------+---------------+----------------------------------------------------+
| NPU   Name                | Health        | Power(W)    Temp(C)           Hugepages-Usage(page)|
| Chip                      | Bus-Id        | AICore(%)   Memory-Usage(MB)  HBM-Usage(MB)        |
+===========================+===============+====================================================+
| 0     910B2C              | OK            | 92.1        47                0    / 0             |
| 0                         | 0000:5A:00.0  | 0           0    / 0          3441 / 65536         |
+===========================+===============+====================================================+
| 1     910B2C              | OK            | 92.7        51                0    / 0             |
| 0                         | 0000:19:00.0  | 0           0    / 0          3427 / 65536         |
+===========================+===============+====================================================+
| 2     910B2C              | OK            | 98.4        50                0    / 0             |
| 0                         | 0000:49:00.0  | 0           0    / 0          3426 / 65536         |
+===========================+===============+====================================================+
| 3     910B2C              | OK            | 96.7        49                0    / 0             |
| 0                         | 0000:39:00.0  | 0           0    / 0          3426 / 65536         |
+===========================+===============+====================================================+
| 4     910B2C              | OK            | 95.2        48                0    / 0             |
| 0                         | 0000:DA:00.0  | 0           0    / 0          3437 / 65536         |
+===========================+===============+====================================================+
| 5     910B2C              | OK            | 101.2       49                0    / 0             |
| 0                         | 0000:99:00.0  | 0           0    / 0          3426 / 65536         |
+===========================+===============+====================================================+
| 6     910B2C              | OK            | 95.1        49                0    / 0             |
| 0                         | 0000:B8:00.0  | 0           0    / 0          3427 / 65536         |
+===========================+===============+====================================================+
| 7     910B2C              | OK            | 97.2        49                0    / 0             |
| 0                         | 0000:C8:00.0  | 0           0    / 0          3426 / 65536         |
+===========================+===============+====================================================+
| 8     910B2C              | OK            | 97.1        51                0    / 0             |
| 0                         | 0000:59:00.0  | 0           0    / 0          3434 / 65536         |
+===========================+===============+====================================================+
| 9     910B2C              | OK            | 98.7        48                0    / 0             |
| 0                         | 0000:18:00.0  | 0           0    / 0          3403 / 65536         |
+===========================+===============+====================================================+
| 10    910B2C              | OK            | 89.3        48                0    / 0             |
| 0                         | 0000:48:00.0  | 0           0    / 0          3402 / 65536         |
+===========================+===============+====================================================+
| 11    910B2C              | OK            | 101.1       50                0    / 0             |
| 0                         | 0000:38:00.0  | 0           0    / 0          3203 / 65536         |
+===========================+===============+====================================================+
| 12    910B2C              | OK            | 96.5        49                0    / 0             |
| 0                         | 0000:D9:00.0  | 0           0    / 0          3399 / 65536         |
+===========================+===============+====================================================+
| 13    910B2C              | OK            | 99.3        48                0    / 0             |
| 0                         | 0000:98:00.0  | 0           0    / 0          3399 / 65536         |
+===========================+===============+====================================================+
| 14    910B2C              | OK            | 97.1        48                0    / 0             |
| 0                         | 0000:B9:00.0  | 0           0    / 0          3399 / 65536         |
+===========================+===============+====================================================+
| 15    910B2C              | OK            | 101.3       49                0    / 0             |
| 0                         | 0000:C9:00.0  | 0           0    / 0          34810/ 65536         |
+===========================+===============+====================================================+
+---------------------------+---------------+----------------------------------------------------+
| NPU     Chip              | Process id    | Process name             | Process memory(MB)      |
+===========================+===============+====================================================+
| No running processes found in NPU 0                                                            |
+===========================+===============+====================================================+
| No running processes found in NPU 1                                                            |
+===========================+===============+====================================================+
| No running processes found in NPU 2                                                            |
+===========================+===============+====================================================+
| No running processes found in NPU 3                                                            |
+===========================+===============+====================================================+
| No running processes found in NPU 4                                                            |
+===========================+===============+====================================================+
| No running processes found in NPU 5                                                            |
+===========================+===============+====================================================+
| No running processes found in NPU 6                                                            |
+===========================+===============+====================================================+
| No running processes found in NPU 7                                                            |
+===========================+===============+====================================================+
| No running processes found in NPU 8                                                            |
+===========================+===============+====================================================+
| No running processes found in NPU 9                                                            |
+===========================+===============+====================================================+
| No running processes found in NPU 10                                                           |
+===========================+===============+====================================================+
| No running processes found in NPU 11                                                           |
+===========================+===============+====================================================+
| No running processes found in NPU 12                                                           |
+===========================+===============+====================================================+
| No running processes found in NPU 13                                                           |
+===========================+===============+====================================================+
| No running processes found in NPU 14                                                           |
+===========================+===============+====================================================+
| 15      0                 | 2337259       |                          | 31463                   |
+===========================+===============+====================================================+

CANN:
package_name=Ascend-cann-toolkit
version=8.5.0
innerversion=V100R001C23SPC002B210
compatible_version=[V100R001C15],[V100R001C18],[V100R001C19],[V100R001C20],[V100R001C21],[V100R001C23]
arch=x86_64
os=linux
path=/root/miniconda/envs/cann-env-8.5.0/Ascend/ascend-toolkit/latest/x86_64-linux/

๐Ÿ› Describe the bug

Minimal example

import os

os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"
from vllm import LLM, SamplingParams


def main():
    prompts = [
        "Hello, my name is is",
        "The president of the United States is",
        "The capital of France is",
        "The future of AI is",
    ]

    # Create a sampling params object.
    sampling_params = SamplingParams(max_tokens=100, temperature=0.0)
    model_name = "/data2/weights/Qwen3-0.6B/"
    # Create an LLM.
    llm = LLM(model=model_name)

    # Generate texts from the prompts.
    outputs = llm.generate(prompts, sampling_params)
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")


if __name__ == "__main__":
    main()

Error Message

(EngineCore_DP0 pid=934497)   File "/vllm-workspace/vllm-ascend/vllm_ascend/compilation/acl_graph.py", line 110, in __call__
(EngineCore_DP0 pid=934497)     return self.runnable(*args, **kwargs)
(EngineCore_DP0 pid=934497)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=934497)   File "/vllm-workspace/vllm/vllm/compilation/piecewise_backend.py", line 190, in __call__
(EngineCore_DP0 pid=934497)     self._maybe_compile_for_range_entry(range_entry, args)
(EngineCore_DP0 pid=934497)   File "/vllm-workspace/vllm/vllm/compilation/piecewise_backend.py", line 155, in _maybe_compile_for_range_entry
(EngineCore_DP0 pid=934497)     range_entry.runnable = self.vllm_backend.compiler_manager.compile(
(EngineCore_DP0 pid=934497)                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=934497)   File "/vllm-workspace/vllm/vllm/compilation/backends.py", line 244, in compile
(EngineCore_DP0 pid=934497)     compiled_graph, handle = self.compiler.compile(
(EngineCore_DP0 pid=934497)                              ^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=934497)   File "/vllm-workspace/vllm-ascend/vllm_ascend/compilation/compiler_interface.py", line 139, in compile
(EngineCore_DP0 pid=934497)     return fusion_pass_compile(graph, example_inputs, compiler_config, compile_range, key)
(EngineCore_DP0 pid=934497)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=934497)   File "/vllm-workspace/vllm-ascend/vllm_ascend/compilation/compiler_interface.py", line 58, in fusion_pass_compile
(EngineCore_DP0 pid=934497)     compiled_fn = compile_fx(
(EngineCore_DP0 pid=934497)                   ^^^^^^^^^^^
(EngineCore_DP0 pid=934497)   File "/vllm-workspace/vllm-ascend/vllm_ascend/compilation/compiler_interface.py", line 41, in compile_fx
(EngineCore_DP0 pid=934497)     return aot_autograd(fw_compiler=inner_compile)(graph, example_inputs)
(EngineCore_DP0 pid=934497)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=934497)   File "/usr/local/python3.11.13/lib/python3.11/site-packages/torch/_dynamo/backends/common.py", line 117, in __call__
(EngineCore_DP0 pid=934497)     cg = aot_module_simplified(gm, example_inputs, **self.kwargs)
(EngineCore_DP0 pid=934497)          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=934497)   File "/usr/local/python3.11.13/lib/python3.11/site-packages/torch/_functorch/aot_autograd.py", line 1106, in aot_module_simplified
(EngineCore_DP0 pid=934497)     compiled_fn, _ = aot_stage2_compile(aot_state, aot_graph_capture)
(EngineCore_DP0 pid=934497)                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=934497)   File "/usr/local/python3.11.13/lib/python3.11/site-packages/torch/_functorch/_aot_autograd/graph_compile.py", line 242, in aot_stage2_compile
(EngineCore_DP0 pid=934497)     return aot_stage2_inference(aot_state, aot_graph_capture)
(EngineCore_DP0 pid=934497)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=934497)   File "/usr/local/python3.11.13/lib/python3.11/site-packages/torch/_functorch/_aot_autograd/graph_compile.py", line 315, in aot_stage2_inference
(EngineCore_DP0 pid=934497)     compiled_fw = compiler(fw_module, updated_flat_args)
(EngineCore_DP0 pid=934497)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=934497)   File "/vllm-workspace/vllm-ascend/vllm_ascend/compilation/compiler_interface.py", line 53, in compile_inner
(EngineCore_DP0 pid=934497)     graph = current_pass_manager(graph)
(EngineCore_DP0 pid=934497)             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=934497)   File "/vllm-workspace/vllm-ascend/vllm_ascend/compilation/graph_fusion_pass_manager.py", line 41, in __call__
(EngineCore_DP0 pid=934497)     pass_(graph)
(EngineCore_DP0 pid=934497)   File "/vllm-workspace/vllm-ascend/vllm_ascend/compilation/passes/qknorm_rope_fusion_pass.py", line 214, in __call__
(EngineCore_DP0 pid=934497)     self.matched_count = self.pattern_match_passes.apply(graph)
(EngineCore_DP0 pid=934497)                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=934497)   File "/usr/local/python3.11.13/lib/python3.11/site-packages/torch/_inductor/pattern_matcher.py", line 1982, in apply
(EngineCore_DP0 pid=934497)     if is_match(m) and guard_or_false(entry.extra_check(m)):
(EngineCore_DP0 pid=934497)                                       ^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=934497)   File "/usr/local/python3.11.13/lib/python3.11/site-packages/torch/_inductor/pattern_matcher.py", line 1520, in check_fn
(EngineCore_DP0 pid=934497)     match.replacement_graph = trace_fn(replace_fn, args)
(EngineCore_DP0 pid=934497)                               ^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=934497)   File "/usr/local/python3.11.13/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore_DP0 pid=934497)     return func(*args, **kwargs)
(EngineCore_DP0 pid=934497)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=934497)   File "/usr/local/python3.11.13/lib/python3.11/site-packages/torch/_inductor/pattern_matcher.py", line 2115, in fwd_only
(EngineCore_DP0 pid=934497)     gm = make_fx(fn, decompositions, tracing_mode="real")(*args)
(EngineCore_DP0 pid=934497)          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=934497)   File "/usr/local/python3.11.13/lib/python3.11/site-packages/torch/fx/experimental/proxy_tensor.py", line 2429, in wrapped
(EngineCore_DP0 pid=934497)     return make_fx_tracer.trace(f, *args)
(EngineCore_DP0 pid=934497)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=934497)   File "/usr/local/python3.11.13/lib/python3.11/site-packages/torch/fx/experimental/proxy_tensor.py", line 2356, in trace
(EngineCore_DP0 pid=934497)     return self._trace_inner(f, *args)
(EngineCore_DP0 pid=934497)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=934497)   File "/usr/local/python3.11.13/lib/python3.11/site-packages/torch/fx/experimental/proxy_tensor.py", line 2318, in _trace_inner
(EngineCore_DP0 pid=934497)     t = dispatch_trace(
(EngineCore_DP0 pid=934497)         ^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=934497)   File "/usr/local/python3.11.13/lib/python3.11/site-packages/torch/_compile.py", line 53, in inner
(EngineCore_DP0 pid=934497)     return disable_fn(*args, **kwargs)
(EngineCore_DP0 pid=934497)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=934497)   File "/usr/local/python3.11.13/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 1044, in _fn
(EngineCore_DP0 pid=934497)     return fn(*args, **kwargs)
(EngineCore_DP0 pid=934497)            ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=934497)   File "/usr/local/python3.11.13/lib/python3.11/site-packages/torch/fx/experimental/proxy_tensor.py", line 1303, in dispatch_trace
(EngineCore_DP0 pid=934497)     graph = tracer.trace(root, concrete_args)  # type: ignore[arg-type]
(EngineCore_DP0 pid=934497)             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=934497)   File "/usr/local/python3.11.13/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 1044, in _fn
(EngineCore_DP0 pid=934497)     return fn(*args, **kwargs)
(EngineCore_DP0 pid=934497)            ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=934497)   File "/usr/local/python3.11.13/lib/python3.11/site-packages/torch/fx/_symbolic_trace.py", line 868, in trace
(EngineCore_DP0 pid=934497)     (self.create_arg(fn(*args)),),
(EngineCore_DP0 pid=934497)                      ^^^^^^^^^
(EngineCore_DP0 pid=934497)   File "/usr/local/python3.11.13/lib/python3.11/site-packages/torch/fx/experimental/proxy_tensor.py", line 1361, in wrapped
(EngineCore_DP0 pid=934497)     out = f(*tensors)  # type:ignore[call-arg]
(EngineCore_DP0 pid=934497)           ^^^^^^^^^^^
(EngineCore_DP0 pid=934497)   File "/vllm-workspace/vllm-ascend/vllm_ascend/compilation/passes/qknorm_rope_fusion_pass.py", line 73, in replacement
(EngineCore_DP0 pid=934497)     results = torch.ops.vllm.qkv_rmsnorm_rope(
(EngineCore_DP0 pid=934497)               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=934497)   File "/usr/local/python3.11.13/lib/python3.11/site-packages/torch/_ops.py", line 1365, in __getattr__
(EngineCore_DP0 pid=934497)     raise AttributeError(
(EngineCore_DP0 pid=934497) AttributeError: '_OpNamespace' 'vllm' object has no attribute 'qkv_rmsnorm_rope'

Analysis of the cause of the problem

This issue is consistent with #6578.

Although it seems to be a problem with the qkv_rmsnorm_rope operator, by reading the source code, we can find that the root cause is

if HAS_TRITON:
import vllm_ascend.ops.triton.linearnorm.split_qkv_rmsnorm_rope # noqa

It is detected that the HAS_TRITON environment variable is False, but it is replaced by
results = torch.ops.vllm.qkv_rmsnorm_rope(
input=qkv,
q_weight=q_weight,
k_weight=k_weight,
q_hidden_size=self.q_size,
kv_hidden_size=self.kv_size,
head_dim=self.head_dim,
eps=self.eps,
q_bias=None,
k_bias=None,
cos_sin_cache=cos_sin_cache,
positions=positions,
)

Further analysis of this operator shows that in the x86 environment, the dependency statement of the xgrammar package will install more triton 3.6 packages than in the arm environment, causing it to conflict with the existing triton-ascend 3.2.0 package during the actual framework inference process. As a result, the correct triton-ascend device cannot be obtained when detecting the triton driver.

Possible solutions

Currently, there are several possible solutions for the A+X reasoning environment:

  1. Promote the upgrade of triton-ascend 3.6.x + as soon as possible and adapt it to vllm-ascend to solve the problem of operator inability to be called normally due to the current version inconsistency.
  2. For the A+X environment, use a patch in vllm-ascend to add entry point support for triton 3.6.0 so that the triton-ascend module can be discovered during triton driver detection.
  3. Correctly handle the situation of HAS_TRITON=False and degenerate into the general rmsnorm operator in A+X environment

If you have any other solution ideas, please feel free to communicate together!

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions