[Bug]: [NPU]Qwen-Image-2512 use usp failed in 1328*1328

### Your current environment

<details>
<summary>The output of <code>python collect_env.py</code></summary>

```text
800T A2
```

</details>


### Your code version

<details>
<summary>The commit id or version of vllm-omni</summary>

```text
commit id: 3fc4f988eabf562572821896847dc96b69a4cf74
```
</details>


### 🐛 Describe the bug

also occurred on 1584*1056 and1056*1584

```
export ASCEND_RT_VISIBLE_DEVICES=4,5
export HCCL_OP_EXPANSION_MODE="AIV"

python /vllm-workspace/vllm-omni/examples/offline_inference/text_to_image/text_to_image.py \
  --model /home/weights/Qwen-Image-2512/ \
  --prompt "a cup of coffee on the table" \
  --seed 42 \
  --cfg_scale 4.0 \
  --num_images_per_prompt 1 \
  --num_inference_steps 50 \
  --height 1328 \
  --width 1328 \
  --output outputs/coffee.png \
  --cache_backend cache_dit \
  --ulysses_degree 2 \
```

log:
```
[rank0]:[E119 16:32:47.013946230 compiler_depend.ts:444] operator():build/CMakeFiles/torch_npu.dir/compiler_depend.ts:1042 NPU function error: call aclnnFlashAttentionScore failed, error code is 561103
[ERROR] 2026-01-19-16:32:47 (PID:10443, Device:0, RankID:0) ERR00100 PTA call acl api failed.
EZ9999: Inner Error!
EZ9999[PID: 10443] 2026-01-19-16:32:47.486.947 (EZ9999):  get unsupported atten_mask shape, the shape is [1, 6902]. B=[1], N=[12], Sq=[6902], Skv=[6902], supported atten_mask shape can be [B, N, Sq, Skv], [B, 1, Sq, Skv], [1, 1, Sq, Skv] and [Sq, Skv].[FUNC:AnalyzeOptionalInput][FILE:flash_attention_score_tiling_general.cpp][LINE:1621]
        TraceBack (most recent call last):
       fail to analyze context info.[FUNC:GetShapeAttrsInfo][FILE:flash_attention_score_tiling_general.cpp][LINE:866]
       Tiling failed
       Tiling Failed.
       Kernel Run failed. opType: 38, FlashAttentionScore
       launch failed for FlashAttentionScore, errno:561103.

Exception raised from operator() at build/CMakeFiles/torch_npu.dir/compiler_depend.ts:1042 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xb0 (0xffffa92848c0 in /usr/local/python3.11.13/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x68 (0xffffa922c140 in /usr/local/python3.11.13/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #2: <unknown function> + 0x111d6b4 (0xfffdf972d6b4 in /usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/lib/libtorch_npu.so)
frame #3: <unknown function> + 0x29f0894 (0xfffdfb000894 in /usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/lib/libtorch_npu.so)
frame #4: <unknown function> + 0x9cc700 (0xfffdf8fdc700 in /usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/lib/libtorch_npu.so)
frame #5: <unknown function> + 0x9cd2dc (0xfffdf8fdd2dc in /usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/lib/libtorch_npu.so)
frame #6: <unknown function> + 0x9cb1f8 (0xfffdf8fdb1f8 in /usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/lib/libtorch_npu.so)
frame #7: <unknown function> + 0xd29cc (0xffffb6e629cc in /lib/aarch64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x80398 (0xffffb7040398 in /lib/aarch64-linux-gnu/libc.so.6)
frame #9: <unknown function> + 0xe9e9c (0xffffb70a9e9c in /lib/aarch64-linux-gnu/libc.so.6)

[rank1]:[E119 16:32:47.014632430 compiler_depend.ts:444] operator():build/CMakeFiles/torch_npu.dir/compiler_depend.ts:1042 NPU function error: call aclnnFlashAttentionScore failed, error code is 561103
[ERROR] 2026-01-19-16:32:47 (PID:10444, Device:1, RankID:1) ERR00100 PTA call acl api failed.
EZ9999: Inner Error!
EZ9999[PID: 10444] 2026-01-19-16:32:47.487.817 (EZ9999):  get unsupported atten_mask shape, the shape is [1, 6902]. B=[1], N=[12], Sq=[6902], Skv=[6902], supported atten_mask shape can be [B, N, Sq, Skv], [B, 1, Sq, Skv], [1, 1, Sq, Skv] and [Sq, Skv].[FUNC:AnalyzeOptionalInput][FILE:flash_attention_score_tiling_general.cpp][LINE:1621]
        TraceBack (most recent call last):
       fail to analyze context info.[FUNC:GetShapeAttrsInfo][FILE:flash_attention_score_tiling_general.cpp][LINE:866]
       Tiling failed
       Tiling Failed.
       Kernel Run failed. opType: 38, FlashAttentionScore
       launch failed for FlashAttentionScore, errno:561103.

Exception raised from operator() at build/CMakeFiles/torch_npu.dir/compiler_depend.ts:1042 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xb0 (0xffffaf0b48c0 in /usr/local/python3.11.13/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x68 (0xffffaf05c140 in /usr/local/python3.11.13/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #2: <unknown function> + 0x111d6b4 (0xfffdff98d6b4 in /usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/lib/libtorch_npu.so)
frame #3: <unknown function> + 0x29f0894 (0xfffe01260894 in /usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/lib/libtorch_npu.so)
frame #4: <unknown function> + 0x9cc700 (0xfffdff23c700 in /usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/lib/libtorch_npu.so)
frame #5: <unknown function> + 0x9cd2dc (0xfffdff23d2dc in /usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/lib/libtorch_npu.so)
frame #6: <unknown function> + 0x9cb1f8 (0xfffdff23b1f8 in /usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/lib/libtorch_npu.so)
frame #7: <unknown function> + 0xd29cc (0xffffbcca29cc in /lib/aarch64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x80398 (0xffffbce80398 in /lib/aarch64-linux-gnu/libc.so.6)
frame #9: <unknown function> + 0xe9e9c (0xffffbcee9e9c in /lib/aarch64-linux-gnu/libc.so.6)

[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270] Error executing RPC: The Inner error is reported as above. The process exits for this inner error, and the current working operator name is aclnnFlashAttentionScore.
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270] Since the operator is called asynchronously, the stacktrace may be inaccurate. If you want to get the accurate stacktrace, please set the environment variable ASCEND_LAUNCH_BLOCKING=1.
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270] Note: ASCEND_LAUNCH_BLOCKING=1 will force ops to run in synchronous mode, resulting in performance degradation. Please unset ASCEND_LAUNCH_BLOCKING in time after debugging.
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270] [ERROR] 2026-01-19-16:32:47 (PID:10443, Device:0, RankID:0) ERR00100 PTA call acl api failed.
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270] Traceback (most recent call last):
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]   File "/workspace/vllm-omni/vllm_omni/diffusion/worker/gpu_diffusion_worker.py", line 265, in execute_rpc
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]     result = func(*args, **kwargs)
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]              ^^^^^^^^^^^^^^^^^^^^^
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]   File "/workspace/vllm-omni/vllm_omni/diffusion/worker/npu/npu_worker.py", line 118, in generate
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]     return self.execute_model(requests, self.od_config)
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]   File "/usr/local/python3.11.13/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]     return func(*args, **kwargs)
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]            ^^^^^^^^^^^^^^^^^^^^^
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]   File "/workspace/vllm-omni/vllm_omni/diffusion/worker/npu/npu_worker.py", line 138, in execute_model
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]     output = self.pipeline.forward(req)
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]              ^^^^^^^^^^^^^^^^^^^^^^^^^^
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]   File "/workspace/vllm-omni/vllm_omni/diffusion/models/qwen_image/pipeline_qwen_image.py", line 771, in forward
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]     latents = self.diffuse(
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]               ^^^^^^^^^^^^^
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]   File "/workspace/vllm-omni/vllm_omni/diffusion/models/qwen_image/pipeline_qwen_image.py", line 610, in diffuse
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]     noise_pred = self.transformer(
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]                  ^^^^^^^^^^^^^^^^^
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]   File "/usr/local/python3.11.13/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]     return self._call_impl(*args, **kwargs)
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]   File "/usr/local/python3.11.13/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]     return forward_call(*args, **kwargs)
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]   File "/usr/local/python3.11.13/lib/python3.11/site-packages/cache_dit/caching/cache_adapters/cache_adapter.py", line 439, in new_forward_with_hf_hook
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]     outputs = new_forward(self, *args, **kwargs)
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]   File "/usr/local/python3.11.13/lib/python3.11/site-packages/cache_dit/caching/cache_adapters/cache_adapter.py", line 427, in new_forward
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]     outputs = original_forward(*args, **kwargs)
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]   File "/workspace/vllm-omni/vllm_omni/diffusion/models/qwen_image/qwen_image_transformer.py", line 906, in forward
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]     encoder_hidden_states, hidden_states = block(
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]                                            ^^^^^^
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]   File "/usr/local/python3.11.13/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]     return self._call_impl(*args, **kwargs)
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]   File "/usr/local/python3.11.13/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]     return forward_call(*args, **kwargs)
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]   File "/usr/local/python3.11.13/lib/python3.11/site-packages/cache_dit/caching/cache_blocks/pattern_base.py", line 321, in forward
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]     ) = self.call_Mn_blocks(  # middle
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]   File "/usr/local/python3.11.13/lib/python3.11/site-packages/cache_dit/caching/cache_blocks/pattern_base.py", line 448, in call_Mn_blocks
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]     hidden_states = block(
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]                     ^^^^^^
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]   File "/usr/local/python3.11.13/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]     return self._call_impl(*args, **kwargs)
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]   File "/usr/local/python3.11.13/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]     return forward_call(*args, **kwargs)
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]   File "/workspace/vllm-omni/vllm_omni/diffusion/models/qwen_image/qwen_image_transformer.py", line 653, in forward
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]     attn_output = self.attn(
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]                   ^^^^^^^^^^
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]   File "/usr/local/python3.11.13/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]     return self._call_impl(*args, **kwargs)
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]   File "/usr/local/python3.11.13/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]     return forward_call(*args, **kwargs)
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]   File "/workspace/vllm-omni/vllm_omni/diffusion/models/qwen_image/qwen_image_transformer.py", line 439, in forward
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]     if hidden_states_mask is not None and hidden_states_mask.all():
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270] RuntimeError: The Inner error is reported as above. The process exits for this inner error, and the current working operator name is aclnnFlashAttentionScore.
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270] Since the operator is called asynchronously, the stacktrace may be inaccurate. If you want to get the accurate stacktrace, please set the environment variable ASCEND_LAUNCH_BLOCKING=1.
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270] Note: ASCEND_LAUNCH_BLOCKING=1 will force ops to run in synchronous mode, resulting in performance degradation. Please unset ASCEND_LAUNCH_BLOCKING in time after debugging.
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270] [ERROR] 2026-01-19-16:32:47 (PID:10443, Device:0, RankID:0) ERR00100 PTA call acl api failed.
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270] Error executing RPC: The Inner error is reported as above. The process exits for this inner error, and the current working operator name is aclnnFlashAttentionScore.
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270] Since the operator is called asynchronously, the stacktrace may be inaccurate. If you want to get the accurate stacktrace, please set the environment variable ASCEND_LAUNCH_BLOCKING=1.
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270] Note: ASCEND_LAUNCH_BLOCKING=1 will force ops to run in synchronous mode, resulting in performance degradation. Please unset ASCEND_LAUNCH_BLOCKING in time after debugging.
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270] [ERROR] 2026-01-19-16:32:47 (PID:10444, Device:1, RankID:1) ERR00100 PTA call acl api failed.
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270] Traceback (most recent call last):
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]   File "/workspace/vllm-omni/vllm_omni/diffusion/worker/gpu_diffusion_worker.py", line 265, in execute_rpc
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]     result = func(*args, **kwargs)
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]              ^^^^^^^^^^^^^^^^^^^^^
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]   File "/workspace/vllm-omni/vllm_omni/diffusion/worker/npu/npu_worker.py", line 118, in generate
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]     return self.execute_model(requests, self.od_config)
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]   File "/usr/local/python3.11.13/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]     return func(*args, **kwargs)
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]            ^^^^^^^^^^^^^^^^^^^^^
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]   File "/workspace/vllm-omni/vllm_omni/diffusion/worker/npu/npu_worker.py", line 138, in execute_model
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]     output = self.pipeline.forward(req)
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]              ^^^^^^^^^^^^^^^^^^^^^^^^^^
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]   File "/workspace/vllm-omni/vllm_omni/diffusion/models/qwen_image/pipeline_qwen_image.py", line 771, in forward
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]     latents = self.diffuse(
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]               ^^^^^^^^^^^^^
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]   File "/workspace/vllm-omni/vllm_omni/diffusion/models/qwen_image/pipeline_qwen_image.py", line 610, in diffuse
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]     noise_pred = self.transformer(
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]                  ^^^^^^^^^^^^^^^^^
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]   File "/usr/local/python3.11.13/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]     return self._call_impl(*args, **kwargs)
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]   File "/usr/local/python3.11.13/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]     return forward_call(*args, **kwargs)
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]   File "/usr/local/python3.11.13/lib/python3.11/site-packages/cache_dit/caching/cache_adapters/cache_adapter.py", line 439, in new_forward_with_hf_hook
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]     outputs = new_forward(self, *args, **kwargs)
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]   File "/usr/local/python3.11.13/lib/python3.11/site-packages/cache_dit/caching/cache_adapters/cache_adapter.py", line 427, in new_forward
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]     outputs = original_forward(*args, **kwargs)
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]   File "/workspace/vllm-omni/vllm_omni/diffusion/models/qwen_image/qwen_image_transformer.py", line 906, in forward
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]     encoder_hidden_states, hidden_states = block(
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]                                            ^^^^^^
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]   File "/usr/local/python3.11.13/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]     return self._call_impl(*args, **kwargs)
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]   File "/usr/local/python3.11.13/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]     return forward_call(*args, **kwargs)
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]   File "/usr/local/python3.11.13/lib/python3.11/site-packages/cache_dit/caching/cache_blocks/pattern_base.py", line 321, in forward
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]     ) = self.call_Mn_blocks(  # middle
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]   File "/usr/local/python3.11.13/lib/python3.11/site-packages/cache_dit/caching/cache_blocks/pattern_base.py", line 448, in call_Mn_blocks
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]     hidden_states = block(
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]                     ^^^^^^
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]   File "/usr/local/python3.11.13/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]     return self._call_impl(*args, **kwargs)
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]   File "/usr/local/python3.11.13/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]     return forward_call(*args, **kwargs)
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]   File "/workspace/vllm-omni/vllm_omni/diffusion/models/qwen_image/qwen_image_transformer.py", line 653, in forward
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]     attn_output = self.attn(
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]                   ^^^^^^^^^^
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]   File "/usr/local/python3.11.13/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]     return self._call_impl(*args, **kwargs)
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]   File "/usr/local/python3.11.13/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]     return forward_call(*args, **kwargs)
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]   File "/workspace/vllm-omni/vllm_omni/diffusion/models/qwen_image/qwen_image_transformer.py", line 439, in forward
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]     if hidden_states_mask is not None and hidden_states_mask.all():
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270] RuntimeError: The Inner error is reported as above. The process exits for this inner error, and the current working operator name is aclnnFlashAttentionScore.
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270] Since the operator is called asynchronously, the stacktrace may be inaccurate. If you want to get the accurate stacktrace, please set the environment variable ASCEND_LAUNCH_BLOCKING=1.
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270] Note: ASCEND_LAUNCH_BLOCKING=1 will force ops to run in synchronous mode, resulting in performance degradation. Please unset ASCEND_LAUNCH_BLOCKING in time after debugging.
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270] [ERROR] 2026-01-19-16:32:47 (PID:10444, Device:1, RankID:1) ERR00100 PTA call acl api failed.
[Stage-0] ERROR 01-19 16:32:47 [gpu_diffusion_worker.py:270]
[Stage-0] ERROR 01-19 16:32:47 [diffusion_engine.py:187] Generation failed: 'dict' object has no attribute 'error'
INFO 01-19 16:32:47 [log_utils.py:550] {'type': 'request_level_metrics',
INFO 01-19 16:32:47 [log_utils.py:550]  'request_id': '0_4c569671-ae5b-4ee4-a2c3-6a661d0a4f58',
INFO 01-19 16:32:47 [log_utils.py:550]  'e2e_time_ms': 189.84675407409668,
INFO 01-19 16:32:47 [log_utils.py:550]  'e2e_tpt': 0.0,
INFO 01-19 16:32:47 [log_utils.py:550]  'e2e_total_tokens': 0,
INFO 01-19 16:32:47 [log_utils.py:550]  'transfers_total_time_ms': 0.0,
INFO 01-19 16:32:47 [log_utils.py:550]  'transfers_total_bytes': 0,
INFO 01-19 16:32:47 [log_utils.py:550]  'stages': {0: {'stage_gen_time_ms': 183.68911743164062,
INFO 01-19 16:32:47 [log_utils.py:550]                 'num_tokens_out': 0,
INFO 01-19 16:32:47 [log_utils.py:550]                 'num_tokens_in': 0}}}
Processed prompts: 100%|███████████████████████████████████████| 1/1 [00:00<00:00,  5.28img/s, est. speed stage-0 img/s: 0.00, avg e2e_lat: 0.0ms]
INFO 01-19 16:32:47 [omni.py:782] [Summary] {'e2e_requests': 1,| 1/1 [00:00<00:00,  5.29img/s, est. speed stage-0 img/s: 0.00, avg e2e_lat: 0.0ms]
INFO 01-19 16:32:47 [omni.py:782]  'e2e_total_time_ms': 191.65873527526855,
INFO 01-19 16:32:47 [omni.py:782]  'e2e_sum_time_ms': 189.84675407409668,
INFO 01-19 16:32:47 [omni.py:782]  'e2e_total_tokens': 0,
INFO 01-19 16:32:47 [omni.py:782]  'e2e_avg_time_per_request_ms': 189.84675407409668,
INFO 01-19 16:32:47 [omni.py:782]  'e2e_avg_tokens_per_s': 0.0,
INFO 01-19 16:32:47 [omni.py:782]  'wall_time_ms': 191.65873527526855,
INFO 01-19 16:32:47 [omni.py:782]  'final_stage_id': {'0_4c569671-ae5b-4ee4-a2c3-6a661d0a4f58': 0},
INFO 01-19 16:32:47 [omni.py:782]  'stages': [{'stage_id': 0,
INFO 01-19 16:32:47 [omni.py:782]              'requests': 1,
INFO 01-19 16:32:47 [omni.py:782]              'tokens': 0,
INFO 01-19 16:32:47 [omni.py:782]              'total_time_ms': 190.59062004089355,
INFO 01-19 16:32:47 [omni.py:782]              'avg_time_per_request_ms': 190.59062004089355,
INFO 01-19 16:32:47 [omni.py:782]              'avg_tokens_per_s': 0.0}],
INFO 01-19 16:32:47 [omni.py:782]  'transfers': []}
Adding requests:   0%|                                                                                                      | 0/1 [00:00<?, ?it/s]
[Stage-0] INFO 01-19 16:32:47 [omni_stage.py:677] Received shutdown signal
[Stage-0] INFO 01-19 16:32:47 [gpu_diffusion_worker.py:304] Worker 1: Received shutdown message
[Stage-0] INFO 01-19 16:32:47 [gpu_diffusion_worker.py:304] Worker 0: Received shutdown message
[Stage-0] INFO 01-19 16:32:47 [gpu_diffusion_worker.py:325] event loop terminated.
[Stage-0] INFO 01-19 16:32:47 [gpu_diffusion_worker.py:325] event loop terminated.
[Stage-0] INFO 01-19 16:32:47 [npu_worker.py:251] Worker 0: Shutdown complete.
[Stage-0] INFO 01-19 16:32:47 [npu_worker.py:251] Worker 1: Shutdown complete.
Total generation time: 5.1984 seconds (5198.37 ms)
INFO 01-19 16:32:52 [text_to_image.py:168] Outputs: [OmniRequestOutput(request_id='', finished=True, stage_id=0, final_output_type='image', request_output=[namespace(request_id='0_4c569671-ae5b-4ee4-a2c3-6a661d0a4f58', output=None)], images=[], prompt=None, latents=None, metrics={})]
Traceback (most recent call last):
  File "/vllm-workspace/vllm-omni/examples/offline_inference/text_to_image/text_to_image.py", line 198, in <module>
    main()
  File "/vllm-workspace/vllm-omni/examples/offline_inference/text_to_image/text_to_image.py", line 177, in main
    raise ValueError("Invalid request_output structure or missing 'images' key")
ValueError: Invalid request_output structure or missing 'images' key
[ERROR] 2026-01-19-16:32:52 (PID:10169, Device:-1, RankID:-1) ERR99999 UNKNOWN applicaiton exception

```

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://vllm-omni.readthedocs.io), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: [NPU]Qwen-Image-2512 use usp failed in 1328*1328 #845

Your current environment

Your code version

🐛 Describe the bug

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug]: [NPU]Qwen-Image-2512 use usp failed in 1328*1328 #845

Description

Your current environment

Your code version

🐛 Describe the bug

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions