[Question] Can Qwen3-rerank model running on NPU?

I tried to run Qwen3-rerank/embedding model on NPU, but hit a issue on rerank model. Can rerank model running on NPU?
Embedding model can run successfully.

Env: 
```
root@edgeainas:/workspace/openvino.genai/tools/llm_bench# pip list | grep openvino
openvino                                 2025.4.0rc3         20398
openvino-genai                           2025.4.0.0          1899
openvino-telemetry                       2025.2.0
openvino-tokenizers                      2025.4.0.0rc3 
```

Convect Model:
`optimum-cli export openvino OpenVINO/Qwen3-Reranker-0.6B-int4-cw-ov --framework pt --task text-generation --weight-format int4 --sym --group-size 128 --weight-format int4 --sym --group-size -1 --ratio 1 --trust-remote-code --model Qwen/Qwen3-Reranker-0.6B`
Running:
```
cd openvino.genai/tools/llm_bench
python3 benchmark.py -m /root/models/OpenVINO/Qwen3-reranker-0.6B-int4-cw-ov -n 2 --task text_rerank --reranking_max_length 1024  -d NPU
```

 Issue:
```
Multiple distributions found for package optimum. Picked distribution: optimum
[ INFO ] ==SUCCESS FOUND==: use_case: text_rerank, model_type: qwen3, model_Name: Qwen3-reranker-0.6B-int4-cw-ov
[ INFO ] OV Config={'CACHE_DIR': ''}
[ INFO ] Model path=/root/models/OpenVINO/Qwen3-reranker-0.6B-int4-cw-ov, openvino runtime version: 2025.4.0-20398-7a975177ff4-releases/2025/4, genai version: 2025.4.0.0-2674-5041b1dc4e5
[ERROR] 05:13:21.908 [IE::FrontEnd::importNetwork]   Upper bounds are not specified for node '__module.model.embed_tokens/ov_ext::embedding/Convert_1' (type 'Convert'): input '0' bounds are '[9223372036854775807, 9223372036854775807]'
[ERROR] 05:13:21.908 [IE::FrontEnd::importNetwork]   Upper bounds are not specified for node '__module.model.embed_tokens/ov_ext::embedding/Gather' (type 'Gather'): input '1' bounds are '[9223372036854775807, 9223372036854775807]'
[ERROR] 05:13:21.908 [IE::FrontEnd::importNetwork]   Upper bounds are not specified for node '__module.model.layers.0.input_layernorm/aten::pow/Power' (type 'Power'): input '0' bounds are '[9223372036854775807, 9223372036854775807, 1024]'
[ERROR] 05:13:21.908 [vpux-compiler] Got Diagnostic at loc(fused<{name = "__module.model.embed_tokens/ov_ext::embedding/Gather", type = "Gather"}>["__module.model.embed_tokens/ov_ext::embedding/Gather"]) : Got non broadcastable dimensions pair : '9223372036854775807' and -9223372036854775808'
loc(fused<{name = "__module.model.embed_tokens/ov_ext::embedding/Gather", type = "Gather"}>["__module.model.embed_tokens/ov_ext::embedding/Gather"]): error: Got non broadcastable dimensions pair : '9223372036854775807' and -9223372036854775808'
[ERROR] 05:13:21.908 [IE::FrontEnd::importNetwork]   Upper bounds are not specified for node '__module.model.layers.0.input_layernorm/aten::mul/Multiply' (type 'Multiply'): input '0' bounds are '[9223372036854775807, 9223372036854775807, 1024]'
[ERROR] 05:13:21.908 [IE::FrontEnd::importNetwork]   Upper bounds are not specified for node '__module.model.layers.0.self_attn.q_norm/aten::pow/Power' (type 'Power'): input '0' bounds are '[9223372036854775807, 9223372036854775807, 16, 128]'
[ERROR] 05:13:21.908 [vpux-compiler] Got Diagnostic at loc(fused<{name = "__module.model.layers.0.self_attn/aten::view/Reshape", type = "Reshape"}>["__module.model.layers.0.self_attn/aten::view/Reshape"]) : Got non broadcastable dimensions pair : '9223372036854775807' and -9223372036854775808'
loc(fused<{name = "__module.model.layers.0.self_attn/aten::view/Reshape", type = "Reshape"}>["__module.model.layers.0.self_attn/aten::view/Reshape"]): error: Got non broadcastable dimensions pair : '9223372036854775807' and -9223372036854775808'
[ERROR] 05:13:21.908 [IE::FrontEnd::importNetwork]   Upper bounds are not specified for node '__module.model.layers.0.self_attn.q_norm/aten::mul/Multiply' (type 'Multiply'): input '0' bounds are '[9223372036854775807, 9223372036854775807, 16, 128]'
[ WARNING ] Model is not supported by OpenVINO GenAI. GenAI pipeline loading failed with following error: Exception from src/inference/src/cpp/core.cpp:114:
Exception from src/inference/src/dev/plugin.cpp:53:
Exception from src/plugins/intel_npu/src/plugin/src/plugin.cpp:732:
Exception from src/plugins/intel_npu/src/compiler_adapter/src/ze_graph_ext_wrappers.cpp:360:
L0 pfnCreate2 result: ZE_RESULT_ERROR_INVALID_ARGUMENT, code 0x78000004 - generic error code for invalid arguments . [NPU_DRV] Driver reports a failure from vclAllocatedExecutableCreate2, return code: 2013265924
[NPU_VCL] Compiler returned msg:
Got negative shape dim bound: '-9223372036854775808'




Benchmark will be switched to Optimum Intel pipeline realization
[ INFO ] Selected Optimum Intel for benchmarking
[ERROR] 05:13:23.030 [IE::FrontEnd::importNetwork]   Upper bounds are not specified for node '__module.model.embed_tokens/ov_ext::embedding/Convert_1' (type 'Convert'): input '0' bounds are '[9223372036854775807, 9223372036854775807]'
[ERROR] 05:13:23.030 [IE::FrontEnd::importNetwork]   Upper bounds are not specified for node '__module.model.embed_tokens/ov_ext::embedding/Gather' (type 'Gather'): input '1' bounds are '[9223372036854775807, 9223372036854775807]'
[ERROR] 05:13:23.030 [IE::FrontEnd::importNetwork]   Upper bounds are not specified for node '__module.model.layers.0.input_layernorm/aten::pow/Power' (type 'Power'): input '0' bounds are '[9223372036854775807, 9223372036854775807, 1024]'
[ERROR] 05:13:23.030 [vpux-compiler] Got Diagnostic at loc(fused<{name = "__module.model.embed_tokens/ov_ext::embedding/Gather", type = "Gather"}>["__module.model.embed_tokens/ov_ext::embedding/Gather"]) : Got non broadcastable dimensions pair : '9223372036854775807' and -9223372036854775808'
loc(fused<{name = "__module.model.embed_tokens/ov_ext::embedding/Gather", type = "Gather"}>["__module.model.embed_tokens/ov_ext::embedding/Gather"]): error: Got non broadcastable dimensions pair : '9223372036854775807' and -9223372036854775808'
[ERROR] 05:13:23.030 [IE::FrontEnd::importNetwork]   Upper bounds are not specified for node '__module.model.layers.0.input_layernorm/aten::mul/Multiply' (type 'Multiply'): input '0' bounds are '[9223372036854775807, 9223372036854775807, 1024]'
[ERROR] 05:13:23.030 [IE::FrontEnd::importNetwork]   Upper bounds are not specified for node '__module.model.layers.0.self_attn.q_norm/aten::pow/Power' (type 'Power'): input '0' bounds are '[9223372036854775807, 9223372036854775807, 16, 128]'
[ERROR] 05:13:23.030 [vpux-compiler] Got Diagnostic at loc(fused<{name = "__module.model.layers.0.self_attn/aten::view/Reshape", type = "Reshape"}>["__module.model.layers.0.self_attn/aten::view/Reshape"]) : Got non broadcastable dimensions pair : '9223372036854775807' and -9223372036854775808'
loc(fused<{name = "__module.model.layers.0.self_attn/aten::view/Reshape", type = "Reshape"}>["__module.model.layers.0.self_attn/aten::view/Reshape"]): error: Got non broadcastable dimensions pair : '9223372036854775807' and -9223372036854775808'
[ERROR] 05:13:23.030 [IE::FrontEnd::importNetwork]   Upper bounds are not specified for node '__module.model.layers.0.self_attn.q_norm/aten::mul/Multiply' (type 'Multiply'): input '0' bounds are '[9223372036854775807, 9223372036854775807, 16, 128]'
[ ERROR ] An exception occurred
[ INFO ] Traceback (most recent call last):
  File "/workspace/openvino.genai/tools/llm_bench/llm_bench_utils/ov_utils.py", line 1232, in create_text_reranker_model
    ov_model = kwargs['use_case'].ov_cls.from_pretrained(
  File "/usr/local/lib/python3.10/dist-packages/optimum/intel/openvino/modeling_base.py", line 583, in from_pretrained
    return super().from_pretrained(
  File "/usr/local/lib/python3.10/dist-packages/optimum/modeling_base.py", line 407, in from_pretrained
    return from_pretrained_method(
  File "/usr/local/lib/python3.10/dist-packages/optimum/intel/openvino/modeling_decoder.py", line 896, in _from_pretrained
    causal_model = init_cls(
  File "/usr/local/lib/python3.10/dist-packages/optimum/intel/openvino/modeling_decoder.py", line 202, in __init__
    raise_error(self.use_cache, use_cache, "use_cache")
  File "/usr/local/lib/python3.10/dist-packages/optimum/intel/openvino/modeling_decoder.py", line 190, in raise_error
    raise ValueError(
ValueError: `use_cache` was set to `True` but the loaded model only supports `use_cache=False`. Please load your current model with `use_cache=False` or export the original model once again with `use_cache=True` when calling the `from_pretrained` method. To export your model, simply set `export=True`.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/workspace/openvino.genai/tools/llm_bench/benchmark.py", line 320, in main
    iter_data_list, pretrain_time, iter_timestamp = CASE_TO_BENCH[model_args['use_case'].task](
  File "/workspace/openvino.genai/tools/llm_bench/task/text_reranker.py", line 371, in run_text_reranker_benchmark
    model, tokenizer, pretrain_time, bench_hook, use_genai = FW_UTILS[framework].create_text_reranker_model(model_path, device, mem_consumption, **args)
  File "/workspace/openvino.genai/tools/llm_bench/llm_bench_utils/ov_utils.py", line 1238, in create_text_reranker_model
    ov_model = kwargs['use_case'].ov_cls.from_pretrained(
  File "/usr/local/lib/python3.10/dist-packages/optimum/intel/openvino/modeling_base.py", line 583, in from_pretrained
    return super().from_pretrained(
  File "/usr/local/lib/python3.10/dist-packages/optimum/modeling_base.py", line 407, in from_pretrained
    return from_pretrained_method(
  File "/usr/local/lib/python3.10/dist-packages/optimum/intel/openvino/modeling_decoder.py", line 896, in _from_pretrained
    causal_model = init_cls(
  File "/usr/local/lib/python3.10/dist-packages/optimum/intel/openvino/modeling_decoder.py", line 208, in __init__
    self.compile()
  File "/usr/local/lib/python3.10/dist-packages/optimum/intel/openvino/modeling_decoder.py", line 428, in compile
    super().compile()
  File "/usr/local/lib/python3.10/dist-packages/optimum/intel/openvino/modeling_base.py", line 868, in compile
    self.request = self._compile_model(self.model, self._device, ov_config, self.model_save_dir)
  File "/usr/local/lib/python3.10/dist-packages/optimum/intel/openvino/modeling_base.py", line 384, in _compile_model
    compiled_model = core.compile_model(model, device.upper() if device is not None else device, config=ov_config)
  File "/usr/local/lib/python3.10/dist-packages/openvino/_ov_api.py", line 610, in compile_model
    super().compile_model(model, device_name, {} if config is None else config),
RuntimeError: Exception from src/inference/src/cpp/core.cpp:114:
Exception from src/inference/src/dev/plugin.cpp:53:
Exception from src/plugins/intel_npu/src/plugin/src/plugin.cpp:732:
Exception from src/plugins/intel_npu/src/compiler_adapter/src/ze_graph_ext_wrappers.cpp:360:
L0 pfnCreate2 result: ZE_RESULT_ERROR_INVALID_ARGUMENT, code 0x78000004 - generic error code for invalid arguments . [NPU_DRV] Driver reports a failure from vclAllocatedExecutableCreate2, return code: 2013265924
[NPU_VCL] Compiler returned msg:
Got negative shape dim bound: '-9223372036854775808'
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] Can Qwen3-rerank model running on NPU? #3135

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Question] Can Qwen3-rerank model running on NPU? #3135

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions