-
Notifications
You must be signed in to change notification settings - Fork 344
Description
[VLM][NPU] VLMPipeline on NPU ignores MAX_PROMPT_LEN — error message instructs users to set a property that immediately crashes with "Unsupported property by CPU plugin"
Environment
- OS: Windows 11 Home, Build 26200
- Hardware: Intel Core Ultra 5 226V (Lunar Lake, NPU4)
- NPU Driver: Intel AI Boost 32.0.100.4512
- OpenVINO: 2025.4.1 (pip,
2025.4.1-20426-82bbf0292c5) - openvino-genai: 2025.4.1 (pip,
2025.4.1.0-2683-fc593653d77) - Python: 3.12.10
- Models tested:
helenai/Qwen2.5-VL-3B-Instruct-ov-nf4-npu0ldev/Qwen2.5-VL-3B-Instruct-ov-nf4-npu(independently converted following Intel's NPU export instructions)
Description
VLMPipeline on NPU enforces a 1024-token limit on input embeddings. When the limit is exceeded, the pipeline throws a clear, actionable error message telling the user to set MAX_PROMPT_LEN. However, setting MAX_PROMPT_LEN via any available Python API immediately crashes with "Unsupported property MAX_PROMPT_LEN by CPU plugin".
The error message and the fix it prescribes are mutually contradictory: the API tells you how to fix the problem, and then rejects the fix.
Expected Behavior
pipe = ov_genai.VLMPipeline(model_path, "NPU", MAX_PROMPT_LEN=2048)
# Should compile the NPU LM subgraph with kv_desc.max_prompt_len = 2048
# and accept prompts up to that lengthActual Behavior
Two distinct failure modes, both preventing any use of MAX_PROMPT_LEN:
Error A — Limit is enforced with an actionable message (but the prescribed fix does not work):
Exception: Check 'inputs_embeds.get_shape().at(1) <= m_max_prompt_len' failed at
src\cpp\src\visual_language\pipeline.cpp:270:
VLM pipeline on NPU may only process input embeddings up to 1024 tokens. 1051 is passed.
Set the "MAX_PROMPT_LEN" config option to increase the limit.
Error B — Following the above instruction immediately crashes:
Exception from src\inference\src\cpp\core.cpp:137:
Exception from src\inference\src\dev\plugin.cpp:53:
Exception from src\plugins\intel_cpu\src\config.cpp:475:
NotFound: Unsupported property MAX_PROMPT_LEN by CPU plugin.
Reproduction Steps
import os
import numpy as np
import openvino as ov
import openvino_genai as ov_genai
from PIL import Image
model_path = "path/to/Qwen2.5-VL-3B-Instruct-ov-nf4-npu"
# Helper: 224x224 test image tensor — 4D (1, H, W, 3) per Intel's own examples
img = Image.new("RGB", (224, 224))
arr = np.array(img.getdata()).reshape(1, 224, 224, 3).astype(np.uint8)
image_tensor = ov.Tensor(arr)
# ── Step 1: Reproduce Error A ────────────────────────────────────────────────
# Limit triggered, fix suggested by the exception message itself
pipe = ov_genai.VLMPipeline(model_path, "NPU", CACHE_DIR="cache_default")
prompt = "Describe this image. " + ("The scene contains various objects. " * 120)
# ~1050 tokens: 224x224 image (~256 image tokens) + ~800 text tokens
pipe.start_chat()
pipe.generate(prompt, image=image_tensor, max_new_tokens=40)
# Raises:
# VLM pipeline on NPU may only process input embeddings up to 1024 tokens.
# 1051 is passed.
# Set the "MAX_PROMPT_LEN" config option to increase the limit.
pipe.finish_chat()
# ── Step 2: Reproduce Error B ────────────────────────────────────────────────
# Following the suggestion from Error A crashes immediately at pipeline load
pipe2 = ov_genai.VLMPipeline(model_path, "NPU",
CACHE_DIR="cache_2048",
MAX_PROMPT_LEN=2048)
# Raises:
# NotFound: Unsupported property MAX_PROMPT_LEN by CPU plugin.Root Cause Analysis
VLMPipeline is composed of a CPU vision encoder subgraph and an NPU LM subgraph. When pipeline properties are passed as kwargs (e.g. MAX_PROMPT_LEN=2048), they are broadcast to all sub-plugins — including the CPU encoder, which correctly rejects the NPU-specific property and throws before NPU compilation ever begins.
The relevant NPU-only path in src/cpp/src/utils.cpp:
// line ~477
kv_desc.max_prompt_len = pop_int_and_cast(properties, "MAX_PROMPT_LEN").value_or(1024u);This only executes when MAX_PROMPT_LEN survives routing to the NPU LM subgraph. For LLMPipeline (text-only, no CPU encoder), direct kwargs work correctly because no CPU sub-plugin intercepts them first. For VLMPipeline, the CPU encoder receives the property and throws unconditionally.
The alternative of passing DEVICE_PROPERTIES as a JSON string also fails silently: pop_or_default<ov::AnyMap> cannot cast std::string → ov::AnyMap, so the nested map is dropped entirely and MAX_PROMPT_LEN again never reaches pop_int_and_cast. The result is that the pipeline always compiles with the hardcoded default of 1024 tokens.
Note: The DEVICE_PROPERTIES string→int type conversion fix from PR #2142 addressed the C API only (src/c/src/llm_pipeline.cpp) and did not fix the VLMPipeline Python/C++ path.
Suggested Fix
In VLMPipeline's NPU initialization, extract MAX_PROMPT_LEN and MIN_RESPONSE_LEN from the top-level properties map before forwarding properties to sub-components, and apply them exclusively when compiling the NPU LM subgraph. The CPU vision encoder should not receive (or need to tolerate) NPU LM-specific KV-cache properties.
This mirrors the fix already applied to LLMPipeline for the same properties.
Workaround
None available from Python. The only mitigation is to minimize image token consumption by resizing input images to ≤128×128 before tensor conversion, which reduces image tokens from ~256 to ~64 and leaves ~960 tokens for system prompt + user text + response within the fixed 1024 limit.
Related
- NPU LLM Pipeline produces garbled output instead of error when prompt exceeds practical context limits #3255 —
LLMPipelineon NPU produces garbled output past context limit (same symptom, different pipeline — direct kwargs do work forLLMPipeline) - PR [C] Implement type conversion for the property values of MAX_PROMPT_LEN and MIN_RESPONSE_LEN #2142 —
[C] Implement type conversion for MAX_PROMPT_LEN and MIN_RESPONSE_LEN(fixed C API only, notVLMPipeline) - PR StaticLLMPipeline: Fix MAX_PROMPT_LEN / MIN_RESPONSE_LEN access #1020 —
StaticLLMPipeline: Fix MAX_PROMPT_LEN / MIN_RESPONSE_LEN access