VLMPipeline on NPU ignores MAX_PROMPT_LEN — error message instructs users to set a property that immediately crashes with "Unsupported property by CPU plugin"

# [VLM][NPU] VLMPipeline on NPU ignores MAX_PROMPT_LEN — error message instructs users to set a property that immediately crashes with "Unsupported property by CPU plugin"

## Environment

- **OS:** Windows 11 Home, Build 26200
- **Hardware:** Intel Core Ultra 5 226V (Lunar Lake, NPU4)
- **NPU Driver:** Intel AI Boost 32.0.100.4512
- **OpenVINO:** 2025.4.1 (pip, `2025.4.1-20426-82bbf0292c5`)
- **openvino-genai:** 2025.4.1 (pip, `2025.4.1.0-2683-fc593653d77`)
- **Python:** 3.12.10
- **Models tested:**
  - `helenai/Qwen2.5-VL-3B-Instruct-ov-nf4-npu`
  - `0ldev/Qwen2.5-VL-3B-Instruct-ov-nf4-npu` (independently converted following Intel's NPU export instructions)

## Description

`VLMPipeline` on NPU enforces a 1024-token limit on input embeddings. When the limit is exceeded, the pipeline throws a clear, actionable error message telling the user to set `MAX_PROMPT_LEN`. However, setting `MAX_PROMPT_LEN` via any available Python API immediately crashes with `"Unsupported property MAX_PROMPT_LEN by CPU plugin"`.

The error message and the fix it prescribes are mutually contradictory: the API tells you how to fix the problem, and then rejects the fix.

## Expected Behavior

```python
pipe = ov_genai.VLMPipeline(model_path, "NPU", MAX_PROMPT_LEN=2048)
# Should compile the NPU LM subgraph with kv_desc.max_prompt_len = 2048
# and accept prompts up to that length
```

## Actual Behavior

Two distinct failure modes, both preventing any use of `MAX_PROMPT_LEN`:

**Error A — Limit is enforced with an actionable message (but the prescribed fix does not work):**

```
Exception: Check 'inputs_embeds.get_shape().at(1) <= m_max_prompt_len' failed at
src\cpp\src\visual_language\pipeline.cpp:270:
VLM pipeline on NPU may only process input embeddings up to 1024 tokens. 1051 is passed.
Set the "MAX_PROMPT_LEN" config option to increase the limit.
```

**Error B — Following the above instruction immediately crashes:**

```
Exception from src\inference\src\cpp\core.cpp:137:
Exception from src\inference\src\dev\plugin.cpp:53:
Exception from src\plugins\intel_cpu\src\config.cpp:475:
NotFound: Unsupported property MAX_PROMPT_LEN by CPU plugin.
```

## Reproduction Steps

```python
import os
import numpy as np
import openvino as ov
import openvino_genai as ov_genai
from PIL import Image

model_path = "path/to/Qwen2.5-VL-3B-Instruct-ov-nf4-npu"

# Helper: 224x224 test image tensor — 4D (1, H, W, 3) per Intel's own examples
img = Image.new("RGB", (224, 224))
arr = np.array(img.getdata()).reshape(1, 224, 224, 3).astype(np.uint8)
image_tensor = ov.Tensor(arr)

# ── Step 1: Reproduce Error A ────────────────────────────────────────────────
# Limit triggered, fix suggested by the exception message itself
pipe = ov_genai.VLMPipeline(model_path, "NPU", CACHE_DIR="cache_default")
prompt = "Describe this image. " + ("The scene contains various objects. " * 120)
# ~1050 tokens: 224x224 image (~256 image tokens) + ~800 text tokens
pipe.start_chat()
pipe.generate(prompt, image=image_tensor, max_new_tokens=40)
# Raises:
#   VLM pipeline on NPU may only process input embeddings up to 1024 tokens.
#   1051 is passed.
#   Set the "MAX_PROMPT_LEN" config option to increase the limit.
pipe.finish_chat()

# ── Step 2: Reproduce Error B ────────────────────────────────────────────────
# Following the suggestion from Error A crashes immediately at pipeline load
pipe2 = ov_genai.VLMPipeline(model_path, "NPU",
                              CACHE_DIR="cache_2048",
                              MAX_PROMPT_LEN=2048)
# Raises:
#   NotFound: Unsupported property MAX_PROMPT_LEN by CPU plugin.
```

## Root Cause Analysis

`VLMPipeline` is composed of a CPU vision encoder subgraph and an NPU LM subgraph. When pipeline properties are passed as kwargs (e.g. `MAX_PROMPT_LEN=2048`), they are broadcast to **all sub-plugins** — including the CPU encoder, which correctly rejects the NPU-specific property and throws before NPU compilation ever begins.

The relevant NPU-only path in `src/cpp/src/utils.cpp`:

```cpp
// line ~477
kv_desc.max_prompt_len = pop_int_and_cast(properties, "MAX_PROMPT_LEN").value_or(1024u);
```

This only executes when `MAX_PROMPT_LEN` survives routing to the NPU LM subgraph. For `LLMPipeline` (text-only, no CPU encoder), direct kwargs work correctly because no CPU sub-plugin intercepts them first. For `VLMPipeline`, the CPU encoder receives the property and throws unconditionally.

The alternative of passing `DEVICE_PROPERTIES` as a JSON string also fails silently: `pop_or_default<ov::AnyMap>` cannot cast `std::string → ov::AnyMap`, so the nested map is dropped entirely and `MAX_PROMPT_LEN` again never reaches `pop_int_and_cast`. The result is that the pipeline always compiles with the hardcoded default of 1024 tokens.

Note: The `DEVICE_PROPERTIES` string→int type conversion fix from PR #2142 addressed the **C API only** (`src/c/src/llm_pipeline.cpp`) and did not fix the `VLMPipeline` Python/C++ path.

## Suggested Fix

In `VLMPipeline`'s NPU initialization, extract `MAX_PROMPT_LEN` and `MIN_RESPONSE_LEN` from the top-level properties map **before** forwarding properties to sub-components, and apply them exclusively when compiling the NPU LM subgraph. The CPU vision encoder should not receive (or need to tolerate) NPU LM-specific KV-cache properties.

This mirrors the fix already applied to `LLMPipeline` for the same properties.

## Workaround

None available from Python. The only mitigation is to minimize image token consumption by resizing input images to ≤128×128 before tensor conversion, which reduces image tokens from ~256 to ~64 and leaves ~960 tokens for system prompt + user text + response within the fixed 1024 limit.

## Related

- #3255 — `LLMPipeline` on NPU produces garbled output past context limit (same symptom, different pipeline — direct kwargs **do** work for `LLMPipeline`)
- PR #2142 — `[C] Implement type conversion for MAX_PROMPT_LEN and MIN_RESPONSE_LEN` (fixed C API only, not `VLMPipeline`)
- PR #1020 — `StaticLLMPipeline: Fix MAX_PROMPT_LEN / MIN_RESPONSE_LEN access`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VLMPipeline on NPU ignores MAX_PROMPT_LEN — error message instructs users to set a property that immediately crashes with "Unsupported property by CPU plugin" #3366

[VLM][NPU] VLMPipeline on NPU ignores MAX_PROMPT_LEN — error message instructs users to set a property that immediately crashes with "Unsupported property by CPU plugin"

Environment

Description

Expected Behavior

Actual Behavior

Reproduction Steps

Root Cause Analysis

Suggested Fix

Workaround

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

VLMPipeline on NPU ignores MAX_PROMPT_LEN — error message instructs users to set a property that immediately crashes with "Unsupported property by CPU plugin" #3366

Description

[VLM][NPU] VLMPipeline on NPU ignores MAX_PROMPT_LEN — error message instructs users to set a property that immediately crashes with "Unsupported property by CPU plugin"

Environment

Description

Expected Behavior

Actual Behavior

Reproduction Steps

Root Cause Analysis

Suggested Fix

Workaround

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions