Skip to content

[NPUW] Fix eagle3 with chunk prefill#33975

Open
GuoliangShiIntel wants to merge 4 commits intoopenvinotoolkit:masterfrom
GuoliangShiIntel:sgl/fix_eagle_trunk_prefill_issue
Open

[NPUW] Fix eagle3 with chunk prefill#33975
GuoliangShiIntel wants to merge 4 commits intoopenvinotoolkit:masterfrom
GuoliangShiIntel:sgl/fix_eagle_trunk_prefill_issue

Conversation

@GuoliangShiIntel
Copy link
Contributor

@GuoliangShiIntel GuoliangShiIntel commented Feb 5, 2026

Details:

Background:

  1. The Eagle3 Target/Draft model now outputs last_hidden_status in addition to logits
  2. While logits only needs the last token (via slice output), last_hidden_status requires all token tensors.
  3. For chunk prefill, we must accumulate last_hidden_status across chunks, unlike logits which only needs the final chunk.

Changes in this PR:
Added logic to accumulate and concatenate last_hidden_status outputs across chunks during chunk prefill in the Eagle3 pipeline.

Tickets:

- CVS-180647

@GuoliangShiIntel GuoliangShiIntel self-assigned this Feb 5, 2026
@github-actions github-actions bot added category: NPU OpenVINO NPU plugin category: NPUW NPUW plugin labels Feb 5, 2026
@GuoliangShiIntel GuoliangShiIntel force-pushed the sgl/fix_eagle_trunk_prefill_issue branch from 5d192f4 to dfefbd5 Compare February 5, 2026 02:45
@GuoliangShiIntel GuoliangShiIntel marked this pull request as ready for review February 5, 2026 03:23
@GuoliangShiIntel GuoliangShiIntel requested review from a team as code owners February 5, 2026 03:23
@GuoliangShiIntel GuoliangShiIntel removed their assignment Feb 5, 2026
@dmatveev dmatveev added this to the 2026.1 milestone Feb 5, 2026
@dmatveev dmatveev self-assigned this Feb 5, 2026
Copy link
Contributor

@AsyaPronina AsyaPronina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great fix, thank you!

const uint32_t target_total_len = static_cast<uint32_t>(target_shape[1]);

OPENVINO_ASSERT(m_chunked_seq_offset + chunk_token_count <= target_total_len,
"Chunked sequence offset exceeds pre-allocated size");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Can't write chunk by stored chunked sequence offset and requested number of tokens, as it will exceed pre-allocated size"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


// Copy chunk data directly to the correct position in pre-allocated tensor
uint8_t* dst_ptr = reinterpret_cast<uint8_t*>(m_last_hidden_state->data());
dst_ptr += m_chunked_seq_offset * row_bytes; // Move to the current write position
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we please use ov::npuw::util::make_tensor_slice and tensor->copy_to(another_tensor) here? Some examples can be found in LLMInferRequest: https://github.com/openvinotoolkit/openvino/blob/master/src/plugins/intel_npu/src/plugin/npuw/llm_infer_request.cpp

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good proposal, done

@GuoliangShiIntel GuoliangShiIntel force-pushed the sgl/fix_eagle_trunk_prefill_issue branch 3 times, most recently from f04e2c0 to 2827581 Compare February 6, 2026 06:42

// Reset chunked prefill state before starting a new chunked prefill session
void reset_chunked_prefill_state() {
m_last_hidden_state = {};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we need to do this?
It means on each prefill stage we are allocating a new tensor. Why?

Copy link
Contributor Author

@GuoliangShiIntel GuoliangShiIntel Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. Please consider this scenario:

If we run two prompts consecutively using infer:

For the first prompt: m_last_hidden_state is null -> pre-allocate tensor for the full tensor -> copy each chunk's last_hidden_state into pre-allocated memory

After the first prefill completes, the generate phase also updates m_last_hidden_state. When the generate phase finishes, m_last_hidden_state remains non-null.

For the second prompt: Since m_last_hidden_state is still non-null, prefill will not enter the "Pre-allocate tensor on first chunk" path, causing a memory size mismatch that triggers the assertion.

Given that each prompt inference only prefill once, it's reasonable to reset the tensor here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So m_last_hidden_state is pointing to different tensors in prefill and generate phase?
It would be nice to have an explanatory comment here.

Having allocation per prefill is not a big deal I think. But we also can have pre-allocated tensor for prefill phase and not allocate it every time.

@GuoliangShiIntel GuoliangShiIntel force-pushed the sgl/fix_eagle_trunk_prefill_issue branch from 2827581 to 8babf24 Compare February 6, 2026 16:06
@GuoliangShiIntel GuoliangShiIntel force-pushed the sgl/fix_eagle_trunk_prefill_issue branch from 8babf24 to 2aaccdb Compare February 6, 2026 16:17
@dmatveev
Copy link
Contributor

dmatveev commented Feb 6, 2026

build_jenkins

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

category: NPU OpenVINO NPU plugin category: NPUW NPUW plugin

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants