[NPUW] Fix eagle3 with chunk prefill#33975
[NPUW] Fix eagle3 with chunk prefill#33975GuoliangShiIntel wants to merge 4 commits intoopenvinotoolkit:masterfrom
Conversation
5d192f4 to
dfefbd5
Compare
AsyaPronina
left a comment
There was a problem hiding this comment.
Great fix, thank you!
| const uint32_t target_total_len = static_cast<uint32_t>(target_shape[1]); | ||
|
|
||
| OPENVINO_ASSERT(m_chunked_seq_offset + chunk_token_count <= target_total_len, | ||
| "Chunked sequence offset exceeds pre-allocated size"); |
There was a problem hiding this comment.
"Can't write chunk by stored chunked sequence offset and requested number of tokens, as it will exceed pre-allocated size"
|
|
||
| // Copy chunk data directly to the correct position in pre-allocated tensor | ||
| uint8_t* dst_ptr = reinterpret_cast<uint8_t*>(m_last_hidden_state->data()); | ||
| dst_ptr += m_chunked_seq_offset * row_bytes; // Move to the current write position |
There was a problem hiding this comment.
Could we please use ov::npuw::util::make_tensor_slice and tensor->copy_to(another_tensor) here? Some examples can be found in LLMInferRequest: https://github.com/openvinotoolkit/openvino/blob/master/src/plugins/intel_npu/src/plugin/npuw/llm_infer_request.cpp
There was a problem hiding this comment.
Good proposal, done
f04e2c0 to
2827581
Compare
|
|
||
| // Reset chunked prefill state before starting a new chunked prefill session | ||
| void reset_chunked_prefill_state() { | ||
| m_last_hidden_state = {}; |
There was a problem hiding this comment.
Why we need to do this?
It means on each prefill stage we are allocating a new tensor. Why?
There was a problem hiding this comment.
Good question. Please consider this scenario:
If we run two prompts consecutively using infer:
For the first prompt: m_last_hidden_state is null -> pre-allocate tensor for the full tensor -> copy each chunk's last_hidden_state into pre-allocated memory
After the first prefill completes, the generate phase also updates m_last_hidden_state. When the generate phase finishes, m_last_hidden_state remains non-null.
For the second prompt: Since m_last_hidden_state is still non-null, prefill will not enter the "Pre-allocate tensor on first chunk" path, causing a memory size mismatch that triggers the assertion.
Given that each prompt inference only prefill once, it's reasonable to reset the tensor here.
There was a problem hiding this comment.
So m_last_hidden_state is pointing to different tensors in prefill and generate phase?
It would be nice to have an explanatory comment here.
Having allocation per prefill is not a big deal I think. But we also can have pre-allocated tensor for prefill phase and not allocate it every time.
2827581 to
8babf24
Compare
8babf24 to
2aaccdb
Compare
|
build_jenkins |
Details:
Background:
last_hidden_statusin addition tologitslogitsonly needs the last token (via slice output),last_hidden_statusrequires all token tensors.last_hidden_statusacross chunks, unlikelogitswhich only needs the final chunk.Changes in this PR:
Added logic to accumulate and concatenate
last_hidden_statusoutputs across chunks during chunk prefill in the Eagle3 pipeline.Tickets:
- CVS-180647