Split fp8_fused_sdpa into two phases by czhu15 · Pull Request #2346 · intel/neural-compressor

czhu15 · 2025-11-26T00:37:48Z

Split fp8_fused_sdpa into two phases to decrease the TTFT.
The first phase will call fused_sdpa kernel w/o mask for prefix cached part.
The second phase will call fused_sdpa kernel with mask for the new prompt part.
Via splitting fp8_fused_sdpa into two phases, it decreases the memory consumption and also decreases the TTFT with current synapse fused_sdpa kernel.

czhu15 · 2025-11-26T00:38:24Z

cc @yangulei

Co-authored-by: Youlei Yang <youlei.yang@intel.com> Signed-off-by: Bob Zhu <bob.zhu@intel.com>

czhu15 · 2025-12-01T01:55:25Z

The output of the APC example code is OK.
The performance of TTFT is decreased to ~2 seconds with the customer's test data.

xin3he · 2025-12-04T05:23:27Z

@czhu15 Thank you for raising this enhancement, we will double check this change and ensure it's not breaking current usages.

linoybu

In the vLLM plugin, we are currently using FSDPA only during the prefill phase.
You can see this distinction here:
https://github.com/vllm-project/vllm-gaudi/blob/b8515d5fb8d5966768ad03e71bbbe1ad6661d7df/vllm_gaudi/attention/backends/hpu_attn.py#L262
It appears to be an attempt to separate decode and prefill operations to improve performance.
My question is: if we are not using FSDPA for decode, should we still expect any performance improvement?
Also, do you have a ticket that explains more about this issue?

xin3he · 2025-12-04T10:07:52Z

Thank you for this contribution. @czhu15
Per my understanding, this change is pure for prefill stage and split the prefill stage into two steps based on VLLM_FUSEDSDPA_SPLIT_THLD to reduce the peak memory usage and somehow improves the TTFT.
My suggestion would be keep the original behavior if VLLM_FUSEDSDPA_SPLIT_THLD is set.
To make it as default, we need more information.

czhu15 · 2025-12-04T13:06:59Z

Yes. This PR only applies only during the prefill phase. More specific, for the prefill phase when prefix caching is enabled. Current implementation is to pass a (big) atten_bias to the kernel, which can easily lead to OOM issue.
There is some discussion in below ticket, though lots of discussion with B, Jayachandran (jayachandran.b@intel.com) was done via teams.
https://jira.habana-labs.com/browse/SW-241376
To keep the non-split behavior, user can just set VLLM_FUSEDSDPA_SPLIT_THLD to 0. Pls feel free to check the performance in INC under different scenarios to see how to set the default value.

yiliu30 · 2025-12-05T01:17:00Z

Thank you for this contribution. @czhu15 Per my understanding, this change is pure for prefill stage and split the prefill stage into two steps based on VLLM_FUSEDSDPA_SPLIT_THLD to reduce the peak memory usage and somehow improves the TTFT. My suggestion would be keep the original behavior if VLLM_FUSEDSDPA_SPLIT_THLD is set. To make it as default, we need more information.

Hi @xin3he This PR was targeted at aice/v122 or v3.6.post.oot for now. It’s okay to allow more flexibility in order to pursue ultimate performance.

yiliu30 · 2025-12-05T01:18:18Z

Hi @czhu15 please let me know once the local tests pass. I can help with the merge, or you’re welcome to do it yourself.

czhu15 marked this pull request as draft November 26, 2025 00:40

Split fp8_fused_sdpa into two phases

82657ca

Co-authored-by: Youlei Yang <youlei.yang@intel.com> Signed-off-by: Bob Zhu <bob.zhu@intel.com>

czhu15 marked this pull request as ready for review December 1, 2025 01:54

chensuyue requested a review from xin3he December 3, 2025 08:02

xin3he requested a review from yiliu30 December 4, 2025 05:19

linoybu reviewed Dec 4, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split fp8_fused_sdpa into two phases#2346

Split fp8_fused_sdpa into two phases#2346
czhu15 wants to merge 1 commit intointel:aice/v122from
czhu15:split_sdpa

czhu15 commented Nov 26, 2025

Uh oh!

czhu15 commented Nov 26, 2025

Uh oh!

czhu15 commented Dec 1, 2025

Uh oh!

xin3he commented Dec 4, 2025 •

edited

Loading

Uh oh!

linoybu left a comment

Uh oh!

xin3he commented Dec 4, 2025

Uh oh!

czhu15 commented Dec 4, 2025

Uh oh!

yiliu30 commented Dec 5, 2025

Uh oh!

yiliu30 commented Dec 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

czhu15 commented Nov 26, 2025

Uh oh!

czhu15 commented Nov 26, 2025

Uh oh!

czhu15 commented Dec 1, 2025

Uh oh!

xin3he commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

linoybu left a comment

Choose a reason for hiding this comment

Uh oh!

xin3he commented Dec 4, 2025

Uh oh!

czhu15 commented Dec 4, 2025

Uh oh!

yiliu30 commented Dec 5, 2025

Uh oh!

yiliu30 commented Dec 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

xin3he commented Dec 4, 2025 •

edited

Loading