sycl: flash-attention implementation #16969

ye-NX · 2025-11-03T13:11:27Z

This PR adds basic Flash Attention support for the SYCL backend, enabling more efficient attention computation on Intel GPUs.

Implemented Flash Attention kernel for SYCL backend
Added forward pass implementation with block-wise computation
Integrated with existing GGML SYCL infrastructure
Support for both F32

Authors:
Joint work by @safranowith and @ye-NX

Notes:

This is an initial implementation
Performance benchmarks and optimizations are planned for future iterations
Feedback and suggestions are welcome!

Co-authored-by: safranowith <[email protected]> Co-authored-by: ye-NX <[email protected]>

NeoZhangJianyu

I meet compile error on https://github.com/ye-NX/llama.cpp/tree/saf-ye/flash-attn:

/home/xxx/hd1/llama.cpp/llama.cpp_saf-ye_flash-attn/ggml/src/ggml-sycl/ggml-sycl.cpp:3843:13: error: use of undeclared identifier 'ggml_sycl_op_flash_attn'
 3843 |             ggml_sycl_op_flash_attn(ctx, dst);
      |             ^
/home/xxx/hd1/llama.cpp/llama.cpp_saf-ye_flash-attn/ggml/src/ggml-sycl/ggml-sycl.cpp:4508:20: error: use of undeclared identifier 'ggml_sycl_flash_attn_ext_supported'
 4508 |             return ggml_sycl_flash_attn_ext_supported(op);
      |                    ^
/home/xxx/hd1/llama.cpp/llama.cpp_saf-ye_flash-attn/ggml/src/ggml-sycl/pad.cpp:64:30: warning: unused parameter 'item_ct1' [-Wunused-parameter]
   64 |         [=](sycl::nd_item<3> item_ct1) {
      |                              ^
2 errors generated.

ggml/src/ggml-sycl/flash-attn/flash-attn-sycl.cpp

Co-authored-by: safranowith <[email protected]> Co-authored-by: ye-NX <[email protected]> Co-authored-by: Neo Zhang Jianyu <[email protected]>

Co-authored-by: Neo Zhang Jianyu <[email protected]> Co-authored-by: ye-NX <[email protected]> Co-authored-by: safranowith <[email protected]>

Co-authored-by: safranowith <[email protected]> Co-authored-by: ye-NX <[email protected]> Co-authored-by: Sigbjørn Skjæret <[email protected]>

Co-authored-by: safranowith <[email protected]> Co-authored-by: ye-NX <[email protected]>

NeoZhangJianyu · 2025-11-05T00:45:00Z

The building is passed.
But the flash attention is not enabled on GPU:

llama_context: layer 0 is assigned to device SYCL0 but the Flash Attention tensor is assigned to device CPU (usually due to missing support)
llama_context: Flash Attention was auto, set to disabled

how to enable flash-attention?
what's the performance benefit?
Is there any plan to support more data types? This PR is for fp32.

ye-NX · 2025-11-06T11:15:49Z

Thanks for feedback!
We're currently investigating why Flash Attention isn't being enabled on the GPU, and we're continuing to refine the implementation.
We also plan to add support for f16. If there are other data types that are important to your use cases, we would love to hear about them.

NeoZhangJianyu · 2025-11-10T01:31:20Z

Thanks for feedback! We're currently investigating why Flash Attention isn't being enabled on the GPU, and we're continuing to refine the implementation. We also plan to add support for f16. If there are other data types that are important to your use cases, we would love to hear about them.

Ok! It's great!

In fact, I'm implementing the flash attention too. It will support more data types. It need several weeks to be finished.
I don't khow to handle my current task. :)

Cancel my task and depend on your implementation?
or continue my task and merge it if mine is better than yours.

How do you think?

ye-NX · 2025-11-10T12:44:44Z

Ok! It's great!

In fact, I'm implementing the flash attention too. It will support more data types. It need several weeks to be finished. I don't khow to handle my current task. :)

Cancel my task and depend on your implementation? or continue my task and merge it if mine is better than yours.

How do you think?

What a coincidence...
For us, this is actually our final project, which we’ll be presenting at a demo in about three weeks.
We’d really appreciate it if you could let us continue developing it under your guidance.
If we don’t manage to polish everything perfectly by our deadline, maybe you could continue improving it afterward.
Does that sound okay to you?

NeoZhangJianyu · 2025-11-11T01:47:29Z

Ok! It's great!
In fact, I'm implementing the flash attention too. It will support more data types. It need several weeks to be finished. I don't khow to handle my current task. :)
Cancel my task and depend on your implementation? or continue my task and merge it if mine is better than yours.
How do you think?

What a coincidence... For us, this is actually our final project, which we’ll be presenting at a demo in about three weeks. We’d really appreciate it if you could let us continue developing it under your guidance. If we don’t manage to polish everything perfectly by our deadline, maybe you could continue improving it afterward. Does that sound okay to you?

Yes! I will support you!
No limitation of time. Please go ahead!

I want to contact to you by email. But I can't see your email address.
Could you send me an email to [email protected]?
So we could discuss more.

Thank you!

sycl: initialize flash-attention implementation

fc0e041

Co-authored-by: safranowith <[email protected]> Co-authored-by: ye-NX <[email protected]>

DajanaV mentioned this pull request Nov 3, 2025

UPSTREAM PR #16969: sycl: flash-attention implementation auroralabs-loci/llama.cpp#51

Closed

github-actions bot added ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels Nov 3, 2025

NeoZhangJianyu reviewed Nov 4, 2025

View reviewed changes

ggml/src/ggml-sycl/flash-attn/flash-attn-sycl.cpp Outdated Show resolved Hide resolved

ggml/src/ggml-sycl/flash-attn/flash-attn-sycl.cpp Outdated Show resolved Hide resolved

CISC reviewed Nov 4, 2025

View reviewed changes

ggml/src/ggml-sycl/flash-attn/flash-attn-sycl.cpp Show resolved Hide resolved

safranowith and others added 3 commits November 4, 2025 11:27

Update ggml/src/ggml-sycl/flash-attn/flash-attn-sycl.cpp

dd1fde5

Co-authored-by: safranowith <[email protected]> Co-authored-by: ye-NX <[email protected]> Co-authored-by: Neo Zhang Jianyu <[email protected]>

Update ggml/src/ggml-sycl/flash-attn/flash-attn-sycl.cpp

4f52591

Co-authored-by: Neo Zhang Jianyu <[email protected]> Co-authored-by: ye-NX <[email protected]> Co-authored-by: safranowith <[email protected]>

Update ggml/src/ggml-sycl/flash-attn/flash-attn-sycl.cpp

af5b644

Co-authored-by: safranowith <[email protected]> Co-authored-by: ye-NX <[email protected]> Co-authored-by: Sigbjørn Skjæret <[email protected]>

safranowith force-pushed the saf-ye/flash-attn branch from 693157c to af5b644 Compare November 4, 2025 09:29

DajanaV mentioned this pull request Nov 4, 2025

UPSTREAM PR #16969: sycl: flash-attention implementation auroralabs-loci/llama.cpp#73

Open

safranowith and others added 2 commits November 4, 2025 18:38

add include in ggml-sycl.cpp

8e8fb57

Co-authored-by: safranowith <[email protected]> Co-authored-by: ye-NX <[email protected]>

remove unrelated changes

dcd7ca5

Co-authored-by: safranowith <[email protected]> Co-authored-by: ye-NX <[email protected]>

safranowith force-pushed the saf-ye/flash-attn branch from fdf83f7 to dcd7ca5 Compare November 4, 2025 17:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

sycl: flash-attention implementation #16969

sycl: flash-attention implementation #16969

ye-NX commented Nov 3, 2025 •

edited

Loading

Uh oh!

NeoZhangJianyu left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

NeoZhangJianyu commented Nov 5, 2025 •

edited

Loading

Uh oh!

ye-NX commented Nov 6, 2025

Uh oh!

NeoZhangJianyu commented Nov 10, 2025 •

edited

Loading

Uh oh!

ye-NX commented Nov 10, 2025

Uh oh!

NeoZhangJianyu commented Nov 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

sycl: flash-attention implementation #16969

Are you sure you want to change the base?

sycl: flash-attention implementation #16969

Conversation

ye-NX commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NeoZhangJianyu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

NeoZhangJianyu commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ye-NX commented Nov 6, 2025

Uh oh!

NeoZhangJianyu commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ye-NX commented Nov 10, 2025

Uh oh!

NeoZhangJianyu commented Nov 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ye-NX commented Nov 3, 2025 •

edited

Loading

NeoZhangJianyu commented Nov 5, 2025 •

edited

Loading

NeoZhangJianyu commented Nov 10, 2025 •

edited

Loading