Skip to content

Conversation

@ye-NX
Copy link
Contributor

@ye-NX ye-NX commented Nov 3, 2025

This PR adds basic Flash Attention support for the SYCL backend, enabling more efficient attention computation on Intel GPUs.

  • Implemented Flash Attention kernel for SYCL backend
  • Added forward pass implementation with block-wise computation
  • Integrated with existing GGML SYCL infrastructure
  • Support for both F32

Authors:
Joint work by @safranowith and @ye-NX

Notes:

  • This is an initial implementation
  • Performance benchmarks and optimizations are planned for future iterations
  • Feedback and suggestions are welcome!

@github-actions github-actions bot added ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels Nov 3, 2025
Copy link
Collaborator

@NeoZhangJianyu NeoZhangJianyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meet compile error on https://github.com/ye-NX/llama.cpp/tree/saf-ye/flash-attn:

/home/xxx/hd1/llama.cpp/llama.cpp_saf-ye_flash-attn/ggml/src/ggml-sycl/ggml-sycl.cpp:3843:13: error: use of undeclared identifier 'ggml_sycl_op_flash_attn'
 3843 |             ggml_sycl_op_flash_attn(ctx, dst);
      |             ^
/home/xxx/hd1/llama.cpp/llama.cpp_saf-ye_flash-attn/ggml/src/ggml-sycl/ggml-sycl.cpp:4508:20: error: use of undeclared identifier 'ggml_sycl_flash_attn_ext_supported'
 4508 |             return ggml_sycl_flash_attn_ext_supported(op);
      |                    ^
/home/xxx/hd1/llama.cpp/llama.cpp_saf-ye_flash-attn/ggml/src/ggml-sycl/pad.cpp:64:30: warning: unused parameter 'item_ct1' [-Wunused-parameter]
   64 |         [=](sycl::nd_item<3> item_ct1) {
      |                              ^
2 errors generated.

safranowith and others added 3 commits November 4, 2025 11:27
Co-authored-by: safranowith <[email protected]>
Co-authored-by: ye-NX <[email protected]>
Co-authored-by: Neo Zhang Jianyu <[email protected]>
Co-authored-by: Neo Zhang Jianyu <[email protected]>
Co-authored-by: ye-NX <[email protected]>
Co-authored-by: safranowith <[email protected]>
Co-authored-by: safranowith <[email protected]>
Co-authored-by: ye-NX <[email protected]>
Co-authored-by: Sigbjørn Skjæret <[email protected]>
safranowith and others added 2 commits November 4, 2025 18:38
Co-authored-by: safranowith <[email protected]>
Co-authored-by: ye-NX <[email protected]>
Co-authored-by: safranowith <[email protected]>
Co-authored-by: ye-NX <[email protected]>
@NeoZhangJianyu
Copy link
Collaborator

NeoZhangJianyu commented Nov 5, 2025

The building is passed.
But the flash attention is not enabled on GPU:

llama_context: layer 0 is assigned to device SYCL0 but the Flash Attention tensor is assigned to device CPU (usually due to missing support)
llama_context: Flash Attention was auto, set to disabled

  1. how to enable flash-attention?
  2. what's the performance benefit?
  3. Is there any plan to support more data types? This PR is for fp32.

@ye-NX
Copy link
Contributor Author

ye-NX commented Nov 6, 2025

Thanks for feedback!
We're currently investigating why Flash Attention isn't being enabled on the GPU, and we're continuing to refine the implementation.
We also plan to add support for f16. If there are other data types that are important to your use cases, we would love to hear about them.

@NeoZhangJianyu
Copy link
Collaborator

NeoZhangJianyu commented Nov 10, 2025

Thanks for feedback! We're currently investigating why Flash Attention isn't being enabled on the GPU, and we're continuing to refine the implementation. We also plan to add support for f16. If there are other data types that are important to your use cases, we would love to hear about them.

Ok! It's great!

In fact, I'm implementing the flash attention too. It will support more data types. It need several weeks to be finished.
I don't khow to handle my current task. :)

Cancel my task and depend on your implementation?
or continue my task and merge it if mine is better than yours.

How do you think?

@ye-NX
Copy link
Contributor Author

ye-NX commented Nov 10, 2025

Ok! It's great!

In fact, I'm implementing the flash attention too. It will support more data types. It need several weeks to be finished. I don't khow to handle my current task. :)

Cancel my task and depend on your implementation? or continue my task and merge it if mine is better than yours.

How do you think?

What a coincidence...
For us, this is actually our final project, which we’ll be presenting at a demo in about three weeks.
We’d really appreciate it if you could let us continue developing it under your guidance.
If we don’t manage to polish everything perfectly by our deadline, maybe you could continue improving it afterward.
Does that sound okay to you?

@NeoZhangJianyu
Copy link
Collaborator

Ok! It's great!
In fact, I'm implementing the flash attention too. It will support more data types. It need several weeks to be finished. I don't khow to handle my current task. :)
Cancel my task and depend on your implementation? or continue my task and merge it if mine is better than yours.
How do you think?

What a coincidence... For us, this is actually our final project, which we’ll be presenting at a demo in about three weeks. We’d really appreciate it if you could let us continue developing it under your guidance. If we don’t manage to polish everything perfectly by our deadline, maybe you could continue improving it afterward. Does that sound okay to you?

Yes! I will support you!
No limitation of time. Please go ahead!

I want to contact to you by email. But I can't see your email address.
Could you send me an email to [email protected]?
So we could discuss more.

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants