Flash Attention Texture Compute Shader for Vulkan Backend Delegate #12982

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

leafs1 merged 1 commit into pytorch:main from leafs1:export-D78836150

Aug 8, 2025

Contributor

leafs1 commented Jul 29, 2025 •

edited by pytorch-bot bot

Loading

Summary: Built flash attention compute shader for Vulkan backend delegate. The current implementation is not fully optimized, but is functional. This shader should speed up the SDPA process in the attention block of transformer inferencing as the previous implementation used many i/o operations. The implementation includes proper multi-query attention support for models like LLaMA, uses tiled block processing to reduce memory usage, and replaces multiple separate operations (matmul, softmax, masking) with a single efficient compute shader.

Reviewed By: SS-JIA

Differential Revision: D78836150

cc @SS-JIA @manuelcandales @cbilgin

leafs1 requested a review from SS-JIA as a code owner

July 29, 2025 21:18

pytorch-bot bot added the module: vulkan label

pytorch-bot bot commented Jul 29, 2025 •

edited

Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/12982

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 1 Unrelated Failure

As of commit f3d7af7 with merge base afdbb85 ():

NEW FAILURES - The following jobs have failed:

Build documentation / build (buck2) / Build doc (gh)
At least one of the pre-conditions you specified did not hold
pull / test-moshi-linux / linux-job (gh)
RuntimeError: Command docker exec -t 1b1cc57c15328d4e8ac7e0b34b8f00b08821bc57c29781e330ff11dd177347c1 /exec failed with exit code 1

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / test-openvino-linux / linux-job (gh) (trunk failure)
AttributeError: '_OpNamespace' 'quantized_decomposed' object has no attribute 'convert_element_type'

This comment was automatically generated by Dr. CI and updates every 15 minutes.

meta-cla bot added the CLA Signed label

Contributor

facebook-github-bot commented Jul 29, 2025

This pull request was exported from Phabricator. Differential Revision: D78836150

facebook-github-bot added the fb-exported label

Contributor Author

leafs1 commented Jul 29, 2025

@pytorchbot label "release notes: none"

pytorch-bot bot added the release notes: none label

Contributor

facebook-github-bot commented Jul 29, 2025

This pull request was exported from Phabricator. Differential Revision: D78836150

leafs1 added a commit to leafs1/executorch that referenced this pull request


          Flash Attention Texture Compute Shader for Vulkan Backend Delegate (p…

8d5fb64

…ytorch#12982)

Summary:
Pull Request resolved: pytorch#12982

Built flash attention compute shader for Vulkan backend delegate. The current implementation is not fully optimized, but is functional. This shader should speed up the SDPA process in the attention block of transformer inferencing as the previous implementation used many i/o operations. The implementation includes proper multi-query attention support for models like LLaMA, uses tiled block processing to reduce memory usage, and replaces multiple separate operations (matmul, softmax, masking) with a single efficient compute shader.

Reviewed By: SS-JIA

Differential Revision: D78836150

leafs1 force-pushed the export-D78836150 branch from 1ac5d0e to 8d5fb64 Compare

July 29, 2025 21:27

Contributor

facebook-github-bot commented Jul 30, 2025

This pull request was exported from Phabricator. Differential Revision: D78836150

leafs1 force-pushed the export-D78836150 branch from 8d5fb64 to 011bcae Compare

July 30, 2025 18:38

leafs1 added a commit to leafs1/executorch that referenced this pull request


          Flash Attention Texture Compute Shader for Vulkan Backend Delegate (p…

011bcae

…ytorch#12982)

Summary:
Pull Request resolved: pytorch#12982

Built flash attention compute shader for Vulkan backend delegate. The current implementation is not fully optimized, but is functional. This shader should speed up the SDPA process in the attention block of transformer inferencing as the previous implementation used many i/o operations. The implementation includes proper multi-query attention support for models like LLaMA, uses tiled block processing to reduce memory usage, and replaces multiple separate operations (matmul, softmax, masking) with a single efficient compute shader.

Reviewed By: SS-JIA

Differential Revision: D78836150

Contributor

facebook-github-bot commented Jul 30, 2025

This pull request was exported from Phabricator. Differential Revision: D78836150

leafs1 added a commit to leafs1/executorch that referenced this pull request


          Flash Attention Texture Compute Shader for Vulkan Backend Delegate (p…

dadf14a

…ytorch#12982)

Summary:
Pull Request resolved: pytorch#12982

Built flash attention compute shader for Vulkan backend delegate. The current implementation is not fully optimized, but is functional. This shader should speed up the SDPA process in the attention block of transformer inferencing as the previous implementation used many i/o operations. The implementation includes proper multi-query attention support for models like LLaMA, uses tiled block processing to reduce memory usage, and replaces multiple separate operations (matmul, softmax, masking) with a single efficient compute shader.

Reviewed By: SS-JIA

Differential Revision: D78836150

leafs1 force-pushed the export-D78836150 branch from 011bcae to dadf14a Compare

July 30, 2025 18:47

SS-JIA approved these changes

View reviewed changes

Contributor

facebook-github-bot commented Aug 7, 2025

This pull request was exported from Phabricator. Differential Revision: D78836150

leafs1 added a commit to leafs1/executorch that referenced this pull request


          Flash Attention Texture Compute Shader for Vulkan Backend Delegate (p…

…ytorch#12982)

Summary:
Pull Request resolved: pytorch#12982

Built flash attention compute shader for Vulkan backend delegate. The current implementation is not fully optimized, but is functional. This shader should speed up the SDPA process in the attention block of transformer inferencing as the previous implementation used many i/o operations. The implementation includes proper multi-query attention support for models like LLaMA, uses tiled block processing to reduce memory usage, and replaces multiple separate operations (matmul, softmax, masking) with a single efficient compute shader.

Reviewed By: SS-JIA

Differential Revision: D78836150

leafs1 force-pushed the export-D78836150 branch from dadf14a to 0786839 Compare

August 7, 2025 22:32


          Flash Attention Texture Compute Shader for Vulkan Backend Delegate (p…

f3d7af7

…ytorch#12982)

Summary:
Pull Request resolved: pytorch#12982

Built flash attention compute shader for Vulkan backend delegate. The current implementation is not fully optimized, but is functional. This shader should speed up the SDPA process in the attention block of transformer inferencing as the previous implementation used many i/o operations. The implementation includes proper multi-query attention support for models like LLaMA, uses tiled block processing to reduce memory usage, and replaces multiple separate operations (matmul, softmax, masking) with a single efficient compute shader.

Reviewed By: SS-JIA

Differential Revision: D78836150

Contributor

facebook-github-bot commented Aug 7, 2025

This pull request was exported from Phabricator. Differential Revision: D78836150

leafs1 force-pushed the export-D78836150 branch from 0786839 to f3d7af7 Compare

August 7, 2025 23:00

leafs1 merged commit c99d2d5 into pytorch:main

175 of 181 checks passed

agrima1304 pushed a commit to agrima1304/executorch that referenced this pull request


          Flash Attention Texture Compute Shader for Vulkan Backend Delegate (p…

afa1d31

…ytorch#12982)

Summary: Built flash attention compute shader for Vulkan backend
delegate. The current implementation is not fully optimized, but is
functional. This shader should speed up the SDPA process in the
attention block of transformer inferencing as the previous
implementation used many i/o operations. The implementation includes
proper multi-query attention support for models like LLaMA, uses tiled
block processing to reduce memory usage, and replaces multiple separate
operations (matmul, softmax, masking) with a single efficient compute
shader.

Reviewed By: SS-JIA

Differential Revision: D78836150


cc @SS-JIA @manuelcandales @cbilgin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed fb-exported module: vulkan release notes: none