Skip to content

Conversation

@leafs1
Copy link
Contributor

@leafs1 leafs1 commented Jul 29, 2025

Summary: Built flash attention compute shader for Vulkan backend delegate. The current implementation is not fully optimized, but is functional. This shader should speed up the SDPA process in the attention block of transformer inferencing as the previous implementation used many i/o operations. The implementation includes proper multi-query attention support for models like LLaMA, uses tiled block processing to reduce memory usage, and replaces multiple separate operations (matmul, softmax, masking) with a single efficient compute shader.

Reviewed By: SS-JIA

Differential Revision: D78836150

cc @SS-JIA @manuelcandales @cbilgin

@leafs1 leafs1 requested a review from SS-JIA as a code owner July 29, 2025 21:18
@pytorch-bot pytorch-bot bot added the module: vulkan Issues related to the Vulkan delegate and code under backends/vulkan/ label Jul 29, 2025
@pytorch-bot
Copy link

pytorch-bot bot commented Jul 29, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/12982

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 1 Unrelated Failure

As of commit f3d7af7 with merge base afdbb85 (image):

NEW FAILURES - The following jobs have failed:

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 29, 2025
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D78836150

@leafs1
Copy link
Contributor Author

leafs1 commented Jul 29, 2025

@pytorchbot label "release notes: none"

@pytorch-bot pytorch-bot bot added the release notes: none Do not include this in the release notes label Jul 29, 2025
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D78836150

leafs1 added a commit to leafs1/executorch that referenced this pull request Jul 29, 2025
…ytorch#12982)

Summary:
Pull Request resolved: pytorch#12982

Built flash attention compute shader for Vulkan backend delegate. The current implementation is not fully optimized, but is functional. This shader should speed up the SDPA process in the attention block of transformer inferencing as the previous implementation used many i/o operations. The implementation includes proper multi-query attention support for models like LLaMA, uses tiled block processing to reduce memory usage, and replaces multiple separate operations (matmul, softmax, masking) with a single efficient compute shader.

Reviewed By: SS-JIA

Differential Revision: D78836150
@leafs1 leafs1 force-pushed the export-D78836150 branch from 1ac5d0e to 8d5fb64 Compare July 29, 2025 21:27
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D78836150

@leafs1 leafs1 force-pushed the export-D78836150 branch from 8d5fb64 to 011bcae Compare July 30, 2025 18:38
leafs1 added a commit to leafs1/executorch that referenced this pull request Jul 30, 2025
…ytorch#12982)

Summary:
Pull Request resolved: pytorch#12982

Built flash attention compute shader for Vulkan backend delegate. The current implementation is not fully optimized, but is functional. This shader should speed up the SDPA process in the attention block of transformer inferencing as the previous implementation used many i/o operations. The implementation includes proper multi-query attention support for models like LLaMA, uses tiled block processing to reduce memory usage, and replaces multiple separate operations (matmul, softmax, masking) with a single efficient compute shader.

Reviewed By: SS-JIA

Differential Revision: D78836150
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D78836150

leafs1 added a commit to leafs1/executorch that referenced this pull request Jul 30, 2025
…ytorch#12982)

Summary:
Pull Request resolved: pytorch#12982

Built flash attention compute shader for Vulkan backend delegate. The current implementation is not fully optimized, but is functional. This shader should speed up the SDPA process in the attention block of transformer inferencing as the previous implementation used many i/o operations. The implementation includes proper multi-query attention support for models like LLaMA, uses tiled block processing to reduce memory usage, and replaces multiple separate operations (matmul, softmax, masking) with a single efficient compute shader.

Reviewed By: SS-JIA

Differential Revision: D78836150
@leafs1 leafs1 force-pushed the export-D78836150 branch from 011bcae to dadf14a Compare July 30, 2025 18:47
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D78836150

leafs1 added a commit to leafs1/executorch that referenced this pull request Aug 7, 2025
…ytorch#12982)

Summary:
Pull Request resolved: pytorch#12982

Built flash attention compute shader for Vulkan backend delegate. The current implementation is not fully optimized, but is functional. This shader should speed up the SDPA process in the attention block of transformer inferencing as the previous implementation used many i/o operations. The implementation includes proper multi-query attention support for models like LLaMA, uses tiled block processing to reduce memory usage, and replaces multiple separate operations (matmul, softmax, masking) with a single efficient compute shader.

Reviewed By: SS-JIA

Differential Revision: D78836150
@leafs1 leafs1 force-pushed the export-D78836150 branch from dadf14a to 0786839 Compare August 7, 2025 22:32
…ytorch#12982)

Summary:
Pull Request resolved: pytorch#12982

Built flash attention compute shader for Vulkan backend delegate. The current implementation is not fully optimized, but is functional. This shader should speed up the SDPA process in the attention block of transformer inferencing as the previous implementation used many i/o operations. The implementation includes proper multi-query attention support for models like LLaMA, uses tiled block processing to reduce memory usage, and replaces multiple separate operations (matmul, softmax, masking) with a single efficient compute shader.

Reviewed By: SS-JIA

Differential Revision: D78836150
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D78836150

@leafs1 leafs1 force-pushed the export-D78836150 branch from 0786839 to f3d7af7 Compare August 7, 2025 23:00
@leafs1 leafs1 merged commit c99d2d5 into pytorch:main Aug 8, 2025
175 of 181 checks passed
agrima1304 pushed a commit to agrima1304/executorch that referenced this pull request Aug 26, 2025
…ytorch#12982)

Summary: Built flash attention compute shader for Vulkan backend
delegate. The current implementation is not fully optimized, but is
functional. This shader should speed up the SDPA process in the
attention block of transformer inferencing as the previous
implementation used many i/o operations. The implementation includes
proper multi-query attention support for models like LLaMA, uses tiled
block processing to reduce memory usage, and replaces multiple separate
operations (matmul, softmax, masking) with a single efficient compute
shader.

Reviewed By: SS-JIA

Differential Revision: D78836150


cc @SS-JIA @manuelcandales @cbilgin
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported module: vulkan Issues related to the Vulkan delegate and code under backends/vulkan/ release notes: none Do not include this in the release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants