-
Notifications
You must be signed in to change notification settings - Fork 742
Flash Attention Texture Compute Shader for Vulkan Backend Delegate #12982
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/12982
Note: Links to docs will display an error until the docs builds have been completed. ❌ 2 New Failures, 1 Unrelated FailureAs of commit f3d7af7 with merge base afdbb85 ( NEW FAILURES - The following jobs have failed:
BROKEN TRUNK - The following job failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
This pull request was exported from Phabricator. Differential Revision: D78836150 |
|
@pytorchbot label "release notes: none" |
|
This pull request was exported from Phabricator. Differential Revision: D78836150 |
…ytorch#12982) Summary: Pull Request resolved: pytorch#12982 Built flash attention compute shader for Vulkan backend delegate. The current implementation is not fully optimized, but is functional. This shader should speed up the SDPA process in the attention block of transformer inferencing as the previous implementation used many i/o operations. The implementation includes proper multi-query attention support for models like LLaMA, uses tiled block processing to reduce memory usage, and replaces multiple separate operations (matmul, softmax, masking) with a single efficient compute shader. Reviewed By: SS-JIA Differential Revision: D78836150
|
This pull request was exported from Phabricator. Differential Revision: D78836150 |
…ytorch#12982) Summary: Pull Request resolved: pytorch#12982 Built flash attention compute shader for Vulkan backend delegate. The current implementation is not fully optimized, but is functional. This shader should speed up the SDPA process in the attention block of transformer inferencing as the previous implementation used many i/o operations. The implementation includes proper multi-query attention support for models like LLaMA, uses tiled block processing to reduce memory usage, and replaces multiple separate operations (matmul, softmax, masking) with a single efficient compute shader. Reviewed By: SS-JIA Differential Revision: D78836150
|
This pull request was exported from Phabricator. Differential Revision: D78836150 |
…ytorch#12982) Summary: Pull Request resolved: pytorch#12982 Built flash attention compute shader for Vulkan backend delegate. The current implementation is not fully optimized, but is functional. This shader should speed up the SDPA process in the attention block of transformer inferencing as the previous implementation used many i/o operations. The implementation includes proper multi-query attention support for models like LLaMA, uses tiled block processing to reduce memory usage, and replaces multiple separate operations (matmul, softmax, masking) with a single efficient compute shader. Reviewed By: SS-JIA Differential Revision: D78836150
|
This pull request was exported from Phabricator. Differential Revision: D78836150 |
…ytorch#12982) Summary: Pull Request resolved: pytorch#12982 Built flash attention compute shader for Vulkan backend delegate. The current implementation is not fully optimized, but is functional. This shader should speed up the SDPA process in the attention block of transformer inferencing as the previous implementation used many i/o operations. The implementation includes proper multi-query attention support for models like LLaMA, uses tiled block processing to reduce memory usage, and replaces multiple separate operations (matmul, softmax, masking) with a single efficient compute shader. Reviewed By: SS-JIA Differential Revision: D78836150
dadf14a to
0786839
Compare
…ytorch#12982) Summary: Pull Request resolved: pytorch#12982 Built flash attention compute shader for Vulkan backend delegate. The current implementation is not fully optimized, but is functional. This shader should speed up the SDPA process in the attention block of transformer inferencing as the previous implementation used many i/o operations. The implementation includes proper multi-query attention support for models like LLaMA, uses tiled block processing to reduce memory usage, and replaces multiple separate operations (matmul, softmax, masking) with a single efficient compute shader. Reviewed By: SS-JIA Differential Revision: D78836150
|
This pull request was exported from Phabricator. Differential Revision: D78836150 |
0786839 to
f3d7af7
Compare
…ytorch#12982) Summary: Built flash attention compute shader for Vulkan backend delegate. The current implementation is not fully optimized, but is functional. This shader should speed up the SDPA process in the attention block of transformer inferencing as the previous implementation used many i/o operations. The implementation includes proper multi-query attention support for models like LLaMA, uses tiled block processing to reduce memory usage, and replaces multiple separate operations (matmul, softmax, masking) with a single efficient compute shader. Reviewed By: SS-JIA Differential Revision: D78836150 cc @SS-JIA @manuelcandales @cbilgin
Summary: Built flash attention compute shader for Vulkan backend delegate. The current implementation is not fully optimized, but is functional. This shader should speed up the SDPA process in the attention block of transformer inferencing as the previous implementation used many i/o operations. The implementation includes proper multi-query attention support for models like LLaMA, uses tiled block processing to reduce memory usage, and replaces multiple separate operations (matmul, softmax, masking) with a single efficient compute shader.
Reviewed By: SS-JIA
Differential Revision: D78836150
cc @SS-JIA @manuelcandales @cbilgin