- 
                Notifications
    You must be signed in to change notification settings 
- Fork 698
[ET-VK] Implement SDPA with fused ops #14130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
## Context As title; optimize the SDPA operator by introducing shaders to perform the operation in 3 steps: 1. Compute attention weights, multiplying QT x K_cache, and applying scale and mask 2. Compute softmax normalization of computed attention weights 3. Compute final output by multiplying attention weights with V cache This new implementation is much more efficient than the existing one, which performed slicing, repeat_interleave, and transposition of projected and cache tensors as separate steps. The fusion of scale and mask with the computation of attention weights also allows for the computation of elements within the mask region to be skipped. ## Impact Decode latency for LLMs is much improved. For llama 3.2 3B generating ~250 tokens, decode latency increases from ~15 tok/s to ~21.5 tok/s Differential Revision: [D82053493](https://our.internmc.facebook.com/intern/diff/D82053493/) [ghstack-poisoned]
## Context As title; optimize the SDPA operator by introducing shaders to perform the operation in 3 steps: 1. Compute attention weights, multiplying QT x K_cache, and applying scale and mask 2. Compute softmax normalization of computed attention weights 3. Compute final output by multiplying attention weights with V cache This new implementation is much more efficient than the existing one, which performed slicing, repeat_interleave, and transposition of projected and cache tensors as separate steps. The fusion of scale and mask with the computation of attention weights also allows for the computation of elements within the mask region to be skipped. ## Impact Decode latency for LLMs is much improved. For llama 3.2 3B generating ~250 tokens, decode latency increases from ~15 tok/s to ~21.5 tok/s Differential Revision: [D82053493](https://our.internmc.facebook.com/intern/diff/D82053493/) ghstack-source-id: 308592117 Pull Request resolved: #14130
| This pull request was exported from Phabricator. Differential Revision: D82053493 | 
| This PR needs a  | 
## Context As title; optimize the SDPA operator by introducing shaders to perform the operation in 3 steps: 1. Compute attention weights, multiplying QT x K_cache, and applying scale and mask 2. Compute softmax normalization of computed attention weights 3. Compute final output by multiplying attention weights with V cache This new implementation is much more efficient than the existing one, which performed slicing, repeat_interleave, and transposition of projected and cache tensors as separate steps. The fusion of scale and mask with the computation of attention weights also allows for the computation of elements within the mask region to be skipped. ## Impact Decode latency for LLMs is much improved. For llama 3.2 3B generating ~250 tokens, decode latency increases from ~15 tok/s to ~21.5 tok/s Differential Revision: [D82053493](https://our.internmc.facebook.com/intern/diff/D82053493/) [ghstack-poisoned]
Pull Request resolved: #14130 ## Context As title; optimize the SDPA operator by introducing shaders to perform the operation in 3 steps: 1. Compute attention weights, multiplying QT x K_cache, and applying scale and mask 2. Compute softmax normalization of computed attention weights 3. Compute final output by multiplying attention weights with V cache This new implementation is much more efficient than the existing one, which performed slicing, repeat_interleave, and transposition of projected and cache tensors as separate steps. The fusion of scale and mask with the computation of attention weights also allows for the computation of elements within the mask region to be skipped. ## Impact Decode latency for LLMs is much improved. For llama 3.2 3B generating ~250 tokens, decode latency increases from ~15 tok/s to ~21.5 tok/s ghstack-source-id: 308621243 @exported-using-ghexport Differential Revision: [D82053493](https://our.internmc.facebook.com/intern/diff/D82053493/)
| This pull request was exported from Phabricator. Differential Revision: D82053493 | 
## Context As title; optimize the SDPA operator by introducing shaders to perform the operation in 3 steps: 1. Compute attention weights, multiplying QT x K_cache, and applying scale and mask 2. Compute softmax normalization of computed attention weights 3. Compute final output by multiplying attention weights with V cache This new implementation is much more efficient than the existing one, which performed slicing, repeat_interleave, and transposition of projected and cache tensors as separate steps. The fusion of scale and mask with the computation of attention weights also allows for the computation of elements within the mask region to be skipped. ## Impact Decode latency for LLMs is much improved. For llama 3.2 3B generating ~250 tokens, decode latency increases from ~15 tok/s to ~21.5 tok/s Differential Revision: [D82053493](https://our.internmc.facebook.com/intern/diff/D82053493/) [ghstack-poisoned]
Pull Request resolved: #14130 ## Context As title; optimize the SDPA operator by introducing shaders to perform the operation in 3 steps: 1. Compute attention weights, multiplying QT x K_cache, and applying scale and mask 2. Compute softmax normalization of computed attention weights 3. Compute final output by multiplying attention weights with V cache This new implementation is much more efficient than the existing one, which performed slicing, repeat_interleave, and transposition of projected and cache tensors as separate steps. The fusion of scale and mask with the computation of attention weights also allows for the computation of elements within the mask region to be skipped. ## Impact Decode latency for LLMs is much improved. For llama 3.2 3B generating ~250 tokens, decode latency increases from ~15 tok/s to ~21.5 tok/s ghstack-source-id: 308660072 @exported-using-ghexport Differential Revision: [D82053493](https://our.internmc.facebook.com/intern/diff/D82053493/)
| This pull request was exported from Phabricator. Differential Revision: D82053493 | 
711ccd8
      into
      
  
    gh/SS-JIA/324/base
  
    
Stack from ghstack (oldest at bottom):
Context
As title; optimize the SDPA operator by introducing shaders to perform the operation in 3 steps:
This new implementation is much more efficient than the existing one, which performed slicing, repeat_interleave, and transposition of projected and cache tensors as separate steps. The fusion of scale and mask with the computation of attention weights also allows for the computation of elements within the mask region to be skipped.
Impact
Decode latency for LLMs is much improved. For llama 3.2 3B generating ~250 tokens, decode latency increases from ~15 tok/s to ~21.5 tok/s
Differential Revision: D82053493