Replies: 1 comment
-
I would like to ask you how to save the Cumulative Attention Score |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I'm wondering what precision is used to represent the attention scores for a particular token during the attention process? I thought perhaps it'd be the same as the K and Q vectors since attention scores are essentially the dot products of those, however, it doesn't seem as straightforward as that when you realise that the KV cache can be quantized to lower a precision (and not the Q vectors?). My calculations show me that it's 32-bit by default; is this correct?
Also, can it be quantized?
Edit: I know you can use Flash Attention (
-fa
arg) which kind of reduces the need for "quantization" of these attention scores, but I'm asking more out of interest than anything else, as I'm currently interested in calculating total peak memory usage.Beta Was this translation helpful? Give feedback.
All reactions