What precision is used by default to represent attention scores, and can it be quantized? #9791

bertsons · 2024-10-08T19:30:47Z

bertsons
Oct 8, 2024

I'm wondering what precision is used to represent the attention scores for a particular token during the attention process? I thought perhaps it'd be the same as the K and Q vectors since attention scores are essentially the dot products of those, however, it doesn't seem as straightforward as that when you realise that the KV cache can be quantized to lower a precision (and not the Q vectors?). My calculations show me that it's 32-bit by default; is this correct?

Also, can it be quantized?

Edit: I know you can use Flash Attention (-fa arg) which kind of reduces the need for "quantization" of these attention scores, but I'm asking more out of interest than anything else, as I'm currently interested in calculating total peak memory usage.

coolling · 2024-10-17T13:47:11Z

coolling
Oct 17, 2024

I would like to ask you how to save the Cumulative Attention Score

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

What precision is used by default to represent attention scores, and can it be quantized? #9791

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

What precision is used by default to represent attention scores, and can it be quantized? #9791

Uh oh!

Uh oh!

bertsons Oct 8, 2024

Replies: 1 comment

Uh oh!

coolling Oct 17, 2024

bertsons
Oct 8, 2024

coolling
Oct 17, 2024