Skip to content

Commit 338353e

Browse files
Refactor attention score calculation in kv-cache.md to include scaling by the square root of the key dimension. (#2893)
1 parent f548686 commit 338353e

File tree

1 file changed

+3
-1
lines changed

1 file changed

+3
-1
lines changed

kv-cache.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -85,8 +85,10 @@ Here’s a minimal PyTorch equivalent using a causal mask:
8585

8686
```python
8787
import torch.nn.functional as F
88+
import math
8889

89-
attention_scores = Q @ K.T
90+
d_k = K.shape[-1]
91+
attention_scores = (Q @ K.T) / math.sqrt(d_k)
9092

9193
# Lower triangular mask to prevent future token access
9294
causal_mask = torch.tril(torch.ones(input_seq_length, input_seq_length))

0 commit comments

Comments
 (0)