Refactor attention score calculation in kv-cache.md to include scaling by the square root of the key dimension. (#2893)

samuellimabraz · web-flow · commit 338353e990f5 · 2025-06-05T08:43:03.000+05:30
diff --git a/kv-cache.md b/kv-cache.md
@@ -85,8 +85,10 @@ Here’s a minimal PyTorch equivalent using a causal mask:
 
 ```python
 import torch.nn.functional as F
+import math
 
-attention_scores = Q @ K.T
+d_k = K.shape[-1]
+attention_scores = (Q @ K.T) / math.sqrt(d_k)
 
 # Lower triangular mask to prevent future token access
 causal_mask = torch.tril(torch.ones(input_seq_length, input_seq_length))