Skip to content

Commit 3aff420

Browse files
finish (#1504)
1 parent 1fc454e commit 3aff420

File tree

1 file changed

+3
-3
lines changed

1 file changed

+3
-3
lines changed

optimize-llm.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -530,10 +530,10 @@ Positional encodings, encode the position of each token into a numerical present
530530

531531
The authors of the [*Attention Is All You Need*](https://arxiv.org/abs/1706.03762) paper introduced sinusoidal positional embeddings \\( \mathbf{P} = \mathbf{p}_1, \ldots, \mathbf{p}_N \\) .
532532
where each vector \\( \mathbf{p}_i \\) is computed as a sinusoidal function of its position \\( i \\) .
533-
The positional encodings are then simply added to the input sequence vectors \\( \mathbf{\hat{X}} = \mathbf{\hat{x}}_1, \ldots, \mathbf{\hat{x}}_N \\) = \ \\) .\mathbf{x}\_1 + \\mathbf{p}\_1, \\ldots, \\mathbf{x}\_N + \\mathbf{x}\_N \ \\) . thereby cueing the model to better learn sentence order.
533+
The positional encodings are then simply added to the input sequence vectors \\( \mathbf{\hat{X}} = \mathbf{\hat{x}}_1, \ldots, \mathbf{\hat{x}}_N \\) = \\( .\mathbf{x}\_1 + \\mathbf{p}\_1, \\ldots, \\mathbf{x}\_N + \\mathbf{x}\_N \\) thereby cueing the model to better learn sentence order.
534534

535535
Instead of using fixed position embeddings, others (such as [Devlin et al.](https://arxiv.org/abs/1810.04805)) used learned positional encodings for which the positional embeddings
536-
\\( mathbf{P} \\) are learned during training.
536+
\\( \mathbf{P} \\) are learned during training.
537537

538538
Sinusoidal and learned position embeddings used to be the predominant methods to encode sentence order into LLMs, but a couple of problems related to these positional encodings were found:
539539

@@ -689,7 +689,7 @@ There is however one catch. While the required peak memory for the \\( \mathbf{Q
689689
Let's compute the number of float values that need to be stored in the key-value cache for the LLM `bigcode/octocoder` that we used before.
690690
The number of float values amounts to:
691691

692-
$$ 2 \times (\text{seq_len} - 1) \times \text{num_attn_heads} \times \text{attn_head_dim} \times \text{num_layers} $$
692+
$$ 2 \times \(\text{seq_len} - 1\) \times \text{num_attn_heads} \times \text{attn_head_dim} \times \text{num_layers} $$
693693

694694
Computing this for our LLM at a hypothetical input sequence length of 16000 gives:
695695

0 commit comments

Comments
 (0)