Skip to content

Commit 4cdbf96

Browse files
Fix more llm 2 (#1498)
* finish * finish * finish * finish
1 parent 36d616a commit 4cdbf96

File tree

1 file changed

+1
-1
lines changed

1 file changed

+1
-1
lines changed

optimize-llm.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -619,7 +619,7 @@ As we can see every time we increase the text input tokens by the just sampled t
619619

620620
With very few exceptions, LLMs are trained using the [causal language modeling objective](https://huggingface.co/docs/transformers/tasks/language_modeling#causal-language-modeling) and therefore mask the upper triangle matrix of the attention score - this is why in the two diagrams above the attention scores are left blank (*a.k.a* have 0 probability). For a quick recap on causal language modeling you can refer to the [*Illustrated Self Attention blog*](https://jalammar.github.io/illustrated-gpt2/#part-2-illustrated-self-attention).
621621

622-
As a consequence, tokens *never* depend on previous tokens, more specifically the \\( \mathbf{q}_i )\\ vector is never put in relation with any key, values vectors \\( \mathbf{k}_j, \mathbf{v}_j )\\ if \\( j > i )\\ . Instead \\( \mathbf{q}_i )\\ only attends to previous key-value vectors \\( \mathbf{k}_{m < i}, \mathbf{v}_{m < i} \text{ , for } m \in \{0, \ldots i - 1\}\\). In order to reduce unnecessary computation, one can therefore cache each layer's key-value vectors for all previous timesteps.
622+
As a consequence, tokens *never* depend on previous tokens, more specifically the \\( \mathbf{q}_i )\\ vector is never put in relation with any key, values vectors \\( \mathbf{k}_j, \mathbf{v}_j )\\ if \\( j > i )\\ . Instead \\( \mathbf{q}_i )\\ only attends to previous key-value vectors \\( \mathbf{k}_{m < i}, \mathbf{v}_{m < i} \text{ , for } m \in \{0, \ldots i - 1\} )\\. In order to reduce unnecessary computation, one can therefore cache each layer's key-value vectors for all previous timesteps.
623623

624624
In the following, we will tell the LLM to make user of the key-value cache by retrieving and forwarding it for each forward pass.
625625
In Transformers, we can retrieve the key-value cache by passing the `use_cache` flag to the `forward` call and can then pass it with the current token.

0 commit comments

Comments
 (0)