Fix more llm 2 (#1498)

patrickvonplaten · web-flow · commit 4cdbf961911a · 2023-09-15T14:14:27.000+02:00
* finish

* finish

* finish

* finish
diff --git a/optimize-llm.md b/optimize-llm.md
@@ -619,7 +619,7 @@ As we can see every time we increase the text input tokens by the just sampled t
 
 With very few exceptions, LLMs are trained using the [causal language modeling objective](https://huggingface.co/docs/transformers/tasks/language_modeling#causal-language-modeling) and therefore mask the upper triangle matrix of the attention score - this is why in the two diagrams above the attention scores are left blank (*a.k.a* have 0 probability). For a quick recap on causal language modeling you can refer to the [*Illustrated Self Attention blog*](https://jalammar.github.io/illustrated-gpt2/#part-2-illustrated-self-attention).
 
-As a consequence, tokens *never* depend on previous tokens, more specifically the \\( \mathbf{q}_i )\\ vector is never put in relation with any key, values vectors \\( \mathbf{k}_j, \mathbf{v}_j )\\ if \\( j > i )\\ . Instead \\( \mathbf{q}_i )\\ only attends to previous key-value vectors \\( \mathbf{k}_{m < i}, \mathbf{v}_{m < i} \text{ , for } m \in \{0, \ldots i - 1\}\\). In order to reduce unnecessary computation, one can therefore cache each layer's key-value vectors for all previous timesteps.
+As a consequence, tokens *never* depend on previous tokens, more specifically the \\( \mathbf{q}_i )\\ vector is never put in relation with any key, values vectors \\( \mathbf{k}_j, \mathbf{v}_j )\\ if \\( j > i )\\ . Instead \\( \mathbf{q}_i )\\ only attends to previous key-value vectors \\( \mathbf{k}_{m < i}, \mathbf{v}_{m < i} \text{ , for } m \in \{0, \ldots i - 1\} )\\. In order to reduce unnecessary computation, one can therefore cache each layer's key-value vectors for all previous timesteps.
 
 In the following, we will tell the LLM to make user of the key-value cache by retrieving and forwarding it for each forward pass.
 In Transformers, we can retrieve the key-value cache by passing the `use_cache` flag to the `forward` call and can then pass it with the current token.