Skip to content

Commit 37b3a4e

Browse files
Update kv-cache.md (#2894)
* Remove extraneous “of” in phrase “a 38% speedup” * Drop trailing comma in token list example “[What, is, in]” * Standardise spelling to “codebase” (was “code base” in one spot)
1 parent 0f8e1f1 commit 37b3a4e

File tree

1 file changed

+2
-2
lines changed

1 file changed

+2
-2
lines changed

kv-cache.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ authors:
1313

1414
## TL;DR
1515

16-
We have implemented KV Caching from scratch in our [nanoVLM](https://github.com/huggingface/nanoVLM) repository (a small code base to train your own Vision Language Model with pure PyTorch). This gave us a **38%** of speedup in generation. In this blog post we cover KV Caching and all our experiences while implementing it. The lessons learnt are general and can be applied to all autoregressive language model generations. Implementing from scratch on a small codebase is a great learning experience, come along for the ride!
16+
We have implemented KV Caching from scratch in our [nanoVLM](https://github.com/huggingface/nanoVLM) repository (a small codebase to train your own Vision Language Model with pure PyTorch). This gave us a **38%** speedup in generation. In this blog post we cover KV Caching and all our experiences while implementing it. The lessons learnt are general and can be applied to all autoregressive language model generations. Implementing from scratch on a small codebase is a great learning experience, come along for the ride!
1717

1818
![bar plot showcasing improvement in generation speed](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/kv-cache/speed_improved.png)
1919

@@ -25,7 +25,7 @@ Autoregressive language models generate text by sampling *one token at a time*.
2525

2626
This step-by-step generation is inherently sequential:
2727

28-
- To generate token \\( t_{i+1} \\), the model must consider the entire sequence from \\( t_0 \\) to \\( t_i \\). From the first instance in the above example \\( t_{i+1} \\) would be `the` , while all the previous tokens \\( t_0 \\) to \\( t_i \\) would be `[What, is, in,]`.
28+
- To generate token \\( t_{i+1} \\), the model must consider the entire sequence from \\( t_0 \\) to \\( t_i \\). From the first instance in the above example \\( t_{i+1} \\) would be `the` , while all the previous tokens \\( t_0 \\) to \\( t_i \\) would be `[What, is, in]`.
2929
- Although transformers are internally parallel, each new prediction requires a full forward pass through all transformer layers, which incurs a quadratic memory/compute in terms of the sequence length.
3030

3131
This repetition also leads to computational **redundancy**. In this post, we explore **KV Caching**, an optimisation technique that mitigates this inefficiency.

0 commit comments

Comments
 (0)