You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+3-3Lines changed: 3 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -39,7 +39,7 @@ We aim to tackle the three pain points of popular acceleration techniques like s
39
39
</picture>
40
40
<br>
41
41
<divalign="left"width="80%">
42
-
<em>Medusa adds extra "heads" to LLMs to predict multiple future tokens simultaneously. When augmenting a model with Medusa, the original model stays untouched, these new heads are fine-tuned during training. During generation, these heads each produce multiple likely next words. These options are then combined and sorted out using a tree-based attention mechanism. Finally, a typical acceptance scheme is employed to pick the most plausible sequence for further decoding.</em>
42
+
<em>Medusa adds extra "heads" to LLMs to predict multiple future tokens simultaneously. When augmenting a model with Medusa, the original model stays untouched, and only the new heads are fine-tuned during. During generation, these heads each produce multiple likely words for the corresponding position. These options are then combined and processed using a tree-based attention mechanism. Finally, a typical acceptance scheme is employed to pick the longest plausible prefix from the candidates for further decoding.</em>
43
43
</div>
44
44
<br>
45
45
</div>
@@ -48,7 +48,7 @@ In a nutshell, we solve the challenges of speculative decoding with the followin
48
48
49
49
- Instead of introducing a new model, we train multiple decoding heads on the *same* model.
50
50
- The training is parameter-efficient so that even GPU poor can do it. And since there is no additional model, there is no need to adjust the distributed computing setup.
51
-
- Relaxing the requirement of matching the distribution of the original model makes the generation with random sampling even faster than greedy decoding.
51
+
- Relaxing the requirement of matching the distribution of the original model makes the non-greedy generation even faster than greedy decoding.
0 commit comments