Skip to content

Commit 10e7387

Browse files
committed
modify readme description
typo
1 parent 1438d1b commit 10e7387

File tree

1 file changed

+3
-3
lines changed

1 file changed

+3
-3
lines changed

README.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,7 @@ We aim to tackle the three pain points of popular acceleration techniques like s
3939
</picture>
4040
<br>
4141
<div align="left" width="80%">
42-
<em>Medusa adds extra "heads" to LLMs to predict multiple future tokens simultaneously. When augmenting a model with Medusa, the original model stays untouched, these new heads are fine-tuned during training. During generation, these heads each produce multiple likely next words. These options are then combined and sorted out using a tree-based attention mechanism. Finally, a typical acceptance scheme is employed to pick the most plausible sequence for further decoding.</em>
42+
<em>Medusa adds extra "heads" to LLMs to predict multiple future tokens simultaneously. When augmenting a model with Medusa, the original model stays untouched, and only the new heads are fine-tuned during. During generation, these heads each produce multiple likely words for the corresponding position. These options are then combined and processed using a tree-based attention mechanism. Finally, a typical acceptance scheme is employed to pick the longest plausible prefix from the candidates for further decoding.</em>
4343
</div>
4444
<br>
4545
</div>
@@ -48,7 +48,7 @@ In a nutshell, we solve the challenges of speculative decoding with the followin
4848

4949
- Instead of introducing a new model, we train multiple decoding heads on the *same* model.
5050
- The training is parameter-efficient so that even GPU poor can do it. And since there is no additional model, there is no need to adjust the distributed computing setup.
51-
- Relaxing the requirement of matching the distribution of the original model makes the generation with random sampling even faster than greedy decoding.
51+
- Relaxing the requirement of matching the distribution of the original model makes the non-greedy generation even faster than greedy decoding.
5252
<p align="center">
5353
<picture>
5454
<img src="assets/size_speedup.png" width="45%">
@@ -88,7 +88,7 @@ pip install -e .
8888
### Model Weights
8989
| Size | Chat Command | Hugging Face Repo |
9090
| ---- | --------------------------------------------- | --------------------------------------------------------------------- |
91-
| 7B | `python -m medusa.inference.cli --model FasterDecoding/medusa-vicuna-7b-v1.3` | [FasterDecoding/medusa-vicuna-33b-v1.3](https://huggingface.co/FasterDecoding/medusa-vicuna-7b-v1.3) |
91+
| 7B | `python -m medusa.inference.cli --model FasterDecoding/medusa-vicuna-7b-v1.3` | [FasterDecoding/medusa-vicuna-7b-v1.3](https://huggingface.co/FasterDecoding/medusa-vicuna-7b-v1.3) |
9292
| 13B | `python -m medusa.inference.cli --model FasterDecoding/medusa-vicuna-13b-v1.3` | [FasterDecoding/medusa-vicuna-13b-v1.3](https://huggingface.co/FasterDecoding/medusa-vicuna-13b-v1.3) |
9393
| 33B | `python -m medusa.inference.cli --model FasterDecoding/medusa-vicuna-33b-v1.3` | [FasterDecoding/medusa-vicuna-33b-v1.3](https://huggingface.co/FasterDecoding/medusa-vicuna-33b-v1.3) |
9494

0 commit comments

Comments
 (0)