You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+7-7Lines changed: 7 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -30,7 +30,7 @@ We aim to tackle the three pain points of popular acceleration techniques like s
30
30
31
31
- Requirement of a good draft model.
32
32
- System complexity.
33
-
- Inefficiency when using sampling-based genenration.
33
+
- Inefficiency when using sampling-based generation.
34
34
35
35
36
36
<divalign="center">
@@ -39,15 +39,15 @@ We aim to tackle the three pain points of popular acceleration techniques like s
39
39
</picture>
40
40
<br>
41
41
<divalign="left"width="80%">
42
-
<em>Medusa adds extra "heads" to LLMs to predict multiple future tokens simultaneously. When augmenting a model with Medusa, the original model stays untouched, and only the new heads are fine-tuned during. During generation, these heads each produce multiple likely words for the corresponding position. These options are then combined and processed using a tree-based attention mechanism. Finally, a typical acceptance scheme is employed to pick the longest plausible prefix from the candidates for further decoding.</em>
42
+
<em>Medusa adds extra "heads" to LLMs to predict multiple future tokens simultaneously. When augmenting a model with Medusa, the original model stays untouched, and only the new heads are fine-tuned during training. During generation, these heads each produce multiple likely words for the corresponding position. These options are then combined and processed using a tree-based attention mechanism. Finally, a typical acceptance scheme is employed to pick the longest plausible prefix from the candidates for further decoding.</em>
43
43
</div>
44
44
<br>
45
45
</div>
46
46
47
-
In a nutshell, we solve the challenges of speculative decoding with the following ideas:
47
+
We aim to solve the challenges associated with speculative decoding by implementing the following ideas:
48
48
49
49
- Instead of introducing a new model, we train multiple decoding heads on the *same* model.
50
-
- The training is parameter-efficient so that even GPU poor can do it. And since there is no additional model, there is no need to adjust the distributed computing setup.
50
+
- The training is parameter-efficient so that even the "GPU-Poor" can do it. And since there is no additional model, there is no need to adjust the distributed computing setup.
51
51
- Relaxing the requirement of matching the distribution of the original model makes the non-greedy generation even faster than greedy decoding.
We currently support inference in the singleGPU and batch size 1 setting, which is the most common setup for local model hosting. We are actively working to extend Medusa's capabilities by integrating it into other inference frameworks, please don't hesitate to reach out if you are interested in contributing to this effort.
96
+
We currently support single-GPU inference with a batch size of 1, which is the most common setup for local model hosting. We are actively working to extend Medusa's capabilities by integrating it into other inference frameworks; please don't hesitate to reach out if you are interested in contributing to this effort.
97
97
98
-
You can use the following command for lauching a CLI interface:
98
+
You can use the following command for launching a CLI interface:
99
99
```bash
100
100
CUDA_VISIBLE_DEVICES=0 python -m medusa.inference.cli --model [path of medusa model]
101
101
```
@@ -162,4 +162,4 @@ We also provide some illustrative notebooks in `notebooks/` to help you understa
162
162
We welcome community contributions to Medusa. If you have an idea for how to improve it, please open an issue to discuss it with us. When submitting a pull request, please ensure that your changes are well-tested. Please split each major change into a separate pull request. We also have a [Roadmap](ROADMAP.md) summarizing our future plans for Medusa. Don't hesitate to reach out if you are interested in contributing to any of the items on the roadmap.
163
163
164
164
## Acknowledgements
165
-
This codebase is influenced by amazing works from the community, including [FastChat](https://github.com/lm-sys/FastChat), [TinyChat](https://github.com/mit-han-lab/llm-awq/tree/main/), [vllm](https://github.com/vllm-project/vllm) and many others.
165
+
This codebase is influenced by remarkable projects from the LLM community, including [FastChat](https://github.com/lm-sys/FastChat), [TinyChat](https://github.com/mit-han-lab/llm-awq/tree/main/), [vllm](https://github.com/vllm-project/vllm) and many others.
0 commit comments