You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+8-6Lines changed: 8 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -25,7 +25,10 @@
25
25
<br>
26
26
27
27
## Update
28
-
Happy to share the latest update of MagicDec. Now MagicDec integerates flashinfer and paged attention to further accelerate inference. We add support of SnapKV-based drafting for higher speculation quality. Please make sure PyTorch version greater than 2.5 to use the new features like custom all-reduce can be used.
28
+
- Supports flashinfer and paged attention to further accelerate inference.
29
+
- Supports SnapKV-based drafting for higher speculation quality.
30
+
- Supppots Qwen2.5-[7B,14B,32B], Yi-1.5-[6B,34B], Mistral-7B-v0.1 and Mistral-7B-v0.1
31
+
- Please make sure PyTorch version greater than 2.5 to use the new features like custom all-reduce.
Currently, we support [Llama-2-7b](https://huggingface.co/meta-llama/Llama-2-7b-hf) and its long context variant [Llama-2-7b-32k](https://huggingface.co/togethercomputer/LLaMA-2-7B-32K), [Llama-2-13b](https://huggingface.co/meta-llama/Llama-2-13b-hf), [Llama-2-70b](https://huggingface.co/meta-llama/Llama-2-70b-hf), [Llama-3-8b](https://huggingface.co/meta-llama/Meta-Llama-3-8B), [Llama-3-70b](https://huggingface.co/meta-llama/Meta-Llama-3-70B), [llama-68m](https://huggingface.co/JackFram/llama-68m), [TinyLlama](https://huggingface.co/TinyLlama/TinyLlama_v1.1), [Llama-3.1-8b](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B), [Llama-3.1-70b](https://huggingface.co/meta-llama/Llama-3.1-70B) and [Llama-3.2-1b](https://huggingface.co/meta-llama/Llama-3.2-1B).
45
+
Currently, we support [Llama-2-7b](https://huggingface.co/meta-llama/Llama-2-7b-hf) and its long context variant [Llama-2-7b-32k](https://huggingface.co/togethercomputer/LLaMA-2-7B-32K), [Llama-2-13b](https://huggingface.co/meta-llama/Llama-2-13b-hf), [Llama-2-70b](https://huggingface.co/meta-llama/Llama-2-70b-hf), [Llama-3-8b](https://huggingface.co/meta-llama/Meta-Llama-3-8B), [Llama-3-70b](https://huggingface.co/meta-llama/Meta-Llama-3-70B), [llama-68m](https://huggingface.co/JackFram/llama-68m), [TinyLlama](https://huggingface.co/TinyLlama/TinyLlama_v1.1), [Llama-3.1-8b](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B), [Llama-3.1-70b](https://huggingface.co/meta-llama/Llama-3.1-70B), [Llama-3.2-1b](https://huggingface.co/meta-llama/Llama-3.2-1B), Qwen2.5-[7B,14B,32B], Yi-1.5-[6B,34B], Mistral-7B-v0.1 and Mistral-7B-v0.1.
43
46
44
47
We can first download the checkpoints we need through `download.py`. The `--repo_id` should be set to the repository ID to download from. The `--hf_token` should be your HuggingFace API token. The `--out_dir` should be the directory you want to save the checkpoint.
For standalone draft experiment, we use `--target` and `--model` to set the target and draft checkpoint. `--model_name` should be set to the repo id of target model, which will used to load the corresponding tokenizer. `--rank_group` should be set to the GPU id we want to do tensor parallelism for the target model, and `--draft_rank_group` should be set to the GPU id we want to do TP for the draft model. `--draft_budget` should be set to the KV budget for the draft model. Set `--draft_budget` to -1 to disable KV compression of draft (Use full KV).
65
+
For standalone draft experiment, we use `--target` and `--model` to set the target and draft checkpoint. `--model_name` should be set to the repo id of target model, which will used to load the corresponding tokenizer. `--rank_group` should be set to the GPU id we want to do tensor parallelism for the target model, and `--draft_rank_group` should be set to the GPU id we want to do TP for the draft model. `--draft_budget` should be set to the KV budget for the draft model. Set `--draft_budget`of `StreamingLLM/longspec_benchmark.py`to -1 to disable KV compression of draft model (Use full KV, the original speculative decoding).
0 commit comments