Skip to content

Commit e4d18f8

Browse files
committed
Update
1 parent 299da55 commit e4d18f8

File tree

2 files changed

+16
-7
lines changed

2 files changed

+16
-7
lines changed

README.md

Lines changed: 8 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,10 @@
2525
<br>
2626

2727
## Update
28-
Happy to share the latest update of MagicDec. Now MagicDec integerates flashinfer and paged attention to further accelerate inference. We add support of SnapKV-based drafting for higher speculation quality. Please make sure PyTorch version greater than 2.5 to use the new features like custom all-reduce can be used.
28+
- Supports flashinfer and paged attention to further accelerate inference.
29+
- Supports SnapKV-based drafting for higher speculation quality.
30+
- Supppots Qwen2.5-[7B,14B,32B], Yi-1.5-[6B,34B], Mistral-7B-v0.1 and Mistral-7B-v0.1
31+
- Please make sure PyTorch version greater than 2.5 to use the new features like custom all-reduce.
2932

3033
## Installation
3134

@@ -39,7 +42,7 @@ pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/
3942
```
4043

4144
### Prepare Checkpoints
42-
Currently, we support [Llama-2-7b](https://huggingface.co/meta-llama/Llama-2-7b-hf) and its long context variant [Llama-2-7b-32k](https://huggingface.co/togethercomputer/LLaMA-2-7B-32K), [Llama-2-13b](https://huggingface.co/meta-llama/Llama-2-13b-hf), [Llama-2-70b](https://huggingface.co/meta-llama/Llama-2-70b-hf), [Llama-3-8b](https://huggingface.co/meta-llama/Meta-Llama-3-8B), [Llama-3-70b](https://huggingface.co/meta-llama/Meta-Llama-3-70B), [llama-68m](https://huggingface.co/JackFram/llama-68m), [TinyLlama](https://huggingface.co/TinyLlama/TinyLlama_v1.1), [Llama-3.1-8b](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B), [Llama-3.1-70b](https://huggingface.co/meta-llama/Llama-3.1-70B) and [Llama-3.2-1b](https://huggingface.co/meta-llama/Llama-3.2-1B).
45+
Currently, we support [Llama-2-7b](https://huggingface.co/meta-llama/Llama-2-7b-hf) and its long context variant [Llama-2-7b-32k](https://huggingface.co/togethercomputer/LLaMA-2-7B-32K), [Llama-2-13b](https://huggingface.co/meta-llama/Llama-2-13b-hf), [Llama-2-70b](https://huggingface.co/meta-llama/Llama-2-70b-hf), [Llama-3-8b](https://huggingface.co/meta-llama/Meta-Llama-3-8B), [Llama-3-70b](https://huggingface.co/meta-llama/Meta-Llama-3-70B), [llama-68m](https://huggingface.co/JackFram/llama-68m), [TinyLlama](https://huggingface.co/TinyLlama/TinyLlama_v1.1), [Llama-3.1-8b](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B), [Llama-3.1-70b](https://huggingface.co/meta-llama/Llama-3.1-70B), [Llama-3.2-1b](https://huggingface.co/meta-llama/Llama-3.2-1B), Qwen2.5-[7B,14B,32B], Yi-1.5-[6B,34B], Mistral-7B-v0.1 and Mistral-7B-v0.1.
4346

4447
We can first download the checkpoints we need through `download.py`. The `--repo_id` should be set to the repository ID to download from. The `--hf_token` should be your HuggingFace API token. The `--out_dir` should be the directory you want to save the checkpoint.
4548
```bash
@@ -59,7 +62,7 @@ ENABLE_INTRA_NODE_COMM=1 torchrun --standalone --nproc_per_node=8 tests/baseline
5962
```
6063

6164
### Standalone Draft
62-
For standalone draft experiment, we use `--target` and `--model` to set the target and draft checkpoint. `--model_name` should be set to the repo id of target model, which will used to load the corresponding tokenizer. `--rank_group` should be set to the GPU id we want to do tensor parallelism for the target model, and `--draft_rank_group` should be set to the GPU id we want to do TP for the draft model. `--draft_budget` should be set to the KV budget for the draft model. Set `--draft_budget` to -1 to disable KV compression of draft (Use full KV).
65+
For standalone draft experiment, we use `--target` and `--model` to set the target and draft checkpoint. `--model_name` should be set to the repo id of target model, which will used to load the corresponding tokenizer. `--rank_group` should be set to the GPU id we want to do tensor parallelism for the target model, and `--draft_rank_group` should be set to the GPU id we want to do TP for the draft model. `--draft_budget` should be set to the KV budget for the draft model. Set `--draft_budget` of `StreamingLLM/longspec_benchmark.py` to -1 to disable KV compression of draft model (Use full KV, the original speculative decoding).
6366

6467
#### SnapKV-based Drafting
6568
```bash
@@ -81,7 +84,7 @@ ENABLE_INTRA_NODE_COMM=1 torchrun --standalone --nproc_per_node=8 tests/SnapKV/s
8184

8285
#### StreamingLLM-based Drafting
8386
```bash
84-
ENABLE_INTRA_NODE_COMM=1 torchrun --standalone --nproc_per_node=8 tests/SnapKV/selfspec_benchmark.py --model checkpoints/meta-llama/Meta-Llama-3.1-8B/model.pth --model_name meta-llama/Meta-Llama-3.1-8B --rank_group 0 1 2 3 4 5 6 7 --gamma 3 --B 64 --prefix_len 16032 --gen_len 16128 --draft_budget 257 --benchmark --compile
87+
ENABLE_INTRA_NODE_COMM=1 torchrun --standalone --nproc_per_node=8 tests/StreamingLLM/selfspec_benchmark.py --model checkpoints/meta-llama/Meta-Llama-3.1-8B/model.pth --model_name meta-llama/Meta-Llama-3.1-8B --rank_group 0 1 2 3 4 5 6 7 --gamma 3 --B 64 --prefix_len 16032 --gen_len 16128 --draft_budget 257 --benchmark --compile
8588
```
8689

8790
## Citation
@@ -94,5 +97,4 @@ If you find MagicDec useful or relevant to your project and research, please kin
9497
journal={arXiv preprint arXiv:2408.11049},
9598
year={2024}
9699
}
97-
```
98-
100+
```

run.sh

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# ENABLE_INTRA_NODE_COMM=1 torchrun --standalone --nproc_per_node=8 tests/baseline_benchmark.py --compile --rank_group 0 1 2 3 4 5 6 7 --B 1 --prefix_len 257 --max_len 384
1+
ENABLE_INTRA_NODE_COMM=1 torchrun --standalone --nproc_per_node=8 tests/baseline_benchmark.py --compile --rank_group 0 1 2 3 4 5 6 7 --B 1 --prefix_len 257 --max_len 384
22
# ENABLE_INTRA_NODE_COMM=1 torchrun --standalone --nproc_per_node=8 tests/baseline_benchmark.py --compile --rank_group 0 1 2 3 4 5 6 7 --B 16 --prefix_len 257 --max_len 384
33
# ENABLE_INTRA_NODE_COMM=1 torchrun --standalone --nproc_per_node=8 tests/baseline_benchmark.py --compile --rank_group 0 1 2 3 4 5 6 7 --B 32 --prefix_len 257 --max_len 384
44
# ENABLE_INTRA_NODE_COMM=1 torchrun --standalone --nproc_per_node=8 tests/baseline_benchmark.py --compile --rank_group 0 1 2 3 4 5 6 7 --B 64 --prefix_len 257 --max_len 384
@@ -14,6 +14,13 @@
1414
# ENABLE_INTRA_NODE_COMM=1 torchrun --standalone --nproc_per_node=8 tests/baseline_benchmark.py --compile --rank_group 0 1 2 3 4 5 6 7 --B 256 --prefix_len 8193 --max_len 8320
1515
# ENABLE_INTRA_NODE_COMM=1 torchrun --standalone --nproc_per_node=8 tests/baseline_benchmark.py --compile --rank_group 0 1 2 3 4 5 6 7 --B 480 --prefix_len 8193 --max_len 8320
1616

17+
torchrun --standalone --nproc_per_node=8 tests/baseline_benchmark.py --compile --rank_group 0 1 2 3 4 5 6 7 --B 32 --prefix_len 32769 --max_len 32896 --model checkpoints/meta-llama/Meta-Llama-3.1-8B/model.pth
18+
torchrun --standalone --nproc_per_node=4 tests/StreamingLLM/longspec_benchmark.py --compile --rank_group 0 1 2 3 4 5 6 7 --draft_rank_group 0 1 2 3 4 5 6 7 --B 16 --prefix_len 32769 --max_len 32896 --benchmark --gamma 2 --draft_budget 513
19+
python download.py --repo_id meta-llama/Llama-3.2-1B --out_dir checkpoints/meta-llama/Llama-3.2-1B
20+
python convert_hf_checkpoint.py --checkpoint_dir checkpoints/meta-llama/Llama-3.2-1B
21+
torchrun --standalone --nproc_per_node=8 tests/StreamingLLM/longspec_benchmark.py --model checkpoints/meta-llama/Llama-3.2-1B/model.pth --target checkpoints/meta-llama/Meta-Llama-3.1-8B/model.pth --compile --rank_group 0 1 2 3 4 5 6 7 --draft_rank_group 0 1 2 3 4 5 6 7 --B 32 --prefix_len 32769 --max_len 32896 --benchmark --gamma 2 --draft_budget 513
22+
torchrun --standalone --nproc_per_node=8 tests/SnapKV/selfspec_benchmark.py --compile --rank_group 0 1 2 3 4 5 6 7 --B 32 --prefix_len 32032 --max_len 32128 --benchmark --gamma 6 --draft_budget 2049 --model checkpoints/meta-llama/Meta-Llama-3.1-8B/model.pth
23+
1724
# ENABLE_INTRA_NODE_COMM=1 torchrun --standalone --nproc_per_node=8 tests/SnapKV/longspec_benchmark.py --compile --rank_group 0 1 2 3 4 5 6 7 --draft_rank_group 0 1 2 3 4 5 6 7 --B 128 --prefix_len 257 --max_len 384 --benchmark --gamma 2
1825
# ENABLE_INTRA_NODE_COMM=1 torchrun --standalone --nproc_per_node=8 tests/SnapKV/longspec_benchmark.py --compile --rank_group 0 1 2 3 4 5 6 7 --draft_rank_group 0 1 2 3 4 5 6 7 --B 256 --prefix_len 257 --max_len 384 --benchmark --gamma 1
1926
# ENABLE_INTRA_NODE_COMM=1 torchrun --standalone --nproc_per_node=8 tests/SnapKV/longspec_benchmark.py --compile --rank_group 0 1 2 3 4 5 6 7 --draft_rank_group 0 1 2 3 4 5 6 7 --B 256 --prefix_len 257 --max_len 384 --benchmark --gamma 2

0 commit comments

Comments
 (0)