Skip to content

Commit 3e80fd5

Browse files
layout
1 parent 93778c8 commit 3e80fd5

File tree

2 files changed

+87
-6
lines changed

2 files changed

+87
-6
lines changed

examples/README.md

Lines changed: 82 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,82 @@
1+
# Examples of UMbreLLa
2+
3+
### 1 Benchmark the decoding/verification speed
4+
5+
```bash
6+
python bench.py --model hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4 --offload --D 1 --T 20
7+
```
8+
9+
<h4>Key Configuration Options</h4>
10+
<ul>
11+
<li><strong>model</strong>: Specifies the target LLM to serve, e.g., <code>"hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4"</code>.</li>
12+
<li><strong>offload</strong>: Enables offloading of the target model to host memory.</li>
13+
<li><strong>cuda_graph</strong>: Toggles CUDA graph optimization for the draft model (currently unsupported for AWQ models).</li>
14+
<li><strong>M</strong>: The maximum token length for input and output combined.</li>
15+
<li><strong>D</strong>: The number of tokens for one decoding steps (for testing verification).</li>
16+
<li><strong>T</strong>: Repeated times in benchmarking.</li>
17+
</ul>
18+
19+
### 2 Benchmarking auto-regressive generation
20+
21+
```bash
22+
python generate.py --model hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4 --offload
23+
```
24+
25+
<h4>Key Configuration Options</h4>
26+
<ul>
27+
<li><strong>model</strong>: Specifies the target LLM to serve, e.g., <code>"hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4"</code>.</li>
28+
<li><strong>offload</strong>: Enables offloading of the target model to host memory.</li>
29+
<li><strong>cuda_graph</strong>: Toggles CUDA graph optimization for the draft model (currently unsupported for AWQ models).</li>
30+
<li><strong>G</strong>: The maximum generated tokens (smaller than 2000).</li>
31+
<li><strong>template</strong>: Defines the structure for input prompts. Supported values include:
32+
<ul>
33+
<li><code>"llama3-code"</code>: Optimized for code-related tasks.</li>
34+
<li><code>"meta-llama3"</code>: General-purpose instruction-following template.</li>
35+
</ul>
36+
</li>
37+
</ul>
38+
39+
### 3 Speculative Decoding Example
40+
41+
```bash
42+
python spec_generate.py --model hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4 --offload
43+
```
44+
45+
<h4>Key Configuration Options</h4>
46+
<ul>
47+
<li><strong>model</strong>: Specifies the target LLM to serve, e.g., <code>"hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4"</code>.</li>
48+
<li><strong>draft_model</strong>: Lightweight draft model, e.g., <code>"meta-llama/Llama-3.2-1B-Instruct"</code>.</li>
49+
<li><strong>offload</strong>: Enables offloading of the target model to host memory.</li>
50+
<li><strong>cuda_graph</strong>: Toggles CUDA graph optimization for the draft model (currently unsupported for AWQ models).</li>
51+
<li><strong>G</strong>: The maximum generated tokens (smaller than 2000).</li>
52+
<li><strong>template</strong>: Defines the structure for input prompts. Supported values include:
53+
<ul>
54+
<li><code>"llama3-code"</code>: Optimized for code-related tasks.</li>
55+
<li><code>"meta-llama3"</code>: General-purpose instruction-following template.</li>
56+
</ul>
57+
</li>
58+
</ul>
59+
60+
### 4 Benchmarking Speculative Decoding
61+
62+
```bash
63+
python spec_bench.py --configuration ../configs/chat_config_24gb.json #MT Bench
64+
python spec_bench_python.py --configuration ../configs/chat_config_24gb.json #Code Completion
65+
```
66+
67+
### 5 Generate Sequoia Tree
68+
69+
```bash
70+
python construct_sequoia.py --w 5 --d 6
71+
```
72+
73+
<h4>Key Configuration Options</h4>
74+
<ul>
75+
<li><strong>model</strong>: Specifies the target LLM to serve, e.g., <code>"hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4"</code>.</li>
76+
<li><strong>draft_model</strong>: Lightweight draft model, e.g., <code>"meta-llama/Llama-3.2-1B-Instruct"</code>.</li>
77+
<li><strong>offload</strong>: Enables offloading of the target model to host memory.</li>
78+
<li><strong>cuda_graph</strong>: Toggles CUDA graph optimization for the draft model (currently unsupported for AWQ models).</li>
79+
<li><strong>w</strong>: The width of the Sequoia trees.</li>
80+
<li><strong>d</strong>: The depth of the Sequoia trees.</li>
81+
<li><strong>dst</strong>: The json file which saves Sequoia tree, and can be specified as a growmap_path in static speculation engine.</li>
82+
</ul>

examples/construct_sequoia.py

Lines changed: 5 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -12,10 +12,12 @@
1212
from tqdm import tqdm
1313
parser = argparse.ArgumentParser()
1414
parser.add_argument('--model', type=str, default="meta-llama/Llama-3.1-8B-Instruct",help='model')
15-
parser.add_argument('--draft_model', type=str, default="Zhuominc/FastCode-500M",help='model')
15+
parser.add_argument('--draft_model', type=str, default="InfiniAILab/CodeDrafter-500M",help='model')
1616
parser.add_argument('--offload', action='store_true', help="offload the model")
1717
parser.add_argument('--cuda_graph', action='store_true', help="whether use cuda graph")
18-
parser.add_argument('--w', type=int, default=8, help="whether use cuda graph")
18+
parser.add_argument('--w', type=int, default=3, help="tree width")
19+
parser.add_argument('--d', type=int, default=4, help="tree depth")
20+
parser.add_argument('--dst', type=str, default="../umbrella/trees/sequoia_tree.json", help="tree depth")
1921
args = parser.parse_args()
2022

2123
system_prompt = SysPrompts['llama3-code']
@@ -85,7 +87,4 @@
8587

8688

8789

88-
generate_sequoia_tree(width=5, depth=6, acc=acceptance_rate.tolist(), json_file="../umbrella/trees/8b_sequoia_tree-5x6.json")
89-
generate_sequoia_tree(width=5, depth=8, acc=acceptance_rate.tolist(), json_file="../umbrella/trees/8b_sequoia_tree-5x8.json")
90-
generate_sequoia_tree(width=6, depth=6, acc=acceptance_rate.tolist(), json_file="../umbrella/trees/8b_sequoia_tree-6x6.json")
91-
generate_sequoia_tree(width=6, depth=7, acc=acceptance_rate.tolist(), json_file="../umbrella/trees/8b_sequoia_tree-6x7.json")
90+
generate_sequoia_tree(width=args.w, depth=args.d, acc=acceptance_rate.tolist(), json_file=args.dst)

0 commit comments

Comments
 (0)