|
| 1 | +# Examples of UMbreLLa |
| 2 | + |
| 3 | +### 1 Benchmark the decoding/verification speed |
| 4 | + |
| 5 | +```bash |
| 6 | + python bench.py --model hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4 --offload --D 1 --T 20 |
| 7 | +``` |
| 8 | + |
| 9 | +<h4>Key Configuration Options</h4> |
| 10 | +<ul> |
| 11 | + <li><strong>model</strong>: Specifies the target LLM to serve, e.g., <code>"hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4"</code>.</li> |
| 12 | + <li><strong>offload</strong>: Enables offloading of the target model to host memory.</li> |
| 13 | + <li><strong>cuda_graph</strong>: Toggles CUDA graph optimization for the draft model (currently unsupported for AWQ models).</li> |
| 14 | + <li><strong>M</strong>: The maximum token length for input and output combined.</li> |
| 15 | + <li><strong>D</strong>: The number of tokens for one decoding steps (for testing verification).</li> |
| 16 | + <li><strong>T</strong>: Repeated times in benchmarking.</li> |
| 17 | +</ul> |
| 18 | + |
| 19 | +### 2 Benchmarking auto-regressive generation |
| 20 | + |
| 21 | +```bash |
| 22 | + python generate.py --model hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4 --offload |
| 23 | +``` |
| 24 | + |
| 25 | +<h4>Key Configuration Options</h4> |
| 26 | +<ul> |
| 27 | + <li><strong>model</strong>: Specifies the target LLM to serve, e.g., <code>"hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4"</code>.</li> |
| 28 | + <li><strong>offload</strong>: Enables offloading of the target model to host memory.</li> |
| 29 | + <li><strong>cuda_graph</strong>: Toggles CUDA graph optimization for the draft model (currently unsupported for AWQ models).</li> |
| 30 | + <li><strong>G</strong>: The maximum generated tokens (smaller than 2000).</li> |
| 31 | + <li><strong>template</strong>: Defines the structure for input prompts. Supported values include: |
| 32 | + <ul> |
| 33 | + <li><code>"llama3-code"</code>: Optimized for code-related tasks.</li> |
| 34 | + <li><code>"meta-llama3"</code>: General-purpose instruction-following template.</li> |
| 35 | + </ul> |
| 36 | + </li> |
| 37 | +</ul> |
| 38 | + |
| 39 | +### 3 Speculative Decoding Example |
| 40 | + |
| 41 | +```bash |
| 42 | + python spec_generate.py --model hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4 --offload |
| 43 | +``` |
| 44 | + |
| 45 | +<h4>Key Configuration Options</h4> |
| 46 | +<ul> |
| 47 | + <li><strong>model</strong>: Specifies the target LLM to serve, e.g., <code>"hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4"</code>.</li> |
| 48 | + <li><strong>draft_model</strong>: Lightweight draft model, e.g., <code>"meta-llama/Llama-3.2-1B-Instruct"</code>.</li> |
| 49 | + <li><strong>offload</strong>: Enables offloading of the target model to host memory.</li> |
| 50 | + <li><strong>cuda_graph</strong>: Toggles CUDA graph optimization for the draft model (currently unsupported for AWQ models).</li> |
| 51 | + <li><strong>G</strong>: The maximum generated tokens (smaller than 2000).</li> |
| 52 | + <li><strong>template</strong>: Defines the structure for input prompts. Supported values include: |
| 53 | + <ul> |
| 54 | + <li><code>"llama3-code"</code>: Optimized for code-related tasks.</li> |
| 55 | + <li><code>"meta-llama3"</code>: General-purpose instruction-following template.</li> |
| 56 | + </ul> |
| 57 | + </li> |
| 58 | +</ul> |
| 59 | + |
| 60 | +### 4 Benchmarking Speculative Decoding |
| 61 | + |
| 62 | +```bash |
| 63 | +python spec_bench.py --configuration ../configs/chat_config_24gb.json #MT Bench |
| 64 | +python spec_bench_python.py --configuration ../configs/chat_config_24gb.json #Code Completion |
| 65 | +``` |
| 66 | + |
| 67 | +### 5 Generate Sequoia Tree |
| 68 | + |
| 69 | +```bash |
| 70 | +python construct_sequoia.py --w 5 --d 6 |
| 71 | +``` |
| 72 | + |
| 73 | +<h4>Key Configuration Options</h4> |
| 74 | +<ul> |
| 75 | + <li><strong>model</strong>: Specifies the target LLM to serve, e.g., <code>"hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4"</code>.</li> |
| 76 | + <li><strong>draft_model</strong>: Lightweight draft model, e.g., <code>"meta-llama/Llama-3.2-1B-Instruct"</code>.</li> |
| 77 | + <li><strong>offload</strong>: Enables offloading of the target model to host memory.</li> |
| 78 | + <li><strong>cuda_graph</strong>: Toggles CUDA graph optimization for the draft model (currently unsupported for AWQ models).</li> |
| 79 | + <li><strong>w</strong>: The width of the Sequoia trees.</li> |
| 80 | + <li><strong>d</strong>: The depth of the Sequoia trees.</li> |
| 81 | + <li><strong>dst</strong>: The json file which saves Sequoia tree, and can be specified as a growmap_path in static speculation engine.</li> |
| 82 | +</ul> |
0 commit comments