|
1 | | -# Exporting popular LLMs out of the box |
| 1 | +# Exporting LLMs |
2 | 2 |
|
3 | 3 | Instead of needing to manually write code to call torch.export(), use ExecuTorch's assortment of lowering APIs, or even interact with TorchAO quantize_ APIs for quantization, we have provided an out of box experience which performantly exports a selection of supported models to ExecuTorch. |
4 | 4 |
|
@@ -42,6 +42,9 @@ We only require manually specifying a checkpoint path for the Llama model family |
42 | 42 |
|
43 | 43 | For the other supported LLMs, the checkpoint will be downloaded from HuggingFace automatically, and the param files can be found in their respective directories under `executorch/examples/models`, for instance `executorch/examples/models/qwen3/config/0_6b_config.json`. |
44 | 44 |
|
| 45 | +## Export settings |
| 46 | +[ExportConfig](https://github.com/pytorch/executorch/blob/main/extension/llm/export/config/llm_config.py) contains settings for the exported `.pte`, such as `max_seq_length` (max length of the prompt) and `max_context_length` (max length of the model's memory/cache). |
| 47 | + |
45 | 48 | ## Adding optimizations |
46 | 49 | `export_llm` performs a variety of optimizations to the model before export, during export, and during lowering. Quantization and delegation to accelerator backends are the main ones and will be covered in the next two sections. All other optimizations can be found under [`ModelConfig`](https://github.com/pytorch/executorch/blob/main/extension/llm/export/config/llm_config.py#L120). We will go ahead and add a few optimizations. |
47 | 50 |
|
@@ -81,6 +84,9 @@ Common ones to use are: |
81 | 84 | - `8da4w`: short for int8 dynamic activation + int4 weight quantization. |
82 | 85 | - `int8`: int8 weight-only quanziation. |
83 | 86 |
|
| 87 | +Group size is specified with: |
| 88 | +- `group_size`: 8, 32, 64, etc. |
| 89 | + |
84 | 90 | For Arm CPUs, there are also [low-bit kernels](https://pytorch.org/blog/hi-po-low-bit-operators/) for int8 dynamic activation + int[1-8] weight quantization. Note that this should not be used alongside XNNPACK, and experimentally we have found that the performance could sometimes even be better for the equivalent `8da4w`. To use these, specify `qmode` to either: |
85 | 91 | - `torchao:8da(\d+)w`: int8 dynamic activation + int[1-8] weights, for example `torchao:8da5w` |
86 | 92 | - `torchao:fpa(\d+)w`: int[1-8] weight only, for example `torchao:fpa4w` |
@@ -156,7 +162,10 @@ python -m extension.llm.export.export_llm \ |
156 | 162 | --config path/to/config.yaml |
157 | 163 | ``` |
158 | 164 |
|
159 | | -In the logs, there will be a table of all ops in the graph, and which ones were and were not delegated. Here is an example: |
| 165 | +In the logs, there will be a table of all ops in the graph, and which ones were and were not delegated. |
| 166 | + |
| 167 | +Here is an example: |
| 168 | +<details> |
160 | 169 | ``` |
161 | 170 | Total delegated subgraphs: 368 |
162 | 171 | Number of delegated nodes: 2588 |
@@ -251,6 +260,8 @@ Number of non-delegated nodes: 2513 |
251 | 260 | │ 42 │ Total │ 2588 │ 2513 │ |
252 | 261 | ╘════╧═══════════════════════════════════════════╧═══════════════════════════════════╧═══════════════════════════════════════╛ |
253 | 262 | ``` |
| 263 | +</details> |
| 264 | +<br/> |
254 | 265 |
|
255 | 266 | To do further performance analysis, you can may opt to use [ExecuTorch's Inspector APIs](https://docs.pytorch.org/executorch/stable/llm/getting-started.html#performance-analysis) to do things such as trace individual operator performance back to source code, view memory planning, and debug intermediate activations. To generate the ETRecord necessary for the Inspector APIs to link back to source code, you can use: |
256 | 267 |
|
|
0 commit comments