Skip to content

Commit 1a3e422

Browse files
committed
Mergen comments
1 parent 659db4f commit 1a3e422

File tree

3 files changed

+18
-5
lines changed

3 files changed

+18
-5
lines changed

docs/source/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -88,7 +88,7 @@ ExecuTorch provides support for:
8888
- [Selective Build](kernel-library-selective-build)
8989
#### Working with LLMs
9090
- [Getting Started](llm/getting-started.md)
91-
- [Exporting LLMs with export_llm](llm/export-llm.md)
91+
- [Exporting LLMs](llm/export-llm.md)
9292
- [Exporting custom LLMs](llm/export-custom-llm.md)
9393
- [Running with C++](llm/run-with-c-plus-plus.md)
9494
- [Running on Android (XNNPack)](llm/llama-demo-android.md)

docs/source/llm/export-llm.md

Lines changed: 13 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Exporting popular LLMs out of the box
1+
# Exporting LLMs
22

33
Instead of needing to manually write code to call torch.export(), use ExecuTorch's assortment of lowering APIs, or even interact with TorchAO quantize_ APIs for quantization, we have provided an out of box experience which performantly exports a selection of supported models to ExecuTorch.
44

@@ -42,6 +42,9 @@ We only require manually specifying a checkpoint path for the Llama model family
4242

4343
For the other supported LLMs, the checkpoint will be downloaded from HuggingFace automatically, and the param files can be found in their respective directories under `executorch/examples/models`, for instance `executorch/examples/models/qwen3/config/0_6b_config.json`.
4444

45+
## Export settings
46+
[ExportConfig](https://github.com/pytorch/executorch/blob/main/extension/llm/export/config/llm_config.py) contains settings for the exported `.pte`, such as `max_seq_length` (max length of the prompt) and `max_context_length` (max length of the model's memory/cache).
47+
4548
## Adding optimizations
4649
`export_llm` performs a variety of optimizations to the model before export, during export, and during lowering. Quantization and delegation to accelerator backends are the main ones and will be covered in the next two sections. All other optimizations can be found under [`ModelConfig`](https://github.com/pytorch/executorch/blob/main/extension/llm/export/config/llm_config.py#L120). We will go ahead and add a few optimizations.
4750

@@ -81,6 +84,9 @@ Common ones to use are:
8184
- `8da4w`: short for int8 dynamic activation + int4 weight quantization.
8285
- `int8`: int8 weight-only quanziation.
8386

87+
Group size is specified with:
88+
- `group_size`: 8, 32, 64, etc.
89+
8490
For Arm CPUs, there are also [low-bit kernels](https://pytorch.org/blog/hi-po-low-bit-operators/) for int8 dynamic activation + int[1-8] weight quantization. Note that this should not be used alongside XNNPACK, and experimentally we have found that the performance could sometimes even be better for the equivalent `8da4w`. To use these, specify `qmode` to either:
8591
- `torchao:8da(\d+)w`: int8 dynamic activation + int[1-8] weights, for example `torchao:8da5w`
8692
- `torchao:fpa(\d+)w`: int[1-8] weight only, for example `torchao:fpa4w`
@@ -156,7 +162,10 @@ python -m extension.llm.export.export_llm \
156162
--config path/to/config.yaml
157163
```
158164

159-
In the logs, there will be a table of all ops in the graph, and which ones were and were not delegated. Here is an example:
165+
In the logs, there will be a table of all ops in the graph, and which ones were and were not delegated.
166+
167+
Here is an example:
168+
<details>
160169
```
161170
Total delegated subgraphs: 368
162171
Number of delegated nodes: 2588
@@ -251,6 +260,8 @@ Number of non-delegated nodes: 2513
251260
│ 42 │ Total │ 2588 │ 2513 │
252261
╘════╧═══════════════════════════════════════════╧═══════════════════════════════════╧═══════════════════════════════════════╛
253262
```
263+
</details>
264+
<br/>
254265

255266
To do further performance analysis, you can may opt to use [ExecuTorch's Inspector APIs](https://docs.pytorch.org/executorch/stable/llm/getting-started.html#performance-analysis) to do things such as trace individual operator performance back to source code, view memory planning, and debug intermediate activations. To generate the ETRecord necessary for the Inspector APIs to link back to source code, you can use:
256267

docs/source/llm/getting-started.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Deploying LLMs to Executorch
1+
# Deploying LLMs to ExecuTorch
22

33
ExecuTorch is designed to support all types of machine learning models, and LLMs are no exception.
44
In this section we demonstrate how to leverage ExecuTorch to performantly run state of the art
@@ -16,7 +16,9 @@ To follow this guide, you'll need to install ExecuTorch. Please see [Setting Up
1616

1717
## Next steps
1818

19-
- [Exporting popular LLMs out of the box](export-llm.md)
19+
Deploying LLMs to ExecuTorch can be boiled down to a two-step process: (1) exporting the LLM to a `.pte` file and (2) running the `.pte` file using our C++ APIs or Swift/Java bindings.
20+
21+
- [Exporting LLMs](export-llm.md)
2022
- [Exporting custom LLMs](export-custom-llm.md)
2123
- [Running with C++](run-with-c-plus-plus.md)
2224
- [Running on Android (XNNPack)](llama-demo-android.md)

0 commit comments

Comments
 (0)