Skip to content

Commit 66fd5c8

Browse files
committed
Update LLM documentation
1 parent 3419b46 commit 66fd5c8

File tree

2 files changed

+150
-862
lines changed

2 files changed

+150
-862
lines changed

docs/source/llm/export-llm.md

Lines changed: 137 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,137 @@
1+
# Exporting popular LLMs out of the box
2+
3+
Instead of needing to manually write code to call torch.export(), use ExecuTorch's assortment of lowering APIs, or even interact with TorchAO quantize_ APIs for quantization, we have provided an out of box experience which performantly exports a selection of supported models to ExecuTorch.
4+
5+
As of this doc, the list of supported LLMs include the following:
6+
- Llama 2/3/3.1/3.2
7+
- Qwen 2.5/3
8+
- Phi 3.5/4-mini
9+
- SmolLM2
10+
11+
The up-to-date list of supported LLMs can be found in the code [here](https://github.com/pytorch/executorch/blob/main/extension/llm/export/config/llm_config.py#L32).
12+
13+
## The export_llm API
14+
`export_llm` is ExecuTorch's high-level export API for LLMs. In this tutorial, we will focus on exporting Llama 3.2 1B using this API. `export_llm`'s arguemnts are specified either through CLI args or through a yaml configuration whose fields are defined in [`LlmConfig`](https://github.com/pytorch/executorch/blob/main/extension/llm/export/config/llm_config.py). To call `export_llm`:
15+
16+
```
17+
python -m executorch.examples.extension.llm.export.export_llm
18+
--config <path-to-config-yaml>
19+
+base.<additional-CLI-overrides>
20+
```
21+
22+
## Basic export
23+
24+
To perform a basic export of Llama3.2, we will first need to specify download the checkpoint file (`consolidated.00.pth`) and params file (`params.json`). You can find these from the (Llama website)[https://www.llama.com/llama-downloads/] or [Hugging Face](https://huggingface.co/meta-llama/Llama-3.2-1B/tree/main/original).
25+
26+
Then, we specify the `model_class`, `checkpoint` (path to checkpoint file), and `params` (path to params file) as arguments. Additionally, later when we run the exported .pte with our runner APIs, the runner will need to know about the bos and eos ids for this model to know when to terminate. These are exposed through bos and eos getter methods in our .pte, which we can add by specifying bos and eos ids in a `metadata` arguemnt. These can usually be found in the model's `tokenizer_config.json` on HuggingFace.
27+
28+
```
29+
# path/to/config.yaml
30+
base:
31+
model_class: llama3_2
32+
checkpoint: path/to/consolidated.00.pth
33+
params: path/to/params.json
34+
metadata: '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}'
35+
36+
# export_llm
37+
python -m extension.llm.export.export_llm \
38+
--config path/to/config.yaml
39+
```
40+
41+
We only require manually specifying a checkpoint path for the Llama model family, since it is our most optimized model and we have more advanced optimizations such as [SpinQuant](https://github.com/pytorch/executorch/blob/main/examples/models/llama/README.md#spinquant) that require custom checkpoints.
42+
43+
For the other supported LLMs, the checkpoint will be downloaded from HuggingFace automatically, and the `params.json`s can be found in their respective directories under `examples/models`, for instance `examples/models/qwen3/config/0_6b_config.json`.
44+
45+
## Adding optimizations
46+
`export_llm` performs a variety of optimizations to the model before export, during export, and during lowering. Quantization and delegation to accelerator backends are the main ones and will be covered in the next two sections. All other optimizations can be found under [`ModelConfig`](https://github.com/pytorch/executorch/blob/main/extension/llm/export/config/llm_config.py#L120). We will go ahead and add a few optimizations.
47+
48+
```
49+
# path/to/config.yaml
50+
base:
51+
model_class: llama3_2
52+
checkpoint: path/to/consolidated.00.pth
53+
params: path/to/params.json
54+
metadata: '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}'
55+
model:
56+
use_kv_cache: True
57+
use_sdpa_with_kv_cache: True
58+
59+
# export_llm
60+
python -m extension.llm.export.export_llm \
61+
--config path/to/config.yaml
62+
```
63+
64+
`use_kv_cache` and `use_sdpa_with_kv_cache` is recommended to export any LLM, while other options are useful situationally. For example:
65+
- `use_shared_embedding` can help for models with tied input/output embedding layers, given that you quantize using TorchAO low bit ops (`quantization.qmode: torchao:8da(\\d+)w` or `quantization.qmode: torchao:fpa(\d+)w`, see more [here](https://github.com/pytorch/executorch/blob/main/examples/models/llama/source_transformation/quantize.py#L82).
66+
- `use_attention_sink` to extend generation by removing from the beginning of the KV cache when the max context length is reached.
67+
- `quantize_kv_cache` quantizes the KV cache in int8.
68+
- `local_global_attention` impements [Local-Global Attention](https://arxiv.org/abs/2411.09604), making specific attention layers use a much smaller localized sliding window KV cache.
69+
70+
## Quantization
71+
Quantization options are defined by [`QuantizationConfig`](https://github.com/pytorch/executorch/blob/main/extension/llm/export/config/llm_config.py#L283). ExecuTorch does quantization in two ways:
72+
1. TorchAO [`quantize_`](https://docs.pytorch.org/ao/stable/generated/torchao.quantization.quantize_.html) API
73+
2. [pt2e quantization](https://docs.pytorch.org/ao/main/tutorials_source/pt2e_quant_ptq.html)
74+
75+
### TorchAO
76+
TorchAO quantizes at the source code level, swapping out Linear modules for QuantizedLinear modules.
77+
This is the recommended quantization path for running on CPU.
78+
The quantization modes are defined [here](https://github.com/pytorch/executorch/blob/main/extension/llm/export/config/llm_config.py#L306).
79+
80+
Common ones to use are:
81+
- `8da4w`: short for int8 dynamic activation + int4 weight quantization.
82+
- `int8`: int8 weight-only quanziation.
83+
84+
For Arm CPUs, there are also [low-bit kernels](https://pytorch.org/blog/hi-po-low-bit-operators/) for int8 dynamic activation + int[1-8] weight quantization. Note that this should not be used alongside XNNPACK, although experimentally we have found that the performance could sometimes even be better for the equivalent `8da4w`. To use these, specify `qmode` to either:
85+
- `torchao:8da(\d+)w`: int8 dynamic activation + int[1-8] weights
86+
- `torchao:fpa(\d+)w`: int[1-8] weight only
87+
88+
To quantize embeddings, specify either `embedding_quantize: <bitwidth>,<groupsize>`, or for low-bit kernels use `embedding_quantize: torchao:<bitwidth>,<groupsize>`. `bitwidth` must be either 2, 4, or 8.
89+
90+
```
91+
# path/to/config.yaml
92+
base:
93+
model_class: llama3_2
94+
checkpoint: path/to/consolidated.00.pth
95+
params: path/to/params.json
96+
metadata: '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}'
97+
model:
98+
use_kv_cache: True
99+
use_sdpa_withp_kv_cache: True
100+
quantization:
101+
embedding_quantize: 4,32
102+
qmode: 8da4w
103+
104+
# export_llm
105+
python -m extension.llm.export.export_llm \
106+
--config path/to/config.yaml
107+
```
108+
109+
### pt2e
110+
pt2e quantizes at the post-export graph level, swapping nodes and injecting quant/dequant nodes.
111+
112+
113+
## Backend support
114+
Backend options are defined by [`BackendConfig`](https://github.com/pytorch/executorch/blob/main/extension/llm/export/config/llm_config.py#L434). Each backend has their own backend configuration options. Here is an example for XNNPACK:
115+
116+
```
117+
# path/to/config.yaml
118+
base:
119+
model_class: llama3_2
120+
checkpoint: path/to/consolidated.00.pth
121+
params: path/to/params.json
122+
metadata: '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}'
123+
model:
124+
use_kv_cache: True
125+
use_sdpa_withp_kv_cache: True
126+
quantization:
127+
embedding_quantize: 4,32
128+
qmode: 8da4w
129+
backend:
130+
xnnpack:
131+
enabled: True
132+
extended_ops: True # Expand the selection of ops delegated to XNNPACK.
133+
134+
# export_llm
135+
python -m extension.llm.export.export_llm \
136+
--config path/to/config.yaml
137+
```

0 commit comments

Comments
 (0)