Skip to content

Commit a914446

Browse files
mergennachinfacebook-github-bot
authored andcommitted
Improve Llama page (pytorch#5639)
Summary: Pull Request resolved: pytorch#5639 - Demotes llama2, promotes llama3 bypass-github-export-checks bypass-github-pytorch-ci-checks bypass-github-executorch-ci-checks Reviewed By: JacobSzwejbka Differential Revision: D63397528 fbshipit-source-id: 3e5c66abb54421f127d0cdfcdfafd391beb4add5
1 parent 9b6d4b4 commit a914446

File tree

1 file changed

+38
-35
lines changed

1 file changed

+38
-35
lines changed

examples/models/llama2/README.md

Lines changed: 38 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,11 @@
11
# Summary
2-
This example demonstrates how to run a [Llama 2](https://llama.meta.com/llama2/) 7B or [Llama 3](https://ai.meta.com/llama/) 8B model on mobile via ExecuTorch. We use XNNPACK to accelerate the performance and 4-bit groupwise PTQ quantization to fit the model on a phone.
2+
This example demonstrates how to run a [llama models](https://www.llama.com/) on mobile via ExecuTorch. We use XNNPACK to accelerate the performance and 4-bit groupwise PTQ quantization to fit the model on a phone.
33

4-
For more details, see [Llama 2 repo](https://github.com/facebookresearch/llama) or [Llama 3 repo](https://github.com/facebookresearch/llama3).
4+
Here are supported models:
5+
6+
- Llama 3.1 8B
7+
- Llama 3 8B
8+
- Llama 2 7B
59

610
Pretrained models are not included in this repo. Users are suggested to download them [here](https://ai.meta.com/resources/models-and-libraries/llama-downloads/).
711

@@ -47,22 +51,13 @@ SpinQuant can generate quantized weights that are [compatible with ExecuTorch](h
4751

4852
## Enablement
4953

50-
We have verified running Llama 2 7B [mobile applications](#step-6-build-mobile-apps) efficiently on select devices including the iPhone 15 Pro, iPhone 15 Pro Max, Samsung Galaxy S22 and S24, and OnePlus 12.
54+
For Llama 3 8B and Llama3.1 8B, we have verified so far on iPhone 15 Pro, iPhone 15 Pro Max, Samsung Galaxy S24+ and OnePlus 12 (with 16GB RAM).
5155

52-
For Llama 3 8B, we have verified so far on iPhone 15 Pro, iPhone 15 Pro Max, Samsung Galaxy S24+ and OnePlus 12 (with 16GB RAM).
56+
We have verified running Llama 2 7B [mobile applications](#step-6-build-mobile-apps) efficiently on select devices including the iPhone 15 Pro, iPhone 15 Pro Max, Samsung Galaxy S22 and S24, and OnePlus 12.
5357

5458
## Performance
5559

56-
### Llama2 7B
57-
Llama 2 7B performance was measured on the Samsung Galaxy S22, S24, and OnePlus 12 devices. The performance measurement is expressed in terms of tokens per second using an [adb binary-based approach](#step-5-run-benchmark-on).
58-
59-
|Device | Groupwise 4-bit (128) | Groupwise 4-bit (256)
60-
|--------| ---------------------- | ---------------
61-
|Galaxy S22 | 8.15 tokens/second | 8.3 tokens/second |
62-
|Galaxy S24 | 10.66 tokens/second | 11.26 tokens/second |
63-
|OnePlus 12 | 11.55 tokens/second | 11.6 tokens/second |
64-
65-
### Llama3 8B
60+
### Llama3 8B and Llama3.1 8B
6661
Llama 3 8B performance was measured on the Samsung Galaxy S22, S24, and OnePlus 12 devices. The performance measurement is expressed in terms of tokens per second using an [adb binary-based approach](#step-5-run-benchmark-on).
6762

6863
Note that since Llama3's vocabulary size is 4x that of Llama2, we had to quantize embedding lookup table as well. For these results embedding lookup table was groupwise quantized with 4-bits and group size of 32.
@@ -73,8 +68,14 @@ Note that since Llama3's vocabulary size is 4x that of Llama2, we had to quantiz
7368
|Galaxy S24 | 10.91 tokens/second | 11.21 tokens/second |
7469
|OnePlus 12 | 10.85 tokens/second | 11.02 tokens/second |
7570

76-
### Llama3.1
77-
Llama3.1 is supported on the ExecuTorch main branch and release/0.4
71+
### Llama2 7B
72+
Llama 2 7B performance was measured on the Samsung Galaxy S22, S24, and OnePlus 12 devices. The performance measurement is expressed in terms of tokens per second using an [adb binary-based approach](#step-5-run-benchmark-on).
73+
74+
|Device | Groupwise 4-bit (128) | Groupwise 4-bit (256)
75+
|--------| ---------------------- | ---------------
76+
|Galaxy S22 | 8.15 tokens/second | 8.3 tokens/second |
77+
|Galaxy S24 | 10.66 tokens/second | 11.26 tokens/second |
78+
|OnePlus 12 | 11.55 tokens/second | 11.6 tokens/second |
7879

7980
# Instructions
8081

@@ -92,23 +93,20 @@ Llama3.1 is supported on the ExecuTorch main branch and release/0.4
9293

9394
## Step 2: Prepare model
9495

95-
### Option A: Download and export Llama 2 7B model
96-
97-
You can export and run the original Llama 2 7B model.
96+
### Option A: Download and export Llama 3 8B instruct model
9897

99-
1. Llama 2 pretrained parameters can be downloaded from [Meta's official website](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) or from [Hugging Face](https://huggingface.co/meta-llama/Llama-2-7b).
98+
You can export and run the original Llama 3 8B instruct model.
10099

101-
2. Edit `params.json` file. Replace `"vocab_size": -1` with `"vocab_size": 32000`. This is a short-term workaround.
100+
1. Llama 3 pretrained parameters can be downloaded from [Meta's official Llama 3 repository](https://github.com/meta-llama/llama3/).
102101

103-
3. Export model and generate `.pte` file:
102+
2. Export model and generate `.pte` file
104103
```
105-
python -m examples.models.llama2.export_llama --checkpoint <checkpoint.pth> --params <params.json> -kv --use_sdpa_with_kv_cache -X -qmode 8da4w --group_size 128 -d fp32
104+
python -m examples.models.llama2.export_llama --checkpoint <consolidated.00.pth> -p <params.json> -kv --use_sdpa_with_kv_cache -X -qmode 8da4w --group_size 128 -d fp32 --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --embedding-quantize 4,32 --output_name="llama3_kv_sdpa_xnn_qe_4_32.pte"
106105
```
107-
4. Create tokenizer.bin.
108106
109-
```
110-
python -m extension.llm.tokenizer.tokenizer -t <tokenizer.model> -o tokenizer.bin
111-
```
107+
Due to the larger vocabulary size of Llama 3, we recommend quantizing the embeddings with `--embedding-quantize 4,32` as shown above to further reduce the model size.
108+
109+
3. SpinQuant [Optional]. If you want to improve accuracy, you can use [SpinQuant](https://github.com/facebookresearch/SpinQuant). Namely, (1) you can generate a new checkpoint via `31_optimize_rotation_executorch.sh` and `32_eval_ptq_executorch.sh` commands in [SpinQuant repo](https://github.com/facebookresearch/SpinQuant/tree/main?tab=readme-ov-file#3-export-to-executorch) (2) pass in an extra `--use_spin_quant native` argument in `export_llama` script above.
112110
113111
### Option B: Download and export stories110M model
114112
@@ -133,25 +131,30 @@ If you want to deploy and run a smaller model for educational purposes. From `ex
133131
python -m extension.llm.tokenizer.tokenizer -t <tokenizer.model> -o tokenizer.bin
134132
```
135133
136-
### Option C: Download and export Llama 3 8B instruct model
134+
### Option C: Download and export Llama 2 7B model
137135
138-
You can export and run the original Llama 3 8B instruct model.
136+
You can export and run the original Llama 2 7B model.
139137
140-
1. Llama 3 pretrained parameters can be downloaded from [Meta's official Llama 3 repository](https://github.com/meta-llama/llama3/).
138+
1. Llama 2 pretrained parameters can be downloaded from [Meta's official website](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) or from [Hugging Face](https://huggingface.co/meta-llama/Llama-2-7b).
141139
142-
2. Export model and generate `.pte` file
140+
2. Edit `params.json` file. Replace `"vocab_size": -1` with `"vocab_size": 32000`. This is a short-term workaround.
141+
142+
3. Export model and generate `.pte` file:
143143
```
144-
python -m examples.models.llama2.export_llama --checkpoint <consolidated.00.pth> -p <params.json> -kv --use_sdpa_with_kv_cache -X -qmode 8da4w --group_size 128 -d fp32 --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --embedding-quantize 4,32 --output_name="llama3_kv_sdpa_xnn_qe_4_32.pte"
144+
python -m examples.models.llama2.export_llama --checkpoint <checkpoint.pth> --params <params.json> -kv --use_sdpa_with_kv_cache -X -qmode 8da4w --group_size 128 -d fp32
145145
```
146+
4. Create tokenizer.bin.
146147
147-
Due to the larger vocabulary size of Llama 3, we recommend quantizing the embeddings with `--embedding-quantize 4,32` as shown above to further reduce the model size.
148-
149-
3. SpinQuant [Optional]. If you want to improve accuracy, you can use [SpinQuant](https://github.com/facebookresearch/SpinQuant). Namely, (1) you can generate a new checkpoint via `31_optimize_rotation_executorch.sh` and `32_eval_ptq_executorch.sh` commands in [SpinQuant repo](https://github.com/facebookresearch/SpinQuant/tree/main?tab=readme-ov-file#3-export-to-executorch) (2) pass in an extra `--use_spin_quant native` argument in `export_llama` script above.
148+
```
149+
python -m extension.llm.tokenizer.tokenizer -t <tokenizer.model> -o tokenizer.bin
150+
```
150151
151152
### Option D: Download models from Hugging Face and convert from safetensor format to state dict
152153
154+
153155
You can also download above models from [Hugging Face](https://huggingface.co/). Since ExecuTorch starts from a PyTorch model, a script like below can be used to convert the Hugging Face safetensors format to PyTorch's state dict. It leverages the utils provided by [TorchTune](https://github.com/pytorch/torchtune).
154156
157+
155158
```Python
156159
from torchtune.utils import FullModelHFCheckpointer
157160
from torchtune.models import convert_weights

0 commit comments

Comments
 (0)