Skip to content

Commit 59c8d82

Browse files
mergennachinfacebook-github-bot
authored andcommitted
Refactor out llama2 specific content out of Llama readme (#6359)
Summary: Pull Request resolved: #6359 Llama2 is "obselete", let's migrate to existing llama2 readme.md page bypass-github-export-checks bypass-github-pytorch-ci-checks bypass-github-executorch-ci-checks Reviewed By: dvorjackz Differential Revision: D64618486 fbshipit-source-id: 82b04aa93023dc021cb162986546d737d5e9f4dd
1 parent 7493aae commit 59c8d82

File tree

2 files changed

+57
-37
lines changed

2 files changed

+57
-37
lines changed

examples/models/llama/README.md

Lines changed: 6 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ Here are supported models:
66
- Llama 3.2 1B and 3B
77
- Llama 3.1 8B
88
- Llama 3 8B
9-
- Llama 2 7B
9+
- [Llama 2 7B](../llama2/README.md)
1010

1111
Pretrained models are not included in this repo. Users are suggested to download them [here](https://ai.meta.com/resources/models-and-libraries/llama-downloads/).
1212

@@ -22,7 +22,7 @@ Please note that the models are subject to the [Llama 2 Acceptable Use Policy](h
2222

2323
# Results
2424

25-
Since Llama 2 7B or Llama 3 8B model needs at least 4-bit quantization to fit even within some of the highend phones, results presented here correspond to 4-bit groupwise post-training quantized model.
25+
Since Llama 3 8B model needs at least 4-bit quantization to fit even within some of the highend phones, results presented here correspond to 4-bit groupwise post-training quantized model.
2626

2727
For Llama 3.2 1B/3B, we validated the models by running them in their original bf16 datatype and unquantized on both Android and iOS phones. The 3B version required high-end phones with larger RAMs to fit the model.
2828

@@ -53,7 +53,6 @@ Below are the results for two different groupsizes, with max_seq_length 2048, an
5353

5454
|Model | Baseline (FP32) | Groupwise 4-bit (128) | Groupwise 4-bit (256)
5555
|--------|-----------------| ---------------------- | ---------------
56-
|Llama 2 7B | 9.2 | 10.2 | 10.7
5756
|Llama 3 8B | 7.9 | 9.4 | 9.7
5857

5958
Note that groupsize less than 128 was not enabled, since such models were still too large. This is because our current efforts have focused on enabling FP32 and support for FP16 is under way. What this implies for model size is that 1) embedding table is in FP32 and 2) quantized weights scales are FP32.
@@ -80,8 +79,6 @@ SpinQuant can generate quantized weights that are [compatible with ExecuTorch](h
8079

8180
For Llama 3 8B and Llama3.1 8B, we have verified so far on iPhone 15 Pro, iPhone 15 Pro Max, Samsung Galaxy S24+ and OnePlus 12 (with 16GB RAM).
8281

83-
We have verified running Llama 2 7B [mobile applications](#step-6-build-mobile-apps) efficiently on select devices including the iPhone 15 Pro, iPhone 15 Pro Max, Samsung Galaxy S22 and S24, and OnePlus 12.
84-
8582
## Performance
8683

8784
### Llama 3.2 1B and 3B
@@ -97,29 +94,21 @@ Llama 3.2 1B and 3B performance was measured on the OnePlus 12 device. The perfo
9794
### Llama3 8B and Llama3.1 8B
9895
Llama 3 8B performance was measured on the Samsung Galaxy S22, S24, and OnePlus 12 devices. The performance measurement is expressed in terms of tokens per second using an [adb binary-based approach](#step-5-run-benchmark-on).
9996

100-
Note that since Llama3's vocabulary size is 4x that of Llama2, we had to quantize embedding lookup table as well. For these results embedding lookup table was groupwise quantized with 4-bits and group size of 32.
97+
Due to Llama3's vocabulary size, we had to quantize embedding lookup table as well. For these results embedding lookup table was groupwise quantized with 4-bits and group size of 32.
10198

10299
|Device | Groupwise 4-bit (128) | Groupwise 4-bit (256)
103100
|--------| ---------------------- | ---------------
104101
|Galaxy S22 | 7.85 tokens/second | 8.4 tokens/second |
105102
|Galaxy S24 | 10.91 tokens/second | 11.21 tokens/second |
106103
|OnePlus 12 | 10.85 tokens/second | 11.02 tokens/second |
107104

108-
### Llama2 7B
109-
Llama 2 7B performance was measured on the Samsung Galaxy S22, S24, and OnePlus 12 devices. The performance measurement is expressed in terms of tokens per second using an [adb binary-based approach](#step-5-run-benchmark-on).
110-
111-
|Device | Groupwise 4-bit (128) | Groupwise 4-bit (256)
112-
|--------| ---------------------- | ---------------
113-
|Galaxy S22 | 8.15 tokens/second | 8.3 tokens/second |
114-
|Galaxy S24 | 10.66 tokens/second | 11.26 tokens/second |
115-
|OnePlus 12 | 11.55 tokens/second | 11.6 tokens/second |
116105

117106
# Instructions
118107

119108
## Tested on
120109

121110
- MacOS M1/M2, Linux.
122-
- For Llama 2 7B, your device may require at least 32GB RAM. If this is a constraint for you, please try the smaller stories model.
111+
- For Llama 3 8B, your device may require at least 32GB RAM. If this is a constraint for you, please try the smaller stories model.
123112

124113
## Step 1: Setup
125114
> :warning: **double check your python environment**: make sure `conda activate <VENV>` is run before all the bash and python scripts.
@@ -208,24 +197,7 @@ If you want to deploy and run a smaller model for educational purposes. From `ex
208197
python -m examples.models.llama.export_llama -c stories110M.pt -p params.json -X -kv
209198
```
210199
211-
### Option D: Download and export Llama 2 7B model
212-
213-
You can export and run the original Llama 2 7B model.
214-
215-
1. Llama 2 pretrained parameters can be downloaded from [Meta's official website](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) or from [Hugging Face](https://huggingface.co/meta-llama/Llama-2-7b).
216-
217-
2. Edit `params.json` file. Replace `"vocab_size": -1` with `"vocab_size": 32000`. This is a short-term workaround.
218-
219-
3. Export model and generate `.pte` file:
220-
```
221-
python -m examples.models.llama.export_llama --checkpoint <checkpoint.pth> --params <params.json> -kv --use_sdpa_with_kv_cache -X -qmode 8da4w --group_size 128 -d fp32
222-
```
223-
4. Create tokenizer.bin.
224-
```
225-
python -m extension.llm.tokenizer.tokenizer -t <tokenizer.model> -o tokenizer.bin
226-
```
227-
228-
### Option E: Download models from Hugging Face and convert from safetensor format to state dict
200+
### Option D: Download models from Hugging Face and convert from safetensor format to state dict
229201
230202
231203
You can also download above models from [Hugging Face](https://huggingface.co/). Since ExecuTorch starts from a PyTorch model, a script like below can be used to convert the Hugging Face safetensors format to PyTorch's state dict. It leverages the utils provided by [TorchTune](https://github.com/pytorch/torchtune).
@@ -348,8 +320,6 @@ Note for Mac users: There's a known linking issue with Xcode 15.1. Refer to the
348320
cmake-out/examples/models/llama/llama_main --model_path=<model pte file> --tokenizer_path=<tokenizer.model> --prompt=<prompt>
349321
```
350322
351-
For Llama2 models, pass the converted `tokenizer.bin` file instead of `tokenizer.model`.
352-
353323
To build for CoreML backend and validate on Mac, replace `-DEXECUTORCH_BUILD_XNNPACK=ON` with `-DEXECUTORCH_BUILD_COREML=ON`
354324
355325
## Step 5: Run benchmark on Android phone
@@ -453,7 +423,7 @@ For CoreML, there are 2 additional optional arguments:
453423
- Enable support for mult-modal models like LlaVa.
454424
## Performance
455425
- Performance improvement via techniques such as speculative decoding
456-
- Enabling LLama2 7b and other architectures via Vulkan
426+
- Enabling LLama and other architectures via Vulkan
457427
- Enabling performant execution of widely used quantization schemes.
458428
459429

examples/models/llama2/README.md

Lines changed: 51 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,52 @@
11
# Summary
2-
For Llama2, please see the [Llama README page](../llama/README.md) for details.
2+
For Llama enablement, please see the [Llama README page](../llama/README.md) for complete details.
3+
4+
This page contains Llama2 specific instructions and information.
5+
6+
7+
## Enablement
8+
9+
We have verified running Llama 2 7B [mobile applications](#step-6-build-mobile-apps) efficiently on select devices including the iPhone 15 Pro, iPhone 15 Pro Max, Samsung Galaxy S22 and S24, and OnePlus 12.
10+
11+
Since Llama 2 7B needs at least 4-bit quantization to fit even within some of the highend phones, results presented here correspond to 4-bit groupwise post-training quantized model.
12+
13+
## Results
14+
15+
### Llama2 7B
16+
Llama 2 7B performance was measured on the Samsung Galaxy S22, S24, and OnePlus 12 devices. The performance measurement is expressed in terms of tokens per second using an [adb binary-based approach](#step-5-run-benchmark-on).
17+
18+
|Device | Groupwise 4-bit (128) | Groupwise 4-bit (256)
19+
|--------| ---------------------- | ---------------
20+
|Galaxy S22 | 8.15 tokens/second | 8.3 tokens/second |
21+
|Galaxy S24 | 10.66 tokens/second | 11.26 tokens/second |
22+
|OnePlus 12 | 11.55 tokens/second | 11.6 tokens/second |
23+
24+
Below are the results for two different groupsizes, with max_seq_length 2048, and limit 1000, based on WikiText perplexity using [LM Eval](https://github.com/EleutherAI/lm-evaluation-harness).
25+
26+
|Model | Baseline (FP32) | Groupwise 4-bit (128) | Groupwise 4-bit (256)
27+
|--------|-----------------| ---------------------- | ---------------
28+
|Llama 2 7B | 9.2 | 10.2 | 10.7
29+
30+
## Prepare model
31+
32+
You can export and run the original Llama 2 7B model.
33+
34+
1. Llama 2 pretrained parameters can be downloaded from [Meta's official website](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) or from [Hugging Face](https://huggingface.co/meta-llama/Llama-2-7b).
35+
36+
2. Edit `params.json` file. Replace `"vocab_size": -1` with `"vocab_size": 32000`. This is a short-term workaround.
37+
38+
3. Export model and generate `.pte` file:
39+
```
40+
python -m examples.models.llama.export_llama --checkpoint <checkpoint.pth> --params <params.json> -kv --use_sdpa_with_kv_cache -X -qmode 8da4w --group_size 128 -d fp32
41+
```
42+
4. Create tokenizer.bin.
43+
```
44+
python -m extension.llm.tokenizer.tokenizer -t <tokenizer.model> -o tokenizer.bin
45+
```
46+
47+
Pass the converted `tokenizer.bin` file instead of `tokenizer.model` for subsequent steps.
48+
49+
50+
# Run
51+
52+
Running will be the same [by following this step](../llama/README.md#step-4-run-on-your-computer-to-validate).

0 commit comments

Comments
 (0)