Skip to content

Commit 82a505b

Browse files
Chester Hufacebook-github-bot
authored andcommitted
Update Android XNNPack demo app doc for Llama 3.2 and Llama Guard 3 (#5640)
Summary: Pull Request resolved: #5640 Added instruction to export Llama 3.2 feather models and Llama Guard 3 1B Reviewed By: kirklandsign Differential Revision: D63151037 fbshipit-source-id: 20f7c817b5dcbacc8f59fda3d3cd260c8d62f99c
1 parent d2ba238 commit 82a505b

File tree

2 files changed

+37
-1
lines changed

2 files changed

+37
-1
lines changed

examples/demo-apps/android/LlamaDemo/README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,8 @@ The goal is for you to see the type of support ExecuTorch provides and feel comf
1717

1818
## Supporting Models
1919
As a whole, the models that this app supports are (varies by delegate):
20+
* Llama 3.2 1B/3B
21+
* Llama Guard 3 1B
2022
* Llama 3.1 8B
2123
* Llama 3 8B
2224
* Llama 2 7B

examples/demo-apps/android/LlamaDemo/docs/delegates/xnnpack_README.md

Lines changed: 35 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,7 @@
11
# Building ExecuTorch Android Demo App for Llama running XNNPack
22

3+
**[UPDATE - 09/25]** We have added support for running [Llama 3.2 models](#for-llama-3.2-1b-and-3b-models) on the XNNPack backend. We currently support inference on their original data type (BFloat16). We have also added instructions to run [Llama Guard 1B models](#for-llama-guard-1b-models) on-device.
4+
35
This tutorial covers the end to end workflow for building an android demo app using CPU on device via XNNPack framework.
46
More specifically, it covers:
57
1. Export and quantization of Llama and Llava models against the XNNPack backend.
@@ -56,10 +58,41 @@ Optional: Use the --pybind flag to install with pybindings.
5658
## Prepare Models
5759
In this demo app, we support text-only inference with up-to-date Llama models and image reasoning inference with LLaVA 1.5.
5860

59-
### For Llama model
61+
### For Llama 3.2 1B and 3B models
62+
We have supported BFloat16 as a data type on the XNNPack backend for Llama 3.2 1B/3B models.
63+
* You can request and download model weights for Llama through Meta official [website](https://llama.meta.com/).
64+
* For chat use-cases, download the instruct models instead of pretrained.
65+
* Run “examples/models/llama2/install_requirements.sh” to install dependencies.
66+
* The 1B model in BFloat16 format can run on mobile devices with 8GB RAM. The 3B model will require 12GB+ RAM.
67+
* Export Llama model and generate .pte file as below:
68+
69+
```
70+
python -m examples.models.llama2.export_llama --checkpoint <checkpoint.pth> --params <params.json> -kv -X -d bf16 --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --output_name="llama3_2.pte"
71+
```
72+
73+
* Convert tokenizer for Llama 3.2 - Rename 'tokenizer.model' to 'tokenizer.bin'.
74+
75+
For more detail using Llama 3.2 lightweight models including prompt template, please go to our official [website](https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_2#-llama-3.2-lightweight-models-(1b/3b)-).
76+
77+
78+
### For Llama Guard 1B models
79+
To safeguard your application, you can use our Llama Guard models for prompt classification or response classification as mentioned [here](https://www.llama.com/docs/model-cards-and-prompt-formats/llama-guard-3/).
80+
* Llama Guard 3-1B is a fine-tuned Llama-3.2-1B pretrained model for content safety classification. It is aligned to safeguard against the [MLCommons standardized hazards taxonomy](https://arxiv.org/abs/2404.12241).
81+
* You can download the latest Llama Guard 1B INT4 model, which is already exported for Executorch, using instructions from [here](https://github.com/meta-llama/PurpleLlama/tree/main/Llama-Guard3). This model is pruned and quantized to 4-bit weights using 8da4w mode and reduced the size to <450MB to optimize deployment on edge devices.
82+
* You can use the same tokenizer from Llama 3.2.
83+
* To try this model, choose Model Type as LLAMA_GUARD_3 in the demo app below and try prompt classification for a given user prompt.
84+
* We prepared this model using the following command
85+
86+
```
87+
python -m examples.models.llama2.export_llama --checkpoint <pruned llama guard 1b checkpoint.pth> --params <params.json> -d fp32 -kv --use_sdpa_with_kv_cache --quantization_mode 8da4w --group_size 256 --xnnpack --max_seq_length 8193 --embedding-quantize 4,32 --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --output_prune_map <llama_guard pruned layers map.json> --output_name="llama_guard_3_1b_pruned_xnnpack.pte"
88+
```
89+
90+
91+
### For Llama 3.1 and Llama 2 models
6092
* You can download original model weights for Llama through Meta official [website](https://llama.meta.com/), or via Huggingface ([Llama 3.1 8B Instruction](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct))
6193
* For Llama 2 models, Edit params.json file. Replace "vocab_size": -1 with "vocab_size": 32000. This is a short-term workaround
6294
* Run `examples/models/llama2/install_requirements.sh` to install dependencies.
95+
* The Llama 3.1 and Llama 2 models (8B and 7B) can run on devices with 12GB+ RAM.
6396
* Export Llama model and generate .pte file
6497

6598
```
@@ -74,6 +107,7 @@ python -m extension.llm.tokenizer.tokenizer -t <tokenizer.model> -o tokenizer.bi
74107
```
75108
* Convert tokenizer for Llama 3 - Rename `tokenizer.model` to `tokenizer.bin`.
76109

110+
77111
### For LLaVA model
78112
* For the Llava 1.5 model, you can get it from Huggingface [here](https://huggingface.co/llava-hf/llava-1.5-7b-hf).
79113
* Run `examples/models/llava/install_requirements.sh` to install dependencies.

0 commit comments

Comments
 (0)