Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -71,8 +71,8 @@ cmake --build cmake-out -j16 --target install --config Release



### Setup Llama Runner
Next we need to build and compile the Llama runner. This is similar to the requirements for running Llama with XNNPack.
### Setup Llama Runner
Next we need to build and compile the Llama runner. This is similar to the requirements for running Llama with XNNPACK.
```
sh examples/models/llama2/install_requirements.sh

Expand Down Expand Up @@ -130,9 +130,9 @@ You may also wonder what the "--metadata" flag is doing. This flag helps export

Convert tokenizer for Llama 2
```
python -m extension.llm.tokenizer.tokenizer -t <tokenizer.model> -o tokenizer.bin
python -m extension.llm.tokenizer.tokenizer -t tokenizer.model -o tokenizer.bin
```
Convert tokenizer for Llama 3 - Rename tokenizer.model to tokenizer.bin.
Rename tokenizer for Llama 3 with command: `mv tokenizer.model tokenizer.bin`. We are updating the demo app to support tokenizer in original format directly.


### Export with Spinquant (Llama 3 8B only)
Expand Down
Original file line number Diff line number Diff line change
@@ -1,8 +1,10 @@
# Building ExecuTorch Android Demo App for Llama running XNNPack
# Building ExecuTorch Android Demo App for Llama/Llava running XNNPACK

This tutorial covers the end to end workflow for building an android demo app using CPU on device via XNNPack framework.
**[UPDATE - 09/25]** We have added support for running [Llama 3.2 models](#for-llama-32-1b-and-3b-models) on the XNNPACK backend. We currently support inference on their original data type (BFloat16). We have also added instructions to run [Llama Guard 1B models](#for-llama-guard-1b-models) on-device.

This tutorial covers the end to end workflow for building an android demo app using CPU on device via XNNPACK framework.
More specifically, it covers:
1. Export and quantization of Llama and Llava models against the XNNPack backend.
1. Export and quantization of Llama and Llava models against the XNNPACK backend.
2. Building and linking libraries that are required to inference on-device for Android platform.
3. Building the Android demo app itself.

Expand Down Expand Up @@ -56,8 +58,38 @@ Optional: Use the --pybind flag to install with pybindings.
## Prepare Models
In this demo app, we support text-only inference with up-to-date Llama models and image reasoning inference with LLaVA 1.5.

### For Llama model
* You can download original model weights for Llama through Meta official [website](https://llama.meta.com/), or via Huggingface ([Llama 3.1 8B Instruction](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct))
### For Llama 3.2 1B and 3B models
We have supported BFloat16 as a data type on the XNNPACK backend for Llama 3.2 1B/3B models.
* You can request and download model weights for Llama through Meta official [website](https://llama.meta.com/).
* For chat use-cases, download the instruct models instead of pretrained.
* Run `examples/models/llama2/install_requirements.sh` to install dependencies.
* The 1B model in BFloat16 format can run on mobile devices with 8GB RAM. The 3B model will require 12GB+ RAM.
* Export Llama model and generate .pte file as below:

```
python -m examples.models.llama2.export_llama --checkpoint <checkpoint.pth> --params <params.json> -kv -X -d bf16 --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --output_name="llama3_2.pte"
```

* Rename tokenizer for Llama 3.2 with command: `mv tokenizer.model tokenizer.bin`. We are updating the demo app to support tokenizer in original format directly.

For more detail using Llama 3.2 lightweight models including prompt template, please go to our official [website](https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_2#-llama-3.2-lightweight-models-(1b/3b)-).


### For Llama Guard 1B models
To safeguard your application, you can use our Llama Guard models for prompt classification or response classification as mentioned [here](https://www.llama.com/docs/model-cards-and-prompt-formats/llama-guard-3/).
* Llama Guard 3-1B is a fine-tuned Llama-3.2-1B pretrained model for content safety classification. It is aligned to safeguard against the [MLCommons standardized hazards taxonomy](https://arxiv.org/abs/2404.12241).
* You can download the latest Llama Guard 1B INT4 model, which is already exported for ExecuTorch, using instructions from [here](https://github.com/meta-llama/PurpleLlama/tree/main/Llama-Guard3). This model is pruned and quantized to 4-bit weights using 8da4w mode and reduced the size to <450MB to optimize deployment on edge devices.
* You can use the same tokenizer from Llama 3.2.
* To try this model, choose Model Type as LLAMA_GUARD_3 in the demo app below and try prompt classification for a given user prompt.
* We prepared this model using the following command

```
python -m examples.models.llama2.export_llama --checkpoint <pruned llama guard 1b checkpoint.pth> --params <params.json> -d fp32 -kv --use_sdpa_with_kv_cache --quantization_mode 8da4w --group_size 256 --xnnpack --max_seq_length 8193 --embedding-quantize 4,32 --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --output_prune_map <llama_guard pruned layers map.json> --output_name="llama_guard_3_1b_pruned_xnnpack.pte"
```


### For Llama 3.1 and Llama 2 models
* You can download original model weights for Llama through Meta official [website](https://llama.meta.com/).
* For Llama 2 models, Edit params.json file. Replace "vocab_size": -1 with "vocab_size": 32000. This is a short-term workaround
* Run `examples/models/llama2/install_requirements.sh` to install dependencies.
* Export Llama model and generate .pte file
Expand All @@ -70,9 +102,9 @@ You may wonder what the ‘--metadata’ flag is doing. This flag helps export t

* Convert tokenizer for Llama 2
```
python -m extension.llm.tokenizer.tokenizer -t <tokenizer.model> -o tokenizer.bin
python -m extension.llm.tokenizer.tokenizer -t tokenizer.model -o tokenizer.bin
```
* Convert tokenizer for Llama 3 - Rename `tokenizer.model` to `tokenizer.bin`.
* Rename tokenizer for Llama 3.1 with command: `mv tokenizer.model tokenizer.bin`. We are updating the demo app to support tokenizer in original format directly.

### For LLaVA model
* For the Llava 1.5 model, you can get it from Huggingface [here](https://huggingface.co/llava-hf/llava-1.5-7b-hf).
Expand Down