pytorch · cmodi-meta · Sep 27, 2024 · Sep 26, 2024 · Sep 27, 2024
diff --git a/examples/demo-apps/android/LlamaDemo/docs/delegates/qualcomm_README.md b/examples/demo-apps/android/LlamaDemo/docs/delegates/qualcomm_README.md
@@ -71,8 +71,8 @@ cmake --build cmake-out -j16 --target install --config Release
 
 
 
-### Setup Llama Runner  
-Next we need to build and compile the Llama runner. This is similar to the requirements for running Llama with XNNPack. 
+### Setup Llama Runner
+Next we need to build and compile the Llama runner. This is similar to the requirements for running Llama with XNNPACK.
 ```
 sh examples/models/llama2/install_requirements.sh
 
@@ -130,9 +130,9 @@ You may also wonder what the "--metadata" flag is doing. This flag helps export
 
 Convert tokenizer for Llama 2
 ```
-python -m extension.llm.tokenizer.tokenizer -t <tokenizer.model> -o tokenizer.bin
+python -m extension.llm.tokenizer.tokenizer -t tokenizer.model -o tokenizer.bin
 ```
-Convert tokenizer for Llama 3 - Rename tokenizer.model to tokenizer.bin.
+Rename tokenizer for Llama 3 with command: `mv tokenizer.model tokenizer.bin`. We are updating the demo app to support tokenizer in original format directly.
 
 
 ### Export with Spinquant (Llama 3 8B only)

diff --git a/examples/demo-apps/android/LlamaDemo/docs/delegates/xnnpack_README.md b/examples/demo-apps/android/LlamaDemo/docs/delegates/xnnpack_README.md
@@ -1,8 +1,10 @@
-# Building ExecuTorch Android Demo App for Llama running XNNPack
+# Building ExecuTorch Android Demo App for Llama/Llava running XNNPACK
 
-This tutorial covers the end to end workflow for building an android demo app using CPU on device via XNNPack framework.
+**[UPDATE - 09/25]** We have added support for running [Llama 3.2 models](#for-llama-32-1b-and-3b-models) on the XNNPACK backend. We currently support inference on their original data type (BFloat16). We have also added instructions to run [Llama Guard 1B models](#for-llama-guard-1b-models) on-device.
+
+This tutorial covers the end to end workflow for building an android demo app using CPU on device via XNNPACK framework.
 More specifically, it covers:
-1. Export and quantization of Llama and Llava models against the XNNPack backend.
+1. Export and quantization of Llama and Llava models against the XNNPACK backend.
 2. Building and linking libraries that are required to inference on-device for Android platform.
 3. Building the Android demo app itself.
 
@@ -56,8 +58,38 @@ Optional: Use the --pybind flag to install with pybindings.
 ## Prepare Models
 In this demo app, we support text-only inference with up-to-date Llama models and image reasoning inference with LLaVA 1.5.
 
-### For Llama model
-* You can download original model weights for Llama through Meta official [website](https://llama.meta.com/), or via Huggingface ([Llama 3.1 8B Instruction](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct))
+### For Llama 3.2 1B and 3B models
+We have supported BFloat16 as a data type on the XNNPACK backend for Llama 3.2 1B/3B models.
+* You can request and download model weights for Llama through Meta official [website](https://llama.meta.com/).
+* For chat use-cases, download the instruct models instead of pretrained.
+* Run `examples/models/llama2/install_requirements.sh` to install dependencies.
+* The 1B model in BFloat16 format can run on mobile devices with 8GB RAM. The 3B model will require 12GB+ RAM.
+* Export Llama model and generate .pte file as below:
+
+```
+python -m examples.models.llama2.export_llama --checkpoint <checkpoint.pth> --params <params.json> -kv -X -d bf16 --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --output_name="llama3_2.pte"
+```
+
+* Rename tokenizer for Llama 3.2 with command: `mv tokenizer.model tokenizer.bin`. We are updating the demo app to support tokenizer in original format directly.
+
+For more detail using Llama 3.2 lightweight models including prompt template, please go to our official [website](https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_2#-llama-3.2-lightweight-models-(1b/3b)-).
+
+
+### For Llama Guard 1B models
+To safeguard your application, you can use our Llama Guard models for prompt classification or response classification as mentioned [here](https://www.llama.com/docs/model-cards-and-prompt-formats/llama-guard-3/).
+* Llama Guard 3-1B is a fine-tuned Llama-3.2-1B pretrained model for content safety classification. It is aligned to safeguard against the [MLCommons standardized hazards taxonomy](https://arxiv.org/abs/2404.12241).
+* You can download the latest Llama Guard 1B INT4 model, which is already exported for ExecuTorch, using instructions from [here](https://github.com/meta-llama/PurpleLlama/tree/main/Llama-Guard3). This model is pruned and quantized to 4-bit weights using 8da4w mode and reduced the size to <450MB to optimize deployment on edge devices.
+* You can use the same tokenizer from Llama 3.2.
+* To try this model, choose Model Type as LLAMA_GUARD_3 in the demo app below and try prompt classification for a given user prompt.
+* We prepared this model using the following command
+
+```
+python -m examples.models.llama2.export_llama --checkpoint <pruned llama guard 1b checkpoint.pth> --params <params.json> -d fp32 -kv --use_sdpa_with_kv_cache --quantization_mode 8da4w --group_size 256 --xnnpack --max_seq_length 8193 --embedding-quantize 4,32 --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --output_prune_map <llama_guard pruned layers map.json> --output_name="llama_guard_3_1b_pruned_xnnpack.pte"
+```
+
+
+### For Llama 3.1 and Llama 2 models
+* You can download original model weights for Llama through Meta official [website](https://llama.meta.com/).
 * For Llama 2 models, Edit params.json file. Replace "vocab_size": -1 with "vocab_size": 32000. This is a short-term workaround
 * Run `examples/models/llama2/install_requirements.sh` to install dependencies.
 * Export Llama model and generate .pte file
@@ -70,9 +102,9 @@ You may wonder what the ‘--metadata’ flag is doing. This flag helps export t
 
 * Convert tokenizer for Llama 2
 ```
-python -m extension.llm.tokenizer.tokenizer -t <tokenizer.model> -o tokenizer.bin
+python -m extension.llm.tokenizer.tokenizer -t tokenizer.model -o tokenizer.bin
 ```
-* Convert tokenizer for Llama 3 - Rename `tokenizer.model` to `tokenizer.bin`.
+* Rename tokenizer for Llama 3.1 with command: `mv tokenizer.model tokenizer.bin`. We are updating the demo app to support tokenizer in original format directly.
 
 ### For LLaVA model
 * For the Llava 1.5 model, you can get it from Huggingface [here](https://huggingface.co/llava-hf/llava-1.5-7b-hf).