Updates to Executorch 0.7 Kleidi LP

pareenaverma · pareenaverma · commit bb8666752100 · 2025-07-28T11:09:29.000-04:00
diff --git a/content/learning-paths/mobile-graphics-and-gaming/build-llama3-chat-android-app-using-executorch-and-xnnpack/2-executorch-setup.md b/content/learning-paths/mobile-graphics-and-gaming/build-llama3-chat-android-app-using-executorch-and-xnnpack/2-executorch-setup.md
@@ -15,8 +15,8 @@ The best practice is to generate an isolated Python environment in which to inst
 ### Option 1: Create a Python virtual environment
 
 ```bash
-python3.10 -m venv executorch
-source executorch/bin/activate
+python3.10 -m venv executorch-venv
+source executorch-venv/bin/activate
 ```
 
 The prompt of your terminal has `executorch` as a prefix to indicate the virtual environment is active.
@@ -28,8 +28,8 @@ Install Miniconda on your development machine by following the [Installing conda
 Once `conda` is installed, create the environment:
 
 ```bash
-conda create -yn executorch python=3.10.0
-conda activate executorch
+conda create -yn executorch-venv python=3.10.0
+conda activate executorch-venv
 ```
 
 ### Clone ExecuTorch and install the required dependencies
@@ -40,7 +40,7 @@ From within the conda environment, run the commands below to download the ExecuT
 git clone https://github.com/pytorch/executorch.git
 cd executorch
 git submodule sync
-git submodule update --init
+git submodule update --init --recursive
 ./install_executorch.sh
 ./examples/models/llama/install_requirements.sh
 ```
diff --git a/content/learning-paths/mobile-graphics-and-gaming/build-llama3-chat-android-app-using-executorch-and-xnnpack/4-prepare-llama-models.md b/content/learning-paths/mobile-graphics-and-gaming/build-llama3-chat-android-app-using-executorch-and-xnnpack/4-prepare-llama-models.md
@@ -46,7 +46,8 @@ python3 -m examples.models.llama.export_llama \
 --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001, 128006, 128007]}' \
 --embedding-quantize 4,32 \
 --output_name="llama3_1B_kv_sdpa_xnn_qe_4_64_1024_embedding_4bit.pte" \
---max_seq_length 1024
+--max_seq_length 1024 \
+--max_context_length 1024
 ```
 
 Due to the larger vocabulary size of Llama 3, you should quantize the embeddings with `--embedding-quantize 4,32` to further reduce the model size.
diff --git a/content/learning-paths/mobile-graphics-and-gaming/build-llama3-chat-android-app-using-executorch-and-xnnpack/5-run-benchmark-on-android.md b/content/learning-paths/mobile-graphics-and-gaming/build-llama3-chat-android-app-using-executorch-and-xnnpack/5-run-benchmark-on-android.md
@@ -38,18 +38,23 @@ cmake -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake \
     -DEXECUTORCH_BUILD_EXTENSION_DATA_LOADER=ON \
     -DEXECUTORCH_BUILD_EXTENSION_MODULE=ON \
     -DEXECUTORCH_BUILD_EXTENSION_TENSOR=ON \
+    -DEXECUTORCH_BUILD_EXTENSION_FLAT_TENSOR=ON \
     -DEXECUTORCH_BUILD_XNNPACK=ON \
     -DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \
     -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON \
     -DEXECUTORCH_BUILD_KERNELS_CUSTOM=ON \
+    -DEXECUTORCH_BUILD_KERNELS_LLM=ON \
+    -DEXECUTORCH_BUILD_EXTENSION_LLM_RUNNER=ON \
+    -DEXECUTORCH_BUILD_EXTENSION_RUNNER_UTIL=ON \
     -DEXECUTORCH_XNNPACK_ENABLE_KLEIDI=ON \
     -DXNNPACK_ENABLE_ARM_BF16=OFF \
+    -DBUILD_TESTING=OFF \
     -Bcmake-out-android .
 
 cmake --build cmake-out-android -j7 --target install --config Release
 ```
 {{% notice Note %}}
-Make sure you add -DEXECUTORCH_XNNPACK_ENABLE_KLEIDI=ON option to enable support for KleidiAI kernels in ExecuTorch with XNNPack.
+Starting with Executorch version 0.7 beta, KleidiAI is enabled by default. The -DEXECUTORCH_XNNPACK_ENABLE_KLEIDI=ON option is enabled and adds default support for KleidiAI kernels in ExecuTorch with XNNPack.
 {{% /notice %}}
 
 ### 3. Build Llama runner for Android
@@ -67,7 +72,8 @@ cmake  -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake \
     -DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \
     -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON \
     -DEXECUTORCH_BUILD_KERNELS_CUSTOM=ON \
-    -DEXECUTORCH_USE_TIKTOKEN=ON \
+    -DSUPPORT_REGEX_LOOKAHEAD=ON \
+    -DBUILD_TESTING=OFF \
     -Bcmake-out-android/examples/models/llama \
     examples/models/llama
 
@@ -144,13 +150,13 @@ Reached to the end of generation
  
 I 00:00:05.399314 executorch:runner.cpp:257] RSS after finishing text generation: 1269.445312 MiB (0 if unsupported)
 PyTorchObserver {"prompt_tokens":54,"generated_tokens":51,"model_load_start_ms":1710296339487,"model_load_end_ms":1710296343047,"inference_start_ms":1710296343370,"inference_end_ms":1710296344877,"prompt_eval_end_ms":1710296343556,"first_token_ms":1710296343556,"aggregate_sampling_time_ms":49,"SCALING_FACTOR_UNITS_PER_SECOND":1000}
-I 00:00:05.399342 executorch:stats.h:111] 	Prompt Tokens: 54    Generated Tokens: 51
-I 00:00:05.399344 executorch:stats.h:117] 	Model Load Time:		3.560000 (seconds)
-I 00:00:05.399346 executorch:stats.h:127] 	Total inference time:		1.507000 (seconds)		 Rate: 	33.842070 (tokens/second)
-I 00:00:05.399348 executorch:stats.h:135] 		Prompt evaluation:	0.186000 (seconds)		 Rate: 	290.322581 (tokens/second)
-I 00:00:05.399350 executorch:stats.h:146] 		Generated 51 tokens:	1.321000 (seconds)		 Rate: 	38.607116 (tokens/second)
-I 00:00:05.399352 executorch:stats.h:154] 	Time to first generated token:	0.186000 (seconds)
-I 00:00:05.399354 executorch:stats.h:161] 	Sampling time over 105 tokens:	0.049000 (seconds)
+I 00:00:04.530945 executorch:stats.h:108] 	Prompt Tokens: 54    Generated Tokens: 69
+I 00:00:04.530947 executorch:stats.h:114] 	Model Load Time:		1.196000 (seconds)
+I 00:00:04.530949 executorch:stats.h:124] 	Total inference time:		1.934000 (seconds)		 Rate: 	35.677353 (tokens/second)
+I 00:00:04.530952 executorch:stats.h:132] 		Prompt evaluation:	0.176000 (seconds)		 Rate: 	306.818182 (tokens/second)
+I 00:00:04.530954 executorch:stats.h:143] 		Generated 69 tokens:	1.758000 (seconds)		 Rate: 	39.249147 (tokens/second)
+I 00:00:04.530956 executorch:stats.h:151] 	Time to first generated token:	0.176000 (seconds)
+I 00:00:04.530959 executorch:stats.h:158] 	Sampling time over 123 tokens:	0.067000 (seconds)
 ```
 
 You have successfully run the Llama 3.1 1B Instruct model on your Android smartphone with ExecuTorch using KleidiAI kernels.