Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
---
title: Understand LiteRT, XNNPACK, KleidiAI and SME2
weight: 2

### FIXED, DO NOT MODIFY
layout: learningpathall
---

## LiteRT, XNNPACK, KleidiAI and SME2

LiteRT (short for Lite Runtime), formerly known as TensorFlow Lite, is a runtime for on-device AI.
The default CPU acceleration library used by LiteRT is XNNPACK.

XNNPACK is an open-source library that provides highly optimized implementations of neural-network operators. It continuously integrates KleidiAI library to leverage new CPU features such as SME2.

KleidiAI is a library developed by Arm that offers performance-critical micro-kernels leveraging Arm architecture features, such as SME2.

Both XNNPACK and KleidiAI are external dependencies of LiteRT. LiteRT specifies the versions of these libraries to use.
When LiteRT is built with both XNNPACK and KleidiAI enabled, XNNPACK invokes KleidiAI’s micro-kernels at runtime to accelerate operators with supported data types; otherwise, it falls back to its own implementation.

The software stack for LiteRT is as follows.

![LiteRT, XNNPACK, KleidiAI and SME2#center](./litert-sw-stack.png "LiteRT, XNNPACK, KleidiAI and SME2")


## Understand how KleidiAI works in LiteRT

To understand how KleidiAI SME2 micro-kernel works in LiteRT, a LiteRT model with one Fully Connected operator with FP32 datatype is used as an example.

The following illustrates the execution workflow of XNNPACK’s implementation compared with the workflow when KleidiAI SME2 is enabled in XNNPACK.

### LiteRT → XNNPACK workflow

![LiteRT, XNNPACK workflow#center](./litert-xnnpack-workflow.png "LiteRT, XNNPACK workflow")

A Fully Connected operator can be essentially implemented as a matrix multiplication.

When LiteRT loads a model, it parses the operators and create a computation graph. If the CPU is selected as the accelerator, LiteRT uses XNNPACK by default.

XNNPACK traverses the operators in the graph and tries to replace them with its own implementations. During this stage, XNNPACK performs the necessary packing of the weight matrix. To speed up the packing process, XNNPACK uses NEON instructions for Arm platform. XNNPACK provides different implementations for different hardware platforms. At runtime, it detects the hardware capabilities and selects the appropriate micro-kernel.

During model inference, XNNPACK performs matrix multiplication on the activation matrix (the left-hand side matrix, LHS) and the repacked weight matrix (the right-hand side matrix, RHS). In this stage, XNNPACK applies tiling strategies to the matrices and performs parallel multiplication across the resulting tiles using multiple threads. To accelerate the computation, XNNPACK uses NEON instructions.


### LiteRT → XNNPACK → KleidiAI workflow

![LiteRT, XNNPACK, KleidiAI workflow#center](./litert-xnnpack-kleidiai-workflow.png "LiteRT, XNNPACK, KleidiAI workflow")

When KleidiAI and SME2 are enabled at building stage, the KleidiAI SME2 micro-kernels are compiled into the XNNPACK.

During the model loading stage, when XNNPACK optimizes the subgraph, it checks the operator’s data type to determine whether a KleidiAI implementation is available. If KleidiAI supports it, XNNPACK bypasses its own default implementation. As a result, RHS packing is performed using the KleidiAI SME packing micro-kernel. In addition, because KleidiAI typically requires packing of the LHS, a flag is also set during this stage.

During model inference, the LHS packing micro-kernel is invoked. After the LHS is packed, XNNPACK performs the matrix multiplication. At this point, the KleidiAI SME micro-kernel is used to compute the matrix product.
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
---
title: Build the LiteRT benchmark tool
weight: 3

### FIXED, DO NOT MODIFY
layout: learningpathall
---

### Build the LiteRT benchamrk tool with KleidiAI and SME2 enabled

LiteRT provides a tool called `benchmark_model` for evaluating the performance of LiteRT models. Use the following steps to build the LiteRT benchamrk tool.

First, clone the LiteRT repository.

``` bash
cd $WORKSPACE
git clone https://github.com/google-ai-edge/LiteRT.git
```

Then, set up build environment using Docker in your Linux developement machine.

``` bash
wget https://raw.githubusercontent.com/tensorflow/tensorflow/master/tensorflow/lite/tools/tflite-android.Dockerfile
docker build . -t tflite-builder -f tflite-android.Dockerfile
```

Inside the container, run the following commands to download Android tools and libraries to build LiteRT for Android.

``` bash
docker run -it -v $PWD:/host_dir tflite-builder bash
sdkmanager \
"build-tools;${ANDROID_BUILD_TOOLS_VERSION}" \
"platform-tools" \
"platforms;android-${ANDROID_API_LEVEL}"
```

Inside the LiteRT source, run the script to configure the bazel paramters.

``` bash
cd /host_dir/LiteRT
./configure
```

You can keep all options at their default values except for:

`Would you like to interactively configure ./WORKSPACE for Android builds? [y/N]`

Type in `y`, then the script will automatically detect the necessary files set up in the sdkmanager command and configure them accordingly.

Now, you can build the benchmark tool with the following commands.

``` bash
export BENCHMARK_TOOL_PATH="litert/tools:benchmark_model"
export XNNPACK_OPTIONS="--define tflite_with_xnnpack=true \
--define=tflite_with_xnnpack_qs8=true \
--define=tflite_with_xnnpack_qu8=true \
--define=tflite_with_xnnpack_dynamic_fully_connected=true \
--define=xnn_enable_arm_sme=true \
--define=xnn_enable_arm_sme2=true \
--define=xnn_enable_kleidiai=true"

bazel build -c opt --config=android_arm64 \
${XNNPACK_OPTIONS} "${BENCHMARK_TOOL_PATH}" \
--repo_env=HERMETIC_PYTHON_VERSION=3.12
```

The above build enables the KleidiAI and SME2 micro-kernels integrated into XNNPACK.


### Build the LiteRT benchamrk tool without KleidiAI

To compare the performance of KleidiAI SME2 implementation against XNNPACK’s original implementation, you can build another version of LiteRT benchmark tool without KleidiAI and SME2 enabled.

``` bash
export BENCHMARK_TOOL_PATH="litert/tools:benchmark_model"
export XNNPACK_OPTIONS="--define tflite_with_xnnpack=true \
--define=tflite_with_xnnpack_qs8=true \
--define=tflite_with_xnnpack_qu8=true \
--define=tflite_with_xnnpack_dynamic_fully_connected=true \
--define=xnn_enable_arm_sme=false \
--define=xnn_enable_arm_sme2=false \
--define=xnn_enable_kleidiai=false"

bazel build -c opt --config=android_arm64 \
${XNNPACK_OPTIONS} "${BENCHMARK_TOOL_PATH}" \
--repo_env=HERMETIC_PYTHON_VERSION=3.12
```

The path to the compiled benchmark tool binary will be displayed in the build output.
You can then use ADB to push the benchmark tool to your Android device.
Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@
---
title: Create LiteRT models
weight: 3

### FIXED, DO NOT MODIFY
layout: learningpathall
---

### KleidiAI SME2 support in LiteRT

Only a subset of KleidiAI SME, SME2 micro-kernels has been integrated into XNNPACK.
These micro-kernels support operators using the following data types and quantization configurations in the LiteRT model.
Other operators are using XNNPACK’s default implementation during the inference.

* Fully connected
| Activations | Weights | Output |
| ---------------------------- | --------------------------------------- | ---------------------------- |
| FP32 | FP32 | FP32 |
| FP32 | FP16 | FP32 |
| FP32 | Per-channel symmetric INT8 quantization | FP32 |
| Asymmetric INT8 quantization | Per-channel symmetric INT8 quantization | Asymmetric INT8 quantization |
| FP32 | Per-channel symmetric INT4 quantization | FP32 |

* Batch Matrix Multiply
| Input A | Input B |
| ------- | --------------------------------------- |
| FP32 | FP32 |
| FP16 | FP16 |
| FP32 | Per-channel symmetric INT8 quantization |


* Conv2D
| Activations | Weights | Output |
| ---------------------------- | ----------------------------------------------------- | ---------------------------- |
| FP32 | FP32, pointwise (kernerl size is 1) | FP32 |
| FP32 | FP16, pointwise (kernerl size is 1) | FP32 |
| FP32 | Per-channel or per-tensor symmetric INT8 quantization | FP32 |
| Asymmetric INT8 quantization | Per-channel or per-tensor symmetric INT8 quantization | Asymmetric INT8 quantization |


* TransposeConv
| Activations | Weights | Output |
| ---------------------------- | ----------------------------------------------------- | ---------------------------- |
| Asymmetric INT8 quantization | Per-channel or per-tensor symmetric INT8 quantization | Asymmetric INT8 quantization |


### Create LiteRT models by Keras

To evaluate the performance of SME2 acceleration per operator, the following script is provided as an example. It uses the Keras to create a simple model containing only a single fully connected operator and convert it into the LiteRT model.

``` python
import tensorflow as tf
import numpy as np
import os

batch_size = 100
input_size = 640
output_size = 1280

def save_litert_model(model_bytes, filename):
if os.path.exists(filename):
print(f"Warning: {filename} already exists and will be overwritten.")
with open(filename, "wb") as f:
f.write(model_bytes)

model = tf.keras.Sequential([
tf.keras.layers.InputLayer(input_shape=(input_size,), batch_size=batch_size),
tf.keras.layers.Dense(output_size)
])

# Convert to FP32 model
converter = tf.lite.TFLiteConverter.from_keras_model(model)
fc_fp32 = converter.convert()
save_litert_model(fc_fp32, "fc_fp32.tflite")
```

The model above is created in FP32 format. As mentioned in the previous section, this operator can invoke the KleidiAI SME2 micro-kernel for acceleration.

You can also optimize this Keras model using post-training quantization to create a LiteRT model that suits your requirements.

* Post-training FP16 quantization
``` python
# Convert to model with FP16 weights and FP32 activations
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_types = [tf.float16]
converter.target_spec._experimental_supported_accumulation_type = tf.dtypes.float16
fc_fp16 = converter.convert()
save_litert_model(fc_fp16, "fc_fp16.tflite")
```

This method applies FP16 quantization to a model with FP32 operators. In practice, this optimization adds metadata to the model to indicate that the model is compatible with FP16 inference. With this hint, at runtime, XNNPACK replaces the FP32 operators with their FP16 equivalents. It also inserts additional operators that convert the model inputs from FP32 to FP16, and convert the model outputs from FP16 back to FP32.

KleidiAI provides FP16 packing micro-kernels for both the activations and weights matrix, as well as FP16 matrix multiplication micro-kernels.

* Post-training INT8 dynamic range quantization
``` python
# Convert to Dynamically Quantized INT8 model (INT8 weights, FP32 activations)
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
fc_int8_dynamic = converter.convert()
save_litert_model(fc_int8_dynamic, "fc_dynamic_int8.tflite")
```

This quantization method optimizes operators with large parameter sizes by quantizing their weights to INT8 while keeping the activations in the FP32 data format.

KleidiAI provides micro-kernels that dynamically quantize activations to INT8 at runtime. KleidiAI also provides packing micro-kernels for the weights matrix, as well as INT8 matrix multiplication micro-kernels that produce FP32 outputs.


* Post-training INT8 static quantization
``` python
def fake_dataset():
for _ in range(100):
sample = np.random.rand(input_size).astype(np.float32)
yield [sample]
# Convert to Statically Quantized INT8 model (INT8 weights and activations)
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.target_spec.supported_types = [tf.int8]
converter.inference_input_type = tf.int8
converter.inference_output_type = tf.int8
converter.representative_dataset = fake_dataset
fc_int8_static = converter.convert()
save_litert_model(fc_int8_static, "fc_static_int8.tflite")
```

This quantization method quantizes both the activations and the weights to INT8.

KleidiAI provides INT8 packing micro-kernels for both the activations and weights matrix, as well as INT8 matrix multiplication micro-kernels.
Loading