diff --git a/content/learning-paths/mobile-graphics-and-gaming/litert-sme/1-litert-kleidiai-sme2.md b/content/learning-paths/mobile-graphics-and-gaming/litert-sme/1-litert-kleidiai-sme2.md new file mode 100644 index 000000000..c8865b2ae --- /dev/null +++ b/content/learning-paths/mobile-graphics-and-gaming/litert-sme/1-litert-kleidiai-sme2.md @@ -0,0 +1,53 @@ +--- +title: Understand LiteRT, XNNPACK, KleidiAI and SME2 +weight: 2 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## LiteRT, XNNPACK, KleidiAI and SME2 + +LiteRT (short for Lite Runtime), formerly known as TensorFlow Lite, is a runtime for on-device AI. +The default CPU acceleration library used by LiteRT is XNNPACK. + +XNNPACK is an open-source library that provides highly optimized implementations of neural-network operators. It continuously integrates KleidiAI library to leverage new CPU features such as SME2. + +KleidiAI is a library developed by Arm that offers performance-critical micro-kernels leveraging Arm architecture features, such as SME2. + +Both XNNPACK and KleidiAI are external dependencies of LiteRT. LiteRT specifies the versions of these libraries to use. +When LiteRT is built with both XNNPACK and KleidiAI enabled, XNNPACK invokes KleidiAI’s micro-kernels at runtime to accelerate operators with supported data types; otherwise, it falls back to its own implementation. + +The software stack for LiteRT is as follows. + +![LiteRT, XNNPACK, KleidiAI and SME2#center](./litert-sw-stack.png "LiteRT, XNNPACK, KleidiAI and SME2") + + +## Understand how KleidiAI works in LiteRT + +To understand how KleidiAI SME2 micro-kernel works in LiteRT, a LiteRT model with one Fully Connected operator with FP32 datatype is used as an example. + +The following illustrates the execution workflow of XNNPACK’s implementation compared with the workflow when KleidiAI SME2 is enabled in XNNPACK. + +### LiteRT → XNNPACK workflow + +![LiteRT, XNNPACK workflow#center](./litert-xnnpack-workflow.png "LiteRT, XNNPACK workflow") + +A Fully Connected operator can be essentially implemented as a matrix multiplication. + +When LiteRT loads a model, it parses the operators and create a computation graph. If the CPU is selected as the accelerator, LiteRT uses XNNPACK by default. + +XNNPACK traverses the operators in the graph and tries to replace them with its own implementations. During this stage, XNNPACK performs the necessary packing of the weight matrix. To speed up the packing process, XNNPACK uses NEON instructions for Arm platform. XNNPACK provides different implementations for different hardware platforms. At runtime, it detects the hardware capabilities and selects the appropriate micro-kernel. + +During model inference, XNNPACK performs matrix multiplication on the activation matrix (the left-hand side matrix, LHS) and the repacked weight matrix (the right-hand side matrix, RHS). In this stage, XNNPACK applies tiling strategies to the matrices and performs parallel multiplication across the resulting tiles using multiple threads. To accelerate the computation, XNNPACK uses NEON instructions. + + +### LiteRT → XNNPACK → KleidiAI workflow + +![LiteRT, XNNPACK, KleidiAI workflow#center](./litert-xnnpack-kleidiai-workflow.png "LiteRT, XNNPACK, KleidiAI workflow") + +When KleidiAI and SME2 are enabled at building stage, the KleidiAI SME2 micro-kernels are compiled into the XNNPACK. + +During the model loading stage, when XNNPACK optimizes the subgraph, it checks the operator’s data type to determine whether a KleidiAI implementation is available. If KleidiAI supports it, XNNPACK bypasses its own default implementation. As a result, RHS packing is performed using the KleidiAI SME packing micro-kernel. In addition, because KleidiAI typically requires packing of the LHS, a flag is also set during this stage. + +During model inference, the LHS packing micro-kernel is invoked. After the LHS is packed, XNNPACK performs the matrix multiplication. At this point, the KleidiAI SME micro-kernel is used to compute the matrix product. \ No newline at end of file diff --git a/content/learning-paths/mobile-graphics-and-gaming/litert-sme/2-build-tool.md b/content/learning-paths/mobile-graphics-and-gaming/litert-sme/2-build-tool.md new file mode 100644 index 000000000..d5cb154f3 --- /dev/null +++ b/content/learning-paths/mobile-graphics-and-gaming/litert-sme/2-build-tool.md @@ -0,0 +1,90 @@ +--- +title: Build the LiteRT benchmark tool +weight: 3 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +### Build the LiteRT benchamrk tool with KleidiAI and SME2 enabled + +LiteRT provides a tool called `benchmark_model` for evaluating the performance of LiteRT models. Use the following steps to build the LiteRT benchamrk tool. + +First, clone the LiteRT repository. + +``` bash +cd $WORKSPACE +git clone https://github.com/google-ai-edge/LiteRT.git +``` + +Then, set up build environment using Docker in your Linux developement machine. + +``` bash +wget https://raw.githubusercontent.com/tensorflow/tensorflow/master/tensorflow/lite/tools/tflite-android.Dockerfile +docker build . -t tflite-builder -f tflite-android.Dockerfile +``` + +Inside the container, run the following commands to download Android tools and libraries to build LiteRT for Android. + +``` bash +docker run -it -v $PWD:/host_dir tflite-builder bash +sdkmanager \ + "build-tools;${ANDROID_BUILD_TOOLS_VERSION}" \ + "platform-tools" \ + "platforms;android-${ANDROID_API_LEVEL}" +``` + +Inside the LiteRT source, run the script to configure the bazel paramters. + +``` bash +cd /host_dir/LiteRT +./configure +``` + +You can keep all options at their default values except for: + +`Would you like to interactively configure ./WORKSPACE for Android builds? [y/N]` + +Type in `y`, then the script will automatically detect the necessary files set up in the sdkmanager command and configure them accordingly. + +Now, you can build the benchmark tool with the following commands. + +``` bash +export BENCHMARK_TOOL_PATH="litert/tools:benchmark_model" +export XNNPACK_OPTIONS="--define tflite_with_xnnpack=true \ +--define=tflite_with_xnnpack_qs8=true \ +--define=tflite_with_xnnpack_qu8=true \ +--define=tflite_with_xnnpack_dynamic_fully_connected=true \ +--define=xnn_enable_arm_sme=true \ +--define=xnn_enable_arm_sme2=true \ +--define=xnn_enable_kleidiai=true" + +bazel build -c opt --config=android_arm64 \ +${XNNPACK_OPTIONS} "${BENCHMARK_TOOL_PATH}" \ +--repo_env=HERMETIC_PYTHON_VERSION=3.12 +``` + +The above build enables the KleidiAI and SME2 micro-kernels integrated into XNNPACK. + + +### Build the LiteRT benchamrk tool without KleidiAI + +To compare the performance of KleidiAI SME2 implementation against XNNPACK’s original implementation, you can build another version of LiteRT benchmark tool without KleidiAI and SME2 enabled. + +``` bash +export BENCHMARK_TOOL_PATH="litert/tools:benchmark_model" +export XNNPACK_OPTIONS="--define tflite_with_xnnpack=true \ +--define=tflite_with_xnnpack_qs8=true \ +--define=tflite_with_xnnpack_qu8=true \ +--define=tflite_with_xnnpack_dynamic_fully_connected=true \ +--define=xnn_enable_arm_sme=false \ +--define=xnn_enable_arm_sme2=false \ +--define=xnn_enable_kleidiai=false" + +bazel build -c opt --config=android_arm64 \ +${XNNPACK_OPTIONS} "${BENCHMARK_TOOL_PATH}" \ +--repo_env=HERMETIC_PYTHON_VERSION=3.12 +``` + +The path to the compiled benchmark tool binary will be displayed in the build output. +You can then use ADB to push the benchmark tool to your Android device. diff --git a/content/learning-paths/mobile-graphics-and-gaming/litert-sme/3-buid-model.md b/content/learning-paths/mobile-graphics-and-gaming/litert-sme/3-buid-model.md new file mode 100644 index 000000000..16dc9aec3 --- /dev/null +++ b/content/learning-paths/mobile-graphics-and-gaming/litert-sme/3-buid-model.md @@ -0,0 +1,130 @@ +--- +title: Create LiteRT models +weight: 3 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +### KleidiAI SME2 support in LiteRT + +Only a subset of KleidiAI SME, SME2 micro-kernels has been integrated into XNNPACK. +These micro-kernels support operators using the following data types and quantization configurations in the LiteRT model. +Other operators are using XNNPACK’s default implementation during the inference. + +* Fully connected +| Activations | Weights | Output | +| ---------------------------- | --------------------------------------- | ---------------------------- | +| FP32 | FP32 | FP32 | +| FP32 | FP16 | FP32 | +| FP32 | Per-channel symmetric INT8 quantization | FP32 | +| Asymmetric INT8 quantization | Per-channel symmetric INT8 quantization | Asymmetric INT8 quantization | +| FP32 | Per-channel symmetric INT4 quantization | FP32 | + +* Batch Matrix Multiply +| Input A | Input B | +| ------- | --------------------------------------- | +| FP32 | FP32 | +| FP16 | FP16 | +| FP32 | Per-channel symmetric INT8 quantization | + + +* Conv2D +| Activations | Weights | Output | +| ---------------------------- | ----------------------------------------------------- | ---------------------------- | +| FP32 | FP32, pointwise (kernerl size is 1) | FP32 | +| FP32 | FP16, pointwise (kernerl size is 1) | FP32 | +| FP32 | Per-channel or per-tensor symmetric INT8 quantization | FP32 | +| Asymmetric INT8 quantization | Per-channel or per-tensor symmetric INT8 quantization | Asymmetric INT8 quantization | + + +* TransposeConv +| Activations | Weights | Output | +| ---------------------------- | ----------------------------------------------------- | ---------------------------- | +| Asymmetric INT8 quantization | Per-channel or per-tensor symmetric INT8 quantization | Asymmetric INT8 quantization | + + +### Create LiteRT models by Keras + +To evaluate the performance of SME2 acceleration per operator, the following script is provided as an example. It uses the Keras to create a simple model containing only a single fully connected operator and convert it into the LiteRT model. + +``` python +import tensorflow as tf +import numpy as np +import os + +batch_size = 100 +input_size = 640 +output_size = 1280 + +def save_litert_model(model_bytes, filename): + if os.path.exists(filename): + print(f"Warning: {filename} already exists and will be overwritten.") + with open(filename, "wb") as f: + f.write(model_bytes) + +model = tf.keras.Sequential([ + tf.keras.layers.InputLayer(input_shape=(input_size,), batch_size=batch_size), + tf.keras.layers.Dense(output_size) +]) + +# Convert to FP32 model +converter = tf.lite.TFLiteConverter.from_keras_model(model) +fc_fp32 = converter.convert() +save_litert_model(fc_fp32, "fc_fp32.tflite") +``` + +The model above is created in FP32 format. As mentioned in the previous section, this operator can invoke the KleidiAI SME2 micro-kernel for acceleration. + +You can also optimize this Keras model using post-training quantization to create a LiteRT model that suits your requirements. + +* Post-training FP16 quantization +``` python +# Convert to model with FP16 weights and FP32 activations +converter = tf.lite.TFLiteConverter.from_keras_model(model) +converter.optimizations = [tf.lite.Optimize.DEFAULT] +converter.target_spec.supported_types = [tf.float16] +converter.target_spec._experimental_supported_accumulation_type = tf.dtypes.float16 +fc_fp16 = converter.convert() +save_litert_model(fc_fp16, "fc_fp16.tflite") +``` + +This method applies FP16 quantization to a model with FP32 operators. In practice, this optimization adds metadata to the model to indicate that the model is compatible with FP16 inference. With this hint, at runtime, XNNPACK replaces the FP32 operators with their FP16 equivalents. It also inserts additional operators that convert the model inputs from FP32 to FP16, and convert the model outputs from FP16 back to FP32. + +KleidiAI provides FP16 packing micro-kernels for both the activations and weights matrix, as well as FP16 matrix multiplication micro-kernels. + +* Post-training INT8 dynamic range quantization +``` python +# Convert to Dynamically Quantized INT8 model (INT8 weights, FP32 activations) +converter = tf.lite.TFLiteConverter.from_keras_model(model) +converter.optimizations = [tf.lite.Optimize.DEFAULT] +fc_int8_dynamic = converter.convert() +save_litert_model(fc_int8_dynamic, "fc_dynamic_int8.tflite") +``` + +This quantization method optimizes operators with large parameter sizes by quantizing their weights to INT8 while keeping the activations in the FP32 data format. + +KleidiAI provides micro-kernels that dynamically quantize activations to INT8 at runtime. KleidiAI also provides packing micro-kernels for the weights matrix, as well as INT8 matrix multiplication micro-kernels that produce FP32 outputs. + + +* Post-training INT8 static quantization +``` python +def fake_dataset(): + for _ in range(100): + sample = np.random.rand(input_size).astype(np.float32) + yield [sample] +# Convert to Statically Quantized INT8 model (INT8 weights and activations) +converter = tf.lite.TFLiteConverter.from_keras_model(model) +converter.optimizations = [tf.lite.Optimize.DEFAULT] +converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8] +converter.target_spec.supported_types = [tf.int8] +converter.inference_input_type = tf.int8 +converter.inference_output_type = tf.int8 +converter.representative_dataset = fake_dataset +fc_int8_static = converter.convert() +save_litert_model(fc_int8_static, "fc_static_int8.tflite") +``` + +This quantization method quantizes both the activations and the weights to INT8. + +KleidiAI provides INT8 packing micro-kernels for both the activations and weights matrix, as well as INT8 matrix multiplication micro-kernels. diff --git a/content/learning-paths/mobile-graphics-and-gaming/litert-sme/4-benchmark.md b/content/learning-paths/mobile-graphics-and-gaming/litert-sme/4-benchmark.md new file mode 100644 index 000000000..f974bc830 --- /dev/null +++ b/content/learning-paths/mobile-graphics-and-gaming/litert-sme/4-benchmark.md @@ -0,0 +1,197 @@ +--- +title: Benchmark the LiteRT model +weight: 4 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +### Use the benchmark tool + +First, you should check if your Android phone supports SME2. You can check it by the following command. + +``` bash +cat /proc/cpuinfo + +... +processor : 7 +BogoMIPS : 2000.00 +Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm ssbs sb dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh bti mte ecv afp mte3 sme smei8i32 smef16f32 smeb16f32 smef32f32 wfxt rprfm sme2 smei16i32 smebi32i32 hbc lrcpc3 +``` + +As you can see from the `Features`, the CPU 7 supports the SME2. + +Then, you can run the `benchmark_model` tool on the CPU that supports the SME2. One of the example is as follows. + +This example uses the `taskset` command to configure the benchmark workload to run on cores 7. It specifies to utilize 1 thread by the option `--num_threads=1`, and running the inferences at least 1000 times by the option `--num_runs=1000`. The CPU is selected. Also, it passes the option `--use_profiler=true` to produce a operator level profiling during inference. + +``` bash +taskset 80 ./benchmark_model --graph=./fc_fp32.tflite --num_runs=1000 --num_threads=1 --use_cpu=true --use_profiler=true + +... +INFO: [litert/runtime/accelerators/auto_registration.cc:148] CPU accelerator registered. +INFO: [litert/runtime/compiled_model.cc:415] Flatbuffer model initialized directly from incoming litert model. +INFO: Initialized TensorFlow Lite runtime. +INFO: Created TensorFlow Lite XNNPACK delegate for CPU. +VERBOSE: Replacing 1 out of 1 node(s) with delegate (TfLiteXNNPackDelegate) node, yielding 1 partitions for subgraph 0. +INFO: The input model file size (MB): 3.27774 +INFO: Initialized session in 4.478ms. +INFO: Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds. +INFO: count=1055 first=1033 curr=473 min=443 max=1033 avg=465.319 std=18 p5=459 median=463 p95=478 + +INFO: Running benchmark for at least 1000 iterations and at least 1 seconds but terminate if exceeding 150 seconds. +INFO: count=2112 first=463 curr=459 min=442 max=979 avg=464.545 std=13 p5=460 median=462 p95=478 + +INFO: [./litert/tools/benchmark_litert_model.h:81] +========== BENCHMARK RESULTS ========== +INFO: [./litert/tools/benchmark_litert_model.h:82] Model initialization: 4.48 ms +INFO: [./litert/tools/benchmark_litert_model.h:84] Warmup (first): 1.03 ms +INFO: [./litert/tools/benchmark_litert_model.h:86] Warmup (avg): 0.47 ms (1055 runs) +INFO: [./litert/tools/benchmark_litert_model.h:88] Inference (avg): 0.46 ms (2112 runs) +INFO: [./litert/tools/benchmark_litert_model.h:92] Inference (min): 0.44 ms +INFO: [./litert/tools/benchmark_litert_model.h:94] Inference (max): 0.98 ms +INFO: [./litert/tools/benchmark_litert_model.h:96] Inference (std): 0.01 +INFO: [./litert/tools/benchmark_litert_model.h:103] Throughput: 525.55 MB/s +INFO: [./litert/tools/benchmark_litert_model.h:112] +Memory Usage: +INFO: [./litert/tools/benchmark_litert_model.h:114] Init footprint: 8.94 MB +INFO: [./litert/tools/benchmark_litert_model.h:116] Overall footprint: 11.51 MB +INFO: [./litert/tools/benchmark_litert_model.h:123] Peak memory usage not available. (peak_mem_mb <= 0) +INFO: [./litert/tools/benchmark_litert_model.h:126] ====================================== + +INFO: [./litert/tools/benchmark_litert_model.h:179] +============================== Run Order ============================== + [node type] [first] [avg ms] [%] [cdf%] [mem KB] [times called] [Name] + LiteRT::Run[buffer registration] 0.020 0.014 3.309% 3.309% 0.000 1 LiteRT::Run[buffer registration]/0 + AllocateTensors 0.291 0.291 0.022% 3.331% 452.000 0 AllocateTensors/0 + Static Reshape (NC) 0.085 0.003 0.739% 4.070% 0.000 1 Delegate/Static Reshape (NC):0 + Fully Connected (NC, PF32) GEMM 0.538 0.382 92.948% 97.018% 0.000 1 Delegate/Fully Connected (NC, PF32) GEMM:1 + LiteRT::Run[Buffer sync] 0.013 0.012 2.982% 100.000% 0.000 1 LiteRT::Run[Buffer sync]/0 + +============================== Top by Computation Time ============================== + [node type] [first] [avg ms] [%] [cdf%] [mem KB] [times called] [Name] + Fully Connected (NC, PF32) GEMM 0.538 0.382 92.948% 92.948% 0.000 1 Delegate/Fully Connected (NC, PF32) GEMM:1 + AllocateTensors 0.291 0.291 0.022% 92.970% 452.000 0 AllocateTensors/0 + LiteRT::Run[buffer registration] 0.020 0.014 3.309% 96.279% 0.000 1 LiteRT::Run[buffer registration]/0 + LiteRT::Run[Buffer sync] 0.013 0.012 2.982% 99.261% 0.000 1 LiteRT::Run[Buffer sync]/0 + Static Reshape (NC) 0.085 0.003 0.739% 100.000% 0.000 1 Delegate/Static Reshape (NC):0 + +Number of nodes executed: 5 +============================== Summary by node type ============================== + [Node type] [count] [avg ms] [avg %] [cdf %] [mem KB] [times called] + Fully Connected (NC, PF32) GEMM 1 0.382 93.171% 93.171% 0.000 1 + LiteRT::Run[buffer registration] 1 0.013 3.171% 96.341% 0.000 1 + LiteRT::Run[Buffer sync] 1 0.012 2.927% 99.268% 0.000 1 + Static Reshape (NC) 1 0.003 0.732% 100.000% 0.000 1 + AllocateTensors 1 0.000 0.000% 100.000% 452.000 0 + +Timings (microseconds): count=3166 first=947 curr=406 min=390 max=947 avg=411.071 std=14 +Memory (bytes): count=0 +5 nodes observed +``` + +As you can see from the results above, the results include the time spent on model initialization, warm-up, and inference, as well as memory usage. Since the profiler was enabled, the output also reports the execution time of each operator. + +Because the model contains only a single fully connected layer, the node type `Fully Connected (NC, PF32) GEMM` shows the average execution time is 0.382 ms, accounting for 93.171% of the total inference time. + +{{% notice Note %}} +To verify the KleidiAI SME2 micro-kernels are invoked for the Fully Connected operator during the model inference, you can use the `simpleperf record -g -- ` to capture the calling graph. For the benchmark_model, you should also build it with the option `-c dbg`. +{{% /notice %}} + +## Measure the performance impact of KleidiAI SME2 micro-kernels + +To compare the performance of the KleidiAI SME2 implementation with the original XNNPACK implementation, you can run the `benchmark_model` tool without KleidiAI enabled and then benchmark again using the same parameters. + +One example is as follows. + +``` bash +taskset 80 ./benchmark_model --graph=./fc_fp32.tflite --num_runs=1000 --num_threads=1 --use_cpu=true --use_profiler=true + +... +INFO: [litert/runtime/accelerators/auto_registration.cc:148] CPU accelerator registered. +INFO: [litert/runtime/compiled_model.cc:415] Flatbuffer model initialized directly from incoming litert model. +INFO: Initialized TensorFlow Lite runtime. +INFO: Created TensorFlow Lite XNNPACK delegate for CPU. +VERBOSE: Replacing 1 out of 1 node(s) with delegate (TfLiteXNNPackDelegate) node, yielding 1 partitions for subgraph 0. +INFO: The input model file size (MB): 3.27774 +INFO: Initialized session in 4.488ms. +INFO: Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds. +INFO: count=358 first=1927 curr=1370 min=1363 max=1927 avg=1386.31 std=38 p5=1366 median=1377 p95=1428 + +INFO: Running benchmark for at least 1000 iterations and at least 1 seconds but terminate if exceeding 150 seconds. +INFO: count=1000 first=1407 curr=1370 min=1362 max=1452 avg=1379.64 std=14 p5=1365 median=1373 p95=1409 + +INFO: [./litert/tools/benchmark_litert_model.h:81] +========== BENCHMARK RESULTS ========== +INFO: [./litert/tools/benchmark_litert_model.h:82] Model initialization: 4.49 ms +INFO: [./litert/tools/benchmark_litert_model.h:84] Warmup (first): 1.93 ms +INFO: [./litert/tools/benchmark_litert_model.h:86] Warmup (avg): 1.39 ms (358 runs) +INFO: [./litert/tools/benchmark_litert_model.h:88] Inference (avg): 1.38 ms (1000 runs) +INFO: [./litert/tools/benchmark_litert_model.h:92] Inference (min): 1.36 ms +INFO: [./litert/tools/benchmark_litert_model.h:94] Inference (max): 1.45 ms +INFO: [./litert/tools/benchmark_litert_model.h:96] Inference (std): 0.01 +INFO: [./litert/tools/benchmark_litert_model.h:103] Throughput: 176.96 MB/s +INFO: [./litert/tools/benchmark_litert_model.h:112] +Memory Usage: +INFO: [./litert/tools/benchmark_litert_model.h:114] Init footprint: 9.07 MB +INFO: [./litert/tools/benchmark_litert_model.h:116] Overall footprint: 11.25 MB +INFO: [./litert/tools/benchmark_litert_model.h:123] Peak memory usage not available. (peak_mem_mb <= 0) +INFO: [./litert/tools/benchmark_litert_model.h:126] ====================================== + +INFO: [./litert/tools/benchmark_litert_model.h:179] +============================== Run Order ============================== + [node type] [first] [avg ms] [%] [cdf%] [mem KB] [times called] [Name] + LiteRT::Run[buffer registration] 0.026 0.018 1.392% 1.392% 0.000 1 LiteRT::Run[buffer registration]/0 + AllocateTensors 0.195 0.195 0.011% 1.403% 56.000 0 AllocateTensors/0 + Static Reshape (NC) 0.004 0.004 0.307% 1.710% 0.000 1 Delegate/Static Reshape (NC):0 + Fully Connected (NC, F32) GEMM 1.564 1.269 97.059% 98.769% 0.000 1 Delegate/Fully Connected (NC, F32) GEMM:1 + LiteRT::Run[Buffer sync] 0.018 0.016 1.231% 100.000% 0.000 1 LiteRT::Run[Buffer sync]/0 + +============================== Top by Computation Time ============================== + [node type] [first] [avg ms] [%] [cdf%] [mem KB] [times called] [Name] + Fully Connected (NC, F32) GEMM 1.564 1.269 97.059% 97.059% 0.000 1 Delegate/Fully Connected (NC, F32) GEMM:1 + AllocateTensors 0.195 0.195 0.011% 97.070% 56.000 0 AllocateTensors/0 + LiteRT::Run[buffer registration] 0.026 0.018 1.392% 98.462% 0.000 1 LiteRT::Run[buffer registration]/0 + LiteRT::Run[Buffer sync] 0.018 0.016 1.231% 99.693% 0.000 1 LiteRT::Run[Buffer sync]/0 + Static Reshape (NC) 0.004 0.004 0.307% 100.000% 0.000 1 Delegate/Static Reshape (NC):0 + +Number of nodes executed: 5 +============================== Summary by node type ============================== + [Node type] [count] [avg ms] [avg %] [cdf %] [mem KB] [times called] + Fully Connected (NC, F32) GEMM 1 1.268 97.090% 97.090% 0.000 1 + LiteRT::Run[buffer registration] 1 0.018 1.378% 98.469% 0.000 1 + LiteRT::Run[Buffer sync] 1 0.016 1.225% 99.694% 0.000 1 + Static Reshape (NC) 1 0.004 0.306% 100.000% 0.000 1 + AllocateTensors 1 0.000 0.000% 100.000% 56.000 0 + +Timings (microseconds): count=1357 first=1807 curr=1295 min=1291 max=1807 avg=1307.19 std=21 +Memory (bytes): count=0 +5 nodes observed +``` + +As you can see from the results, for the same model, the XNNPACK node type name is different. For the non-KleidiAI implementation, the node type is `Fully Connected (NC, F32) GEMM`, whereas for the KleidiAI implementation, it is `Fully Connected (NC, PF32) GEMM`. + +For other operators supported by KleidiAI, the per-operator profiling node types differ between the implementations with and without KleidiAI enabled in XNNPACK as follows: + +| Operator | Node Type (KleidiAI Enabled) | Node Type (KleidiAI Disabled) | +|----------------------------------------|-------------------------------------------------------|--------------------------------------------------------| +| Fully Connected / Conv2D (Pointwise) | Fully Connected (NC, PF32) | Fully Connected (NC, F32) | +| Fully Connected | Dynamic Fully Connected (NC, PF32) | Dynamic Fully Connected (NC, F32) | +| Fully Connected / Conv2D (Pointwise) | Fully Connected (NC, PF16) | Fully Connected (NC, F16) | +| Fully Connected | Dynamic Fully Connected (NC, PF16) | Dynamic Fully Connected (NC, F16) | +| Fully Connected | Fully Connected (NC, QP8, F32, QC4W) | Fully Connected (NC, QD8, F32, QC4W) | +| Fully Connected / Conv2D (Pointwise) | Fully Connected (NC, QP8, F32, QC8W) | Fully Connected (NC, QD8, F32, QC8W) | +| Fully Connected / Conv2D (Pointwise) | Fully Connected (NC, PQS8, QC8W) | Fully Connected (NC, QS8, QC8W) | +| Batch Matrix Multiply | Batch Matrix Multiply (NC, PF32) | Batch Matrix Multiply (NC, F32) | +| Batch Matrix Multiply | Batch Matrix Multiply (NC, PF16) | Batch Matrix Multiply (NC, F16) | +| Batch Matrix Multiply | Batch Matrix Multiply (NC, QP8, F32, QC8W) | Batch Matrix Multiply (NC, QD8, F32, QC8W) | +| Conv2D | Convolution (NHWC, PQS8, QS8, QC8W) | Convolution (NHWC, QC8) | +| TransposeConv | Deconvolution (NHWC, PQS8, QS8, QC8W) | Deconvolution (NC, QS8, QC8W) | + +As you can see from the list, the letter “P” indicates that the node type corresponds to a KleidiAI implementation. + +For example, in `Convolution (NHWC, PQS8, QS8, QC8W)`, this represents a Conv2D operator computation by KleidiAI micro-kernel, where the tensor is in NHWC layout. + +* The input is packed INT8 quantized, +* The weights are per-channel INT8 quantized, +* The output is INT8 quantized. \ No newline at end of file diff --git a/content/learning-paths/mobile-graphics-and-gaming/litert-sme/_index.md b/content/learning-paths/mobile-graphics-and-gaming/litert-sme/_index.md new file mode 100644 index 000000000..c036a71e9 --- /dev/null +++ b/content/learning-paths/mobile-graphics-and-gaming/litert-sme/_index.md @@ -0,0 +1,55 @@ +--- +title: Accelerate LiteRT Models on Android with KleidiAI and SME2 + +minutes_to_complete: 30 + +who_is_this_for: This is an advanced topic for developers looking to leverage the SME2 instrutions to accelerate LiteRT models inference on Android. + +learning_objectives: + - Understand how KleidiAI works in LiteRT. + - Build the LiteRT benchmark tool and enable XNNPACK and KleidiAI with SME2 support in LiteRT. + - Create LiteRT models that can be acclerated by SME2 through KleidiAI. + - Use the benchmark tool to evaluate and validate the SME2 acceleration performance of LiteRT models. + + +prerequisites: + - A Linux development machine. + - An Android device that supports the SME2 Arm architecture features. + +author: Jiaming Guo + +### Tags +skilllevels: Advanced +subjects: ML +armips: + - Cortex-A +tools_software_languages: + - C + - Python +operatingsystems: + - Android + + + +further_reading: + - resource: + title: LiteRT model optimization + link: https://ai.google.dev/edge/litert/models/model_optimization + type: website + - resource: + title: Convert Pytorch model to LiteRT model + link: https://ai.google.dev/edge/litert/models/pytorch_to_tflite + type: website + - resource: + title: LiteRT repository + link: https://github.com/google-ai-edge/LiteRT?tab=readme-ov-file#1--i-have-a-pytorch-model + type: website + + + +### FIXED, DO NOT MODIFY +# ================================================================================ +weight: 1 # _index.md always has weight of 1 to order correctly +layout: "learningpathall" # All files under learning paths have this same wrapper +learning_path_main_page: "yes" # This should be surfaced when looking for related content. Only set for _index.md of learning path content. +--- diff --git a/content/learning-paths/mobile-graphics-and-gaming/litert-sme/_next-steps.md b/content/learning-paths/mobile-graphics-and-gaming/litert-sme/_next-steps.md new file mode 100644 index 000000000..c3db0de5a --- /dev/null +++ b/content/learning-paths/mobile-graphics-and-gaming/litert-sme/_next-steps.md @@ -0,0 +1,8 @@ +--- +# ================================================================================ +# FIXED, DO NOT MODIFY THIS FILE +# ================================================================================ +weight: 21 # Set to always be larger than the content in this path to be at the end of the navigation. +title: "Next Steps" # Always the same, html page title. +layout: "learningpathall" # All files under learning paths have this same wrapper for Hugo processing. +--- diff --git a/content/learning-paths/mobile-graphics-and-gaming/litert-sme/litert-sw-stack.png b/content/learning-paths/mobile-graphics-and-gaming/litert-sme/litert-sw-stack.png new file mode 100644 index 000000000..eb4c2b02c Binary files /dev/null and b/content/learning-paths/mobile-graphics-and-gaming/litert-sme/litert-sw-stack.png differ diff --git a/content/learning-paths/mobile-graphics-and-gaming/litert-sme/litert-xnnpack-kleidiai-workflow.png b/content/learning-paths/mobile-graphics-and-gaming/litert-sme/litert-xnnpack-kleidiai-workflow.png new file mode 100644 index 000000000..0820a5b72 Binary files /dev/null and b/content/learning-paths/mobile-graphics-and-gaming/litert-sme/litert-xnnpack-kleidiai-workflow.png differ diff --git a/content/learning-paths/mobile-graphics-and-gaming/litert-sme/litert-xnnpack-workflow.png b/content/learning-paths/mobile-graphics-and-gaming/litert-sme/litert-xnnpack-workflow.png new file mode 100644 index 000000000..35fabf3c4 Binary files /dev/null and b/content/learning-paths/mobile-graphics-and-gaming/litert-sme/litert-xnnpack-workflow.png differ