Skip to content

Commit 8079eb7

Browse files
mcr229facebook-github-bot
authored andcommitted
Kleidi Integration (pytorch#5162)
Summary: # Bringing KleidiAI QB4 Kernels to ExecuTorch KleidiAI has released QB4 Kernels which pack the activation while dynamically quantizating to improve performance of the gemm kernel. We leverage these kernels through XNNPACK by wiring up these kernels there. This Integration is still waiting on a couple of dependent PRs in other Repos to land. ## Dependent PR Tracking * google/XNNPACK#7003 * https://gitlab.arm.com/kleidi/kleidiai/-/merge_requests/28 ## Notes on the Update When updating XNNPACK to the branch with the integrated Kleidi Kernels, we have to make some changes to the cmake because of refactoring done in XNNPACK. prod-microkernels and kleidiai are both static libraries linked to libXNNPACK.a, since llama runner (which links against xnnpack_backend) is in a seperate project, we need to install these new static libraries so that we can later properly link them to llama runner. These changes can be seen in the corresponding cmake files. The new feature is currently guarded behind EXECUTORCH_XNNPACK_ENABLE_KLEIDI flag. ## Repro ``` git submodule sync git submodule update --init ``` I used the following alias's to make it easier to build llama_main for android: ``` alias build_et_android="cmake -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake \ -DANDROID_ABI=arm64-v8a \ -DANDROID_PLATFORM=android-23 \ -DCMAKE_INSTALL_PREFIX=cmake-out-android \ -DEXECUTORCH_ENABLE_LOGGING=1 \ -DCMAKE_BUILD_TYPE=Release \ -DEXECUTORCH_BUILD_EXTENSION_DATA_LOADER=ON \ -DEXECUTORCH_BUILD_EXTENSION_MODULE=ON \ -DEXECUTORCH_BUILD_EXTENSION_TENSOR=ON \ -DEXECUTORCH_BUILD_XNNPACK=ON \ -DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \ -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON \ -DEXECUTORCH_BUILD_KERNELS_CUSTOM=ON \ -DEXECUTORCH_XNNPACK_ENABLE_KLEIDI=ON \ -DXNNPACK_ENABLE_ARM_BF16=OFF \ -Bcmake-out-android . && cmake --build cmake-out-android -j16 --target install --config Release " alias build_llama_android="cmake -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake \ -DANDROID_ABI=arm64-v8a \ -DANDROID_PLATFORM=android-23 \ -DCMAKE_INSTALL_PREFIX=cmake-out-android \ -DCMAKE_BUILD_TYPE=Release \ -DPYTHON_EXECUTABLE=python \ -DEXECUTORCH_BUILD_XNNPACK=ON \ -DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \ -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON \ -DEXECUTORCH_BUILD_KERNELS_CUSTOM=ON \ -DEXECUTORCH_USE_TIKTOKEN=ON \ -Bcmake-out-android/examples/models/llama2 \ examples/models/llama2 && cmake --build cmake-out-android/examples/models/llama2 -j16 --config Release " ``` I run the following: ``` build_et_android build_llama_android cd cmake-out-android/examples/models/llama2 adb push llama_main /data/local/tmp/ adb push <path/to/llama3.pte> /data/local/tmp adb push <path/to/tiktokenizer> /data/local/tmp adb shell "cd /data/local/tmp && ./llama_main --model_path <model.pte> --tokenizer_path <tokenizer.bin> --cpu_threads=4 ``` ## Benchmarks I ran llama3.1 with * sdpa_w_kvcache * quantized embeddings * 4bit blockwise quantized weights * dynamic shapes * parallel prefill on Samsung S22 w/4 threads ### Baseline (QD8) ``` I 00:00:32.772974 executorch:stats.h:84] Prompt Tokens: 8 Generated Tokens: 119 I 00:00:32.772980 executorch:stats.h:90] Model Load Time: 15.273000 (seconds) I 00:00:32.773014 executorch:stats.h:100] Total inference time: 17.488000 (seconds) Rate: 6.804666 (tokens/second) I 00:00:32.773019 executorch:stats.h:108] Prompt evaluation: 2.971000 (seconds) Rate: 2.692696 (tokens/second) I 00:00:32.773023 executorch:stats.h:119] Generated 119 tokens: 14.517000 (seconds) Rate: 8.197286 (tokens/second) I 00:00:32.773027 executorch:stats.h:127] Time to first generated token: 2.971000 (seconds) I 00:00:32.773030 executorch:stats.h:134] Sampling time over 127 tokens: 0.173000 (seconds) ``` ### QP8 ``` I 00:00:46.767429 executorch:stats.h:84] Prompt Tokens: 8 Generated Tokens: 119 I 00:00:46.767437 executorch:stats.h:90] Model Load Time: 28.297000 (seconds) I 00:00:46.767475 executorch:stats.h:100] Total inference time: 18.436000 (seconds) Rate: 6.454762 (tokens/second) I 00:00:46.767483 executorch:stats.h:108] Prompt evaluation: 1.770000 (seconds) Rate: 4.519774 (tokens/second) I 00:00:46.767491 executorch:stats.h:119] Generated 119 tokens: 16.666000 (seconds) Rate: 7.140286 (tokens/second) I 00:00:46.767522 executorch:stats.h:127] Time to first generated token: 1.770000 (seconds) I 00:00:46.767527 executorch:stats.h:134] Sampling time over 127 tokens: 0.189000 (seconds) ``` We see ~+68% Perf Improvement on Prefill, ad ~-13% regression on Decode. See the dependent XNNPACK PR for more benchmarking details Pull Request resolved: pytorch#5162 Reviewed By: digantdesai Differential Revision: D63651987 Pulled By: mcr229 fbshipit-source-id: aafc92b5006c90f3465af415acc04309851dcd8c
1 parent 660ef77 commit 8079eb7

File tree

482 files changed

+195
-3857
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

482 files changed

+195
-3857
lines changed

backends/xnnpack/CMakeLists.txt

Lines changed: 10 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -32,14 +32,20 @@ if(NOT PYTHON_EXECUTABLE)
3232
resolve_python_executable()
3333
endif()
3434

35-
# NB: Enabling this will serialize execution of delegate instances.
36-
# This setting may have performance implications.
35+
# NB: Enabling this will serialize execution of delegate instances
36+
# Keeping this OFF by default to maintain existing behavior, to be revisited.
3737
option(EXECUTORCH_XNNPACK_SHARED_WORKSPACE
38-
"Enable workspace sharing across different delegate instances" ON
39-
)
38+
"Enable workspace sharing across different delegate instances" ON)
39+
# Keeping this OFF by default due to regressions in decode
40+
# and model load with kleidi kernels
41+
option(EXECUTORCH_XNNPACK_ENABLE_KLEIDI
42+
"Enable workspace sharing across different delegate instances" OFF)
4043
if(EXECUTORCH_XNNPACK_SHARED_WORKSPACE)
4144
add_definitions(-DENABLE_XNNPACK_SHARED_WORKSPACE)
4245
endif()
46+
if(EXECUTORCH_XNNPACK_ENABLE_KLEIDI)
47+
add_definitions(-DENABLE_XNNPACK_KLEIDI)
48+
endif()
4349

4450
set(_common_include_directories ${EXECUTORCH_ROOT}/..)
4551
set(_common_compile_options -Wno-deprecated-declarations -fPIC)

backends/xnnpack/cmake/Dependencies.cmake

Lines changed: 27 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -36,13 +36,39 @@ set(XNNPACK_ENABLE_AVXVNNI
3636
OFF
3737
CACHE BOOL ""
3838
)
39-
set(XNNPACK_ENABLE_KLEIDIAI
39+
40+
if(EXECUTORCH_XNNPACK_ENABLE_KLEIDI)
41+
set(XNNPACK_ENABLE_KLEIDIAI
42+
ON
43+
CACHE BOOL ""
44+
)
45+
else()
46+
set(XNNPACK_ENABLE_KLEIDIAI
47+
OFF
48+
CACHE BOOL ""
49+
)
50+
endif()
51+
52+
53+
set(XNNPACK_BUILD_ALL_MICROKERNELS
4054
OFF
4155
CACHE BOOL ""
4256
)
4357
add_subdirectory("${XNNPACK_SOURCE_DIR}")
4458
include_directories(SYSTEM ${XNNPACK_INCLUDE_DIR})
4559
list(APPEND xnnpack_third_party XNNPACK)
60+
install(TARGETS microkernels-prod
61+
LIBRARY DESTINATION ${CMAKE_INSTALL_LIBDIR}
62+
ARCHIVE DESTINATION ${CMAKE_INSTALL_LIBDIR}
63+
PUBLIC_HEADER DESTINATION ${CMAKE_INSTALL_INCLUDEDIR})
64+
65+
66+
if(EXECUTORCH_XNNPACK_ENABLE_KLEIDI)
67+
install(TARGETS kleidiai
68+
LIBRARY DESTINATION ${CMAKE_INSTALL_LIBDIR}
69+
ARCHIVE DESTINATION ${CMAKE_INSTALL_LIBDIR}
70+
PUBLIC_HEADER DESTINATION ${CMAKE_INSTALL_INCLUDEDIR})
71+
endif()
4672

4773
# Revert PIC Flag to what it originally was
4874
set(CMAKE_POSITION_INDEPENDENT_CODE

backends/xnnpack/runtime/XNNCompiler.cpp

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -630,7 +630,14 @@ Error defineConvertNode(
630630
subgraph_ptr,
631631
remapped_ids.at(graph_node->input_id()),
632632
remapped_ids.at(graph_node->output_id()),
633+
#ifdef ENABLE_XNNPACK_KLEIDI
634+
// This maps to XNNPACK's XNN_FLAG_MAYBE_PACK_FOR_QB4W_GEMM
635+
// however this is not currently exposed at top level
636+
// xnnpack.h Header
637+
0x00000100);
638+
#else
633639
graph_node->flags());
640+
#endif
634641

635642
ET_CHECK_OR_RETURN_ERROR(
636643
status == xnn_status_success,

backends/xnnpack/targets.bzl

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -49,6 +49,8 @@ def define_common_targets():
4949
preprocessor_flags = [
5050
# Uncomment to enable per operator timings
5151
# "-DENABLE_XNNPACK_PROFILING",
52+
# Uncomment to enable using KleidiAI Kernels
53+
# "-DENABLE_XNNPACK_KLEIDI"
5254
] + _get_preprocessor_flags(),
5355
exported_deps = [
5456
"//executorch/runtime/backend:interface",

backends/xnnpack/test/CMakeLists.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,7 @@ et_cxx_test(
3939
XNNPACK
4040
pthreadpool
4141
cpuinfo
42+
microkernels-prod
4243
)
4344
target_include_directories(
4445
backends_xnnpack_test

backends/xnnpack/test/runtime/test_xnnexecutor.cpp

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@
99
#include <executorch/backends/xnnpack/runtime/XNNExecutor.h>
1010
#include <executorch/runtime/core/exec_aten/testing_util/tensor_factory.h>
1111
#include <gtest/gtest.h>
12-
#include <xnnpack/subgraph.h>
12+
#include <xnnpack.h>
1313

1414
using torch::executor::Error;
1515
using torch::executor::EValue;
@@ -26,7 +26,7 @@ TEST(XNNExecutorTest, ArgumentWithTooManyDimensions) {
2626
std::unique_ptr<xnn_subgraph, decltype(&xnn_delete_subgraph)> auto_subgraph(
2727
subgraph, xnn_delete_subgraph);
2828

29-
auto input_id = XNN_INVALID_NODE_ID;
29+
auto input_id = XNN_INVALID_VALUE_ID;
3030
std::vector<size_t> dims = {
3131
1,
3232
};
@@ -43,9 +43,9 @@ TEST(XNNExecutorTest, ArgumentWithTooManyDimensions) {
4343
/*external_id=*/0,
4444
/*flags=*/XNN_VALUE_FLAG_EXTERNAL_INPUT,
4545
&input_id));
46-
ASSERT_NE(input_id, XNN_INVALID_NODE_ID);
46+
ASSERT_NE(input_id, XNN_INVALID_VALUE_ID);
4747

48-
auto output_id = XNN_INVALID_NODE_ID;
48+
auto output_id = XNN_INVALID_VALUE_ID;
4949
ASSERT_EQ(
5050
xnn_status_success,
5151
xnn_define_quantized_tensor_value(
@@ -59,7 +59,7 @@ TEST(XNNExecutorTest, ArgumentWithTooManyDimensions) {
5959
/*external_id=*/0,
6060
/*flags=*/XNN_VALUE_FLAG_EXTERNAL_OUTPUT,
6161
&output_id));
62-
ASSERT_NE(output_id, XNN_INVALID_NODE_ID);
62+
ASSERT_NE(output_id, XNN_INVALID_VALUE_ID);
6363

6464
ASSERT_EQ(
6565
xnn_status_success,

backends/xnnpack/test/targets.bzl

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,7 @@ def define_common_targets():
2424
srcs = ["runtime/test_xnnexecutor.cpp"],
2525
deps = [
2626
third_party_dep("XNNPACK"),
27+
third_party_dep("FP16"),
2728
"//executorch/runtime/core/exec_aten/testing_util:tensor_util",
2829
"//executorch/runtime/core/exec_aten/util:scalar_type_util",
2930
"//executorch/backends/xnnpack:xnnpack_backend",
Submodule XNNPACK updated 9962 files

backends/xnnpack/third-party/generate-xnnpack-wrappers.py

Lines changed: 0 additions & 213 deletions
This file was deleted.

0 commit comments

Comments
 (0)