Skip to content

Commit 7f88379

Browse files
committed
Update on "[ET-VK][qlinear] Faster weight only quantized linear gemv kernel"
## Changes * Introduce a new compute shader for int4 linear's gemv cases that performs much better than the existing shader. This shader is inspired from MNN's gemv_1x1_conv_buf.cl shader. With this compute kernel, transformer models' text generation can execute much faster than before. On Samsung Galaxy S24 for Llama 3.2 1B, generating 128 tokens: Before: ~25 tok/s After: ~49 tok/s ## Why this new shader is faster The biggest reason is due to vectorized loading of the uint4 weight buffer. This new shader loads the weight buffer as a buffer/image of `uvec4`, whereas the old shader loads the weight buffer as a buffer/image of `u8vec4`. Using the Adreno Offline Compiler, I found that in the former, only one load instruction was used to load from the weight tensor, whereas in the latter 16 load instructions were used to load from the weight tensor. It appears that the data loading was not being vectorized at the assembly level. This is potentially behaviour that can be approved in the SPIR-V shader compiler. An additional factor is better weight packing layout. The new prepacking routine results in better memory coalescing between threads in a work group. The final major factor is the use of tree based reduction to co-operatively reduce partial results into the final output. Previously, a single thread was responsible for the final reduction. ## Future Work * Introduce faster shader for int4 linear gemm cases * Update QCSNW to also use these updated shaders Differential Revision: [D78275584](https://our.internmc.facebook.com/intern/diff/D78275584/) [ghstack-poisoned]
2 parents 470032f + cbaf32f commit 7f88379

File tree

90 files changed

+6962
-708
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

90 files changed

+6962
-708
lines changed

.ci/scripts/unittest-buck2.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ set -eux
1010
# TODO: can't query cadence & vulkan backends
1111
# TODO: can't query //kernels/prim_ops because of non-buckified stuff in OSS.
1212
buck2 query "//backends/apple/... + //backends/example/... + \
13-
//backends/mediatek/... + //backends/test/... + //backends/transforms/... + \
13+
//backends/mediatek/... + //backends/transforms/... + \
1414
//backends/xnnpack/... + //configurations/... + //kernels/aten/... + \
1515
//kernels/optimized/... + //kernels/portable/... + //kernels/quantized/... + \
1616
//kernels/test/... + //runtime/... + //schema/... + //test/... + //util/..."

.github/workflows/android-release-artifacts.yml

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -47,15 +47,13 @@ jobs:
4747
name: build-aar
4848
needs: check-if-aar-exists
4949
if: ${{ !github.event.pull_request.head.repo.fork }}
50-
uses: pytorch/test-infra/.github/workflows/linux_job_v2.yml@release/2.7
50+
uses: pytorch/test-infra/.github/workflows/linux_job_v2.yml@main
5151
secrets: inherit
5252
permissions:
5353
id-token: write
5454
contents: read
5555
with:
5656
secrets-env: EXECUTORCH_MAVEN_SIGNING_KEYID EXECUTORCH_MAVEN_SIGNING_PASSWORD EXECUTORCH_MAVEN_CENTRAL_PASSWORD EXECUTORCH_MAVEN_CENTRAL_USERNAME EXECUTORCH_MAVEN_SIGNING_GPG_KEY_CONTENTS
57-
# As this job has access to Maven credential, run this on a fresh ephemeral runner
58-
runner: ephemeral.linux.2xlarge
5957
docker-image: executorch-ubuntu-22.04-clang12-android
6058
submodules: 'recursive'
6159
ref: ${{ github.sha }}
Lines changed: 121 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,121 @@
1+
#!/usr/bin/env bash
2+
# Copyright 2025 Arm Limited and/or its affiliates.
3+
#
4+
# This source code is licensed under the BSD-style license found in the
5+
# LICENSE file in the root directory of this source tree.
6+
7+
set -euo pipefail
8+
9+
# TODO
10+
mlsdk_manifest_url=""
11+
12+
script_dir=$(cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd)
13+
14+
source ${script_dir}/utils.sh
15+
16+
usage() { echo "Usage: $0 [-u <mlsdk-manifest-url>]" 1>&2; exit 1; }
17+
18+
while getopts ":u:" opt; do
19+
case "${opt}" in
20+
u)
21+
mlsdk_manifest_url=${OPTARG}
22+
;;
23+
*)
24+
usage
25+
;;
26+
esac
27+
done
28+
29+
function download_ai_mlsdk_manifest() {
30+
local _dada_dir="$1"
31+
32+
if [[ -z "${_dada_dir}" ]]; then
33+
echo "Error: _dada_dir parameter missing?"
34+
return 1
35+
fi
36+
37+
if [[ -z "${mlsdk_manifest_url}" ]]; then
38+
echo "Error: mlsdk_manifest_url parameter missing?"
39+
return 1
40+
fi
41+
42+
if [[ ! -d "${_dada_dir}" ]]; then
43+
mkdir -p "$_dada_dir"
44+
pushd "$_dada_dir" || exit 1
45+
46+
curl https://storage.googleapis.com/git-repo-downloads/repo > repo
47+
chmod u+x repo
48+
./repo init --no-repo-verify --depth=1 --manifest-url ${mlsdk_manifest_url} -g model-converter,emulation-layer,vgf-library
49+
./repo sync
50+
51+
popd
52+
fi
53+
}
54+
55+
function setup_model_converter() {
56+
local work_dir="$1"
57+
local manifest_dir="$2"
58+
local enable_vgf_lib="$3"
59+
local enable_emulation_layer="$4"
60+
61+
if [[ -z "$work_dir" ]]; then
62+
echo "Error: work_dir parameter is required."
63+
return 1
64+
fi
65+
66+
if [[ -z "$manifest_dir" ]]; then
67+
echo "Error: manifest_dir parameter is required."
68+
return 1
69+
fi
70+
71+
mkdir -p "$work_dir"
72+
pushd "$work_dir" || exit 1
73+
74+
download_ai_mlsdk_manifest ${manifest_dir}
75+
76+
pushd "$manifest_dir"
77+
78+
# model-converter
79+
# TODO: Remove macOS patch after mlsdk fully supports macOS
80+
if [[ "$(uname)" == "Darwin" ]]; then
81+
sed -i '' '/^ *print(f"Unsupported host platform/ i\
82+
if system == "Darwin":\
83+
# Use default Apple toolchain (Clang) on macOS\
84+
return True\
85+
\
86+
' sw/model-converter/scripts/build.py
87+
fi
88+
python sw/model-converter/scripts/build.py -j$(nproc)
89+
90+
# libvgf
91+
if [[ "${enable_vgf_lib}" -eq 1 ]]; then
92+
# TODO: Remove macOS patch after mlsdk fully supports macOS
93+
if [[ "$(uname)" == "Darwin" ]]; then
94+
sed -i '' '/^ *print(f"ERROR: Unsupported host platform/ i\
95+
if system == "Darwin":\
96+
# Use default Apple toolchain (Clang) on macOS\
97+
return True\
98+
\
99+
' sw/vgf-lib/scripts/build.py
100+
fi
101+
python sw/vgf-lib/scripts/build.py -j$(nproc)
102+
fi
103+
104+
# emu layer
105+
if [[ "${enable_emulation_layer}" -eq 1 ]]; then
106+
pushd sw/emulation-layer
107+
cmake -B build \
108+
-DGLSLANG_PATH=../../dependencies/glslang \
109+
-DSPIRV_CROSS_PATH=../../dependencies/SPIRV-Cross \
110+
-DSPIRV_HEADERS_PATH=../../dependencies/SPIRV-Headers \
111+
-DSPIRV_TOOLS_PATH=../../dependencies/SPIRV-Tools \
112+
-DVULKAN_HEADERS_PATH=../../dependencies/Vulkan-Headers
113+
cmake --build build
114+
popd
115+
fi
116+
117+
popd
118+
}
119+
120+
#setup_model_converter() $1
121+
# `"$manifest_dir"'

backends/arm/tosa_quant_utils.py

Lines changed: 11 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -146,7 +146,15 @@ def quantize_value(self, x: torch.Tensor | float) -> Tensor:
146146
x = x.to(torch.float32)
147147
if self.per_channel:
148148
q_op = exir_ops.edge.quantized_decomposed.quantize_per_channel.default
149-
args = (x, self.scale, self.zp, self.axis, self.qmin, self.qmax, self.dtype)
149+
args = (
150+
x,
151+
torch.tensor(self.scale),
152+
torch.tensor(self.zp),
153+
self.axis,
154+
self.qmin,
155+
self.qmax,
156+
self.dtype,
157+
)
150158
else:
151159
q_op = exir_ops.edge.quantized_decomposed.quantize_per_tensor.default
152160
args = (x, self.scale, self.zp, self.qmin, self.qmax, self.dtype) # type: ignore[assignment]
@@ -162,8 +170,8 @@ def dequantize_value(self, qx: torch.Tensor) -> torch.Tensor:
162170
dq_op = exir_ops.edge.quantized_decomposed.dequantize_per_channel.default
163171
args = (
164172
qx,
165-
self.scale,
166-
self.zp,
173+
torch.tensor(self.scale),
174+
torch.tensor(self.zp),
167175
self.axis,
168176
self.qmin,
169177
self.qmax,

backends/cadence/aot/TARGETS

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -184,6 +184,21 @@ python_library(
184184
],
185185
)
186186

187+
python_library(
188+
name = "program_builder",
189+
srcs = [
190+
"program_builder.py",
191+
],
192+
typing = True,
193+
deps = [
194+
":graph_builder",
195+
"fbcode//caffe2:torch",
196+
"fbcode//executorch/exir:lib",
197+
"fbcode//executorch/exir:pass_base",
198+
"fbcode//executorch/exir/verification:verifier",
199+
],
200+
)
201+
187202
python_library(
188203
name = "fuse_ops",
189204
srcs = [
@@ -508,6 +523,7 @@ python_unittest(
508523
":typing_stubs",
509524
":ops_registrations",
510525
":pass_utils",
526+
":program_builder",
511527
"//caffe2:torch",
512528
"//executorch/exir:memory",
513529
"//executorch/exir/dialects:lib",

0 commit comments

Comments
 (0)