Skip to content

Commit 5ac4a93

Browse files
author
morelos
committed
Update base for Update on "[ET-VK][Ops] torchao.choose_qparams_affine vulkan impl and shader (buffer only) and cleanup"
# Changes * Implement `torchao.choose_qparams_affine` operator in Vulkan backend with comprehensive buffer storage support * Add block-wise quantization parameter computation in `choose_qparams_buffer.glsl` shader for configurable tensor block analysis * Extend quantization parameter infrastructure in `ChooseQParams.cpp` to handle affine transformations with configurable block sizes and multiple mapping types * Support three quantization mapping strategies: ASYMMETRIC, SYMMETRIC, and SYMMETRIC_NO_CLIPPING_ERR for optimal parameter selection * Consolidated the logic for choosing scale and zero point between affine cases and regular quantized_decomposed cases. BE: Improved the documentation in the shader logic which is more detailed and clear # Motivation The existing Vulkan quantization infrastructure lacked support for the `torchao.choose_qparams_affine` operator, which is essential for computing optimal quantization parameters in dynamic quantization workflows. The `choose_qparams_affine` operator provides flexible block-wise parameter computation that analyzes statistical distributions within tensor blocks, enabling: * **Block-wise Parameter Computation**: Analyzes configurable tensor blocks to compute optimal scale and zero-point values, improving quantization accuracy for heterogeneous data distributions * **Multiple Mapping Types**: Supports ASYMMETRIC, SYMMETRIC, and SYMMETRIC_NO_CLIPPING_ERR quantization strategies for different precision-performance trade-offs # Operator Description The `choose_qparams_affine` operator computes optimal quantization parameters (scale and zero_point) from floating-point tensor blocks using statistical analysis of data distributions. Block-wise parameter computation divides tensors into blocks and analyzes each block independently to determine the best quantization mapping for subsequent quantization operations. The parameter calculation varies by mapping type: - **ASYMMETRIC**: `scale = (max - min) / (quant_max - quant_min)`, `zero_point = quant_min - round(min / scale)` - **SYMMETRIC**: `scale = max_abs / ((quant_max - quant_min) / 2)`, `zero_point = midpoint` - **SYMMETRIC_NO_CLIPPING_ERR**: `scale = max(abs(min)/abs(quant_min), max/quant_max)`, `zero_point = midpoint` **Storage Requirements**: Input tensors must be floating-point (kFloat) with width-packed layout. Output scale/zero_point tensors use buffer storage. NOTE: Texture storage implementation is not supported due to complexity of block-wise coordinate mapping in 3D texture space. This will likely be necessary for better efficiency in the future. # Block-wise Parameter Computation Implementation Block-wise parameter computation enables fine-grained quantization analysis by dividing tensors into blocks and computing separate scale/zero_point parameters for each block. The implementation uses several key data structures computed in `ChooseQParams.cpp`: * **`block_size_vec`**: WHCN-ordered block dimensions converted from PyTorch NCHW layout (e.g., [3,3,2,1] for 3×3×2×1 blocks) * **`tensor_size_whcn`**: Input tensor dimensions converted to WHCN layout using `utils::make_whcn_ivec4()` * **`num_blocks_vec`**: Number of blocks per dimension calculated as `ceil(tensor_size_whcn / block_size_vec)` to handle non-divisible dimensions * **`block_stride_vec`**: Pre-computed linear strides for block grid indexing `{1, #W, #W*#H, #W*#H*#C}` to enable efficient block ID calculation * **`mapping_type`**: Integer encoding of quantization strategy (0=ASYMMETRIC, 1=SYMMETRIC, 2=SYMMETRIC_NO_CLIPPING_ERR) The block coordinate calculation uses: `block_coord = block_id_to_coord(block_id)` which converts linear block IDs back to 4D WHCN coordinates, then computes element ranges: `t0 = block_coord * blockSize` and `tEnd = t0 + blockSize` for nested loop iteration. # Shader Algorithm Overview ## Buffer Storage Implementation (`choose_qparams_buffer.glsl`) **Workgroup Configuration**: - **Global WG Size**: `{nBlocks, 1u, 1u}` where `nBlocks = total number of blocks` computed from `ceil(tensor_size / block_size)` for each dimension - **Local WG Size**: `{1u, 1u, 1u}` (single thread per block for simplicity, though could be optimized for larger blocks) **Block-wise Mode Algorithm**: The shader uses a sophisticated multi-level nested approach to process tensor blocks efficiently. Each thread is assigned multiple blocks using strided access: `for (uint block_id = gl_GlobalInvocationID.x; block_id < TOTAL_BLOCKS; block_id += STRIDE)` where `STRIDE = gl_WorkGroupSize.x * gl_NumWorkGroups.x`. For each assigned block, the algorithm performs several key steps: **1. Block Coordinate Conversion**: The `block_id_to_coord(block_id)` function converts linear block IDs to 4D WHCN coordinates using modular arithmetic. **2. Element Range Calculation**: Computes the inclusive start coordinate `t0 = bc * blockSize` and exclusive end coordinate `tEnd = t0 + blockSize` to define the block's element boundaries in tensor space. **3. Nested Loop Min/Max Scan**: Uses four nested loops to iterate through all elements within the block: `for (int n = t0.w; n < tEnd.w; ++n) for (int c = t0.z; c < tEnd.z; ++c) for (int h = t0.y; h < tEnd.y; ++h) for (int w = t0.x; w < tEnd.x; ++w)` Each element is accessed using `tidx_to_bufi(ivec4(w,h,c,n), t_in_strides)` to convert 4D tensor coordinates to linear buffer indices with proper stride handling. **4. Parameter Calculation**: Calls `calc_scale_zp(lo, hi, quant_min, quant_max, mapping_type, eps, scale, zp)` which implements the three mapping strategies: * **ASYMMETRIC (mapping_type=0)**: Maps full range [min, max] to [quant_min, quant_max] preserving data distribution * **SYMMETRIC (mapping_type=1)**: Centers around zero using `max_abs = max(abs(min), abs(max))` for balanced quantization * **SYMMETRIC_NO_CLIPPING_ERR (mapping_type=2)**: Computes separate scales for positive/negative ranges and uses the maximum to prevent clipping **Future Improvements**: Implement workgroup-level reduction for large blocks, optimize memory access patterns for better cache utilization, and explore texture storage implementation with simplified block alignment constraints. Differential Revision: [D78436638](https://our.internmc.facebook.com/intern/diff/D78436638/) cc SS-JIA manuelcandales cbilgin [ghstack-poisoned]
2 parents d172021 + f35de65 commit 5ac4a93

37 files changed

+2893
-69
lines changed

CMakeLists.txt

Lines changed: 4 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -48,8 +48,6 @@
4848
cmake_minimum_required(VERSION 3.24)
4949
project(executorch)
5050

51-
# MARK: - Start EXECUTORCH_H12025_BUILD_MIGRATION
52-
5351
include(${PROJECT_SOURCE_DIR}/tools/cmake/common/preset.cmake)
5452
include(${PROJECT_SOURCE_DIR}/tools/cmake/Utils.cmake)
5553
include(CMakeDependentOption)
@@ -82,6 +80,7 @@ announce_configured_options(BUCK2)
8280

8381
announce_configured_options(CMAKE_CXX_COMPILER_ID)
8482
announce_configured_options(CMAKE_TOOLCHAIN_FILE)
83+
announce_configured_options(BUILD_TESTING)
8584

8685
load_build_preset()
8786
include(${PROJECT_SOURCE_DIR}/tools/cmake/preset/default.cmake)
@@ -97,11 +96,6 @@ else()
9796
endif()
9897
announce_configured_options(CCACHE_PROGRAM)
9998

100-
# Print all the configs that were called with announce_configured_options.
101-
print_configured_options()
102-
103-
# MARK: - End EXECUTORCH_H12025_BUILD_MIGRATION
104-
10599
set(CMAKE_EXPORT_COMPILE_COMMANDS ON)
106100

107101
# Setup RPATH. See
@@ -750,3 +744,6 @@ if(EXECUTORCH_BUILD_ANDROID_JNI)
750744
endif()
751745

752746
include(Test.cmake)
747+
748+
# Print all the configs that were called with announce_configured_options.
749+
print_configured_options()

CMakePresets.json

Lines changed: 8 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88
},
99
{
1010
"name": "macos",
11-
"displayName": "Build everything buildable on macOS",
11+
"displayName": "Build ExecuTorch for macOS",
1212
"inherits": ["common"],
1313
"generator": "Xcode",
1414
"cacheVariables": {
@@ -25,7 +25,7 @@
2525
},
2626
{
2727
"name": "ios",
28-
"displayName": "Build everything buildable on iOS",
28+
"displayName": "Build ExecuTorch for iOS",
2929
"inherits": ["common"],
3030
"generator": "Xcode",
3131
"cacheVariables": {
@@ -42,7 +42,7 @@
4242
},
4343
{
4444
"name": "ios-simulator",
45-
"displayName": "Build everything buildable on iOS simulator",
45+
"displayName": "Build ExecuTorch for iOS Simulator",
4646
"inherits": ["common"],
4747
"generator": "Xcode",
4848
"cacheVariables": {
@@ -59,7 +59,7 @@
5959
},
6060
{
6161
"name": "linux",
62-
"displayName": "Build everything buildable on Linux",
62+
"displayName": "Build ExecuTorch for Linux",
6363
"inherits": ["common"],
6464
"cacheVariables": {
6565
"CMAKE_SYSTEM_NAME": "Linux",
@@ -88,29 +88,21 @@
8888
{
8989
"name": "llm",
9090
"displayName": "Build LLM libraries",
91-
"inherits": [
92-
"common"
93-
],
91+
"inherits": ["common"],
9492
"cacheVariables": {
9593
"EXECUTORCH_BUILD_PRESET_FILE": "${sourceDir}/tools/cmake/preset/llm.cmake",
9694
"CMAKE_OSX_DEPLOYMENT_TARGET": "12.0"
9795
},
9896
"condition": {
9997
"type": "inList",
10098
"string": "${hostSystemName}",
101-
"list": [
102-
"Darwin",
103-
"Linux",
104-
"Windows"
105-
]
99+
"list": ["Darwin", "Linux", "Windows"]
106100
}
107101
},
108102
{
109103
"name": "zephyr",
110-
"displayName": "Build everything buildable on Zephyr RTOS",
111-
"inherits": [
112-
"common"
113-
],
104+
"displayName": "Build ExecuTorch for Zephyr RTOS",
105+
"inherits": ["common"],
114106
"cacheVariables": {
115107
"EXECUTORCH_BUILD_PRESET_FILE": "${sourceDir}/tools/cmake/preset/zephyr.cmake",
116108
"CMAKE_TOOLCHAIN_FILE": "${sourceDir}/examples/zephyr/x86_64-linux-arm-zephyr-eabi-gcc.cmake"

backends/nxp/nxp_backend.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -174,7 +174,8 @@ def preprocess(
174174
# Otherwise, we get violation that this op is not part of ATen Core ops.
175175
edge_program._verifiers = [
176176
EXIREdgeDialectVerifier(
177-
class_only=True, core_aten_ops_exception_list=[torch.ops.aten.max_pool2d.default]
177+
class_only=True,
178+
core_aten_ops_exception_list=[torch.ops.aten.max_pool2d.default],
178179
)
179180
]
180181

backends/qualcomm/qnn_preprocess.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -178,6 +178,11 @@ def preprocess_multimethod(
178178

179179
if len(py_op_wrapper_list) == len(edge_programs.values()):
180180
qnn_context_binary = qnn_manager.Compile(graph_name, py_op_wrapper_list)
181+
if option.saver:
182+
# TODO: Currently, only the first method is saved. Update this logic if saving multiple methods becomes necessary in the future.
183+
exit(
184+
f"Record all QNN API calls from saver backend at: {option.saver_output_dir}"
185+
)
181186
assert (
182187
len(qnn_context_binary) != 0
183188
), "Failed to generate Qnn context binary."

backends/qualcomm/tests/test_qnn_delegate.py

Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3384,6 +3384,38 @@ def test_qnn_backend_rewrite_prepared_observer(self):
33843384
quantized_module = convert_pt2e(prepared)
33853385
self.lower_module_and_test_output(quantized_module, sample_input)
33863386

3387+
def test_qnn_backend_saver_backend(self):
3388+
backend_options = generate_htp_compiler_spec(use_fp16=False)
3389+
TestQNN.compiler_specs = generate_qnn_executorch_compiler_spec(
3390+
soc_model=self.chipset_table[TestQNN.model],
3391+
backend_options=backend_options,
3392+
saver=True,
3393+
)
3394+
module = Relu() # noqa: F405
3395+
sample_input = (torch.randn([2, 5, 1, 3]),)
3396+
module = self.get_qdq_module(module, sample_input)
3397+
3398+
from executorch.backends.qualcomm.serialization.qc_schema_serialize import (
3399+
flatbuffer_to_option,
3400+
option_to_flatbuffer,
3401+
)
3402+
3403+
with tempfile.TemporaryDirectory() as tmp_dir:
3404+
option = flatbuffer_to_option(TestQNN.compiler_specs[0].value)
3405+
option.saver_output_dir = f"{tmp_dir}/saver_output"
3406+
TestQNN.compiler_specs[0].value = option_to_flatbuffer(option)
3407+
3408+
with self.assertRaises(SystemExit):
3409+
self.lower_module_and_test_output(module, sample_input)
3410+
self.assertTrue(
3411+
os.path.isfile(f"{tmp_dir}/saver_output/params.bin"),
3412+
"failed to find params.bin",
3413+
)
3414+
self.assertTrue(
3415+
os.path.isfile(f"{tmp_dir}/saver_output/saver_output.c"),
3416+
"failed to find saver_output.c",
3417+
)
3418+
33873419
def test_qnn_backend_skip_node_id_partitioner(self):
33883420
module = SimpleModel() # noqa: F405
33893421
sample_input = (torch.ones(1, 32, 28, 28), torch.ones(1, 32, 28, 28))
@@ -5022,6 +5054,40 @@ def test_swin_transformer(self):
50225054
self.assertGreaterEqual(msg["top_1"], 60)
50235055
self.assertGreaterEqual(msg["top_5"], 80)
50245056

5057+
def test_t5(self):
5058+
if not self.required_envs([self.qa_dataset]):
5059+
self.skipTest("missing required envs")
5060+
cmds = [
5061+
"python",
5062+
f"{self.executorch_root}/examples/qualcomm/oss_scripts/t5/t5.py",
5063+
"--dataset",
5064+
self.sentence_dataset,
5065+
"--artifact",
5066+
self.artifact_dir,
5067+
"--build_folder",
5068+
self.build_folder,
5069+
"--device",
5070+
self.device,
5071+
"--model",
5072+
self.model,
5073+
"--ip",
5074+
self.ip,
5075+
"--port",
5076+
str(self.port),
5077+
]
5078+
if self.host:
5079+
cmds.extend(["--host", self.host])
5080+
5081+
p = subprocess.Popen(cmds, stdout=subprocess.DEVNULL)
5082+
with Listener((self.ip, self.port)) as listener:
5083+
conn = listener.accept()
5084+
p.communicate()
5085+
msg = json.loads(conn.recv())
5086+
if "Error" in msg:
5087+
self.fail(msg["Error"])
5088+
else:
5089+
self.assertGreaterEqual(msg["f1"], 0.7)
5090+
50255091
def test_whisper(self):
50265092
if not self.required_envs():
50275093
self.skipTest("missing required envs")

backends/qualcomm/tests/utils.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -183,6 +183,7 @@ class TestQNN(unittest.TestCase):
183183
executorch_root: str = ""
184184
artifact_dir: str = ""
185185
image_dataset: str = ""
186+
qa_dataset: str = ""
186187
sentence_dataset: str = ""
187188
pretrained_weight: str = ""
188189
enable_profile: bool = False

backends/vulkan/op_registry.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -693,6 +693,7 @@ def register_transfer_ops(features: OpFeatures):
693693
exir_ops.edge.aten.full_like.default,
694694
exir_ops.edge.aten.ones.default,
695695
exir_ops.edge.aten.ones_like.default,
696+
exir_ops.edge.aten.scalar_tensor.default,
696697
exir_ops.edge.aten.upsample_nearest2d.vec,
697698
exir_ops.edge.aten.upsample_bilinear2d.vec,
698699
exir_ops.edge.aten.zeros.default,

backends/vulkan/quantizer/TARGETS

Lines changed: 5 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -3,24 +3,18 @@ load("@fbcode_macros//build_defs:python_library.bzl", "python_library")
33
oncall("executorch")
44

55
python_library(
6-
name = "vulkan_quantizer_utils",
7-
srcs = [
8-
"vulkan_quantizer_utils.py",
9-
],
6+
name = "vulkan_quantizer",
7+
srcs = ["vulkan_quantizer.py"],
108
deps = [
9+
":vulkan_quantizer_utils",
1110
"//caffe2:torch",
12-
"//pytorch/ao:torchao", # @manual
1311
],
1412
)
1513

1614
python_library(
17-
name = "vulkan_quantizer",
18-
srcs = [
19-
"vulkan_quantizer.py",
20-
],
15+
name = "vulkan_quantizer_utils",
16+
srcs = ["vulkan_quantizer_utils.py"],
2117
deps = [
22-
":vulkan_quantizer_utils",
2318
"//caffe2:torch",
24-
"//pytorch/ao:torchao", # @manual
2519
],
2620
)

backends/vulkan/runtime/graph/ComputeGraph.cpp

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -273,6 +273,14 @@ vkapi::ScalarType ComputeGraph::dtype_of(const ValueRef idx) const {
273273
return val.toConstTensor().dtype();
274274
} else if (val.isTensorRef()) {
275275
return val.toConstTensorRef().dtype;
276+
} else if (val.isBool()) {
277+
return vkapi::ScalarType::Bool;
278+
} else if (val.isDouble()) {
279+
// We downcast anyway in the shader and we want to avoid having to
280+
// write special cases there.
281+
return vkapi::ScalarType::Float;
282+
} else if (val.isInt()) {
283+
return vkapi::ScalarType::Int;
276284
}
277285
VK_THROW("Could not get dtype of value with type ", val.type());
278286
}
Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
/*
2+
* Copyright (c) Meta Platforms, Inc. and affiliates.
3+
* All rights reserved.
4+
*
5+
* This source code is licensed under the BSD-style license found in the
6+
* LICENSE file in the root directory of this source tree.
7+
*/
8+
9+
#version 450 core
10+
11+
#define PRECISION ${PRECISION}
12+
13+
#define BUF_T ${buffer_scalar_type(DTYPE)}
14+
#define VEC4_T ${texel_type(DTYPE)}
15+
16+
${define_active_storage_type(STORAGE)}
17+
${define_required_extensions(DTYPE)}
18+
${define_required_extensions(SCALAR_VALUE_TYPE)}
19+
20+
#include "indexing_utils.h"
21+
22+
layout(std430) buffer;
23+
24+
${layout_declare_tensor(B, "w", "t_out", DTYPE, STORAGE)}
25+
${layout_declare_ubo(B, buffer_scalar_type(SCALAR_VALUE_TYPE), "scalar_value")}
26+
27+
layout(local_size_x_id = 0, local_size_y_id = 1, local_size_z_id = 2) in;
28+
29+
#ifdef USING_BUFFER
30+
31+
void main() {
32+
const int i = int(gl_GlobalInvocationID.x);
33+
34+
if (i > 0) {
35+
return;
36+
}
37+
38+
t_out[i] = BUF_T(scalar_value);
39+
}
40+
41+
# else // !USING_BUFFER
42+
43+
void main() {
44+
const ivec3 pos = ivec3(gl_GlobalInvocationID);
45+
46+
// Scalar tensor is a special case where the packed dim is always 1.
47+
if (any(greaterThanEqual(pos, ivec3(1)))) {
48+
return;
49+
}
50+
51+
VEC4_T outtex = VEC4_T(scalar_value);
52+
write_texel(t_out, pos, outtex);
53+
}
54+
55+
#endif // !USING_BUFFER

0 commit comments

Comments
 (0)