Skip to content

Commit dff84d5

Browse files
author
morelos
committed
Update base for Update on "[ET-VK][Ops] torchao.dequantize_affine vulkan impl and shader and cleanup"
# Changes * Implement `torchao.dequantize_affine` operator in Vulkan backend with comprehensive texture and buffer storage support * Add block-wise dequantization mode in `dequantize_texture.glsl` and `dequantize_buffer.glsl` shaders for configurable tensor block dequantization * Extend dequantization infrastructure in `Dequantize.cpp` to handle affine transformations with configurable block sizes and quantization parameters * Support integer-to-floating-point conversion with precise reconstruction of original values BE: Improved the documentation in the shader logic which is more detailed and clear # Motivation The existing Vulkan quantization infrastructure lacked support for the `torchao.dequantize_affine` operator, which is essential for completing the quantization-dequantization cycle in dynamic quantization workflows. The `dequantize_affine` operator provides flexible block-wise dequantization that reconstructs floating-point values from quantized integer blocks, enabling: * **Block-wise Dequantization**: Reconstructs floating-point values from configurable tensor blocks using separate scale and zero-point parameters, enabling precise recovery of original data distributions * **Affine Transformation**: Uses the formula `value = (qvalue - zero_point) * scale` for accurate integer-to-floating-point mapping * **TorchAO Integration**: Seamless compatibility with TorchAO quantization workflows and completes the quantization-dequantization round-trip # Operator Description The `dequantize_affine` operator converts n-bit integer tensor values back to floating-point representations using pre-computed quantization parameters (scale and zero_point) applied to configurable tensor blocks. Block-wise dequantization divides tensors into blocks and applies separate dequantization parameters to each block, allowing fine-grained reconstruction of the original floating-point precision. The dequantization formula is: `value = (qvalue - zero_point) * scale` **Storage Requirements**: Scale and zero_point tensors must use buffer storage with width-packed layout. Input/output tensors support both buffer and texture storage with standard axis mapping. Input tensors must be integer types (kByte, kChar, kInt). # Block-wise Dequantization Implementation Block-wise dequantization enables fine-grained reconstruction by dividing tensors into blocks and applying separate dequantization parameters to each block. The implementation uses the same key data structures computed in `Dequantize.cpp`: * **`block_size_vec`**: WHCN-ordered block dimensions converted from PyTorch NCHW layout (e.g., [3,3,2,1] for 3×3×2×1 blocks) * **`tensor_size_whcn`**: Input tensor dimensions converted to WHCN layout using `utils::make_whcn_ivec4()` * **`num_blocks_vec`**: Number of blocks per dimension calculated as `tensor_size_whcn / block_size_vec` * **`block_stride_vec`**: Pre-computed linear strides for block grid indexing `{1, #W, #W*#H, #W*#H*#C}` to enable efficient block ID calculation The block coordinate calculation uses: `bcoord = tidx / blockSize` where `tidx` is the tensor coordinate in WHCN layout, then the linear block ID is computed as: `block_id = bcoord.x * blockStride.x + bcoord.y * blockStride.y + bcoord.z * blockStride.z + bcoord.w * blockStride.w` # Shader Algorithm Overview ## Texture Storage Implementation (`dequantize_texture.glsl`) **Workgroup Configuration**: - **Global WG Size**: Default sizing based on texture dimensions - **Local WG Size**: Default with special handling for batch dimension dequantization (Z dimension set to 1 for proper workgroup dispatching when `global_workgroup_size[2] > 1`) **Block-wise Mode Algorithm**: The shader processes 3D texture positions where each position represents a texel containing 4 width-packed integer components. For each texel at position `pos`, it calculates a base tensor index `base_tidx = ivec4(pos.x * 4, pos.y, pos.z, 0)` to account for width-packing. For each of the 4 components in the texel, it computes the actual tensor coordinate: `tidx = ivec4(base_tidx.x + i, base_tidx.y, (foldedZ % C_total), (foldedZ / C_total))` where `foldedZ = pos.z` handles batch-channel folding in 4D tensors and `C_total = numBlocks.z * blockSize.z` represents the total channel dimension. The block coordinate is calculated using integer division: `bcoord = tidx / blockSize`, then the linear block ID uses pre-computed strides: `block_id = bcoord.x * blockStride.x + bcoord.y * blockStride.y + bcoord.z * blockStride.z + bcoord.w * blockStride.w`. Each integer component is dequantized using its corresponding block's parameters: `value = dequantize_val(qvalue, t_scale[block_id], t_zero_point[block_id])` where `dequantize_val()` applies the formula `(qvalue - zero_point) * scale`. The reconstructed floating-point values are written to the output texel with proper type handling for double precision outputs. ## Buffer Storage Implementation (`dequantize_buffer.glsl`) **Workgroup Configuration**: - **Global WG Size**: Default sizing based on buffer element count - **Local WG Size**: Default sizing without special constraints **Block-wise Mode Algorithm**: The shader processes linear buffer indices using `gl_GlobalInvocationID.x` as the output buffer index. It converts this to tensor coordinates using `bufi_to_tidx(out_bufi, t_out_strides, out_dim_order)` which handles the buffer-to-tensor index mapping with proper stride calculations. For each element, it computes the block coordinate directly: `bcoord = out_tidx / blockSize` where `out_tidx` is the 4D tensor coordinate in WHCN layout. The linear block ID calculation uses the same pre-computed stride approach: `block_id = bcoord.x * blockStride.x + bcoord.y * blockStride.y + bcoord.z * blockStride.z + bcoord.w * blockStride.w`. The quantized integer value is loaded using the corresponding input buffer index: `qvalue = t_in[in_bufi]` where `in_bufi = tidx_to_bufi(out_tidx, t_in_strides)`. Dequantization applies the block-specific parameters: `value = dequantize_val(qvalue, t_scale[block_id], t_zero_point[block_id])` to reconstruct the original floating-point value. **Future Improvements**: Dynamic workgroup sizing based on block dimensions Differential Revision: [D78435552](https://our.internmc.facebook.com/intern/diff/D78435552/) cc SS-JIA manuelcandales cbilgin [ghstack-poisoned]
2 parents bd99157 + f57633b commit dff84d5

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

42 files changed

+1088
-105
lines changed

.github/workflows/trunk.yml

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -302,6 +302,37 @@ jobs:
302302
exit 1
303303
fi
304304
305+
test-arm-ootb-linux:
306+
name: test-arm-ootb-linux
307+
uses: pytorch/test-infra/.github/workflows/linux_job_v2.yml@main
308+
permissions:
309+
id-token: write
310+
contents: read
311+
with:
312+
runner: linux.2xlarge
313+
docker-image: executorch-ubuntu-22.04-arm-sdk
314+
submodules: 'recursive'
315+
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
316+
timeout: 90
317+
script: |
318+
# The generic Linux job chooses to use base env, not the one setup by the image
319+
CONDA_ENV=$(conda env list --json | jq -r ".envs | .[-1]")
320+
conda activate "${CONDA_ENV}"
321+
322+
# Follow the steps required before running the notebooks
323+
# Try to mirror these as closely as possible
324+
source .ci/scripts/utils.sh
325+
install_executorch "--use-pt-pinned-commit"
326+
327+
.ci/scripts/setup-arm-baremetal-tools.sh
328+
source examples/arm/ethos-u-scratch/setup_path.sh
329+
330+
# Install requirements for converting notebooks
331+
pip install notebook
332+
333+
# Run OOTB tests
334+
backends/arm/test/test_arm_ootb.sh
335+
305336
test-coreml-delegate:
306337
name: test-coreml-delegate
307338
uses: pytorch/test-infra/.github/workflows/macos_job.yml@main

CMakePresets.json

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,8 @@
1515
"CMAKE_TOOLCHAIN_FILE": "${sourceDir}/third-party/ios-cmake/ios.toolchain.cmake",
1616
"EXECUTORCH_BUILD_PRESET_FILE": "${sourceDir}/tools/cmake/preset/macos.cmake",
1717
"PLATFORM": "MAC_ARM64",
18-
"DEPLOYMENT_TARGET": "12.0"
18+
"DEPLOYMENT_TARGET": "12.0",
19+
"CMAKE_MACOSX_BUNDLE": "OFF"
1920
},
2021
"condition": {
2122
"lhs": "${hostSystemName}",

backends/arm/_passes/decompose_asin_pass.py

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -85,12 +85,11 @@ def _build_polynomial(
8585
return result
8686

8787
def call_operator(self, op, args, kwargs, meta):
88+
if op not in edge_asin_op:
89+
return super().call_operator(op, args, kwargs, meta)
8890
logging.info(
8991
f"Approximating asin. This may introduce small numerical errors. For details, see {__file__}."
9092
)
91-
if op not in edge_asin_op:
92-
return super().call_operator(op, args, kwargs, meta)
93-
9493
x = args[0]
9594
half = 0.5
9695
one = 1.0

backends/arm/test/ops/test_add.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -205,6 +205,5 @@ def test_add_tensor_vgf_int(test_data: input_t1):
205205
aten_op,
206206
exir_op,
207207
tosa_version="TOSA-1.0+INT",
208-
symmetric_io_quantization=True,
209208
)
210209
pipeline.run()

backends/arm/test/test_arm_ootb.sh

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
#!/usr/bin/env bash
2+
3+
# Copyright 2025 Arm Limited and/or its affiliates.
4+
#
5+
# This source code is licensed under the BSD-style license found in the
6+
# LICENSE file in the root directory of this source tree.
7+
8+
set -e
9+
10+
run_ootb_tests_ethos_u() {
11+
echo "$FUNCNAME: Running out-of-the-box tests for Arm Ethos-U"
12+
jupyter nbconvert \
13+
--to notebook \
14+
--execute examples/arm/ethos_u_minimal_example.ipynb
15+
echo "${FUNCNAME}: PASS"
16+
}
17+
18+
run_ootb_tests_ethos_u

backends/arm/test/tester/arm_tester.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,7 @@
4343
EthosUQuantizer,
4444
get_symmetric_quantization_config,
4545
TOSAQuantizer,
46+
VgfQuantizer,
4647
)
4748
from executorch.backends.arm.test.runner_utils import (
4849
dbg_tosa_fb_to_json,
@@ -332,6 +333,8 @@ def quantize(
332333
quantizer = TOSAQuantizer(tosa_spec)
333334
elif is_ethosu(self.compile_spec):
334335
quantizer = EthosUQuantizer(self.compile_spec)
336+
elif is_vgf(self.compile_spec):
337+
quantizer = VgfQuantizer(self.compile_spec)
335338
quantize_stage = tester.Quantize(
336339
quantizer,
337340
get_symmetric_quantization_config(),

backends/arm/test/tester/test_pipeline.py

Lines changed: 5 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -861,18 +861,15 @@ def __init__(
861861
rtol: float = 1e-03,
862862
qtol: int = 1,
863863
dynamic_shapes: Optional[Tuple[Any]] = None,
864+
transform_passes: Optional[
865+
Union[Sequence[PassType], Dict[str, Sequence[PassType]]]
866+
] = None,
864867
):
865868

866869
if (
867870
symmetric_io_quantization or per_channel_quantization
868871
) and tosa_version == "TOSA-1.0+FP":
869872
raise ValueError("Dont configure quantization with FP TOSA profile.")
870-
if (
871-
symmetric_io_quantization is False
872-
and per_channel_quantization is False
873-
and tosa_version == "TOSA-1.0+INT"
874-
):
875-
raise ValueError("Missing quantization options for INT TOSA profile.")
876873

877874
tosa_profile = TosaSpecification.create_from_string(tosa_version)
878875
compile_spec = common.get_vgf_compile_spec(
@@ -887,6 +884,7 @@ def __init__(
887884
exir_op,
888885
use_to_edge_transform_and_lower,
889886
dynamic_shapes,
887+
transform_passes=transform_passes,
890888
)
891889

892890
if symmetric_io_quantization or per_channel_quantization:
@@ -900,7 +898,7 @@ def __init__(
900898
else:
901899
quant_stage = None
902900

903-
if quant_stage:
901+
if "INT" in tosa_version:
904902
self.add_stage(self.tester.quantize, quant_stage, pos=0)
905903

906904
self.add_stage_after(

backends/cadence/aot/TARGETS

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -539,8 +539,10 @@ python_unittest(
539539
],
540540
typing = True,
541541
deps = [
542+
":program_builder",
542543
"//executorch/backends/cadence/aot:graph_builder",
543544
"//executorch/backends/cadence/aot:ops_registrations",
545+
"//executorch/runtime:runtime",
544546
"//later:lib",
545547
],
546548
)

backends/cadence/aot/memory_planning.py

Lines changed: 10 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,10 @@
1919
MemoryPlanningAlgo,
2020
MemoryPlanningState,
2121
)
22-
from executorch.backends.cadence.aot.utils import MemoryConfig
22+
from executorch.backends.cadence.aot.utils import (
23+
MemoryConfig,
24+
MemoryPlanningAlgoFailure,
25+
)
2326

2427
from executorch.exir import ExecutorchProgramManager
2528
from executorch.exir.memory_planning import collect_specs_from_nodes, Verifier
@@ -95,7 +98,9 @@ def plan(
9598
):
9699
self.plan_spec(spec, state, placement_constraints)
97100
if not state.is_placed(spec):
98-
raise MemoryError(f"Cannot fit {spec} in any memory hierarchy")
101+
raise MemoryPlanningAlgoFailure(
102+
f"Cannot fit {spec} {spec.allocated_memory=} in any memory hierarchy for {self.memory_config}"
103+
)
99104

100105

101106
class GreedyWithHeuristic(MemoryPlanningAlgo):
@@ -169,7 +174,9 @@ def plan(
169174
):
170175
self.plan_spec(spec, state, placement_constraints)
171176
if not state.is_placed(spec):
172-
raise MemoryError(f"Cannot fit {spec} in any memory hierarchy")
177+
raise MemoryPlanningAlgoFailure(
178+
f"Cannot fit {spec} in any memory hierarchy for {self.memory_config}"
179+
)
173180

174181
logging.debug(
175182
f"greedy by size for offset calculation with hierarchy returns bufsizes: {state.bufsizes}"

backends/cadence/aot/tests/test_fusion_ops_passes.py

Lines changed: 0 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,6 @@
1212

1313
import executorch.backends.cadence.aot.ops_registrations # noqa
1414
import torch
15-
from executorch.backends.cadence.aot import compiler
1615
from executorch.backends.cadence.aot.fuse_ops import (
1716
FuseCascadedTransposeOrPermuteOps,
1817
FuseCascadedViewOps,
@@ -30,7 +29,6 @@
3029
from executorch.exir.dialects._ops import ops as exir_ops
3130
from executorch.exir.dialects.edge._ops import EdgeOpOverload
3231
from executorch.exir.pass_base import PassResult, ProxyValue
33-
from torch import nn
3432

3533

3634
class TestFusionPassesBase(unittest.TestCase):
@@ -178,43 +176,6 @@ def test_keep_mm_add_with_multiple_users(self) -> None:
178176
self.assertEqual(count_node(converted_graph, exir_ops.edge.aten.mm.default), 1)
179177
self.assertEqual(count_node(converted_graph, exir_ops.edge.aten.add.Tensor), 3)
180178

181-
# TODO(matthiascremon) -> None: enable that pass with new flow
182-
@torch.no_grad()
183-
@unittest.expectedFailure
184-
def test_legacy_conv_bn_fusion(self) -> None:
185-
class ModelConvBN(torch.nn.Module):
186-
def __init__(
187-
self, in_features: int, out_features: int, kernel_size: int
188-
) -> None:
189-
super().__init__()
190-
self.conv1d = nn.Conv1d(in_features, out_features, kernel_size)
191-
self.bn = nn.BatchNorm1d(out_features)
192-
193-
def forward(self, x: torch.Tensor) -> torch.Tensor:
194-
y = self.conv1d(x)
195-
return self.bn(y)
196-
197-
model = ModelConvBN(64, 1, 2)
198-
x = torch.randn(1, 64, 4)
199-
200-
graph_module = (
201-
compiler.export_to_executorch_gen_etrecord(model.eval(), (x,))
202-
.exported_program()
203-
.graph_module
204-
)
205-
# Assert that after running the fusion passes, batchnorm was fused with conv1d
206-
self.assertEqual(
207-
count_node(graph_module, torch.ops.aten.linear.out)
208-
+ count_node(graph_module, torch.ops.cadence.convolution.out),
209-
1,
210-
)
211-
self.assertEqual(
212-
count_node(
213-
graph_module, torch.ops.aten._native_batch_norm_legit_no_training.out
214-
),
215-
0,
216-
)
217-
218179
def test_permute_transpose_fusion(self) -> None:
219180
builder = GraphBuilder()
220181
x = builder.placeholder("x", torch.randn(3, 1, 3, 1, 4, dtype=torch.float32))

0 commit comments

Comments
 (0)