[ET-VK] Creating get_symmetric_quantization_config #12999

pytorchbot · 2025-07-30T16:15:14Z

This PR was created by the merge bot to help merge the original PR into the main branch.
ghstack PR number: #12573 by @ahmtox
^ Please use this as the source of truth for the PR details, comments, and reviews
ghstack PR base: https://github.com/pytorch/executorch/tree/gh/ahmtox/41/base
ghstack PR head: https://github.com/pytorch/executorch/tree/gh/ahmtox/41/head
Merge bot PR base: https://github.com/pytorch/executorch/tree/gh/ahmtox/40/orig
Merge bot PR head: https://github.com/pytorch/executorch/tree/gh/ahmtox/41/orig
@diff-train-skip-merge

Pull Request resolved: #12572 # Context Eventually as the vulkan_quantizer file expands, we will need to migrate into a custom utils file and stop depending on the xnnpack_quantizer_utils. We migrate only the minimal amount of functions necessary to ensure the vulkan_quantizer works. # Changes We create a new file `vulkan_quantizer_utils.py` and migrate off of `xnnpack_quantizer_utils.py` in `vulkan_quantizer`. There are no specific modifications necessary to work separate from xnnpack utils except bits_to_range to allow not needing to specify the ranges everytime. ghstack-source-id: 299473612 @exported-using-ghexport Differential Revision: [D78290055](https://our.internmc.facebook.com/intern/diff/D78290055/)

Pull Request resolved: #12573 # Context Eventually dynamic quantization will be enabled in the vulkan_quantizer (particularly 8bit dyn act with 8bit weights). In order to enable this functionality we need to utilize a similar method as XNNPack with how they define their quantization config. This diff aims to align with XNNPack quantizer logic and also migrate away from utilizing the old static quantization config logic. # Changes A few noticable changes is that we migrate away from `get_linear_weight_only_qcs_xnn_qconfig`, and we now define a symmetric config that has parameters to define whether it's dynamically quantized or not. Furthermore, we also incorporate bits_to_range so that we can automatically designate the min and max quant ranges without having to set them during initialization. We also change some wording from using just static as we are now enabling dynamic quantization as well. Furthermore, we change internally other codebases that are calling our existing legacy config, and move them into the more universal symmetric config. Since this follows the same naming scheme as XNNPack, I have decided to just add aliases in cases where its being imported directly along with XNNPack. ghstack-source-id: 299473613 @exported-using-ghexport Differential Revision: [D78291249](https://our.internmc.facebook.com/intern/diff/D78291249/)

pytorch-bot · 2025-07-30T16:15:18Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/12999

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

⏳ No Failures, 89 Pending

As of commit b2f0fd9 with merge base da0c80a ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2025-07-30T16:54:40Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@ahmtox

This PR was created by the merge bot to help merge the original PR into the main branch. ghstack PR number: #12574 by @ahmtox ^ Please use this as the source of truth for the PR details, comments, and reviews ghstack PR base: https://github.com/pytorch/executorch/tree/gh/ahmtox/42/base ghstack PR head: https://github.com/pytorch/executorch/tree/gh/ahmtox/42/head Merge bot PR base: https://github.com/pytorch/executorch/tree/gh/ahmtox/41/orig Merge bot PR head: https://github.com/pytorch/executorch/tree/gh/ahmtox/42/orig @diff-train-skip-merge --------- Co-authored-by: morelos <[email protected]> Co-authored-by: ahmtox <[email protected]>

…ffer only) and cleanup (#13003) # Changes * Implement `torchao.choose_qparams_affine` operator in Vulkan backend with comprehensive buffer storage support * Add block-wise quantization parameter computation in `choose_qparams_buffer.glsl` shader for configurable tensor block analysis * Extend quantization parameter infrastructure in `ChooseQParams.cpp` to handle affine transformations with configurable block sizes and multiple mapping types * Support three quantization mapping strategies: ASYMMETRIC, SYMMETRIC, and SYMMETRIC_NO_CLIPPING_ERR for optimal parameter selection * Consolidated the logic for choosing scale and zero point between affine cases and regular quantized_decomposed cases. BE: Improved the documentation in the shader logic which is more detailed and clear # Motivation The existing Vulkan quantization infrastructure lacked support for the `torchao.choose_qparams_affine` operator, which is essential for computing optimal quantization parameters in dynamic quantization workflows. The `choose_qparams_affine` operator provides flexible block-wise parameter computation that analyzes statistical distributions within tensor blocks, enabling: * **Block-wise Parameter Computation**: Analyzes configurable tensor blocks to compute optimal scale and zero-point values, improving quantization accuracy for heterogeneous data distributions * **Multiple Mapping Types**: Supports ASYMMETRIC, SYMMETRIC, and SYMMETRIC_NO_CLIPPING_ERR quantization strategies for different precision-performance trade-offs # Operator Description The `choose_qparams_affine` operator computes optimal quantization parameters (scale and zero_point) from floating-point tensor blocks using statistical analysis of data distributions. Block-wise parameter computation divides tensors into blocks and analyzes each block independently to determine the best quantization mapping for subsequent quantization operations. The parameter calculation varies by mapping type: - **ASYMMETRIC**: `scale = (max - min) / (quant_max - quant_min)`, `zero_point = quant_min - round(min / scale)` - **SYMMETRIC**: `scale = max_abs / ((quant_max - quant_min) / 2)`, `zero_point = midpoint` - **SYMMETRIC_NO_CLIPPING_ERR**: `scale = max(abs(min)/abs(quant_min), max/quant_max)`, `zero_point = midpoint` **Storage Requirements**: Input tensors must be floating-point (kFloat) with width-packed layout. Output scale/zero_point tensors use buffer storage. NOTE: Texture storage implementation is not supported due to complexity of block-wise coordinate mapping in 3D texture space. This will likely be necessary for better efficiency in the future. # Block-wise Parameter Computation Implementation Block-wise parameter computation enables fine-grained quantization analysis by dividing tensors into blocks and computing separate scale/zero_point parameters for each block. The implementation uses several key data structures computed in `ChooseQParams.cpp`: * **`block_size_vec`**: WHCN-ordered block dimensions converted from PyTorch NCHW layout (e.g., [3,3,2,1] for 3×3×2×1 blocks) * **`tensor_size_whcn`**: Input tensor dimensions converted to WHCN layout using `utils::make_whcn_ivec4()` * **`num_blocks_vec`**: Number of blocks per dimension calculated as `ceil(tensor_size_whcn / block_size_vec)` to handle non-divisible dimensions * **`block_stride_vec`**: Pre-computed linear strides for block grid indexing `{1, #W, #W*#H, #W*#H*#C}` to enable efficient block ID calculation * **`mapping_type`**: Integer encoding of quantization strategy (0=ASYMMETRIC, 1=SYMMETRIC, 2=SYMMETRIC_NO_CLIPPING_ERR) The block coordinate calculation uses: `block_coord = block_id_to_coord(block_id)` which converts linear block IDs back to 4D WHCN coordinates, then computes element ranges: `t0 = block_coord * blockSize` and `tEnd = t0 + blockSize` for nested loop iteration. # Shader Algorithm Overview ## Buffer Storage Implementation (`choose_qparams_buffer.glsl`) **Workgroup Configuration**: - **Global WG Size**: `{nBlocks, 1u, 1u}` where `nBlocks = total number of blocks` computed from `ceil(tensor_size / block_size)` for each dimension - **Local WG Size**: `{1u, 1u, 1u}` (single thread per block for simplicity, though could be optimized for larger blocks) **Block-wise Mode Algorithm**: The shader uses a sophisticated multi-level nested approach to process tensor blocks efficiently. Each thread is assigned multiple blocks using strided access: `for (uint block_id = gl_GlobalInvocationID.x; block_id < TOTAL_BLOCKS; block_id += STRIDE)` where `STRIDE = gl_WorkGroupSize.x * gl_NumWorkGroups.x`. For each assigned block, the algorithm performs several key steps: **1. Block Coordinate Conversion**: The `block_id_to_coord(block_id)` function converts linear block IDs to 4D WHCN coordinates using modular arithmetic. **2. Element Range Calculation**: Computes the inclusive start coordinate `t0 = bc * blockSize` and exclusive end coordinate `tEnd = t0 + blockSize` to define the block's element boundaries in tensor space. **3. Nested Loop Min/Max Scan**: Uses four nested loops to iterate through all elements within the block: `for (int n = t0.w; n < tEnd.w; ++n) for (int c = t0.z; c < tEnd.z; ++c) for (int h = t0.y; h < tEnd.y; ++h) for (int w = t0.x; w < tEnd.x; ++w)` Each element is accessed using `tidx_to_bufi(ivec4(w,h,c,n), t_in_strides)` to convert 4D tensor coordinates to linear buffer indices with proper stride handling. **4. Parameter Calculation**: Calls `calc_scale_zp(lo, hi, quant_min, quant_max, mapping_type, eps, scale, zp)` which implements the three mapping strategies: * **ASYMMETRIC (mapping_type=0)**: Maps full range [min, max] to [quant_min, quant_max] preserving data distribution * **SYMMETRIC (mapping_type=1)**: Centers around zero using `max_abs = max(abs(min), abs(max))` for balanced quantization * **SYMMETRIC_NO_CLIPPING_ERR (mapping_type=2)**: Computes separate scales for positive/negative ranges and uses the maximum to prevent clipping **Future Improvements**: Implement workgroup-level reduction for large blocks, optimize memory access patterns for better cache utilization, and explore texture storage implementation with simplified block alignment constraints. Differential Revision: [D78436638](https://our.internmc.facebook.com/intern/diff/D78436638/) cc SS-JIA manuelcandales cbilgin [ghstack-poisoned]

@ahmtox

This PR was created by the merge bot to help merge the original PR into the main branch. ghstack PR number: pytorch#12573 by @ahmtox ^ Please use this as the source of truth for the PR details, comments, and reviews ghstack PR base: https://github.com/pytorch/executorch/tree/gh/ahmtox/41/base ghstack PR head: https://github.com/pytorch/executorch/tree/gh/ahmtox/41/head Merge bot PR base: https://github.com/pytorch/executorch/tree/gh/ahmtox/40/orig Merge bot PR head: https://github.com/pytorch/executorch/tree/gh/ahmtox/41/orig @diff-train-skip-merge --------- Co-authored-by: morelos <[email protected]> Co-authored-by: ahmtox <[email protected]> Co-authored-by: Gasoonjia <[email protected]>

morelos added 2 commits July 29, 2025 23:43

pytorchbot requested review from SS-JIA, jackzhxng, larryliu0820, mergennachin and swolchok as code owners July 30, 2025 16:15

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 30, 2025

ahmtox added 2 commits July 30, 2025 09:16

Merge branch 'main' into gh/ahmtox/40/orig

4218a8f

Merge branch 'gh/ahmtox/40/orig' into gh/ahmtox/41/orig

2944184

Base automatically changed from gh/ahmtox/40/orig to main July 30, 2025 16:51

Merge branch 'main' into gh/ahmtox/41/orig

e95d1f0

ahmtox self-requested a review July 30, 2025 16:54

ahmtox approved these changes Jul 30, 2025

View reviewed changes

pytorchbot and others added 3 commits July 30, 2025 10:09

Merge branch 'main' into gh/ahmtox/41/orig

b2f0fd9

SS-JIA force-pushed the gh/ahmtox/41/orig branch from 3efbdba to b2f0fd9 Compare July 30, 2025 18:54

SS-JIA merged commit 21ab2ec into main Jul 30, 2025
100 checks passed

SS-JIA deleted the gh/ahmtox/41/orig branch July 30, 2025 19:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ET-VK] Creating get_symmetric_quantization_config #12999

[ET-VK] Creating get_symmetric_quantization_config #12999

Uh oh!

pytorchbot commented Jul 30, 2025

Uh oh!

pytorch-bot bot commented Jul 30, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Jul 30, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[ET-VK] Creating get_symmetric_quantization_config #12999

[ET-VK] Creating get_symmetric_quantization_config #12999

Uh oh!

Conversation

pytorchbot commented Jul 30, 2025

Uh oh!

pytorch-bot bot commented Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/12999

⏳ No Failures, 89 Pending

Uh oh!

github-actions bot commented Jul 30, 2025

This PR needs a release notes: label

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

pytorch-bot bot commented Jul 30, 2025 •

edited

Loading

This PR needs a `release notes:` label