Skip to content

Conversation

@pytorchbot
Copy link
Collaborator

This PR was created by the merge bot to help merge the original PR into the main branch.
ghstack PR number: #12573 by @ahmtox
^ Please use this as the source of truth for the PR details, comments, and reviews
ghstack PR base: https://github.com/pytorch/executorch/tree/gh/ahmtox/41/base
ghstack PR head: https://github.com/pytorch/executorch/tree/gh/ahmtox/41/head
Merge bot PR base: https://github.com/pytorch/executorch/tree/gh/ahmtox/40/orig
Merge bot PR head: https://github.com/pytorch/executorch/tree/gh/ahmtox/41/orig
@diff-train-skip-merge

morelos added 2 commits July 29, 2025 23:43
Pull Request resolved: #12572

# Context

Eventually as the vulkan_quantizer file expands, we will need to migrate into a custom utils file and stop depending on the xnnpack_quantizer_utils. We migrate only the minimal amount of functions necessary to ensure the vulkan_quantizer works.

# Changes

We create a new file `vulkan_quantizer_utils.py` and migrate off of `xnnpack_quantizer_utils.py` in `vulkan_quantizer`. There are no specific modifications necessary to work separate from xnnpack utils except bits_to_range to allow not needing to specify the ranges everytime.
ghstack-source-id: 299473612
@exported-using-ghexport

Differential Revision: [D78290055](https://our.internmc.facebook.com/intern/diff/D78290055/)
Pull Request resolved: #12573

# Context

Eventually dynamic quantization will be enabled in the vulkan_quantizer (particularly 8bit dyn act with 8bit weights). In order to enable this functionality we need to utilize a similar method as XNNPack with how they define their quantization config. This diff aims to align with XNNPack quantizer logic and also migrate away from utilizing the old static quantization config logic.

# Changes

A few noticable changes is that we migrate away from `get_linear_weight_only_qcs_xnn_qconfig`, and we now define a symmetric config that has parameters to define whether it's dynamically quantized or not. Furthermore, we also incorporate bits_to_range so that we can automatically designate the min and max quant ranges without having to set them during initialization. We also change some wording from using just static as we are now enabling dynamic quantization as well.

Furthermore, we change internally other codebases that are calling our existing legacy config, and move them into the more universal symmetric config. Since this follows the same naming scheme as XNNPack, I have decided to just add aliases in cases where its being imported directly along with XNNPack.
ghstack-source-id: 299473613
@exported-using-ghexport

Differential Revision: [D78291249](https://our.internmc.facebook.com/intern/diff/D78291249/)
@pytorch-bot
Copy link

pytorch-bot bot commented Jul 30, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/12999

Note: Links to docs will display an error until the docs builds have been completed.

⏳ No Failures, 89 Pending

As of commit b2f0fd9 with merge base da0c80a (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 30, 2025
Base automatically changed from gh/ahmtox/40/orig to main July 30, 2025 16:51
@ahmtox ahmtox self-requested a review July 30, 2025 16:54
@github-actions
Copy link

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

pytorchbot and others added 3 commits July 30, 2025 10:09
This PR was created by the merge bot to help merge the original PR into
the main branch.
ghstack PR number: #12574 by
@ahmtox
^ Please use this as the source of truth for the PR details, comments,
and reviews
ghstack PR base:
https://github.com/pytorch/executorch/tree/gh/ahmtox/42/base
ghstack PR head:
https://github.com/pytorch/executorch/tree/gh/ahmtox/42/head
Merge bot PR base:
https://github.com/pytorch/executorch/tree/gh/ahmtox/41/orig
Merge bot PR head:
https://github.com/pytorch/executorch/tree/gh/ahmtox/42/orig
@diff-train-skip-merge

---------

Co-authored-by: morelos <[email protected]>
Co-authored-by: ahmtox <[email protected]>
…ffer only) and cleanup (#13003)

# Changes
* Implement `torchao.choose_qparams_affine` operator in Vulkan backend with comprehensive buffer storage support
* Add block-wise quantization parameter computation in `choose_qparams_buffer.glsl` shader for configurable tensor block analysis
* Extend quantization parameter infrastructure in `ChooseQParams.cpp` to handle affine transformations with configurable block sizes and multiple mapping types
* Support three quantization mapping strategies: ASYMMETRIC, SYMMETRIC, and SYMMETRIC_NO_CLIPPING_ERR for optimal parameter selection
* Consolidated the logic for choosing scale and zero point between affine cases and regular quantized_decomposed cases.

BE: Improved the documentation in the shader logic which is more detailed and clear

# Motivation
The existing Vulkan quantization infrastructure lacked support for the `torchao.choose_qparams_affine` operator, which is essential for computing optimal quantization parameters in dynamic quantization workflows. The `choose_qparams_affine` operator provides flexible block-wise parameter computation that analyzes statistical distributions within tensor blocks, enabling:

* **Block-wise Parameter Computation**: Analyzes configurable tensor blocks to compute optimal scale and zero-point values, improving quantization accuracy for heterogeneous data distributions
* **Multiple Mapping Types**: Supports ASYMMETRIC, SYMMETRIC, and SYMMETRIC_NO_CLIPPING_ERR quantization strategies for different precision-performance trade-offs

# Operator Description
The `choose_qparams_affine` operator computes optimal quantization parameters (scale and zero_point) from floating-point tensor blocks using statistical analysis of data distributions. Block-wise parameter computation divides tensors into blocks and analyzes each block independently to determine the best quantization mapping for subsequent quantization operations.

The parameter calculation varies by mapping type:
- **ASYMMETRIC**: `scale = (max - min) / (quant_max - quant_min)`, `zero_point = quant_min - round(min / scale)`
- **SYMMETRIC**: `scale = max_abs / ((quant_max - quant_min) / 2)`, `zero_point = midpoint`
- **SYMMETRIC_NO_CLIPPING_ERR**: `scale = max(abs(min)/abs(quant_min), max/quant_max)`, `zero_point = midpoint`

**Storage Requirements**: Input tensors must be floating-point (kFloat) with width-packed layout. Output scale/zero_point tensors use buffer storage.

NOTE: Texture storage implementation is not supported due to complexity of block-wise coordinate mapping in 3D texture space. This will likely be necessary for better efficiency in the future.

# Block-wise Parameter Computation Implementation
Block-wise parameter computation enables fine-grained quantization analysis by dividing tensors into blocks and computing separate scale/zero_point parameters for each block. The implementation uses several key data structures computed in `ChooseQParams.cpp`:

* **`block_size_vec`**: WHCN-ordered block dimensions converted from PyTorch NCHW layout (e.g., [3,3,2,1] for 3×3×2×1 blocks)
* **`tensor_size_whcn`**: Input tensor dimensions converted to WHCN layout using `utils::make_whcn_ivec4()`
* **`num_blocks_vec`**: Number of blocks per dimension calculated as `ceil(tensor_size_whcn / block_size_vec)` to handle non-divisible dimensions
* **`block_stride_vec`**: Pre-computed linear strides for block grid indexing `{1, #W, #W*#H, #W*#H*#C}` to enable efficient block ID calculation
* **`mapping_type`**: Integer encoding of quantization strategy (0=ASYMMETRIC, 1=SYMMETRIC, 2=SYMMETRIC_NO_CLIPPING_ERR)

The block coordinate calculation uses: `block_coord = block_id_to_coord(block_id)` which converts linear block IDs back to 4D WHCN coordinates, then computes element ranges: `t0 = block_coord * blockSize` and `tEnd = t0 + blockSize` for nested loop iteration.

# Shader Algorithm Overview

## Buffer Storage Implementation (`choose_qparams_buffer.glsl`)

**Workgroup Configuration**:
- **Global WG Size**: `{nBlocks, 1u, 1u}` where `nBlocks = total number of blocks` computed from `ceil(tensor_size / block_size)` for each dimension
- **Local WG Size**: `{1u, 1u, 1u}` (single thread per block for simplicity, though could be optimized for larger blocks)

**Block-wise Mode Algorithm**:
The shader uses a sophisticated multi-level nested approach to process tensor blocks efficiently. Each thread is assigned multiple blocks using strided access: `for (uint block_id = gl_GlobalInvocationID.x; block_id < TOTAL_BLOCKS; block_id += STRIDE)` where `STRIDE = gl_WorkGroupSize.x * gl_NumWorkGroups.x`.

For each assigned block, the algorithm performs several key steps:

**1. Block Coordinate Conversion**:
The `block_id_to_coord(block_id)` function converts linear block IDs to 4D WHCN coordinates using modular arithmetic.

**2. Element Range Calculation**: Computes the inclusive start coordinate `t0 = bc * blockSize` and exclusive end coordinate `tEnd = t0 + blockSize` to define the block's element boundaries in tensor space.

**3. Nested Loop Min/Max Scan**: Uses four nested loops to iterate through all elements within the block:

`for (int n = t0.w; n < tEnd.w; ++n) for (int c = t0.z; c < tEnd.z; ++c) for (int h = t0.y; h < tEnd.y; ++h) for (int w = t0.x; w < tEnd.x; ++w)`

Each element is accessed using `tidx_to_bufi(ivec4(w,h,c,n), t_in_strides)` to convert 4D tensor coordinates to linear buffer indices with proper stride handling.

**4. Parameter Calculation**: Calls `calc_scale_zp(lo, hi, quant_min, quant_max, mapping_type, eps, scale, zp)` which implements the three mapping strategies:

*   **ASYMMETRIC (mapping_type=0)**: Maps full range [min, max] to [quant_min, quant_max] preserving data distribution
*   **SYMMETRIC (mapping_type=1)**: Centers around zero using `max_abs = max(abs(min), abs(max))` for balanced quantization
*   **SYMMETRIC_NO_CLIPPING_ERR (mapping_type=2)**: Computes separate scales for positive/negative ranges and uses the maximum to prevent clipping

**Future Improvements**: Implement workgroup-level reduction for large blocks, optimize memory access patterns for better cache utilization, and explore texture storage implementation with simplified block alignment constraints.

Differential Revision: [D78436638](https://our.internmc.facebook.com/intern/diff/D78436638/)

cc SS-JIA manuelcandales cbilgin

[ghstack-poisoned]
@SS-JIA SS-JIA force-pushed the gh/ahmtox/41/orig branch from 3efbdba to b2f0fd9 Compare July 30, 2025 18:54
@SS-JIA SS-JIA merged commit 21ab2ec into main Jul 30, 2025
100 checks passed
@SS-JIA SS-JIA deleted the gh/ahmtox/41/orig branch July 30, 2025 19:14
agrima1304 pushed a commit to agrima1304/executorch that referenced this pull request Aug 26, 2025
This PR was created by the merge bot to help merge the original PR into
the main branch.
ghstack PR number: pytorch#12573 by
@ahmtox
^ Please use this as the source of truth for the PR details, comments,
and reviews
ghstack PR base:
https://github.com/pytorch/executorch/tree/gh/ahmtox/41/base
ghstack PR head:
https://github.com/pytorch/executorch/tree/gh/ahmtox/41/head
Merge bot PR base:
https://github.com/pytorch/executorch/tree/gh/ahmtox/40/orig
Merge bot PR head:
https://github.com/pytorch/executorch/tree/gh/ahmtox/41/orig
@diff-train-skip-merge

---------

Co-authored-by: morelos <[email protected]>
Co-authored-by: ahmtox <[email protected]>
Co-authored-by: Gasoonjia <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants