Update on "[ET-VK][Ops] torchao.choose_qparams_affine vulkan impl and shader (buffer only) and cleanup"

morelos · morelos · commit 707625772e5a · 2025-07-16T17:46:10.000-07:00
# Changes * Implement `torchao.choose_qparams_affine` operator in Vulkan backend with comprehensive buffer storage support * Add block-wise quantization parameter computation in `choose_qparams_buffer.glsl` shader for configurable tensor block analysis * Extend quantization parameter infrastructure in `ChooseQParams.cpp` to handle affine transformations with configurable block sizes and multiple mapping types * Support three quantization mapping strategies: ASYMMETRIC, SYMMETRIC, and SYMMETRIC_NO_CLIPPING_ERR for optimal parameter selection * Consolidated the logic for choosing scale and zero point between affine cases and regular quantized_decomposed cases. BE: Improved the documentation in the shader logic which is more detailed and clear # Motivation The existing Vulkan quantization infrastructure lacked support for the `torchao.choose_qparams_affine` operator, which is essential for computing optimal quantization parameters in dynamic quantization workflows. The `choose_qparams_affine` operator provides flexible block-wise parameter computation that analyzes statistical distributions within tensor blocks, enabling: * **Block-wise Parameter Computation**: Analyzes configurable tensor blocks to compute optimal scale and zero-point values, improving quantization accuracy for heterogeneous data distributions * **Multiple Mapping Types**: Supports ASYMMETRIC, SYMMETRIC, and SYMMETRIC_NO_CLIPPING_ERR quantization strategies for different precision-performance trade-offs # Operator Description The `choose_qparams_affine` operator computes optimal quantization parameters (scale and zero_point) from floating-point tensor blocks using statistical analysis of data distributions. Block-wise parameter computation divides tensors into blocks and analyzes each block independently to determine the best quantization mapping for subsequent quantization operations. The parameter calculation varies by mapping type: - **ASYMMETRIC**: `scale = (max - min) / (quant_max - quant_min)`, `zero_point = quant_min - round(min / scale)` - **SYMMETRIC**: `scale = max_abs / ((quant_max - quant_min) / 2)`, `zero_point = midpoint` - **SYMMETRIC_NO_CLIPPING_ERR**: `scale = max(abs(min)/abs(quant_min), max/quant_max)`, `zero_point = midpoint` **Storage Requirements**: Input tensors must be floating-point (kFloat) with width-packed layout. Output scale/zero_point tensors use buffer storage. NOTE: Texture storage implementation is not supported due to complexity of block-wise coordinate mapping in 3D texture space. This will likely be necessary for better efficiency in the future. # Block-wise Parameter Computation Implementation Block-wise parameter computation enables fine-grained quantization analysis by dividing tensors into blocks and computing separate scale/zero_point parameters for each block. The implementation uses several key data structures computed in `ChooseQParams.cpp`: * **`block_size_vec`**: WHCN-ordered block dimensions converted from PyTorch NCHW layout (e.g., [3,3,2,1] for 3×3×2×1 blocks) * **`tensor_size_whcn`**: Input tensor dimensions converted to WHCN layout using `utils::make_whcn_ivec4()` * **`num_blocks_vec`**: Number of blocks per dimension calculated as `ceil(tensor_size_whcn / block_size_vec)` to handle non-divisible dimensions * **`block_stride_vec`**: Pre-computed linear strides for block grid indexing `{1, #W, #W*#H, #W*#H*#C}` to enable efficient block ID calculation * **`mapping_type`**: Integer encoding of quantization strategy (0=ASYMMETRIC, 1=SYMMETRIC, 2=SYMMETRIC_NO_CLIPPING_ERR) The block coordinate calculation uses: `block_coord = block_id_to_coord(block_id)` which converts linear block IDs back to 4D WHCN coordinates, then computes element ranges: `t0 = block_coord * blockSize` and `tEnd = t0 + blockSize` for nested loop iteration. # Shader Algorithm Overview ## Buffer Storage Implementation (`choose_qparams_buffer.glsl`) **Workgroup Configuration**: - **Global WG Size**: `{nBlocks, 1u, 1u}` where `nBlocks = total number of blocks` computed from `ceil(tensor_size / block_size)` for each dimension - **Local WG Size**: `{1u, 1u, 1u}` (single thread per block for simplicity, though could be optimized for larger blocks) **Block-wise Mode Algorithm**: The shader uses a sophisticated multi-level nested approach to process tensor blocks efficiently. Each thread is assigned multiple blocks using strided access: `for (uint block_id = gl_GlobalInvocationID.x; block_id < TOTAL_BLOCKS; block_id += STRIDE)` where `STRIDE = gl_WorkGroupSize.x * gl_NumWorkGroups.x`. For each assigned block, the algorithm performs several key steps: **1. Block Coordinate Conversion**: The `block_id_to_coord(block_id)` function converts linear block IDs to 4D WHCN coordinates using modular arithmetic. **2. Element Range Calculation**: Computes the inclusive start coordinate `t0 = bc * blockSize` and exclusive end coordinate `tEnd = t0 + blockSize` to define the block's element boundaries in tensor space. **3. Nested Loop Min/Max Scan**: Uses four nested loops to iterate through all elements within the block: `for (int n = t0.w; n < tEnd.w; ++n) for (int c = t0.z; c < tEnd.z; ++c) for (int h = t0.y; h < tEnd.y; ++h) for (int w = t0.x; w < tEnd.x; ++w)` Each element is accessed using `tidx_to_bufi(ivec4(w,h,c,n), t_in_strides)` to convert 4D tensor coordinates to linear buffer indices with proper stride handling. **4. Parameter Calculation**: Calls `calc_scale_zp(lo, hi, quant_min, quant_max, mapping_type, eps, scale, zp)` which implements the three mapping strategies: * **ASYMMETRIC (mapping_type=0)**: Maps full range [min, max] to [quant_min, quant_max] preserving data distribution * **SYMMETRIC (mapping_type=1)**: Centers around zero using `max_abs = max(abs(min), abs(max))` for balanced quantization * **SYMMETRIC_NO_CLIPPING_ERR (mapping_type=2)**: Computes separate scales for positive/negative ranges and uses the maximum to prevent clipping **Future Improvements**: Implement workgroup-level reduction for large blocks, optimize memory access patterns for better cache utilization, and explore texture storage implementation with simplified block alignment constraints. Differential Revision: [D78436638](https://our.internmc.facebook.com/intern/diff/D78436638/) cc SS-JIA manuelcandales cbilgin [ghstack-poisoned]
diff --git a/backends/vulkan/test/op_tests/affine_test.cpp b/backends/vulkan/test/op_tests/affine_test.cpp
@@ -279,40 +279,26 @@ at::Tensor dequantize_affine_reference_impl(
       std::string("INT"));
 }
 
-/*
- * choose_qparams_affine_reference_impl
- * -----------------------------------
- * A faithful C++ re-implementation of the Python helper
- *   choose_qparams_affine  (see quant_primitives.py)
- *
- * Supported input dtypes : float32 / float16 / bfloat16
- * Supported mapping types: ASYMMETRIC / SYMMETRIC / SYMMETRIC_NO_CLIPPING_ERR
- */
 std::tuple<at::Tensor, at::Tensor> choose_qparams_affine_reference_impl(
-    const at::Tensor& input_, // F32 / F16 / BF16
-    const std::string&
-        mapping_type, // ASYMMETRIC / SYMMETRIC / SYMMETRIC_NO_CLIPPING_ERR
-    const std::vector<int64_t>& block_size, // same length as input_.dim()
+    const at::Tensor& input_,
+    const std::string& mapping_type,
+    const std::vector<int64_t>& block_size,
     int64_t quant_min,
     int64_t quant_max,
     double eps) {
-  // -------------------- 1. Validations --------------------------------------
   const int64_t ndim = input_.dim();
   _check_dims("input", block_size.size(), ndim);
 
-  TORCH_CHECK(
+  VK_CHECK_COND(
       input_.scalar_type() == at::kFloat || input_.scalar_type() == at::kHalf ||
           input_.scalar_type() == at::kBFloat16,
       "Unsupported input dtype: ",
       input_.dtype());
 
-  // Ensure contiguous –  view() is only well-defined on contiguous tensors
   at::Tensor input = input_.contiguous();
 
-  // -------------------- 2. Derive reduction shape ---------------------------
-  // Equivalent to python _get_reduction_params
   std::vector<int64_t> shape_for_reduction;
-  std::vector<int64_t> reduction_dims; // dims we later collapse to size-1
+  std::vector<int64_t> reduction_dims;
   int64_t cur_dim = 0;
 
   auto in_sizes = input.sizes();
@@ -321,7 +307,7 @@ std::tuple<at::Tensor, at::Tensor> choose_qparams_affine_reference_impl(
     const int64_t dim = in_sizes[i];
 
     if (blk != dim && blk > 1) {
-      TORCH_CHECK(
+      VK_CHECK_COND(
           dim % blk == 0,
           "Input size ",
           dim,
@@ -331,11 +317,11 @@ std::tuple<at::Tensor, at::Tensor> choose_qparams_affine_reference_impl(
           i);
       shape_for_reduction.push_back(dim / blk);
       shape_for_reduction.push_back(blk);
-      reduction_dims.push_back(cur_dim + 1); // the 'inside block' dim
+      reduction_dims.push_back(cur_dim + 1);
       cur_dim += 2;
     } else {
       shape_for_reduction.push_back(dim);
-      if (blk != 1) { // per-axis / per-tensor
+      if (blk != 1) {
         reduction_dims.push_back(cur_dim);
       }
       cur_dim += 1;
@@ -344,19 +330,14 @@ std::tuple<at::Tensor, at::Tensor> choose_qparams_affine_reference_impl(
 
   at::Tensor input_reshaped = input.view(shape_for_reduction);
 
-  // Shape after reduction – same rank as shape_for_reduction but
-  // all 'reduction_dims' set to 1.  We'll reshape scale / zp to that.
   std::vector<int64_t> shape_after_reduction = shape_for_reduction;
   for (int64_t d : reduction_dims) {
     shape_after_reduction[d] = 1;
   }
 
-  // -------------------- 3. Find min/max values -----------------------------
-  // Reduce over the specified dimensions to get min/max values
   at::Tensor min_val = input_reshaped.amin(reduction_dims, /*keepdim=*/true);
   at::Tensor max_val = input_reshaped.amax(reduction_dims, /*keepdim=*/true);
 
-  // -------------------- 4. Calculate scale and zero_point ------------------
   at::Tensor scale, zero_point;
 
   if (mapping_type == "ASYMMETRIC") {
@@ -403,15 +384,13 @@ std::tuple<at::Tensor, at::Tensor> choose_qparams_affine_reference_impl(
     zero_point =
         at::full_like(scale, (quant_max + quant_min + 1) / 2, at::kInt);
   } else {
-    TORCH_CHECK(
+    VK_CHECK_COND(
         false,
         "Unsupported mapping_type: ",
         mapping_type,
         ". Expected ASYMMETRIC, SYMMETRIC, or SYMMETRIC_NO_CLIPPING_ERR");
   }
 
-  // -------------------- 5. Reshape back to output shape --------------------
-  // Calculate output shape (remove reduction dimensions)
   std::vector<int64_t> output_shape;
   for (size_t i = 0; i < shape_after_reduction.size(); ++i) {
     if (shape_after_reduction[i] != 1 ||
@@ -1007,7 +986,6 @@ TEST(VulkanDequantizeAffineTest, test_4d_dequantization) {
       at::kFloat); // output dtype
 }
 
-// Test function for choose_qparams_affine
 void test_vulkan_choose_qparams_affine_impl(
     const std::vector<int>& input_sizes,
     const std::vector<int64_t>& block_size,
@@ -1033,7 +1011,6 @@ void test_vulkan_choose_qparams_affine_impl(
   at::Tensor reference_scale = std::get<0>(reference_out);
   at::Tensor reference_zero_point = std::get<1>(reference_out);
 
-  // Ensure zero_point is int32 as expected by the shader
   reference_zero_point = reference_zero_point.to(at::kInt);
 
   using namespace vkcompute;