Update base for Update on "[ET-VK][Ops] torchao.choose_qparams_affine vulkan impl and shader (buffer only) and cleanup"

morelos · morelos · commit 0b1b6c9fcdc1 · 2025-07-17T15:18:59.000-07:00
# Changes * Implement `torchao.choose_qparams_affine` operator in Vulkan backend with comprehensive buffer storage support * Add block-wise quantization parameter computation in `choose_qparams_buffer.glsl` shader for configurable tensor block analysis * Extend quantization parameter infrastructure in `ChooseQParams.cpp` to handle affine transformations with configurable block sizes and multiple mapping types * Support three quantization mapping strategies: ASYMMETRIC, SYMMETRIC, and SYMMETRIC_NO_CLIPPING_ERR for optimal parameter selection * Consolidated the logic for choosing scale and zero point between affine cases and regular quantized_decomposed cases. BE: Improved the documentation in the shader logic which is more detailed and clear # Motivation The existing Vulkan quantization infrastructure lacked support for the `torchao.choose_qparams_affine` operator, which is essential for computing optimal quantization parameters in dynamic quantization workflows. The `choose_qparams_affine` operator provides flexible block-wise parameter computation that analyzes statistical distributions within tensor blocks, enabling: * **Block-wise Parameter Computation**: Analyzes configurable tensor blocks to compute optimal scale and zero-point values, improving quantization accuracy for heterogeneous data distributions * **Multiple Mapping Types**: Supports ASYMMETRIC, SYMMETRIC, and SYMMETRIC_NO_CLIPPING_ERR quantization strategies for different precision-performance trade-offs # Operator Description The `choose_qparams_affine` operator computes optimal quantization parameters (scale and zero_point) from floating-point tensor blocks using statistical analysis of data distributions. Block-wise parameter computation divides tensors into blocks and analyzes each block independently to determine the best quantization mapping for subsequent quantization operations. The parameter calculation varies by mapping type: - **ASYMMETRIC**: `scale = (max - min) / (quant_max - quant_min)`, `zero_point = quant_min - round(min / scale)` - **SYMMETRIC**: `scale = max_abs / ((quant_max - quant_min) / 2)`, `zero_point = midpoint` - **SYMMETRIC_NO_CLIPPING_ERR**: `scale = max(abs(min)/abs(quant_min), max/quant_max)`, `zero_point = midpoint` **Storage Requirements**: Input tensors must be floating-point (kFloat) with width-packed layout. Output scale/zero_point tensors use buffer storage. NOTE: Texture storage implementation is not supported due to complexity of block-wise coordinate mapping in 3D texture space. This will likely be necessary for better efficiency in the future. # Block-wise Parameter Computation Implementation Block-wise parameter computation enables fine-grained quantization analysis by dividing tensors into blocks and computing separate scale/zero_point parameters for each block. The implementation uses several key data structures computed in `ChooseQParams.cpp`: * **`block_size_vec`**: WHCN-ordered block dimensions converted from PyTorch NCHW layout (e.g., [3,3,2,1] for 3×3×2×1 blocks) * **`tensor_size_whcn`**: Input tensor dimensions converted to WHCN layout using `utils::make_whcn_ivec4()` * **`num_blocks_vec`**: Number of blocks per dimension calculated as `ceil(tensor_size_whcn / block_size_vec)` to handle non-divisible dimensions * **`block_stride_vec`**: Pre-computed linear strides for block grid indexing `{1, #W, #W*#H, #W*#H*#C}` to enable efficient block ID calculation * **`mapping_type`**: Integer encoding of quantization strategy (0=ASYMMETRIC, 1=SYMMETRIC, 2=SYMMETRIC_NO_CLIPPING_ERR) The block coordinate calculation uses: `block_coord = block_id_to_coord(block_id)` which converts linear block IDs back to 4D WHCN coordinates, then computes element ranges: `t0 = block_coord * blockSize` and `tEnd = t0 + blockSize` for nested loop iteration. # Shader Algorithm Overview ## Buffer Storage Implementation (`choose_qparams_buffer.glsl`) **Workgroup Configuration**: - **Global WG Size**: `{nBlocks, 1u, 1u}` where `nBlocks = total number of blocks` computed from `ceil(tensor_size / block_size)` for each dimension - **Local WG Size**: `{1u, 1u, 1u}` (single thread per block for simplicity, though could be optimized for larger blocks) **Block-wise Mode Algorithm**: The shader uses a sophisticated multi-level nested approach to process tensor blocks efficiently. Each thread is assigned multiple blocks using strided access: `for (uint block_id = gl_GlobalInvocationID.x; block_id < TOTAL_BLOCKS; block_id += STRIDE)` where `STRIDE = gl_WorkGroupSize.x * gl_NumWorkGroups.x`. For each assigned block, the algorithm performs several key steps: **1. Block Coordinate Conversion**: The `block_id_to_coord(block_id)` function converts linear block IDs to 4D WHCN coordinates using modular arithmetic. **2. Element Range Calculation**: Computes the inclusive start coordinate `t0 = bc * blockSize` and exclusive end coordinate `tEnd = t0 + blockSize` to define the block's element boundaries in tensor space. **3. Nested Loop Min/Max Scan**: Uses four nested loops to iterate through all elements within the block: `for (int n = t0.w; n < tEnd.w; ++n) for (int c = t0.z; c < tEnd.z; ++c) for (int h = t0.y; h < tEnd.y; ++h) for (int w = t0.x; w < tEnd.x; ++w)` Each element is accessed using `tidx_to_bufi(ivec4(w,h,c,n), t_in_strides)` to convert 4D tensor coordinates to linear buffer indices with proper stride handling. **4. Parameter Calculation**: Calls `calc_scale_zp(lo, hi, quant_min, quant_max, mapping_type, eps, scale, zp)` which implements the three mapping strategies: * **ASYMMETRIC (mapping_type=0)**: Maps full range [min, max] to [quant_min, quant_max] preserving data distribution * **SYMMETRIC (mapping_type=1)**: Centers around zero using `max_abs = max(abs(min), abs(max))` for balanced quantization * **SYMMETRIC_NO_CLIPPING_ERR (mapping_type=2)**: Computes separate scales for positive/negative ranges and uses the maximum to prevent clipping **Future Improvements**: Implement workgroup-level reduction for large blocks, optimize memory access patterns for better cache utilization, and explore texture storage implementation with simplified block alignment constraints. Differential Revision: [D78436638](https://our.internmc.facebook.com/intern/diff/D78436638/) cc SS-JIA manuelcandales cbilgin [ghstack-poisoned]
diff --git a/backends/vulkan/_passes/fuse_quantized_ops.py b/backends/vulkan/_passes/fuse_quantized_ops.py
@@ -215,57 +215,20 @@ def fuse_into_linear_qcnw_node(
 #########################
 
 
-def matches_linear_qta8a_qga4w_pattern(
-    program: ExportedProgram, node: torch.fx.Node
-) -> Optional[Tuple[int, int]]:
-    """
-    Checks if the nodes surrounding a linear node matches the pattern for dynamic
-    activation + grouped weight quantized linear (QTA8A_QGA4W).
-
-    This pattern involves:
-    1. Dynamic quantization of input activations (8-bit)
-    2. Grouped quantization of weights (4-bit with group size)
-
-    The expected pattern from Int8DynActInt4WeightQuantizer is:
-        scale, zero_point = choose_qparams_affine(input)
-        quantized_input = quantize_affine(input, scale, zero_point)
-        dequantized_input = dequantize_affine(quantized_input, ...)
-        dequantized_weight = dequantize_affine(weight, weight_scales, weight_zeros)
-        output = linear(dequantized_input, dequantized_weight)
-
-    If the pattern matches, return (group_size, weight_bits), otherwise None.
-    """
-    if not utils.is_linear_node(node):
-        return None
-
-    input_node = node.args[0]
-    weight_node = node.args[1]
-
-    # Type checking - ensure we have torch.fx.Node objects
-    if not isinstance(weight_node, torch.fx.Node):
-        return None
-    if not isinstance(input_node, torch.fx.Node):
-        return None
-
-    # Check if input is dequantized with dequantize_affine (from dynamic quantization)
-    if not (
-        input_node.op == "call_function"
-        and input_node.target is not None
-        and hasattr(input_node.target, "__name__")
-        and "dequantize_affine" in getattr(input_node.target, "__name__", "")
-    ):
-        return None
+def _is_dequantize_affine_node(node: torch.fx.Node) -> bool:
+    """Check if a node is a dequantize_affine function call."""
+    return (
+        node.op == "call_function"
+        and node.target is not None
+        and hasattr(node.target, "__name__")
+        and "dequantize_affine" in getattr(node.target, "__name__", "")
+    )
 
-    # Check if weight is dequantized with dequantize_affine
-    if not (
-        weight_node.op == "call_function"
-        and weight_node.target is not None
-        and hasattr(weight_node.target, "__name__")
-        and "dequantize_affine" in getattr(weight_node.target, "__name__", "")
-    ):
-        return None
 
-    # Get the original quantized weight and quantization parameters
+def _validate_qta8a_qga4w_nodes(
+    program: ExportedProgram, weight_node: torch.fx.Node
+) -> Optional[Tuple[torch.fx.Node, torch.fx.Node, torch.fx.Node]]:
+    """Validate and extract weight quantization nodes for QTA8A_QGA4W pattern."""
     if len(weight_node.args) < 4:
         return None
 
@@ -287,7 +250,16 @@ def matches_linear_qta8a_qga4w_pattern(
     if not is_param_node(program, weight_zeros):
         return None
 
-    # Get tensors to analyze the quantization scheme
+    return orig_weight, weight_scales, weight_zeros
+
+
+def _validate_qta8a_qga4w_tensors(
+    program: ExportedProgram,
+    orig_weight: torch.fx.Node,
+    weight_scales: torch.fx.Node,
+    weight_zeros: torch.fx.Node,
+) -> Optional[Tuple[torch.Tensor, torch.Tensor, torch.Tensor]]:
+    """Validate and extract weight tensors for QTA8A_QGA4W pattern."""
     orig_weight_tensor = get_param_tensor(program, orig_weight)
     weight_scales_tensor = get_param_tensor(program, weight_scales)
     weight_zeros_tensor = get_param_tensor(program, weight_zeros)
@@ -299,20 +271,24 @@ def matches_linear_qta8a_qga4w_pattern(
     if not isinstance(weight_zeros_tensor, torch.Tensor):
         return None
 
-    # Check if weight is quantized to 4 bits (values should be in [-8, 7] range)
+    return orig_weight_tensor, weight_scales_tensor, weight_zeros_tensor
+
+
+def _validate_4bit_quantization(orig_weight_tensor: torch.Tensor) -> bool:
+    """Check if weight tensor is quantized to 4 bits."""
     quant_min = orig_weight_tensor.min().item()
     quant_max = orig_weight_tensor.max().item()
+    return quant_min >= -8 and quant_max <= 7
 
-    if not (quant_min >= -8 and quant_max <= 7):
-        return None
-
-    # Determine group size from the scales tensor shape
-    # For grouped quantization, scales shape should be [out_features, in_features // group_size]
-    out_features, in_features = orig_weight_tensor.shape
 
+def _calculate_group_size(
+    orig_weight_tensor: torch.Tensor, weight_scales_tensor: torch.Tensor
+) -> Optional[int]:
+    """Calculate and validate group size from tensor shapes."""
     if len(weight_scales_tensor.shape) != 2:
         return None
 
+    out_features, in_features = orig_weight_tensor.shape
     scales_out_features, num_groups = weight_scales_tensor.shape
 
     if scales_out_features != out_features:
@@ -322,6 +298,70 @@ def matches_linear_qta8a_qga4w_pattern(
     if in_features % group_size != 0:
         return None
 
+    return group_size
+
+
+def matches_linear_qta8a_qga4w_pattern(
+    program: ExportedProgram, node: torch.fx.Node
+) -> Optional[Tuple[int, int]]:
+    """
+    Checks if the nodes surrounding a linear node matches the pattern for dynamic
+    activation + grouped weight quantized linear (QTA8A_QGA4W).
+
+    This pattern involves:
+    1. Dynamic quantization of input activations (8-bit)
+    2. Grouped quantization of weights (4-bit with group size)
+
+    The expected pattern from Int8DynActInt4WeightQuantizer is:
+        scale, zero_point = choose_qparams_affine(input)
+        quantized_input = quantize_affine(input, scale, zero_point)
+        dequantized_input = dequantize_affine(quantized_input, ...)
+        dequantized_weight = dequantize_affine(weight, weight_scales, weight_zeros)
+        output = linear(dequantized_input, dequantized_weight)
+
+    If the pattern matches, return (group_size, weight_bits), otherwise None.
+    """
+    if not utils.is_linear_node(node):
+        return None
+
+    input_node = node.args[0]
+    weight_node = node.args[1]
+
+    # Type checking - ensure we have torch.fx.Node objects
+    if not isinstance(weight_node, torch.fx.Node):
+        return None
+    if not isinstance(input_node, torch.fx.Node):
+        return None
+
+    # Check if input and weight are dequantized with dequantize_affine
+    if not _is_dequantize_affine_node(input_node):
+        return None
+    if not _is_dequantize_affine_node(weight_node):
+        return None
+
+    # Validate and extract weight quantization nodes
+    weight_nodes = _validate_qta8a_qga4w_nodes(program, weight_node)
+    if weight_nodes is None:
+        return None
+    orig_weight, weight_scales, weight_zeros = weight_nodes
+
+    # Validate and extract weight tensors
+    weight_tensors = _validate_qta8a_qga4w_tensors(
+        program, orig_weight, weight_scales, weight_zeros
+    )
+    if weight_tensors is None:
+        return None
+    orig_weight_tensor, weight_scales_tensor, weight_zeros_tensor = weight_tensors
+
+    # Check if weight is quantized to 4 bits
+    if not _validate_4bit_quantization(orig_weight_tensor):
+        return None
+
+    # Calculate and validate group size
+    group_size = _calculate_group_size(orig_weight_tensor, weight_scales_tensor)
+    if group_size is None:
+        return None
+
     # Verify this is 4-bit grouped quantization
     weight_bits = 4
 
diff --git a/backends/vulkan/custom_ops_lib.py b/backends/vulkan/custom_ops_lib.py
@@ -258,7 +258,6 @@ def linear_qta8a_qga4w(
         weight_zeros: Per-group zero points for weights
     """
     original_x_shape = x_quantized.shape
-    batch_size = original_x_shape[0]
     feature_dim = original_x_shape[-1]
 
     # Reshape for processing