Update on "[ET-VK][Ops] torchao.quantize_affine vulkan impl and shader and cleanup"

morelos · morelos · commit 1705ce347c5e · 2025-07-17T15:18:58.000-07:00
# Changes * Implement `torchao.quantize_affine` operator in Vulkan backend with comprehensive texture and buffer storage support * Add block-wise quantization mode in `quantize_texture.glsl` and `quantize_buffer.glsl` shaders for configurable tensor block quantization * Introduce comprehensive test suite in `affine_test.cpp` with multi-dimensional tensor validation and reference implementation * Extend quantization infrastructure in `Quantize.cpp` to handle affine transformations with configurable block sizes and quantization parameters BE: Improved the documentation in the shader logic which is more detailed and clear NOTE: I delegated the quantize_affine and future affine operators through a new custom test file denoted as `affine_test.cpp` as the other quantization testing framework was getting a little large, and it makes more sense to separate the namespace between torchao and quantized_decomposed. I believe the _decomposed namespace is getting phased out in favor of this affine operator so deprecation will be easier in the future. # Motivation The existing Vulkan quantization infrastructure lacked support for the `torchao.quantize_affine` operator, which is essential for enabling dynamic quantization efficiently. The `quantize_affine` operator provides flexible block-wise quantization that allows different scale and zero-point values for tensor blocks, enabling: * **Block-wise Quantization**: Applies quantization parameters to configurable tensor blocks rather than entire tensors, improving quantization accuracy for heterogeneous data distributions * **Affine Transformation**: Uses the formula `qvalue = clamp(round(value / scale) + zero_point, quant_min, quant_max)` for precise floating-point to integer mapping # Operator Description The `quantize_affine` operator converts floating-point tensor values to n-bit integer representations using pre-computed quantization parameters (scale and zero_point) applied to configurable tensor blocks. Block-wise quantization divides tensors into blocks and applies separate quantization parameters to each block, allowing fine-grained control over quantization precision. The quantization formula is: `qvalue = clamp(round(value / scale) + zero_point, quant_min, quant_max)` **Storage Requirements**: Scale and zero_point tensors must use buffer storage with width-packed layout. Input/output tensors support both buffer and texture storage with standard axis mapping. # Block-wise Quantization Implementation Block-wise quantization enables fine-grained quantization by dividing tensors into blocks and applying separate quantization parameters to each block. The implementation uses several key data structures computed in `Quantize.cpp`: * **`block_size_vec`**: WHCN-ordered block dimensions converted from PyTorch NCHW layout (e.g., [3,3,2,1] for 3×3×2×1 blocks) * **`tensor_size_whcn`**: Input tensor dimensions converted to WHCN layout using `utils::make_whcn_ivec4()` * **`num_blocks_vec`**: Number of blocks per dimension calculated as `tensor_size_whcn / block_size_vec` * **`block_stride_vec`**: Pre-computed linear strides for block grid indexing `{1, #W, #W*#H, #W*#H*#C}` to enable efficient block ID calculation The block coordinate calculation uses: `bcoord = tidx / blockSize` where `tidx` is the tensor coordinate in WHCN layout, then the linear block ID is computed as: `block_id = bcoord.x * blockStride.x + bcoord.y * blockStride.y + bcoord.z * blockStride.z + bcoord.w * blockStride.w` # Shader Algorithm Overview ## Texture Storage Implementation (`quantize_texture.glsl`) **Workgroup Configuration**: - **Global WG Size**: Default sizing based on texture dimensions - **Local WG Size**: Default with special handling for batch dimension quantization (Z dimension set to 1 for proper workgroup dispatching when `global_workgroup_size[2] > 1`) **Block-wise Mode Algorithm**: The shader processes 3D texture positions where each position represents a texel containing 4 width-packed components. For each texel at position `pos`, it calculates a base tensor index `base_tidx = ivec4(pos.x * 4, pos.y, pos.z, 0)` to account for width-packing. For each of the 4 components in the texel, it computes the actual tensor coordinate: `tidx = ivec4(base_tidx.x + i, base_tidx.y, (foldedZ % C_total), (foldedZ / C_total))` where `foldedZ = pos.z` handles batch-channel folding in 4D tensors and `C_total = numBlocks.z * blockSize.z` represents the total channel dimension. The block coordinate is calculated using integer division: `bcoord = tidx / blockSize`, then the linear block ID uses pre-computed strides: `block_id = bcoord.x * blockStride.x + bcoord.y * blockStride.y + bcoord.z * blockStride.z + bcoord.w * blockStride.w`. Each component is quantized using its corresponding block's parameters: `qvalue = quantize_val(value, t_scale[block_id], t_zero_point[block_id])` and written to the output texel. ## Buffer Storage Implementation (`quantize_buffer.glsl`) **Workgroup Configuration**: - **Global WG Size**: Default sizing based on buffer element count - **Local WG Size**: Default sizing without special constraints **Block-wise Mode Algorithm**: The shader processes linear buffer indices using `gl_GlobalInvocationID.x` as the output buffer index. It converts this to tensor coordinates using `bufi_to_tidx(out_bufi, t_out_strides, out_dim_order)` which handles the buffer-to-tensor index mapping with proper stride calculations. For each element, it computes the block coordinate directly: `bcoord = out_tidx / blockSize` where `out_tidx` is the 4D tensor coordinate in WHCN layout. The linear block ID calculation uses the same pre-computed stride approach: `block_id = bcoord.x * blockStride.x + bcoord.y * blockStride.y + bcoord.z * blockStride.z + bcoord.w * blockStride.w`. The element value is loaded using the corresponding input buffer index: `value = t_in[in_bufi]` where `in_bufi = tidx_to_bufi(out_tidx, t_in_strides)`. Quantization applies the block-specific parameters: `qvalue = quantize_val(value, t_scale[block_id], t_zero_point[block_id])`. **Future Improvements**: Dynamic workgroup sizing based on block dimensions, there is likely a better method to making it better than what it is currently. Differential Revision: [D78302195](https://our.internmc.facebook.com/intern/diff/D78302195/) cc SS-JIA manuelcandales cbilgin [ghstack-poisoned]
diff --git a/backends/vulkan/_passes/fuse_quantized_ops.py b/backends/vulkan/_passes/fuse_quantized_ops.py
@@ -215,57 +215,20 @@ def fuse_into_linear_qcnw_node(
 #########################
 
 
-def matches_linear_qta8a_qga4w_pattern(
-    program: ExportedProgram, node: torch.fx.Node
-) -> Optional[Tuple[int, int]]:
-    """
-    Checks if the nodes surrounding a linear node matches the pattern for dynamic
-    activation + grouped weight quantized linear (QTA8A_QGA4W).
-
-    This pattern involves:
-    1. Dynamic quantization of input activations (8-bit)
-    2. Grouped quantization of weights (4-bit with group size)
-
-    The expected pattern from Int8DynActInt4WeightQuantizer is:
-        scale, zero_point = choose_qparams_affine(input)
-        quantized_input = quantize_affine(input, scale, zero_point)
-        dequantized_input = dequantize_affine(quantized_input, ...)
-        dequantized_weight = dequantize_affine(weight, weight_scales, weight_zeros)
-        output = linear(dequantized_input, dequantized_weight)
-
-    If the pattern matches, return (group_size, weight_bits), otherwise None.
-    """
-    if not utils.is_linear_node(node):
-        return None
-
-    input_node = node.args[0]
-    weight_node = node.args[1]
-
-    # Type checking - ensure we have torch.fx.Node objects
-    if not isinstance(weight_node, torch.fx.Node):
-        return None
-    if not isinstance(input_node, torch.fx.Node):
-        return None
-
-    # Check if input is dequantized with dequantize_affine (from dynamic quantization)
-    if not (
-        input_node.op == "call_function"
-        and input_node.target is not None
-        and hasattr(input_node.target, "__name__")
-        and "dequantize_affine" in getattr(input_node.target, "__name__", "")
-    ):
-        return None
+def _is_dequantize_affine_node(node: torch.fx.Node) -> bool:
+    """Check if a node is a dequantize_affine function call."""
+    return (
+        node.op == "call_function"
+        and node.target is not None
+        and hasattr(node.target, "__name__")
+        and "dequantize_affine" in getattr(node.target, "__name__", "")
+    )
 
-    # Check if weight is dequantized with dequantize_affine
-    if not (
-        weight_node.op == "call_function"
-        and weight_node.target is not None
-        and hasattr(weight_node.target, "__name__")
-        and "dequantize_affine" in getattr(weight_node.target, "__name__", "")
-    ):
-        return None
 
-    # Get the original quantized weight and quantization parameters
+def _validate_qta8a_qga4w_nodes(
+    program: ExportedProgram, weight_node: torch.fx.Node
+) -> Optional[Tuple[torch.fx.Node, torch.fx.Node, torch.fx.Node]]:
+    """Validate and extract weight quantization nodes for QTA8A_QGA4W pattern."""
     if len(weight_node.args) < 4:
         return None
 
@@ -287,7 +250,16 @@ def matches_linear_qta8a_qga4w_pattern(
     if not is_param_node(program, weight_zeros):
         return None
 
-    # Get tensors to analyze the quantization scheme
+    return orig_weight, weight_scales, weight_zeros
+
+
+def _validate_qta8a_qga4w_tensors(
+    program: ExportedProgram,
+    orig_weight: torch.fx.Node,
+    weight_scales: torch.fx.Node,
+    weight_zeros: torch.fx.Node,
+) -> Optional[Tuple[torch.Tensor, torch.Tensor, torch.Tensor]]:
+    """Validate and extract weight tensors for QTA8A_QGA4W pattern."""
     orig_weight_tensor = get_param_tensor(program, orig_weight)
     weight_scales_tensor = get_param_tensor(program, weight_scales)
     weight_zeros_tensor = get_param_tensor(program, weight_zeros)
@@ -299,20 +271,24 @@ def matches_linear_qta8a_qga4w_pattern(
     if not isinstance(weight_zeros_tensor, torch.Tensor):
         return None
 
-    # Check if weight is quantized to 4 bits (values should be in [-8, 7] range)
+    return orig_weight_tensor, weight_scales_tensor, weight_zeros_tensor
+
+
+def _validate_4bit_quantization(orig_weight_tensor: torch.Tensor) -> bool:
+    """Check if weight tensor is quantized to 4 bits."""
     quant_min = orig_weight_tensor.min().item()
     quant_max = orig_weight_tensor.max().item()
+    return quant_min >= -8 and quant_max <= 7
 
-    if not (quant_min >= -8 and quant_max <= 7):
-        return None
-
-    # Determine group size from the scales tensor shape
-    # For grouped quantization, scales shape should be [out_features, in_features // group_size]
-    out_features, in_features = orig_weight_tensor.shape
 
+def _calculate_group_size(
+    orig_weight_tensor: torch.Tensor, weight_scales_tensor: torch.Tensor
+) -> Optional[int]:
+    """Calculate and validate group size from tensor shapes."""
     if len(weight_scales_tensor.shape) != 2:
         return None
 
+    out_features, in_features = orig_weight_tensor.shape
     scales_out_features, num_groups = weight_scales_tensor.shape
 
     if scales_out_features != out_features:
@@ -322,6 +298,70 @@ def matches_linear_qta8a_qga4w_pattern(
     if in_features % group_size != 0:
         return None
 
+    return group_size
+
+
+def matches_linear_qta8a_qga4w_pattern(
+    program: ExportedProgram, node: torch.fx.Node
+) -> Optional[Tuple[int, int]]:
+    """
+    Checks if the nodes surrounding a linear node matches the pattern for dynamic
+    activation + grouped weight quantized linear (QTA8A_QGA4W).
+
+    This pattern involves:
+    1. Dynamic quantization of input activations (8-bit)
+    2. Grouped quantization of weights (4-bit with group size)
+
+    The expected pattern from Int8DynActInt4WeightQuantizer is:
+        scale, zero_point = choose_qparams_affine(input)
+        quantized_input = quantize_affine(input, scale, zero_point)
+        dequantized_input = dequantize_affine(quantized_input, ...)
+        dequantized_weight = dequantize_affine(weight, weight_scales, weight_zeros)
+        output = linear(dequantized_input, dequantized_weight)
+
+    If the pattern matches, return (group_size, weight_bits), otherwise None.
+    """
+    if not utils.is_linear_node(node):
+        return None
+
+    input_node = node.args[0]
+    weight_node = node.args[1]
+
+    # Type checking - ensure we have torch.fx.Node objects
+    if not isinstance(weight_node, torch.fx.Node):
+        return None
+    if not isinstance(input_node, torch.fx.Node):
+        return None
+
+    # Check if input and weight are dequantized with dequantize_affine
+    if not _is_dequantize_affine_node(input_node):
+        return None
+    if not _is_dequantize_affine_node(weight_node):
+        return None
+
+    # Validate and extract weight quantization nodes
+    weight_nodes = _validate_qta8a_qga4w_nodes(program, weight_node)
+    if weight_nodes is None:
+        return None
+    orig_weight, weight_scales, weight_zeros = weight_nodes
+
+    # Validate and extract weight tensors
+    weight_tensors = _validate_qta8a_qga4w_tensors(
+        program, orig_weight, weight_scales, weight_zeros
+    )
+    if weight_tensors is None:
+        return None
+    orig_weight_tensor, weight_scales_tensor, weight_zeros_tensor = weight_tensors
+
+    # Check if weight is quantized to 4 bits
+    if not _validate_4bit_quantization(orig_weight_tensor):
+        return None
+
+    # Calculate and validate group size
+    group_size = _calculate_group_size(orig_weight_tensor, weight_scales_tensor)
+    if group_size is None:
+        return None
+
     # Verify this is 4-bit grouped quantization
     weight_bits = 4
 
diff --git a/backends/vulkan/custom_ops_lib.py b/backends/vulkan/custom_ops_lib.py
@@ -258,7 +258,6 @@ def linear_qta8a_qga4w(
         weight_zeros: Per-group zero points for weights
     """
     original_x_shape = x_quantized.shape
-    batch_size = original_x_shape[0]
     feature_dim = original_x_shape[-1]
 
     # Reshape for processing