Skip to content

Commit 7076257

Browse files
author
morelos
committed
Update on "[ET-VK][Ops] torchao.choose_qparams_affine vulkan impl and shader (buffer only) and cleanup"
# Changes * Implement `torchao.choose_qparams_affine` operator in Vulkan backend with comprehensive buffer storage support * Add block-wise quantization parameter computation in `choose_qparams_buffer.glsl` shader for configurable tensor block analysis * Extend quantization parameter infrastructure in `ChooseQParams.cpp` to handle affine transformations with configurable block sizes and multiple mapping types * Support three quantization mapping strategies: ASYMMETRIC, SYMMETRIC, and SYMMETRIC_NO_CLIPPING_ERR for optimal parameter selection * Consolidated the logic for choosing scale and zero point between affine cases and regular quantized_decomposed cases. BE: Improved the documentation in the shader logic which is more detailed and clear # Motivation The existing Vulkan quantization infrastructure lacked support for the `torchao.choose_qparams_affine` operator, which is essential for computing optimal quantization parameters in dynamic quantization workflows. The `choose_qparams_affine` operator provides flexible block-wise parameter computation that analyzes statistical distributions within tensor blocks, enabling: * **Block-wise Parameter Computation**: Analyzes configurable tensor blocks to compute optimal scale and zero-point values, improving quantization accuracy for heterogeneous data distributions * **Multiple Mapping Types**: Supports ASYMMETRIC, SYMMETRIC, and SYMMETRIC_NO_CLIPPING_ERR quantization strategies for different precision-performance trade-offs # Operator Description The `choose_qparams_affine` operator computes optimal quantization parameters (scale and zero_point) from floating-point tensor blocks using statistical analysis of data distributions. Block-wise parameter computation divides tensors into blocks and analyzes each block independently to determine the best quantization mapping for subsequent quantization operations. The parameter calculation varies by mapping type: - **ASYMMETRIC**: `scale = (max - min) / (quant_max - quant_min)`, `zero_point = quant_min - round(min / scale)` - **SYMMETRIC**: `scale = max_abs / ((quant_max - quant_min) / 2)`, `zero_point = midpoint` - **SYMMETRIC_NO_CLIPPING_ERR**: `scale = max(abs(min)/abs(quant_min), max/quant_max)`, `zero_point = midpoint` **Storage Requirements**: Input tensors must be floating-point (kFloat) with width-packed layout. Output scale/zero_point tensors use buffer storage. NOTE: Texture storage implementation is not supported due to complexity of block-wise coordinate mapping in 3D texture space. This will likely be necessary for better efficiency in the future. # Block-wise Parameter Computation Implementation Block-wise parameter computation enables fine-grained quantization analysis by dividing tensors into blocks and computing separate scale/zero_point parameters for each block. The implementation uses several key data structures computed in `ChooseQParams.cpp`: * **`block_size_vec`**: WHCN-ordered block dimensions converted from PyTorch NCHW layout (e.g., [3,3,2,1] for 3×3×2×1 blocks) * **`tensor_size_whcn`**: Input tensor dimensions converted to WHCN layout using `utils::make_whcn_ivec4()` * **`num_blocks_vec`**: Number of blocks per dimension calculated as `ceil(tensor_size_whcn / block_size_vec)` to handle non-divisible dimensions * **`block_stride_vec`**: Pre-computed linear strides for block grid indexing `{1, #W, #W*#H, #W*#H*#C}` to enable efficient block ID calculation * **`mapping_type`**: Integer encoding of quantization strategy (0=ASYMMETRIC, 1=SYMMETRIC, 2=SYMMETRIC_NO_CLIPPING_ERR) The block coordinate calculation uses: `block_coord = block_id_to_coord(block_id)` which converts linear block IDs back to 4D WHCN coordinates, then computes element ranges: `t0 = block_coord * blockSize` and `tEnd = t0 + blockSize` for nested loop iteration. # Shader Algorithm Overview ## Buffer Storage Implementation (`choose_qparams_buffer.glsl`) **Workgroup Configuration**: - **Global WG Size**: `{nBlocks, 1u, 1u}` where `nBlocks = total number of blocks` computed from `ceil(tensor_size / block_size)` for each dimension - **Local WG Size**: `{1u, 1u, 1u}` (single thread per block for simplicity, though could be optimized for larger blocks) **Block-wise Mode Algorithm**: The shader uses a sophisticated multi-level nested approach to process tensor blocks efficiently. Each thread is assigned multiple blocks using strided access: `for (uint block_id = gl_GlobalInvocationID.x; block_id < TOTAL_BLOCKS; block_id += STRIDE)` where `STRIDE = gl_WorkGroupSize.x * gl_NumWorkGroups.x`. For each assigned block, the algorithm performs several key steps: **1. Block Coordinate Conversion**: The `block_id_to_coord(block_id)` function converts linear block IDs to 4D WHCN coordinates using modular arithmetic. **2. Element Range Calculation**: Computes the inclusive start coordinate `t0 = bc * blockSize` and exclusive end coordinate `tEnd = t0 + blockSize` to define the block's element boundaries in tensor space. **3. Nested Loop Min/Max Scan**: Uses four nested loops to iterate through all elements within the block: `for (int n = t0.w; n < tEnd.w; ++n) for (int c = t0.z; c < tEnd.z; ++c) for (int h = t0.y; h < tEnd.y; ++h) for (int w = t0.x; w < tEnd.x; ++w)` Each element is accessed using `tidx_to_bufi(ivec4(w,h,c,n), t_in_strides)` to convert 4D tensor coordinates to linear buffer indices with proper stride handling. **4. Parameter Calculation**: Calls `calc_scale_zp(lo, hi, quant_min, quant_max, mapping_type, eps, scale, zp)` which implements the three mapping strategies: * **ASYMMETRIC (mapping_type=0)**: Maps full range [min, max] to [quant_min, quant_max] preserving data distribution * **SYMMETRIC (mapping_type=1)**: Centers around zero using `max_abs = max(abs(min), abs(max))` for balanced quantization * **SYMMETRIC_NO_CLIPPING_ERR (mapping_type=2)**: Computes separate scales for positive/negative ranges and uses the maximum to prevent clipping **Future Improvements**: Implement workgroup-level reduction for large blocks, optimize memory access patterns for better cache utilization, and explore texture storage implementation with simplified block alignment constraints. Differential Revision: [D78436638](https://our.internmc.facebook.com/intern/diff/D78436638/) cc SS-JIA manuelcandales cbilgin [ghstack-poisoned]
1 parent 4b61f7e commit 7076257

File tree

1 file changed

+9
-32
lines changed

1 file changed

+9
-32
lines changed

backends/vulkan/test/op_tests/affine_test.cpp

Lines changed: 9 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -279,40 +279,26 @@ at::Tensor dequantize_affine_reference_impl(
279279
std::string("INT"));
280280
}
281281

282-
/*
283-
* choose_qparams_affine_reference_impl
284-
* -----------------------------------
285-
* A faithful C++ re-implementation of the Python helper
286-
* choose_qparams_affine (see quant_primitives.py)
287-
*
288-
* Supported input dtypes : float32 / float16 / bfloat16
289-
* Supported mapping types: ASYMMETRIC / SYMMETRIC / SYMMETRIC_NO_CLIPPING_ERR
290-
*/
291282
std::tuple<at::Tensor, at::Tensor> choose_qparams_affine_reference_impl(
292-
const at::Tensor& input_, // F32 / F16 / BF16
293-
const std::string&
294-
mapping_type, // ASYMMETRIC / SYMMETRIC / SYMMETRIC_NO_CLIPPING_ERR
295-
const std::vector<int64_t>& block_size, // same length as input_.dim()
283+
const at::Tensor& input_,
284+
const std::string& mapping_type,
285+
const std::vector<int64_t>& block_size,
296286
int64_t quant_min,
297287
int64_t quant_max,
298288
double eps) {
299-
// -------------------- 1. Validations --------------------------------------
300289
const int64_t ndim = input_.dim();
301290
_check_dims("input", block_size.size(), ndim);
302291

303-
TORCH_CHECK(
292+
VK_CHECK_COND(
304293
input_.scalar_type() == at::kFloat || input_.scalar_type() == at::kHalf ||
305294
input_.scalar_type() == at::kBFloat16,
306295
"Unsupported input dtype: ",
307296
input_.dtype());
308297

309-
// Ensure contiguous – view() is only well-defined on contiguous tensors
310298
at::Tensor input = input_.contiguous();
311299

312-
// -------------------- 2. Derive reduction shape ---------------------------
313-
// Equivalent to python _get_reduction_params
314300
std::vector<int64_t> shape_for_reduction;
315-
std::vector<int64_t> reduction_dims; // dims we later collapse to size-1
301+
std::vector<int64_t> reduction_dims;
316302
int64_t cur_dim = 0;
317303

318304
auto in_sizes = input.sizes();
@@ -321,7 +307,7 @@ std::tuple<at::Tensor, at::Tensor> choose_qparams_affine_reference_impl(
321307
const int64_t dim = in_sizes[i];
322308

323309
if (blk != dim && blk > 1) {
324-
TORCH_CHECK(
310+
VK_CHECK_COND(
325311
dim % blk == 0,
326312
"Input size ",
327313
dim,
@@ -331,11 +317,11 @@ std::tuple<at::Tensor, at::Tensor> choose_qparams_affine_reference_impl(
331317
i);
332318
shape_for_reduction.push_back(dim / blk);
333319
shape_for_reduction.push_back(blk);
334-
reduction_dims.push_back(cur_dim + 1); // the 'inside block' dim
320+
reduction_dims.push_back(cur_dim + 1);
335321
cur_dim += 2;
336322
} else {
337323
shape_for_reduction.push_back(dim);
338-
if (blk != 1) { // per-axis / per-tensor
324+
if (blk != 1) {
339325
reduction_dims.push_back(cur_dim);
340326
}
341327
cur_dim += 1;
@@ -344,19 +330,14 @@ std::tuple<at::Tensor, at::Tensor> choose_qparams_affine_reference_impl(
344330

345331
at::Tensor input_reshaped = input.view(shape_for_reduction);
346332

347-
// Shape after reduction – same rank as shape_for_reduction but
348-
// all 'reduction_dims' set to 1. We'll reshape scale / zp to that.
349333
std::vector<int64_t> shape_after_reduction = shape_for_reduction;
350334
for (int64_t d : reduction_dims) {
351335
shape_after_reduction[d] = 1;
352336
}
353337

354-
// -------------------- 3. Find min/max values -----------------------------
355-
// Reduce over the specified dimensions to get min/max values
356338
at::Tensor min_val = input_reshaped.amin(reduction_dims, /*keepdim=*/true);
357339
at::Tensor max_val = input_reshaped.amax(reduction_dims, /*keepdim=*/true);
358340

359-
// -------------------- 4. Calculate scale and zero_point ------------------
360341
at::Tensor scale, zero_point;
361342

362343
if (mapping_type == "ASYMMETRIC") {
@@ -403,15 +384,13 @@ std::tuple<at::Tensor, at::Tensor> choose_qparams_affine_reference_impl(
403384
zero_point =
404385
at::full_like(scale, (quant_max + quant_min + 1) / 2, at::kInt);
405386
} else {
406-
TORCH_CHECK(
387+
VK_CHECK_COND(
407388
false,
408389
"Unsupported mapping_type: ",
409390
mapping_type,
410391
". Expected ASYMMETRIC, SYMMETRIC, or SYMMETRIC_NO_CLIPPING_ERR");
411392
}
412393

413-
// -------------------- 5. Reshape back to output shape --------------------
414-
// Calculate output shape (remove reduction dimensions)
415394
std::vector<int64_t> output_shape;
416395
for (size_t i = 0; i < shape_after_reduction.size(); ++i) {
417396
if (shape_after_reduction[i] != 1 ||
@@ -1007,7 +986,6 @@ TEST(VulkanDequantizeAffineTest, test_4d_dequantization) {
1007986
at::kFloat); // output dtype
1008987
}
1009988

1010-
// Test function for choose_qparams_affine
1011989
void test_vulkan_choose_qparams_affine_impl(
1012990
const std::vector<int>& input_sizes,
1013991
const std::vector<int64_t>& block_size,
@@ -1033,7 +1011,6 @@ void test_vulkan_choose_qparams_affine_impl(
10331011
at::Tensor reference_scale = std::get<0>(reference_out);
10341012
at::Tensor reference_zero_point = std::get<1>(reference_out);
10351013

1036-
// Ensure zero_point is int32 as expected by the shader
10371014
reference_zero_point = reference_zero_point.to(at::kInt);
10381015

10391016
using namespace vkcompute;

0 commit comments

Comments
 (0)