Commit 34c41db
morelos
Update on "[ET-VK][Ops] linear_qta8a_qga4w_qta8o test framework"
# Context
This test framework establishes the foundation for validating the `linear_qta8a_qga4w_qta8o` operator implementation as part of enabling dynamic quantization. The motivation stems from advancing beyond weight-only quantization to full activation and weight quantized linear operations, enabling true integer arithmetic throughout the matrix multiplication process for improved performance on GPU hardware.
The current weight-only quantized linear implementations in ET-VK dequantize weights to floating point before computation, missing the performance benefits of integer arithmetic.
This operator nomenclature breakdown:
- **qta8a**: Quantized per-token affine 8-bit activation inputs
- **qga4w**: Quantized per-group affine 4-bit weights
- **qta8o**: Quantized per-token affine 8-bit outputs
# Changes
The reference implementation (`linear_qta8a_qga4w_qta8o_4bit_dequant_impl`) provides a baseline for validating the GPU shader implementation through a deliberately simplified computation path. The quantized int8 input tensor is dequantized using the standard affine transformation `(quantized_input.to(at::kFloat) - input_zero_point) * input_scale`. After dequantization, the implementation performs standard floating point linear operation `at::linear(x_float, weights_dequantized)`, then manually quantizes the result using `at::round(linear_result / output_scale) + output_zero_point` with clamping to the int8 range [-128,127]. This two-stage approach of dequantize → compute → quantize provides a clear reference against which the GPU's integer arithmetic implementation can be validated.
Differential Revision: [D77173442](https://our.internmc.facebook.com/intern/diff/D77173442/)
[ghstack-poisoned]File tree
3 files changed
+173
-258
lines changed- backends/vulkan/test/op_tests
3 files changed
+173
-258
lines changedLines changed: 0 additions & 251 deletions
This file was deleted.
0 commit comments