You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
# Changes
* Introduce `linear_qta8a_qga4w` custom operator in `custom_ops_lib.py` to handle dynamic activation + grouped weight quantized linear operations
* Add pattern matching and fusion logic in `FuseQuantizedOpsTransform` to detect and replace dequant + dequant + linear sequences with the new fused operator
* Implement comprehensive test coverage in `test_vulkan_passes.py` for the QTA8A_QGA4W fusion pattern validation
* Add 4-bit weight packing utilities and grouped quantization support for efficient memory usage
# Motivation
The existing quantization workflow in Vulkan backend processes dynamic activation + grouped weight quantized linear operations as separate quantize/dequantize/linear steps, which creates performance overhead through:
* Multiple kernel dispatches instead of a single fused operation
* Intermediate tensor allocations for dequantized weights and activations
* Suboptimal memory bandwidth utilization
The new `linear_qta8a_qga4w` operator fuses the entire sequence into a single operation that:
* Directly processes 8-bit quantized activations with per-token scales/zero-points
* Handles 4-bit grouped quantized weights with configurable group sizes
* Eliminates intermediate dequantization steps by performing dequantization inline
* Reduces memory footprint through packed 4-bit weight storage
This aligns with the broader goal of optimizing quantized model inference in the Vulkan backend by leveraging graph-level transformations to improve computational efficiency while maintaining numerical accuracy.
Differential Revision: [D78291269](https://our.internmc.facebook.com/intern/diff/D78291269/)
[ghstack-poisoned]
0 commit comments