You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Qualcomm AI Engine Direct - Observer Fix and remove unused passes (#6225)
Summary:
- `ConvertToLinear()` is redundant in `qnn_preprocess.py` since this pass is already called in `executorch/backends/qualcomm/utils/utils.py`
- Some models are experiencing a significant drop in accuracy, with a few models having 0% accuracy. Adding new conditions to perform requantization and change ptq_per_channel_quant_config's IO from MinMaxObserver to MovingAverageMinMaxObserver to resolve the issue.
1. Why adding new conditions to do requantization? We noticed this change in PyTorch PR (pytorch/pytorch@b8eef50#diff-976c3b0c6f85048d3db01a0c394ce8eb16e2f7541f0983d0f4ef549baa4be822L152). Before this PR, quantization spec only checks whether 2 qspecs were same by comparing `dtype` and `is_dynamic`. After this change, it checks for more attributes such as `scale`, `zero_point`, etc. This causes some nodes having an extra pair of QDQ nodes. As shown in the image below, there are 2 pairs of QDQ nodes after the PyTorch PR, and these 2 pairs of QDQ nodes have different scale and offset. For QNN lowering process, node will only save the quant info right after the node output. For example, `cat` op below will use `quantize_per_tensor_default_18`'s scale and offset as the node's quant attribute, and all other quant and dequant nodes will be ignored.
This causes an accuracy drop, but by inserting a requantize node, we can see an improvement in accuracy for most models. Taking inceptionv3 as an example, the average top1 accuracy 0%->~75%. I have checked a couple other models and see accuracy either stays the same or have improvements.
I have also provided the option for users to skip this requant optimization if they preferred not to use it.
**Before:**

___
**After**

2. Why change ptq_per_channel_quant_config's IO from MinMaxObserver to MovingAverageMinMaxObserver?
After the above change, it seems like there is an inference speed drop due to requantization. By switching to MovingAverageMinMaxObserver, I observed an improvement in inference speed for some models such as inceptionv3.
Pull Request resolved: #6225
Reviewed By: kirklandsign
Differential Revision: D64413835
Pulled By: cccclai
fbshipit-source-id: a8be66b034c69ff403f9f2985f2b584695f3798b
0 commit comments