As for NPU, the scalar core only have a few threads/blocks (24) and GPU's SIMT has 22k and more threads. Will it be a limitation on de-quantization?