Can you provide inference time data for Stable Diffusion (w4a4 vs full precision fp32) on GPU/CPU?

In the paper, it says using w4a4 quantization can theoretically produce 8x inference speedup. Could you please confirm this for SD or what sort of speedup (inference latency) you observed? Thanks