Thank you for your great work
I tested the official inference code with quantization enabled, expecting improved inference speed and reduced VRAM usage.
However, I observed the opposite behavior: quantization makes inference slower and consumes more VRAM compared to the baseline.
Model: SDXL-Turbo
Baseline:
Inference time: 0.5241 - 0.9573 seconds per step, 4.1236 seconds total
VRAM usage: ~17.1489 GB
Quantized (w4w8g8):
Inference time: 1.4139 ~ 1.4310 seconds per step, 7.4306 seconds total
VRAM usage: ~ 31.4364 GB
BOPs and FLOPs are reduced as expected.
The performance gap is consistent across multiple runs.
Could this be due to the specific quantization method or implementation overhead?