Skip to content

Quantization increases latency and VRAM usage instead of reducing them #6

@JEONG8652

Description

@JEONG8652

Thank you for your great work

I tested the official inference code with quantization enabled, expecting improved inference speed and reduced VRAM usage.
However, I observed the opposite behavior: quantization makes inference slower and consumes more VRAM compared to the baseline.

Model: SDXL-Turbo

Baseline:
Inference time: 0.5241 - 0.9573 seconds per step, 4.1236 seconds total
VRAM usage: ~17.1489 GB

Quantized (w4w8g8):
Inference time: 1.4139 ~ 1.4310 seconds per step, 7.4306 seconds total
VRAM usage: ~ 31.4364 GB

BOPs and FLOPs are reduced as expected.

The performance gap is consistent across multiple runs.
Could this be due to the specific quantization method or implementation overhead?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions