Skip to content

[Benchmark] Add quantization quality benchmark script (LPIPS)#1575

Open
lishunyang12 wants to merge 1 commit intovllm-project:mainfrom
lishunyang12:bench/quantization-quality
Open

[Benchmark] Add quantization quality benchmark script (LPIPS)#1575
lishunyang12 wants to merge 1 commit intovllm-project:mainfrom
lishunyang12:bench/quantization-quality

Conversation

@lishunyang12
Copy link
Contributor

@lishunyang12 lishunyang12 commented Mar 1, 2026

Purpose

We have several open quantization PRs (#1528, #1470, #1338, #1412, #1413, #1414) and more coming. Currently, contributors evaluate quality with visual side-by-side comparisons but no quantitative perceptual metric. This makes it hard to objectively assess quality regressions.

This PR adds a reusable benchmark script that computes LPIPS (Learned Perceptual Image Patch Similarity) between BF16 baseline and quantized outputs, covering both image and video generation. Contributors can run it and paste the Markdown table directly into their PR description.

Current state of quality evaluation across open quantization PRs

PR Method Quality Evidence
#1528 BitsAndBytes 4-bit BNB NF4 2 images, no metric
#1470 INT8 DiT W8A8 2 images, no metric
#1338 FP8 text encoder FP8 Memory only, no images
#1412 FP8 Wan 2.2 FP8 Videos side-by-side, no metric
#1414 FP8 VAE FP8 weight storage No quality evidence
#1413 FP8 KV quant FP8 Roundtrip unit test only

Example: what this script produces

Using this script on Qwen-Image-2512 FP8 (from PR #1034):

Config Avg Time Speedup Memory (GiB) Mem Reduction Mean LPIPS
BF16 baseline 6.58s 53.74 (ref)
FP8 all layers 5.14s 22% 41.20 23% 0.5614
FP8 skip img_mlp 6.02s 9% 45.43 15% 0.0086

LPIPS < 0.01 = imperceptible, > 0.1 = clearly noticeable.

Usage

Text-to-Image

python benchmarks/diffusion/quantization_quality.py \
    --model Qwen/Qwen-Image-2512 \
    --task t2i \
    --quantization fp8 \
    --ignored-layers "img_mlp" \
    --prompts \
        "an aerial view of a coral reef with crystal clear turquoise water" \
        "a campfire in a dark forest with sparks rising into a starry sky" \
        "a gourmet dessert plate with chocolate mousse and gold leaf" \
    --height 1024 --width 1024 \
    --num-inference-steps 50 --seed 42

Text-to-Video

python benchmarks/diffusion/quantization_quality.py \
    --model Wan-AI/Wan2.2-T2V-A14B-Diffusers \
    --task t2v \
    --quantization fp8 \
    --prompts \
        "A serene lakeside sunrise with mist over the water" \
        "A cat walking across a wooden bridge in autumn" \
    --height 720 --width 1280 \
    --num-frames 81 --num-inference-steps 40 --seed 42

Ablation (test which layers are FP8-sensitive)

python benchmarks/diffusion/quantization_quality.py \
    --model Qwen/Qwen-Image-2512 \
    --task t2i \
    --quantization fp8 \
    --ablation-layers "img_mlp" "txt_mlp" "img_mlp,txt_mlp" \
    --prompts "a cup of coffee on the table" \
    --height 1024 --width 1024 \
    --num-inference-steps 50 --seed 42

What it outputs

  • quant_bench_output/baseline/ — BF16 reference images/videos
  • quant_bench_output/<config>/ — Quantized outputs
  • quant_bench_output/results.md — Markdown table ready to paste into PR

Requirements

pip install lpips

Test plan

  • Run with --task t2i on Z-Image with --quantization fp8
  • Run with --task t2v on Wan2.2 with --quantization fp8
  • Run ablation mode with --ablation-layers
  • Verify results.md output is valid Markdown

Related

Signed-off-by: lishunyang <lishunyang12@163.com>
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a2cd05846d

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".


def _unload_omni(omni):
"""Delete Omni instance and free GPU memory."""
del omni

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Drop caller references when unloading Omni instances

_unload_omni only executes del omni on its local parameter, so the caller’s variables (for example omni_bl and the previous omni_qt) still keep the engine alive; in this benchmark that means the BF16 model can remain resident while quantized models are loaded, which inflates memory usage and can cause OOM during multi-config ablations on large models. This directly skews both runtime/memory comparisons and can prevent the script from completing in constrained GPU environments.

Useful? React with 👍 / 👎.

),
)
elapsed = time.perf_counter() - start
peak_mem = torch.cuda.max_memory_allocated() / (1024**3)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Measure GPU memory from the generation worker

The benchmark records memory with torch.cuda.max_memory_allocated() in the client process, but Omni generation runs in stage worker processes by default, so this value does not represent model execution memory and can be near-zero/noisy; as a result, the reported Memory (GiB) and Mem Reduction columns are not reliable for the default execution mode. This undermines the script’s main purpose of quantitatively comparing quantization memory impact.

Useful? React with 👍 / 👎.

inner = first.request_output[0]
if isinstance(inner, OmniRequestOutput) and hasattr(inner, "images"):
frames = inner.images[0] if inner.images else None
else:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function doesn't actually free memory because Python's del only removes the local reference. The caller's variable (e.g., omni_bl at line 243) still holds a reference, so the BF16 model stays loaded while quantized models are loaded. This causes OOM during multi-config ablations on large models.

Fix: Return None from this function and assign it back: omni_bl = _unload_omni(omni_bl) or restructure to use context manager.

torch.cuda.reset_peak_memory_stats()
start = time.perf_counter()
outputs = omni.generate(
{"prompt": prompt},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Measurement accuracy issue: torch.cuda.max_memory_allocated() measures memory in the client process, but Omni generation runs in stage worker processes by default. The reported Memory (GiB) values may be near-zero or noisy, not reflecting actual model execution memory. This undermines the script's purpose of comparing memory impact.

Consider: (1) documenting that --enforce-eager is required for accurate memory measurement, or (2) querying worker process memory via Omni's internal APIs if available.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants