[Benchmark] Add quantization quality benchmark script (LPIPS) by lishunyang12 · Pull Request #1575 · vllm-project/vllm-omni

lishunyang12 · 2026-03-01T03:34:27Z

Purpose

We have several open quantization PRs (#1528, #1470, #1338, #1412, #1413, #1414) and more coming. Currently, contributors evaluate quality with visual side-by-side comparisons but no quantitative perceptual metric. This makes it hard to objectively assess quality regressions.

This PR adds a reusable benchmark script that computes LPIPS (Learned Perceptual Image Patch Similarity) between BF16 baseline and quantized outputs, covering both image and video generation. Contributors can run it and paste the Markdown table directly into their PR description.

Current state of quality evaluation across open quantization PRs

PR	Method	Quality Evidence
#1528 BitsAndBytes 4-bit	BNB NF4	2 images, no metric
#1470 INT8 DiT	W8A8	2 images, no metric
#1338 FP8 text encoder	FP8	Memory only, no images
#1412 FP8 Wan 2.2	FP8	Videos side-by-side, no metric
#1414 FP8 VAE	FP8 weight storage	No quality evidence
#1413 FP8 KV quant	FP8	Roundtrip unit test only

Example: what this script produces

Using this script on Qwen-Image-2512 FP8 (from PR #1034):

Config	Avg Time	Speedup	Memory (GiB)	Mem Reduction	Mean LPIPS
BF16 baseline	6.58s	—	53.74	—	(ref)
FP8 all layers	5.14s	22%	41.20	23%	0.5614
FP8 skip img_mlp	6.02s	9%	45.43	15%	0.0086

LPIPS < 0.01 = imperceptible, > 0.1 = clearly noticeable.

Usage

Text-to-Image

python benchmarks/diffusion/quantization_quality.py \
    --model Qwen/Qwen-Image-2512 \
    --task t2i \
    --quantization fp8 \
    --ignored-layers "img_mlp" \
    --prompts \
        "an aerial view of a coral reef with crystal clear turquoise water" \
        "a campfire in a dark forest with sparks rising into a starry sky" \
        "a gourmet dessert plate with chocolate mousse and gold leaf" \
    --height 1024 --width 1024 \
    --num-inference-steps 50 --seed 42

Text-to-Video

python benchmarks/diffusion/quantization_quality.py \
    --model Wan-AI/Wan2.2-T2V-A14B-Diffusers \
    --task t2v \
    --quantization fp8 \
    --prompts \
        "A serene lakeside sunrise with mist over the water" \
        "A cat walking across a wooden bridge in autumn" \
    --height 720 --width 1280 \
    --num-frames 81 --num-inference-steps 40 --seed 42

Ablation (test which layers are FP8-sensitive)

python benchmarks/diffusion/quantization_quality.py \
    --model Qwen/Qwen-Image-2512 \
    --task t2i \
    --quantization fp8 \
    --ablation-layers "img_mlp" "txt_mlp" "img_mlp,txt_mlp" \
    --prompts "a cup of coffee on the table" \
    --height 1024 --width 1024 \
    --num-inference-steps 50 --seed 42

What it outputs

quant_bench_output/baseline/ — BF16 reference images/videos
quant_bench_output/<config>/ — Quantized outputs
quant_bench_output/results.md — Markdown table ready to paste into PR

Requirements

pip install lpips

Test plan

Run with --task t2i on Z-Image with --quantization fp8
Run with --task t2v on Wan2.2 with --quantization fp8
Run ablation mode with --ablation-layers
Verify results.md output is valid Markdown

[RFC]: FP8 Quantization Support for Diffusion Transformers #935 (RFC: FP8 Quantization Support for Diffusion Transformers)
[RFC]: Diffusion Models Quantization Support (Interface & Research RoadMap) #896 (RFC: Diffusion Models Quantization Support)
[Feature]: FP8 Quantization Support for DiT #1034 (FP8 Quantization Support for DiT — merged, used this methodology)

Signed-off-by: lishunyang <lishunyang12@163.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a2cd05846d

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-03-01T03:38:44Z

benchmarks/diffusion/quantization_quality.py

+
+def _unload_omni(omni):
+    """Delete Omni instance and free GPU memory."""
+    del omni


Drop caller references when unloading Omni instances

_unload_omni only executes del omni on its local parameter, so the caller’s variables (for example omni_bl and the previous omni_qt) still keep the engine alive; in this benchmark that means the BF16 model can remain resident while quantized models are loaded, which inflates memory usage and can cause OOM during multi-config ablations on large models. This directly skews both runtime/memory comparisons and can prevent the script from completing in constrained GPU environments.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-01T03:38:44Z

benchmarks/diffusion/quantization_quality.py

+        ),
+    )
+    elapsed = time.perf_counter() - start
+    peak_mem = torch.cuda.max_memory_allocated() / (1024**3)


Measure GPU memory from the generation worker

The benchmark records memory with torch.cuda.max_memory_allocated() in the client process, but Omni generation runs in stage worker processes by default, so this value does not represent model execution memory and can be near-zero/noisy; as a result, the reported Memory (GiB) and Mem Reduction columns are not reliable for the default execution mode. This undermines the script’s main purpose of quantitatively comparing quantization memory impact.

Useful? React with 👍 / 👎.

hsliuustc0106 · 2026-03-01T13:27:37Z

benchmarks/diffusion/quantization_quality.py

+        inner = first.request_output[0]
+        if isinstance(inner, OmniRequestOutput) and hasattr(inner, "images"):
+            frames = inner.images[0] if inner.images else None
+        else:


This function doesn't actually free memory because Python's del only removes the local reference. The caller's variable (e.g., omni_bl at line 243) still holds a reference, so the BF16 model stays loaded while quantized models are loaded. This causes OOM during multi-config ablations on large models.

Fix: Return None from this function and assign it back: omni_bl = _unload_omni(omni_bl) or restructure to use context manager.

hsliuustc0106 · 2026-03-01T13:27:44Z

benchmarks/diffusion/quantization_quality.py

+    torch.cuda.reset_peak_memory_stats()
+    start = time.perf_counter()
+    outputs = omni.generate(
+        {"prompt": prompt},


Measurement accuracy issue: torch.cuda.max_memory_allocated() measures memory in the client process, but Omni generation runs in stage worker processes by default. The reported Memory (GiB) values may be near-zero or noisy, not reflecting actual model execution memory. This undermines the script's purpose of comparing memory impact.

Consider: (1) documenting that --enforce-eager is required for accurate memory measurement, or (2) querying worker process memory via Omni's internal APIs if available.

[Benchmark] Add quantization quality benchmark script (LPIPS)

a2cd058

Signed-off-by: lishunyang <lishunyang12@163.com>

lishunyang12 requested a review from hsliuustc0106 as a code owner March 1, 2026 03:34

chatgpt-codex-connector bot reviewed Mar 1, 2026

View reviewed changes

hsliuustc0106 reviewed Mar 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Benchmark] Add quantization quality benchmark script (LPIPS)#1575

[Benchmark] Add quantization quality benchmark script (LPIPS)#1575
lishunyang12 wants to merge 1 commit intovllm-project:mainfrom
lishunyang12:bench/quantization-quality

lishunyang12 commented Mar 1, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Mar 1, 2026

Uh oh!

chatgpt-codex-connector bot Mar 1, 2026

Uh oh!

hsliuustc0106 Mar 1, 2026

Uh oh!

hsliuustc0106 Mar 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lishunyang12 commented Mar 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Current state of quality evaluation across open quantization PRs

Example: what this script produces

Usage

Text-to-Image

Text-to-Video

Ablation (test which layers are FP8-sensitive)

What it outputs

Requirements

Test plan

Related

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

hsliuustc0106 Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

hsliuustc0106 Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lishunyang12 commented Mar 1, 2026 •

edited

Loading