[Benchmark] Add quantization quality benchmark script (LPIPS)#1575
[Benchmark] Add quantization quality benchmark script (LPIPS)#1575lishunyang12 wants to merge 1 commit intovllm-project:mainfrom
Conversation
Signed-off-by: lishunyang <lishunyang12@163.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a2cd05846d
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
|
||
| def _unload_omni(omni): | ||
| """Delete Omni instance and free GPU memory.""" | ||
| del omni |
There was a problem hiding this comment.
Drop caller references when unloading Omni instances
_unload_omni only executes del omni on its local parameter, so the caller’s variables (for example omni_bl and the previous omni_qt) still keep the engine alive; in this benchmark that means the BF16 model can remain resident while quantized models are loaded, which inflates memory usage and can cause OOM during multi-config ablations on large models. This directly skews both runtime/memory comparisons and can prevent the script from completing in constrained GPU environments.
Useful? React with 👍 / 👎.
| ), | ||
| ) | ||
| elapsed = time.perf_counter() - start | ||
| peak_mem = torch.cuda.max_memory_allocated() / (1024**3) |
There was a problem hiding this comment.
Measure GPU memory from the generation worker
The benchmark records memory with torch.cuda.max_memory_allocated() in the client process, but Omni generation runs in stage worker processes by default, so this value does not represent model execution memory and can be near-zero/noisy; as a result, the reported Memory (GiB) and Mem Reduction columns are not reliable for the default execution mode. This undermines the script’s main purpose of quantitatively comparing quantization memory impact.
Useful? React with 👍 / 👎.
| inner = first.request_output[0] | ||
| if isinstance(inner, OmniRequestOutput) and hasattr(inner, "images"): | ||
| frames = inner.images[0] if inner.images else None | ||
| else: |
There was a problem hiding this comment.
This function doesn't actually free memory because Python's del only removes the local reference. The caller's variable (e.g., omni_bl at line 243) still holds a reference, so the BF16 model stays loaded while quantized models are loaded. This causes OOM during multi-config ablations on large models.
Fix: Return None from this function and assign it back: omni_bl = _unload_omni(omni_bl) or restructure to use context manager.
| torch.cuda.reset_peak_memory_stats() | ||
| start = time.perf_counter() | ||
| outputs = omni.generate( | ||
| {"prompt": prompt}, |
There was a problem hiding this comment.
Measurement accuracy issue: torch.cuda.max_memory_allocated() measures memory in the client process, but Omni generation runs in stage worker processes by default. The reported Memory (GiB) values may be near-zero or noisy, not reflecting actual model execution memory. This undermines the script's purpose of comparing memory impact.
Consider: (1) documenting that --enforce-eager is required for accurate memory measurement, or (2) querying worker process memory via Omni's internal APIs if available.
Purpose
We have several open quantization PRs (#1528, #1470, #1338, #1412, #1413, #1414) and more coming. Currently, contributors evaluate quality with visual side-by-side comparisons but no quantitative perceptual metric. This makes it hard to objectively assess quality regressions.
This PR adds a reusable benchmark script that computes LPIPS (Learned Perceptual Image Patch Similarity) between BF16 baseline and quantized outputs, covering both image and video generation. Contributors can run it and paste the Markdown table directly into their PR description.
Current state of quality evaluation across open quantization PRs
Example: what this script produces
Using this script on Qwen-Image-2512 FP8 (from PR #1034):
Usage
Text-to-Image
python benchmarks/diffusion/quantization_quality.py \ --model Qwen/Qwen-Image-2512 \ --task t2i \ --quantization fp8 \ --ignored-layers "img_mlp" \ --prompts \ "an aerial view of a coral reef with crystal clear turquoise water" \ "a campfire in a dark forest with sparks rising into a starry sky" \ "a gourmet dessert plate with chocolate mousse and gold leaf" \ --height 1024 --width 1024 \ --num-inference-steps 50 --seed 42Text-to-Video
python benchmarks/diffusion/quantization_quality.py \ --model Wan-AI/Wan2.2-T2V-A14B-Diffusers \ --task t2v \ --quantization fp8 \ --prompts \ "A serene lakeside sunrise with mist over the water" \ "A cat walking across a wooden bridge in autumn" \ --height 720 --width 1280 \ --num-frames 81 --num-inference-steps 40 --seed 42Ablation (test which layers are FP8-sensitive)
python benchmarks/diffusion/quantization_quality.py \ --model Qwen/Qwen-Image-2512 \ --task t2i \ --quantization fp8 \ --ablation-layers "img_mlp" "txt_mlp" "img_mlp,txt_mlp" \ --prompts "a cup of coffee on the table" \ --height 1024 --width 1024 \ --num-inference-steps 50 --seed 42What it outputs
quant_bench_output/baseline/— BF16 reference images/videosquant_bench_output/<config>/— Quantized outputsquant_bench_output/results.md— Markdown table ready to paste into PRRequirements
Test plan
--task t2ion Z-Image with--quantization fp8--task t2von Wan2.2 with--quantization fp8--ablation-layersresults.mdoutput is valid MarkdownRelated