Benchmark torch.compile optimization for GPTQ by colldata79 · Pull Request #2320 · vllm-project/llm-compressor

colldata79 · 2026-01-31T08:53:36Z

Summary

Benchmark and validate torch.compile optimization for the GPTQ quantization algorithm.

Addresses #1496
Supersedes #1561

What was done

Added enable_torch_compile flag to GPTQModifier
Added @torch.compile(dynamic=True) decorated _process_block function
Added quantize_weight_optimized compiled code path
Created benchmark harness for reproducibility
Ran benchmarks on Qwen2.5-3B (V100) and TinyLlama-1.1B
Validated correctness (compiled vs uncompiled outputs)
Validated memory (no inflation, no leaks)

Benchmark Results (Qwen2.5-3B on V100)

Metric	Value
Baseline	879s (14m 39s)
Compiled (warm)	478s (7m 58s)
Speedup	1.84x
Compile overhead	~35s (one-time)
Memory overhead	0%
Graph breaks	0

Validation (TinyLlama-1.1B)

Test	Result
Memory Sentinel	✅ 0.03% error
Storage-Size Proof	✅ No inflation
Numerical Check	✅ Functionally equivalent

Add `enable_torch_compile` flag to GPTQModifier that enables torch.compile optimization on the inner block processing kernel (_process_block). Key changes: - Add @torch.compile(dynamic=True) decorated _process_block function - Add _quantize_core wrapper that calls compiled kernel per block - Add quantize_weight_optimized as the compiled code path - Wire enable_torch_compile flag in GPTQModifier (default: False) Benchmark results (Qwen2.5-3B on V100): - Baseline: 879s avg - Compiled warm: 478s avg (1.84x speedup) - Compile overhead: ~35s (one-time) - Memory overhead: 0% - Graph breaks: 0 Includes benchmark harness and artifacts for reproducibility. Fixes vllm-project#1496 Supersedes vllm-project#1561

github-actions · 2026-01-31T08:53:44Z

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

gemini-code-assist · 2026-01-31T08:54:02Z

Summary of Changes

Hello @colldata79, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant performance enhancement to the GPTQ quantization algorithm by integrating torch.compile. It provides a thoroughly benchmarked and validated solution that accelerates the quantization process by nearly two times, without incurring additional memory overhead or compromising the numerical accuracy of the quantized models. The changes include a new optimized code path and a comprehensive benchmarking suite to ensure the reliability and effectiveness of the optimization.

Highlights

torch.compile Integration: Implemented torch.compile support for GPTQ quantization by decorating the _process_block function with @torch.compile(dynamic=True) and introducing a new quantize_weight_optimized path.
Performance Improvement: Achieved a significant 1.84x speedup in GPTQ quantization for the Qwen2.5-3B model on a V100 GPU, with a minimal one-time compile overhead of ~35 seconds.
Comprehensive Benchmarking: Developed a robust benchmarking harness to measure runtime, memory usage, compile overhead, and numerical correctness, ensuring reproducibility and detailed analysis.
Validation & Stability: Validated the optimization for correctness (functionally equivalent outputs), memory (no inflation, no leaks), and stability (zero graph breaks and reduced run-to-run variance).
User Control: Added an enable_torch_compile flag to GPTQModifier, allowing users to optionally enable or disable the torch.compile optimization.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a comprehensive benchmark suite for evaluating torch.compile optimizations on the GPTQ algorithm, along with the necessary code changes to enable this optimization. The benchmark is well-structured and the results are promising. My review focuses on improving code maintainability by reducing duplication, enhancing the robustness of the benchmark scripts, and addressing potential security and documentation inconsistencies. The most significant issue is the large amount of duplicated code between the original and the new optimized quantization functions, which should be refactored. I've also pointed out a few issues in the benchmark scripts and documentation for clarity and correctness.

gemini-code-assist · 2026-01-31T08:56:03Z

gptq-torch-compile-benchmark/README.md

+
+| Property | Value |
+|----------|-------|
+| **Project** | `our-rampart-478403-t3` |


The GCP project ID our-rampart-478403-t3 appears to be hardcoded. If this is a real project ID, it should be removed or replaced with a placeholder to avoid leaking sensitive information.

gemini-code-assist · 2026-01-31T08:56:03Z

gptq-torch-compile-benchmark/benchmark_utils.py

+        compile_stats = compiled.get("compile_stats", {})
+        recompiles = compile_stats.get("graph_breaks", 0)
+
+        max_diff = results.get("numerical_check", {}).get("max_abs_diff", "N/A")


The summary table generation uses max_abs_diff, but the numerical_check function in benchmark_gptq_compile.py provides logit_max_diff from compare_model_outputs. The compare_model_weights function, which calculates max_abs_diff, is not used for the numerical check. This inconsistency will lead to 'N/A' in the summary table. Please update this to use the correct metric, logit_max_diff.

gemini-code-assist · 2026-01-31T08:56:03Z

gptq-torch-compile-benchmark/README.md

+
+1. **Keep `dynamic=True`**: Works well, no excessive recompilations
+2. **Default to `enable_torch_compile=False`**: Safe default, opt-in for performance
+3. **Document compile overhead**: Users should expect ~11 min first-run penalty


The documented compile overhead of ~11 min seems to contradict the benchmark results presented earlier in this document (e.g., in the Executive Summary and Results section), which state the overhead is ~35 seconds. Please ensure the documentation is consistent.

gemini-code-assist · 2026-01-31T08:56:03Z

gptq-torch-compile-benchmark/benchmark_gptq_compile.py

+import torch
+
+# Add local src to path
+sys.path.insert(0, str(Path(__file__).parent / "src"))


Manipulating sys.path can be fragile. Since the README.md instructs users to install the package in editable mode (pip install -e .), this line should be unnecessary. Relying on the package installation is a more robust approach.

gemini-code-assist · 2026-01-31T08:56:03Z

gptq-torch-compile-benchmark/benchmark_gptq_compile.py

+
+    # Reset dynamo counters before cold run
+    reset_dynamo_counters()
+    torch._dynamo.reset()


The use of torch._dynamo.reset() is calling a private API. While often necessary for benchmarking, it's worth adding a comment to note that this is not a public API and may be subject to change or removal in future PyTorch versions, which could break this script.

gemini-code-assist · 2026-01-31T08:56:04Z

gptq-torch-compile-benchmark/benchmark_gptq_compile.py

+
+
+if __name__ == "__main__":
+    exit(main())


It is more conventional to use sys.exit() to exit a script. Consider changing exit(main()) to sys.exit(main()).

Suggested change

exit(main())

sys.exit(main())

gemini-code-assist · 2026-01-31T08:56:04Z

gptq-torch-compile-benchmark/benchmark_utils.py

+    from safetensors import safe_open
+
+    # Find safetensor files
+    def find_safetensors(path: Path) -> List[Path]:
+        files = list(path.glob("*.safetensors"))
+        if not files:
+            files = list(path.glob("**/*.safetensors"))
+        return sorted(files)
+
+    files_a = find_safetensors(path_a)
+    files_b = find_safetensors(path_b)
+
+    if not files_a or not files_b:
+        return {"error": "No safetensor files found"}
+
+    all_max_diffs = []
+    all_mean_diffs = []
+    all_cosine_sims = []
+    all_element_diffs = []  # For p99 calculation
+    layer_stats = []
+
+    for file_a in files_a:
+        # Find corresponding file in b
+        file_b = path_b / file_a.name
+        if not file_b.exists():
+            continue
+
+        with safe_open(file_a, framework="pt", device=device) as fa:
+            with safe_open(file_b, framework="pt", device=device) as fb:
+                keys_a = set(fa.keys())
+                keys_b = set(fb.keys())
+                common_keys = keys_a & keys_b
+
+                for key in common_keys:
+                    tensor_a = fa.get_tensor(key).float()
+                    tensor_b = fb.get_tensor(key).float()
+
+                    if tensor_a.shape != tensor_b.shape:
+                        continue
+
+                    # Absolute differences
+                    diff = (tensor_a - tensor_b).abs()
+                    max_diff = diff.max().item()
+                    mean_diff = diff.mean().item()
+
+                    # Collect element-level diffs for p99 (sample to avoid memory issues)
+                    flat_diff = diff.flatten()
+                    if len(flat_diff) > 10000:
+                        # Sample 10k elements for p99 calculation
+                        indices = torch.randperm(len(flat_diff))[:10000]
+                        sampled = flat_diff[indices].tolist()
+                    else:
+                        sampled = flat_diff.tolist()
+                    all_element_diffs.extend(sampled)
+
+                    # Cosine similarity
+                    flat_a = tensor_a.flatten()
+                    flat_b = tensor_b.flatten()
+                    cos_sim = torch.nn.functional.cosine_similarity(
+                        flat_a.unsqueeze(0), flat_b.unsqueeze(0)
+                    ).item()
+
+                    all_max_diffs.append(max_diff)
+                    all_mean_diffs.append(mean_diff)
+                    all_cosine_sims.append(cos_sim)
+
+                    layer_stats.append({
+                        "name": key,
+                        "max_abs_diff": max_diff,
+                        "mean_abs_diff": mean_diff,
+                        "cosine_similarity": cos_sim,
+                    })
+
+    if not all_max_diffs:
+        return {"error": "No comparable tensors found"}
+
+    # Calculate p99 from sampled element diffs
+    element_arr = np.array(all_element_diffs)
+    p99_diff = float(np.percentile(element_arr, 99))
+
+    return {
+        "max_abs_diff": max(all_max_diffs),
+        "mean_abs_diff": sum(all_mean_diffs) / len(all_mean_diffs),
+        "p99_abs_diff": p99_diff,
+        "cosine_similarity": sum(all_cosine_sims) / len(all_cosine_sims),
+        "num_tensors_compared": len(all_max_diffs),
+        "equivalent": max(all_max_diffs) < 1e-5,
+        "layer_stats": layer_stats[:10],  # First 10 for brevity
+    }
+
+
+def compare_model_outputs(
+    path_a: Path,
+    path_b: Path,
+    num_samples: int = 5,
+    max_new_tokens: int = 20,
+    device: str = "cuda",
+) -> Dict[str, Any]:
+    """
+    Compare model outputs between two quantized models on fixed inputs.
+
+    This is the recommended method for numerical correctness validation
+    because it compares actual model behavior rather than internal
+    representations (packed int32 tensors give meaningless diffs).
+
+    Args:
+        path_a: Path to first model
+        path_b: Path to second model
+        num_samples: Number of test prompts
+        max_new_tokens: Tokens to generate per prompt
+        device: Device for inference
+
+    Returns dict with:
+    - logit_max_diff: Maximum difference in output logits
+    - logit_mean_diff: Mean difference in output logits
+    - token_match_rate: Fraction of generated tokens that match
+    - output_samples: Sample outputs for inspection
+    """
+    from transformers import AutoModelForCausalLM, AutoTokenizer
+
+    # Fixed test prompts for reproducibility
+    test_prompts = [
+        "The capital of France is",
+        "def fibonacci(n):",
+        "In machine learning, gradient descent",
+        "The quick brown fox",
+        "Water boils at",
+    ][:num_samples]
+
+    results = {
+        "method": "output_comparison",
+        "num_samples": num_samples,
+        "max_new_tokens": max_new_tokens,
+        "logit_diffs": [],
+        "token_matches": [],
+        "output_samples": [],
+    }
+
+    try:
+        # Load tokenizer (should be same for both)
+        tokenizer = AutoTokenizer.from_pretrained(path_a)
+        if tokenizer.pad_token is None:
+            tokenizer.pad_token = tokenizer.eos_token
+
+        # Load models
+        model_a = AutoModelForCausalLM.from_pretrained(
+            path_a,
+            device_map=device,
+            torch_dtype=torch.float16,
+        )
+        model_b = AutoModelForCausalLM.from_pretrained(
+            path_b,
+            device_map=device,
+            torch_dtype=torch.float16,
+        )
+
+        model_a.eval()
+        model_b.eval()
+
+        for prompt in test_prompts:
+            inputs = tokenizer(prompt, return_tensors="pt").to(device)
+
+            with torch.no_grad():
+                # Get logits for the input (forward pass only)
+                outputs_a = model_a(**inputs)
+                outputs_b = model_b(**inputs)
+
+                # Compare logits
+                logits_a = outputs_a.logits.float()
+                logits_b = outputs_b.logits.float()
+
+                logit_diff = (logits_a - logits_b).abs()
+                max_diff = logit_diff.max().item()
+                mean_diff = logit_diff.mean().item()
+
+                results["logit_diffs"].append({
+                    "prompt": prompt,
+                    "max_diff": max_diff,
+                    "mean_diff": mean_diff,
+                })
+
+                # Generate tokens and compare
+                gen_a = model_a.generate(
+                    **inputs,
+                    max_new_tokens=max_new_tokens,
+                    do_sample=False,  # Greedy for reproducibility
+                    pad_token_id=tokenizer.pad_token_id,
+                )
+                gen_b = model_b.generate(
+                    **inputs,
+                    max_new_tokens=max_new_tokens,
+                    do_sample=False,
+                    pad_token_id=tokenizer.pad_token_id,
+                )
+
+                # Compare generated tokens
+                tokens_a = gen_a[0].tolist()
+                tokens_b = gen_b[0].tolist()
+                min_len = min(len(tokens_a), len(tokens_b))
+                matches = sum(1 for i in range(min_len) if tokens_a[i] == tokens_b[i])
+                match_rate = matches / min_len if min_len > 0 else 0
+
+                results["token_matches"].append({
+                    "prompt": prompt,
+                    "match_rate": match_rate,
+                    "matched": matches,
+                    "total": min_len,
+                })
+
+                # Decode for inspection
+                text_a = tokenizer.decode(gen_a[0], skip_special_tokens=True)
+                text_b = tokenizer.decode(gen_b[0], skip_special_tokens=True)
+
+                results["output_samples"].append({
+                    "prompt": prompt,
+                    "output_a": text_a,
+                    "output_b": text_b,
+                    "identical": text_a == text_b,
+                })
+
+        # Cleanup models
+        del model_a, model_b
+        gc.collect()
+        if torch.cuda.is_available():
+            torch.cuda.empty_cache()
+
+        # Aggregate results
+        all_max_diffs = [d["max_diff"] for d in results["logit_diffs"]]
+        all_mean_diffs = [d["mean_diff"] for d in results["logit_diffs"]]
+        all_match_rates = [m["match_rate"] for m in results["token_matches"]]
+
+        results["logit_max_diff"] = max(all_max_diffs)
+        results["logit_mean_diff"] = sum(all_mean_diffs) / len(all_mean_diffs)
+        results["token_match_rate"] = sum(all_match_rates) / len(all_match_rates)
+        results["all_outputs_identical"] = all(s["identical"] for s in results["output_samples"])
+
+        # Equivalence check: logit diff should be small for numerical equivalence
+        # With FP16 quantized models, we expect some small differences
+        results["equivalent"] = results["logit_max_diff"] < 1.0 and results["token_match_rate"] == 1.0
+        results["status"] = "success"
+
+    except Exception as e:
+        import traceback


Several imports (safetensors, transformers, traceback) are located inside functions. For better code organization, readability, and to avoid repeated import overhead, please move all imports to the top of the file.

gemini-code-assist · 2026-01-31T08:56:04Z

gptq-torch-compile-benchmark/benchmark_utils.py

+
+        # Equivalence check: logit diff should be small for numerical equivalence
+        # With FP16 quantized models, we expect some small differences
+        results["equivalent"] = results["logit_max_diff"] < 1.0 and results["token_match_rate"] == 1.0


The condition for equivalent requires token_match_rate == 1.0. However, the README.md suggests that for W4A16 quantization, a perfect token match is not always expected. This discrepancy could be confusing. Consider relaxing this condition or clarifying the definition of 'equivalent' in this context. The threshold of 1.0 for logit_max_diff also seems arbitrary and could benefit from justification or being made configurable.

gemini-code-assist · 2026-01-31T08:56:04Z

src/llmcompressor/modifiers/quantization/gptq/base.py

+            quantize_fn = (
+                quantize_weight_optimized
+                if self.enable_torch_compile
+                else quantize_weight
+            )


The logic to select the quantization function can be simplified by moving the conditional check inside the quantize_weight function itself. This would avoid exporting two separate functions and make the call site cleaner.

gemini-code-assist · 2026-01-31T08:56:04Z

src/llmcompressor/modifiers/quantization/gptq/gptq_quantize.py

+
+    # Storage-size instrumentation for memory spike analysis
+    # Only log once per run (via env var flag)
+    import os


The import os statement is inside the quantize_weight_optimized function. It should be moved to the top of the file to follow standard Python conventions.

colldata79 · 2026-01-31T08:59:03Z

This PR is ready for review. Could a maintainer please add the ready label?

HDCharles · 2026-02-10T03:25:18Z

Hey, this looks good!

I'd say for this to land there are a few final steps

lm eval performance of e.g. the qwen model you generated vs the normal gptq one, or something along those lines. I know you have numerical results but ultimately this is where we will/won't see regression in our lm eval tests
get rid of the majority of the benchmark code, hopefully you didn't spend too long on this. if you want to add a benchmarking folder with some dedicated tests i don't think its a bad idea but what you have is complicated and much of it is not necessary, i.e. checking storage size has nothing to do with the GPTQ algorithm, that would be the model compressor which wasn't modified, many of the other features are similar. A dedicated benchmark script to make it easier to compare compiled and non compiled gptq could be useful (but not necessary) but something as complicated as this makes it harder, not easier to do those tests for a random person looking to do some work. At worst case if we really want the code again, its logged in this commit. I often do this, i.e. I tend to leave a bunch of tests in my first commit so if someone wants to repro its there, though its not clogging up the actual codebase.

let me know if you need help doing the evaluation

HDCharles

looks good, see comment

colldata79 · 2026-02-10T03:42:53Z

Great feedback . i will look at this later in the week to close it out

HDCharles · 2026-02-18T18:04:30Z

another point, we should probably extract some/most of the functionality into helper functions so we don't have the exact same code duplicated across quantize_weight and quantize_weight_optimized. This will be a real pain to maintain otherwise.

…torch_compile flag - Restore original _grid_search_mse with early stopping + patch_attr (non-compiled path) - Add _grid_search_mse_compiled for torch.compile-compatible path - Add enable_torch_compile flag to observer_kwargs (default False) - Add _call_grid_search helper to reduce code duplication in observer classes - Follows GPTQ PR vllm-project#2320 pattern: flag-based compiled/non-compiled path selection Signed-off-by: Jaewoo Kim <pewpewplay315@gmail.com>

mergify · 2026-02-23T17:14:47Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @colldata79.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

HDCharles · 2026-03-18T20:26:24Z

for the API lets expand https://github.com/vllm-project/llm-compressor/pull/2384/changes#diff-8b25e0e7dfb3c229926f751ac7d9dff4784e3e6692cc1038b050e0c1cb21792eR25 and just make this 'enable_torch_compile`

gemini-code-assist bot reviewed Jan 31, 2026

View reviewed changes

HDCharles added the ready When a PR is ready for review label Feb 2, 2026

dsikka requested review from HDCharles and kylesayrs February 6, 2026 14:23

HDCharles requested changes Feb 10, 2026

View reviewed changes

HDCharles added 2 commits February 11, 2026 09:56

Merge branch 'main' into gptq-torch-compile-v2

8302363

Merge branch 'main' into gptq-torch-compile-v2

1758db5

HDCharles mentioned this pull request Feb 18, 2026

perf: make MSE observer compatible with torch.compile #2384

Open

HDCharles changed the title ~~Benchmark torch.compile optimization for quantization~~ Benchmark torch.compile optimization for GPTQ Feb 19, 2026

mergify bot added the needs-rebase label Feb 23, 2026

Conversation

colldata79 commented Jan 31, 2026

Summary

What was done

Benchmark Results (Qwen2.5-3B on V100)

Validation (TinyLlama-1.1B)

Uh oh!

github-actions bot commented Jan 31, 2026

Uh oh!

gemini-code-assist bot commented Jan 31, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 31, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 31, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 31, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 31, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 31, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 31, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 31, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 31, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 31, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 31, 2026

Choose a reason for hiding this comment

Uh oh!

colldata79 commented Jan 31, 2026

Uh oh!

HDCharles commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HDCharles left a comment

Choose a reason for hiding this comment

Uh oh!

colldata79 commented Feb 10, 2026

Uh oh!

HDCharles commented Feb 18, 2026

Uh oh!

mergify bot commented Feb 23, 2026

Uh oh!

HDCharles commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

HDCharles commented Feb 10, 2026 •

edited

Loading