Skip to content

Benchmark torch.compile optimization for GPTQ#2320

Open
colldata79 wants to merge 3 commits intovllm-project:mainfrom
colldata79:gptq-torch-compile-v2
Open

Benchmark torch.compile optimization for GPTQ#2320
colldata79 wants to merge 3 commits intovllm-project:mainfrom
colldata79:gptq-torch-compile-v2

Conversation

@colldata79
Copy link
Contributor

Summary

Benchmark and validate torch.compile optimization for the GPTQ quantization algorithm.

Addresses #1496
Supersedes #1561

What was done

  • Added enable_torch_compile flag to GPTQModifier
  • Added @torch.compile(dynamic=True) decorated _process_block function
  • Added quantize_weight_optimized compiled code path
  • Created benchmark harness for reproducibility
  • Ran benchmarks on Qwen2.5-3B (V100) and TinyLlama-1.1B
  • Validated correctness (compiled vs uncompiled outputs)
  • Validated memory (no inflation, no leaks)

Benchmark Results (Qwen2.5-3B on V100)

Metric Value
Baseline 879s (14m 39s)
Compiled (warm) 478s (7m 58s)
Speedup 1.84x
Compile overhead ~35s (one-time)
Memory overhead 0%
Graph breaks 0

Validation (TinyLlama-1.1B)

Test Result
Memory Sentinel ✅ 0.03% error
Storage-Size Proof ✅ No inflation
Numerical Check ✅ Functionally equivalent

Add `enable_torch_compile` flag to GPTQModifier that enables torch.compile
optimization on the inner block processing kernel (_process_block).

Key changes:
- Add @torch.compile(dynamic=True) decorated _process_block function
- Add _quantize_core wrapper that calls compiled kernel per block
- Add quantize_weight_optimized as the compiled code path
- Wire enable_torch_compile flag in GPTQModifier (default: False)

Benchmark results (Qwen2.5-3B on V100):
- Baseline: 879s avg
- Compiled warm: 478s avg (1.84x speedup)
- Compile overhead: ~35s (one-time)
- Memory overhead: 0%
- Graph breaks: 0

Includes benchmark harness and artifacts for reproducibility.

Fixes vllm-project#1496
Supersedes vllm-project#1561
@github-actions
Copy link

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @colldata79, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant performance enhancement to the GPTQ quantization algorithm by integrating torch.compile. It provides a thoroughly benchmarked and validated solution that accelerates the quantization process by nearly two times, without incurring additional memory overhead or compromising the numerical accuracy of the quantized models. The changes include a new optimized code path and a comprehensive benchmarking suite to ensure the reliability and effectiveness of the optimization.

Highlights

  • torch.compile Integration: Implemented torch.compile support for GPTQ quantization by decorating the _process_block function with @torch.compile(dynamic=True) and introducing a new quantize_weight_optimized path.
  • Performance Improvement: Achieved a significant 1.84x speedup in GPTQ quantization for the Qwen2.5-3B model on a V100 GPU, with a minimal one-time compile overhead of ~35 seconds.
  • Comprehensive Benchmarking: Developed a robust benchmarking harness to measure runtime, memory usage, compile overhead, and numerical correctness, ensuring reproducibility and detailed analysis.
  • Validation & Stability: Validated the optimization for correctness (functionally equivalent outputs), memory (no inflation, no leaks), and stability (zero graph breaks and reduced run-to-run variance).
  • User Control: Added an enable_torch_compile flag to GPTQModifier, allowing users to optionally enable or disable the torch.compile optimization.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive benchmark suite for evaluating torch.compile optimizations on the GPTQ algorithm, along with the necessary code changes to enable this optimization. The benchmark is well-structured and the results are promising. My review focuses on improving code maintainability by reducing duplication, enhancing the robustness of the benchmark scripts, and addressing potential security and documentation inconsistencies. The most significant issue is the large amount of duplicated code between the original and the new optimized quantization functions, which should be refactored. I've also pointed out a few issues in the benchmark scripts and documentation for clarity and correctness.


| Property | Value |
|----------|-------|
| **Project** | `our-rampart-478403-t3` |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The GCP project ID our-rampart-478403-t3 appears to be hardcoded. If this is a real project ID, it should be removed or replaced with a placeholder to avoid leaking sensitive information.

compile_stats = compiled.get("compile_stats", {})
recompiles = compile_stats.get("graph_breaks", 0)

max_diff = results.get("numerical_check", {}).get("max_abs_diff", "N/A")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The summary table generation uses max_abs_diff, but the numerical_check function in benchmark_gptq_compile.py provides logit_max_diff from compare_model_outputs. The compare_model_weights function, which calculates max_abs_diff, is not used for the numerical check. This inconsistency will lead to 'N/A' in the summary table. Please update this to use the correct metric, logit_max_diff.


1. **Keep `dynamic=True`**: Works well, no excessive recompilations
2. **Default to `enable_torch_compile=False`**: Safe default, opt-in for performance
3. **Document compile overhead**: Users should expect ~11 min first-run penalty
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The documented compile overhead of ~11 min seems to contradict the benchmark results presented earlier in this document (e.g., in the Executive Summary and Results section), which state the overhead is ~35 seconds. Please ensure the documentation is consistent.

import torch

# Add local src to path
sys.path.insert(0, str(Path(__file__).parent / "src"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Manipulating sys.path can be fragile. Since the README.md instructs users to install the package in editable mode (pip install -e .), this line should be unnecessary. Relying on the package installation is a more robust approach.


# Reset dynamo counters before cold run
reset_dynamo_counters()
torch._dynamo.reset()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The use of torch._dynamo.reset() is calling a private API. While often necessary for benchmarking, it's worth adding a comment to note that this is not a public API and may be subject to change or removal in future PyTorch versions, which could break this script.



if __name__ == "__main__":
exit(main())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

It is more conventional to use sys.exit() to exit a script. Consider changing exit(main()) to sys.exit(main()).

Suggested change
exit(main())
sys.exit(main())

Comment on lines +274 to +516
from safetensors import safe_open

# Find safetensor files
def find_safetensors(path: Path) -> List[Path]:
files = list(path.glob("*.safetensors"))
if not files:
files = list(path.glob("**/*.safetensors"))
return sorted(files)

files_a = find_safetensors(path_a)
files_b = find_safetensors(path_b)

if not files_a or not files_b:
return {"error": "No safetensor files found"}

all_max_diffs = []
all_mean_diffs = []
all_cosine_sims = []
all_element_diffs = [] # For p99 calculation
layer_stats = []

for file_a in files_a:
# Find corresponding file in b
file_b = path_b / file_a.name
if not file_b.exists():
continue

with safe_open(file_a, framework="pt", device=device) as fa:
with safe_open(file_b, framework="pt", device=device) as fb:
keys_a = set(fa.keys())
keys_b = set(fb.keys())
common_keys = keys_a & keys_b

for key in common_keys:
tensor_a = fa.get_tensor(key).float()
tensor_b = fb.get_tensor(key).float()

if tensor_a.shape != tensor_b.shape:
continue

# Absolute differences
diff = (tensor_a - tensor_b).abs()
max_diff = diff.max().item()
mean_diff = diff.mean().item()

# Collect element-level diffs for p99 (sample to avoid memory issues)
flat_diff = diff.flatten()
if len(flat_diff) > 10000:
# Sample 10k elements for p99 calculation
indices = torch.randperm(len(flat_diff))[:10000]
sampled = flat_diff[indices].tolist()
else:
sampled = flat_diff.tolist()
all_element_diffs.extend(sampled)

# Cosine similarity
flat_a = tensor_a.flatten()
flat_b = tensor_b.flatten()
cos_sim = torch.nn.functional.cosine_similarity(
flat_a.unsqueeze(0), flat_b.unsqueeze(0)
).item()

all_max_diffs.append(max_diff)
all_mean_diffs.append(mean_diff)
all_cosine_sims.append(cos_sim)

layer_stats.append({
"name": key,
"max_abs_diff": max_diff,
"mean_abs_diff": mean_diff,
"cosine_similarity": cos_sim,
})

if not all_max_diffs:
return {"error": "No comparable tensors found"}

# Calculate p99 from sampled element diffs
element_arr = np.array(all_element_diffs)
p99_diff = float(np.percentile(element_arr, 99))

return {
"max_abs_diff": max(all_max_diffs),
"mean_abs_diff": sum(all_mean_diffs) / len(all_mean_diffs),
"p99_abs_diff": p99_diff,
"cosine_similarity": sum(all_cosine_sims) / len(all_cosine_sims),
"num_tensors_compared": len(all_max_diffs),
"equivalent": max(all_max_diffs) < 1e-5,
"layer_stats": layer_stats[:10], # First 10 for brevity
}


def compare_model_outputs(
path_a: Path,
path_b: Path,
num_samples: int = 5,
max_new_tokens: int = 20,
device: str = "cuda",
) -> Dict[str, Any]:
"""
Compare model outputs between two quantized models on fixed inputs.

This is the recommended method for numerical correctness validation
because it compares actual model behavior rather than internal
representations (packed int32 tensors give meaningless diffs).

Args:
path_a: Path to first model
path_b: Path to second model
num_samples: Number of test prompts
max_new_tokens: Tokens to generate per prompt
device: Device for inference

Returns dict with:
- logit_max_diff: Maximum difference in output logits
- logit_mean_diff: Mean difference in output logits
- token_match_rate: Fraction of generated tokens that match
- output_samples: Sample outputs for inspection
"""
from transformers import AutoModelForCausalLM, AutoTokenizer

# Fixed test prompts for reproducibility
test_prompts = [
"The capital of France is",
"def fibonacci(n):",
"In machine learning, gradient descent",
"The quick brown fox",
"Water boils at",
][:num_samples]

results = {
"method": "output_comparison",
"num_samples": num_samples,
"max_new_tokens": max_new_tokens,
"logit_diffs": [],
"token_matches": [],
"output_samples": [],
}

try:
# Load tokenizer (should be same for both)
tokenizer = AutoTokenizer.from_pretrained(path_a)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token

# Load models
model_a = AutoModelForCausalLM.from_pretrained(
path_a,
device_map=device,
torch_dtype=torch.float16,
)
model_b = AutoModelForCausalLM.from_pretrained(
path_b,
device_map=device,
torch_dtype=torch.float16,
)

model_a.eval()
model_b.eval()

for prompt in test_prompts:
inputs = tokenizer(prompt, return_tensors="pt").to(device)

with torch.no_grad():
# Get logits for the input (forward pass only)
outputs_a = model_a(**inputs)
outputs_b = model_b(**inputs)

# Compare logits
logits_a = outputs_a.logits.float()
logits_b = outputs_b.logits.float()

logit_diff = (logits_a - logits_b).abs()
max_diff = logit_diff.max().item()
mean_diff = logit_diff.mean().item()

results["logit_diffs"].append({
"prompt": prompt,
"max_diff": max_diff,
"mean_diff": mean_diff,
})

# Generate tokens and compare
gen_a = model_a.generate(
**inputs,
max_new_tokens=max_new_tokens,
do_sample=False, # Greedy for reproducibility
pad_token_id=tokenizer.pad_token_id,
)
gen_b = model_b.generate(
**inputs,
max_new_tokens=max_new_tokens,
do_sample=False,
pad_token_id=tokenizer.pad_token_id,
)

# Compare generated tokens
tokens_a = gen_a[0].tolist()
tokens_b = gen_b[0].tolist()
min_len = min(len(tokens_a), len(tokens_b))
matches = sum(1 for i in range(min_len) if tokens_a[i] == tokens_b[i])
match_rate = matches / min_len if min_len > 0 else 0

results["token_matches"].append({
"prompt": prompt,
"match_rate": match_rate,
"matched": matches,
"total": min_len,
})

# Decode for inspection
text_a = tokenizer.decode(gen_a[0], skip_special_tokens=True)
text_b = tokenizer.decode(gen_b[0], skip_special_tokens=True)

results["output_samples"].append({
"prompt": prompt,
"output_a": text_a,
"output_b": text_b,
"identical": text_a == text_b,
})

# Cleanup models
del model_a, model_b
gc.collect()
if torch.cuda.is_available():
torch.cuda.empty_cache()

# Aggregate results
all_max_diffs = [d["max_diff"] for d in results["logit_diffs"]]
all_mean_diffs = [d["mean_diff"] for d in results["logit_diffs"]]
all_match_rates = [m["match_rate"] for m in results["token_matches"]]

results["logit_max_diff"] = max(all_max_diffs)
results["logit_mean_diff"] = sum(all_mean_diffs) / len(all_mean_diffs)
results["token_match_rate"] = sum(all_match_rates) / len(all_match_rates)
results["all_outputs_identical"] = all(s["identical"] for s in results["output_samples"])

# Equivalence check: logit diff should be small for numerical equivalence
# With FP16 quantized models, we expect some small differences
results["equivalent"] = results["logit_max_diff"] < 1.0 and results["token_match_rate"] == 1.0
results["status"] = "success"

except Exception as e:
import traceback
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Several imports (safetensors, transformers, traceback) are located inside functions. For better code organization, readability, and to avoid repeated import overhead, please move all imports to the top of the file.


# Equivalence check: logit diff should be small for numerical equivalence
# With FP16 quantized models, we expect some small differences
results["equivalent"] = results["logit_max_diff"] < 1.0 and results["token_match_rate"] == 1.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The condition for equivalent requires token_match_rate == 1.0. However, the README.md suggests that for W4A16 quantization, a perfect token match is not always expected. This discrepancy could be confusing. Consider relaxing this condition or clarifying the definition of 'equivalent' in this context. The threshold of 1.0 for logit_max_diff also seems arbitrary and could benefit from justification or being made configurable.

Comment on lines +266 to +270
quantize_fn = (
quantize_weight_optimized
if self.enable_torch_compile
else quantize_weight
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic to select the quantization function can be simplified by moving the conditional check inside the quantize_weight function itself. This would avoid exporting two separate functions and make the call site cleaner.


# Storage-size instrumentation for memory spike analysis
# Only log once per run (via env var flag)
import os
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The import os statement is inside the quantize_weight_optimized function. It should be moved to the top of the file to follow standard Python conventions.

@colldata79
Copy link
Contributor Author

This PR is ready for review. Could a maintainer please add the ready label?

@HDCharles HDCharles added the ready When a PR is ready for review label Feb 2, 2026
@dsikka dsikka requested review from HDCharles and kylesayrs February 6, 2026 14:23
@HDCharles
Copy link
Collaborator

HDCharles commented Feb 10, 2026

Hey, this looks good!

I'd say for this to land there are a few final steps

  1. lm eval performance of e.g. the qwen model you generated vs the normal gptq one, or something along those lines. I know you have numerical results but ultimately this is where we will/won't see regression in our lm eval tests
  2. get rid of the majority of the benchmark code, hopefully you didn't spend too long on this. if you want to add a benchmarking folder with some dedicated tests i don't think its a bad idea but what you have is complicated and much of it is not necessary, i.e. checking storage size has nothing to do with the GPTQ algorithm, that would be the model compressor which wasn't modified, many of the other features are similar. A dedicated benchmark script to make it easier to compare compiled and non compiled gptq could be useful (but not necessary) but something as complicated as this makes it harder, not easier to do those tests for a random person looking to do some work. At worst case if we really want the code again, its logged in this commit. I often do this, i.e. I tend to leave a bunch of tests in my first commit so if someone wants to repro its there, though its not clogging up the actual codebase.

let me know if you need help doing the evaluation

Copy link
Collaborator

@HDCharles HDCharles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, see comment

@colldata79
Copy link
Contributor Author

Great feedback . i will look at this later in the week to close it out

@HDCharles
Copy link
Collaborator

another point, we should probably extract some/most of the functionality into helper functions so we don't have the exact same code duplicated across quantize_weight and quantize_weight_optimized. This will be a real pain to maintain otherwise.

Bias92 added a commit to Bias92/llm-compressor that referenced this pull request Feb 19, 2026
…torch_compile flag

- Restore original _grid_search_mse with early stopping + patch_attr (non-compiled path)
- Add _grid_search_mse_compiled for torch.compile-compatible path
- Add enable_torch_compile flag to observer_kwargs (default False)
- Add _call_grid_search helper to reduce code duplication in observer classes
- Follows GPTQ PR vllm-project#2320 pattern: flag-based compiled/non-compiled path selection

Signed-off-by: Jaewoo Kim <pewpewplay315@gmail.com>
@HDCharles HDCharles changed the title Benchmark torch.compile optimization for quantization Benchmark torch.compile optimization for GPTQ Feb 19, 2026
@mergify
Copy link
Contributor

mergify bot commented Feb 23, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @colldata79.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Feb 23, 2026
@HDCharles
Copy link
Collaborator

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-rebase ready When a PR is ready for review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants