Benchmark torch.compile optimization for GPTQ#2320
Benchmark torch.compile optimization for GPTQ#2320colldata79 wants to merge 3 commits intovllm-project:mainfrom
Conversation
Add `enable_torch_compile` flag to GPTQModifier that enables torch.compile optimization on the inner block processing kernel (_process_block). Key changes: - Add @torch.compile(dynamic=True) decorated _process_block function - Add _quantize_core wrapper that calls compiled kernel per block - Add quantize_weight_optimized as the compiled code path - Wire enable_torch_compile flag in GPTQModifier (default: False) Benchmark results (Qwen2.5-3B on V100): - Baseline: 879s avg - Compiled warm: 478s avg (1.84x speedup) - Compile overhead: ~35s (one-time) - Memory overhead: 0% - Graph breaks: 0 Includes benchmark harness and artifacts for reproducibility. Fixes vllm-project#1496 Supersedes vllm-project#1561
|
👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review. Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed. |
Summary of ChangesHello @colldata79, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a significant performance enhancement to the GPTQ quantization algorithm by integrating Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a comprehensive benchmark suite for evaluating torch.compile optimizations on the GPTQ algorithm, along with the necessary code changes to enable this optimization. The benchmark is well-structured and the results are promising. My review focuses on improving code maintainability by reducing duplication, enhancing the robustness of the benchmark scripts, and addressing potential security and documentation inconsistencies. The most significant issue is the large amount of duplicated code between the original and the new optimized quantization functions, which should be refactored. I've also pointed out a few issues in the benchmark scripts and documentation for clarity and correctness.
|
|
||
| | Property | Value | | ||
| |----------|-------| | ||
| | **Project** | `our-rampart-478403-t3` | |
| compile_stats = compiled.get("compile_stats", {}) | ||
| recompiles = compile_stats.get("graph_breaks", 0) | ||
|
|
||
| max_diff = results.get("numerical_check", {}).get("max_abs_diff", "N/A") |
There was a problem hiding this comment.
The summary table generation uses max_abs_diff, but the numerical_check function in benchmark_gptq_compile.py provides logit_max_diff from compare_model_outputs. The compare_model_weights function, which calculates max_abs_diff, is not used for the numerical check. This inconsistency will lead to 'N/A' in the summary table. Please update this to use the correct metric, logit_max_diff.
|
|
||
| 1. **Keep `dynamic=True`**: Works well, no excessive recompilations | ||
| 2. **Default to `enable_torch_compile=False`**: Safe default, opt-in for performance | ||
| 3. **Document compile overhead**: Users should expect ~11 min first-run penalty |
There was a problem hiding this comment.
| import torch | ||
|
|
||
| # Add local src to path | ||
| sys.path.insert(0, str(Path(__file__).parent / "src")) |
|
|
||
| # Reset dynamo counters before cold run | ||
| reset_dynamo_counters() | ||
| torch._dynamo.reset() |
There was a problem hiding this comment.
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| exit(main()) |
| from safetensors import safe_open | ||
|
|
||
| # Find safetensor files | ||
| def find_safetensors(path: Path) -> List[Path]: | ||
| files = list(path.glob("*.safetensors")) | ||
| if not files: | ||
| files = list(path.glob("**/*.safetensors")) | ||
| return sorted(files) | ||
|
|
||
| files_a = find_safetensors(path_a) | ||
| files_b = find_safetensors(path_b) | ||
|
|
||
| if not files_a or not files_b: | ||
| return {"error": "No safetensor files found"} | ||
|
|
||
| all_max_diffs = [] | ||
| all_mean_diffs = [] | ||
| all_cosine_sims = [] | ||
| all_element_diffs = [] # For p99 calculation | ||
| layer_stats = [] | ||
|
|
||
| for file_a in files_a: | ||
| # Find corresponding file in b | ||
| file_b = path_b / file_a.name | ||
| if not file_b.exists(): | ||
| continue | ||
|
|
||
| with safe_open(file_a, framework="pt", device=device) as fa: | ||
| with safe_open(file_b, framework="pt", device=device) as fb: | ||
| keys_a = set(fa.keys()) | ||
| keys_b = set(fb.keys()) | ||
| common_keys = keys_a & keys_b | ||
|
|
||
| for key in common_keys: | ||
| tensor_a = fa.get_tensor(key).float() | ||
| tensor_b = fb.get_tensor(key).float() | ||
|
|
||
| if tensor_a.shape != tensor_b.shape: | ||
| continue | ||
|
|
||
| # Absolute differences | ||
| diff = (tensor_a - tensor_b).abs() | ||
| max_diff = diff.max().item() | ||
| mean_diff = diff.mean().item() | ||
|
|
||
| # Collect element-level diffs for p99 (sample to avoid memory issues) | ||
| flat_diff = diff.flatten() | ||
| if len(flat_diff) > 10000: | ||
| # Sample 10k elements for p99 calculation | ||
| indices = torch.randperm(len(flat_diff))[:10000] | ||
| sampled = flat_diff[indices].tolist() | ||
| else: | ||
| sampled = flat_diff.tolist() | ||
| all_element_diffs.extend(sampled) | ||
|
|
||
| # Cosine similarity | ||
| flat_a = tensor_a.flatten() | ||
| flat_b = tensor_b.flatten() | ||
| cos_sim = torch.nn.functional.cosine_similarity( | ||
| flat_a.unsqueeze(0), flat_b.unsqueeze(0) | ||
| ).item() | ||
|
|
||
| all_max_diffs.append(max_diff) | ||
| all_mean_diffs.append(mean_diff) | ||
| all_cosine_sims.append(cos_sim) | ||
|
|
||
| layer_stats.append({ | ||
| "name": key, | ||
| "max_abs_diff": max_diff, | ||
| "mean_abs_diff": mean_diff, | ||
| "cosine_similarity": cos_sim, | ||
| }) | ||
|
|
||
| if not all_max_diffs: | ||
| return {"error": "No comparable tensors found"} | ||
|
|
||
| # Calculate p99 from sampled element diffs | ||
| element_arr = np.array(all_element_diffs) | ||
| p99_diff = float(np.percentile(element_arr, 99)) | ||
|
|
||
| return { | ||
| "max_abs_diff": max(all_max_diffs), | ||
| "mean_abs_diff": sum(all_mean_diffs) / len(all_mean_diffs), | ||
| "p99_abs_diff": p99_diff, | ||
| "cosine_similarity": sum(all_cosine_sims) / len(all_cosine_sims), | ||
| "num_tensors_compared": len(all_max_diffs), | ||
| "equivalent": max(all_max_diffs) < 1e-5, | ||
| "layer_stats": layer_stats[:10], # First 10 for brevity | ||
| } | ||
|
|
||
|
|
||
| def compare_model_outputs( | ||
| path_a: Path, | ||
| path_b: Path, | ||
| num_samples: int = 5, | ||
| max_new_tokens: int = 20, | ||
| device: str = "cuda", | ||
| ) -> Dict[str, Any]: | ||
| """ | ||
| Compare model outputs between two quantized models on fixed inputs. | ||
|
|
||
| This is the recommended method for numerical correctness validation | ||
| because it compares actual model behavior rather than internal | ||
| representations (packed int32 tensors give meaningless diffs). | ||
|
|
||
| Args: | ||
| path_a: Path to first model | ||
| path_b: Path to second model | ||
| num_samples: Number of test prompts | ||
| max_new_tokens: Tokens to generate per prompt | ||
| device: Device for inference | ||
|
|
||
| Returns dict with: | ||
| - logit_max_diff: Maximum difference in output logits | ||
| - logit_mean_diff: Mean difference in output logits | ||
| - token_match_rate: Fraction of generated tokens that match | ||
| - output_samples: Sample outputs for inspection | ||
| """ | ||
| from transformers import AutoModelForCausalLM, AutoTokenizer | ||
|
|
||
| # Fixed test prompts for reproducibility | ||
| test_prompts = [ | ||
| "The capital of France is", | ||
| "def fibonacci(n):", | ||
| "In machine learning, gradient descent", | ||
| "The quick brown fox", | ||
| "Water boils at", | ||
| ][:num_samples] | ||
|
|
||
| results = { | ||
| "method": "output_comparison", | ||
| "num_samples": num_samples, | ||
| "max_new_tokens": max_new_tokens, | ||
| "logit_diffs": [], | ||
| "token_matches": [], | ||
| "output_samples": [], | ||
| } | ||
|
|
||
| try: | ||
| # Load tokenizer (should be same for both) | ||
| tokenizer = AutoTokenizer.from_pretrained(path_a) | ||
| if tokenizer.pad_token is None: | ||
| tokenizer.pad_token = tokenizer.eos_token | ||
|
|
||
| # Load models | ||
| model_a = AutoModelForCausalLM.from_pretrained( | ||
| path_a, | ||
| device_map=device, | ||
| torch_dtype=torch.float16, | ||
| ) | ||
| model_b = AutoModelForCausalLM.from_pretrained( | ||
| path_b, | ||
| device_map=device, | ||
| torch_dtype=torch.float16, | ||
| ) | ||
|
|
||
| model_a.eval() | ||
| model_b.eval() | ||
|
|
||
| for prompt in test_prompts: | ||
| inputs = tokenizer(prompt, return_tensors="pt").to(device) | ||
|
|
||
| with torch.no_grad(): | ||
| # Get logits for the input (forward pass only) | ||
| outputs_a = model_a(**inputs) | ||
| outputs_b = model_b(**inputs) | ||
|
|
||
| # Compare logits | ||
| logits_a = outputs_a.logits.float() | ||
| logits_b = outputs_b.logits.float() | ||
|
|
||
| logit_diff = (logits_a - logits_b).abs() | ||
| max_diff = logit_diff.max().item() | ||
| mean_diff = logit_diff.mean().item() | ||
|
|
||
| results["logit_diffs"].append({ | ||
| "prompt": prompt, | ||
| "max_diff": max_diff, | ||
| "mean_diff": mean_diff, | ||
| }) | ||
|
|
||
| # Generate tokens and compare | ||
| gen_a = model_a.generate( | ||
| **inputs, | ||
| max_new_tokens=max_new_tokens, | ||
| do_sample=False, # Greedy for reproducibility | ||
| pad_token_id=tokenizer.pad_token_id, | ||
| ) | ||
| gen_b = model_b.generate( | ||
| **inputs, | ||
| max_new_tokens=max_new_tokens, | ||
| do_sample=False, | ||
| pad_token_id=tokenizer.pad_token_id, | ||
| ) | ||
|
|
||
| # Compare generated tokens | ||
| tokens_a = gen_a[0].tolist() | ||
| tokens_b = gen_b[0].tolist() | ||
| min_len = min(len(tokens_a), len(tokens_b)) | ||
| matches = sum(1 for i in range(min_len) if tokens_a[i] == tokens_b[i]) | ||
| match_rate = matches / min_len if min_len > 0 else 0 | ||
|
|
||
| results["token_matches"].append({ | ||
| "prompt": prompt, | ||
| "match_rate": match_rate, | ||
| "matched": matches, | ||
| "total": min_len, | ||
| }) | ||
|
|
||
| # Decode for inspection | ||
| text_a = tokenizer.decode(gen_a[0], skip_special_tokens=True) | ||
| text_b = tokenizer.decode(gen_b[0], skip_special_tokens=True) | ||
|
|
||
| results["output_samples"].append({ | ||
| "prompt": prompt, | ||
| "output_a": text_a, | ||
| "output_b": text_b, | ||
| "identical": text_a == text_b, | ||
| }) | ||
|
|
||
| # Cleanup models | ||
| del model_a, model_b | ||
| gc.collect() | ||
| if torch.cuda.is_available(): | ||
| torch.cuda.empty_cache() | ||
|
|
||
| # Aggregate results | ||
| all_max_diffs = [d["max_diff"] for d in results["logit_diffs"]] | ||
| all_mean_diffs = [d["mean_diff"] for d in results["logit_diffs"]] | ||
| all_match_rates = [m["match_rate"] for m in results["token_matches"]] | ||
|
|
||
| results["logit_max_diff"] = max(all_max_diffs) | ||
| results["logit_mean_diff"] = sum(all_mean_diffs) / len(all_mean_diffs) | ||
| results["token_match_rate"] = sum(all_match_rates) / len(all_match_rates) | ||
| results["all_outputs_identical"] = all(s["identical"] for s in results["output_samples"]) | ||
|
|
||
| # Equivalence check: logit diff should be small for numerical equivalence | ||
| # With FP16 quantized models, we expect some small differences | ||
| results["equivalent"] = results["logit_max_diff"] < 1.0 and results["token_match_rate"] == 1.0 | ||
| results["status"] = "success" | ||
|
|
||
| except Exception as e: | ||
| import traceback |
|
|
||
| # Equivalence check: logit diff should be small for numerical equivalence | ||
| # With FP16 quantized models, we expect some small differences | ||
| results["equivalent"] = results["logit_max_diff"] < 1.0 and results["token_match_rate"] == 1.0 |
There was a problem hiding this comment.
The condition for equivalent requires token_match_rate == 1.0. However, the README.md suggests that for W4A16 quantization, a perfect token match is not always expected. This discrepancy could be confusing. Consider relaxing this condition or clarifying the definition of 'equivalent' in this context. The threshold of 1.0 for logit_max_diff also seems arbitrary and could benefit from justification or being made configurable.
| quantize_fn = ( | ||
| quantize_weight_optimized | ||
| if self.enable_torch_compile | ||
| else quantize_weight | ||
| ) |
|
|
||
| # Storage-size instrumentation for memory spike analysis | ||
| # Only log once per run (via env var flag) | ||
| import os |
|
This PR is ready for review. Could a maintainer please add the |
|
Hey, this looks good! I'd say for this to land there are a few final steps
let me know if you need help doing the evaluation |
HDCharles
left a comment
There was a problem hiding this comment.
looks good, see comment
|
Great feedback . i will look at this later in the week to close it out |
|
another point, we should probably extract some/most of the functionality into helper functions so we don't have the exact same code duplicated across quantize_weight and quantize_weight_optimized. This will be a real pain to maintain otherwise. |
…torch_compile flag - Restore original _grid_search_mse with early stopping + patch_attr (non-compiled path) - Add _grid_search_mse_compiled for torch.compile-compatible path - Add enable_torch_compile flag to observer_kwargs (default False) - Add _call_grid_search helper to reduce code duplication in observer classes - Follows GPTQ PR vllm-project#2320 pattern: flag-based compiled/non-compiled path selection Signed-off-by: Jaewoo Kim <pewpewplay315@gmail.com>
|
This pull request has merge conflicts that must be resolved before it can be |
|
for the API lets expand https://github.com/vllm-project/llm-compressor/pull/2384/changes#diff-8b25e0e7dfb3c229926f751ac7d9dff4784e3e6692cc1038b050e0c1cb21792eR25 and just make this 'enable_torch_compile` |
Summary
Benchmark and validate
torch.compileoptimization for the GPTQ quantization algorithm.Addresses #1496
Supersedes #1561
What was done
enable_torch_compileflag toGPTQModifier@torch.compile(dynamic=True)decorated_process_blockfunctionquantize_weight_optimizedcompiled code pathBenchmark Results (Qwen2.5-3B on V100)
Validation (TinyLlama-1.1B)