feat: add distributed weight-parallel support to AWQ modifier#2442
feat: add distributed weight-parallel support to AWQ modifier#2442NJX-njx wants to merge 1 commit intovllm-project:mainfrom
Conversation
Add data-parallel distributed support to AWQModifier so that multi-GPU calibration produces identical results on every rank without needing an explicit broadcast step. Changes: - Import is_distributed() from compressed_tensors and torch.distributed - Add _reduce_activation_means(): all-reduces cached activation mean/count pairs across DP ranks before the grid search, ensuring every rank uses globally-consistent activation statistics - Update _compute_loss(): all-reduces MSE loss and element counts across ranks so the grid search converges to the same best_scales on every rank - Call _reduce_activation_means() at the start of _apply_smoothing() when running in a distributed context Design follows RFC vllm-project#2180: since the AWQ grid search is lightweight, duplicating it on every rank (with synchronized inputs) is cheaper than sharding work and broadcasting results (as done in GPTQ). Closes vllm-project#2219
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
|
👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review. Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed. |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request enhances the AWQModifier by integrating distributed weight-parallel support. The core idea is to enable multi-GPU calibration to produce identical smoothing results on every rank by synchronizing activation statistics and loss calculations across distributed processes. This design avoids the overhead of explicit weight broadcasting, making the quantization process more efficient in distributed environments. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces distributed support for the AWQ modifier, enabling consistent multi-GPU calibration. However, the implementation lacks proper synchronization of the mappings being processed across different ranks, which can lead to inconsistent calls to collective operations and potential deadlocks in distributed data-parallel settings, especially with sparse models like MoE. Additionally, there are potential division-by-zero or NaN issues if certain modules are not activated during calibration or when loss masks are used extensively, which could lead to errors or model corruption.
| for name, (mean, count) in self._smooth_activation_means.items(): | ||
| device = mean.device | ||
| # Recover the local sum from the running mean | ||
| local_sum = mean * count | ||
| count_tensor = torch.tensor( | ||
| [count], dtype=torch.int64, device=device | ||
| ) | ||
|
|
||
| dist.all_reduce(local_sum, op=dist.ReduceOp.SUM) | ||
| dist.all_reduce(count_tensor, op=dist.ReduceOp.SUM) |
There was a problem hiding this comment.
The _reduce_activation_means method iterates over the keys of self._smooth_activation_means and performs collective all_reduce operations. However, self._smooth_activation_means is populated dynamically during the calibration phase based on which layers are activated by the input data. In a distributed data-parallel setting, different ranks process different batches of data. In sparse models like Mixture of Experts (MoE), it is highly likely that certain experts or layers are activated on some ranks but not others. This results in inconsistent sets of keys in self._smooth_activation_means across ranks. When dist.all_reduce is called inconsistently (i.e., some ranks call it while others do not), it leads to a permanent hang (deadlock) of the distributed process.
| dist.all_reduce(count_tensor, op=dist.ReduceOp.SUM) | ||
|
|
||
| total_count = count_tensor.item() | ||
| global_mean = local_sum / total_count |
There was a problem hiding this comment.
There's a potential for a division-by-zero error if total_count is 0. This can happen if a module (like an expert in a MoE model) is not activated by any calibration samples across all distributed ranks. This would lead to NaN values for global_mean, which will cause issues later.
To prevent this, you should handle the case where total_count is zero. When total_count is 0, the global sum of activations (local_sum after all-reduce) will also be 0, so setting global_mean to a tensor of zeros is a safe approach.
| global_mean = local_sum / total_count | |
| global_mean = local_sum / total_count if total_count > 0 else torch.zeros_like(local_sum) |
| ) | ||
| dist.all_reduce(loss_t, op=dist.ReduceOp.SUM) | ||
| dist.all_reduce(count_t, op=dist.ReduceOp.SUM) | ||
| return (loss_t.item() / count_t.item()) |
There was a problem hiding this comment.
This line is susceptible to a division-by-zero error if count_t.item() is 0, which can occur if a particular mapping or expert is not activated by any sample across all ranks in the distributed group, or if all tokens are masked out by loss_mask. This vulnerability can lead to a NaN result in _reduce_activation_means or a ZeroDivisionError in _compute_loss. A ZeroDivisionError causes a crash (DoS), while NaN weights can corrupt the model. To prevent this, a check should be added to handle the zero count case, returning 0.0 as the total loss would also be zero.
| return (loss_t.item() / count_t.item()) | |
| return (loss_t.item() / count_t.item()) if count_t.item() > 0 else 0.0 |
There was a problem hiding this comment.
Pull request overview
Adds data-parallel distributed support to AWQModifier so that multi-GPU calibration yields identical smoothing scales on every rank (without an explicit post-step broadcast), aligning with the distributed calibration/optimization approach described in RFC #2180.
Changes:
- Add distributed detection (
is_distributed) andtorch.distributedusage in AWQ. - Introduce
_reduce_activation_means()to all-reduce cached activation statistics across ranks before grid search. - All-reduce MSE loss (and element counts) during the AWQ grid search so each rank sees the same loss landscape.
Comments suppressed due to low confidence (2)
src/llmcompressor/modifiers/awq/base.py:552
- New distributed behavior is introduced here (global reduction of activation means before the AWQ grid search), but the existing AWQ unit tests don't cover distributed execution. Adding a small test that initializes a 2-rank process group (e.g., gloo) and asserts
_reduce_activation_means()produces identical means on both ranks would help prevent regressions.
# ── Distributed: all-reduce activation means across DP ranks ──
# Each rank has computed activation means from its local data
# partition. We average them so that every rank uses identical
# statistics (and therefore computes the same best scales).
if is_distributed():
self._reduce_activation_means()
src/llmcompressor/modifiers/awq/base.py:884
lossis already a scalar tensor after the firstmse_lossaccumulation; wrapping it withtorch.tensor([loss], ...)forces an extra copy/conversion (and may emit the "copy construct from a tensor" warning) inside the grid-search loop. Use the existing tensor directly (e.g., detach/cast to the desired dtype) rather than constructing a new tensor each call to reduce overhead in distributed runs.
device = fp16_outputs[0].device if fp16_outputs else "cpu"
loss_t = torch.tensor([loss], dtype=torch.float64, device=device)
count_t = torch.tensor(
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| device = mean.device | ||
| # Recover the local sum from the running mean | ||
| local_sum = mean * count | ||
| count_tensor = torch.tensor( | ||
| [count], dtype=torch.int64, device=device | ||
| ) | ||
|
|
||
| dist.all_reduce(local_sum, op=dist.ReduceOp.SUM) | ||
| dist.all_reduce(count_tensor, op=dist.ReduceOp.SUM) | ||
|
|
||
| total_count = count_tensor.item() | ||
| global_mean = local_sum / total_count |
There was a problem hiding this comment.
_smooth_activation_means entries are stored on CPU (see the hook storing act_mean.cpu()), so mean.device will typically be cpu here. dist.all_reduce on CPU tensors only works with a CPU-capable backend (e.g., gloo); if the distributed context is initialized with NCCL (common for multi-GPU), this will raise at runtime. Consider moving local_sum/count_tensor to a CUDA device for the reduction (or using a dedicated gloo process group for CPU reductions), and ensure the dtype is supported for the chosen backend (e.g., cast sums to fp32/fp64 before all-reduce).
| device = mean.device | |
| # Recover the local sum from the running mean | |
| local_sum = mean * count | |
| count_tensor = torch.tensor( | |
| [count], dtype=torch.int64, device=device | |
| ) | |
| dist.all_reduce(local_sum, op=dist.ReduceOp.SUM) | |
| dist.all_reduce(count_tensor, op=dist.ReduceOp.SUM) | |
| total_count = count_tensor.item() | |
| global_mean = local_sum / total_count | |
| orig_device = mean.device | |
| orig_dtype = mean.dtype | |
| # Recover the local sum from the running mean | |
| local_sum = mean * count | |
| count_tensor = torch.tensor( | |
| [count], dtype=torch.int64, device=orig_device | |
| ) | |
| backend = dist.get_backend() if dist.is_initialized() else None | |
| # For NCCL, tensors must be on CUDA; optionally upcast sums to fp32 | |
| if backend in (dist.Backend.NCCL, "nccl"): | |
| if not local_sum.is_cuda: | |
| if not torch.cuda.is_available(): | |
| raise RuntimeError( | |
| "NCCL backend requires CUDA tensors for all_reduce, " | |
| "but CUDA is not available." | |
| ) | |
| reduce_device = torch.device("cuda", torch.cuda.current_device()) | |
| local_sum = local_sum.to(reduce_device, dtype=torch.float32) | |
| count_tensor = count_tensor.to(reduce_device) | |
| else: | |
| # Ensure a supported dtype for reduction | |
| local_sum = local_sum.to(dtype=torch.float32) | |
| # Perform distributed reduction | |
| dist.all_reduce(local_sum, op=dist.ReduceOp.SUM) | |
| dist.all_reduce(count_tensor, op=dist.ReduceOp.SUM) | |
| total_count = int(count_tensor.item()) | |
| # Compute global mean and move back to original device/dtype | |
| global_mean = (local_sum / total_count).to(dtype=orig_dtype, device=orig_device) |
Summary
Add data-parallel distributed support to AWQModifier so that multi-GPU calibration produces identical smoothing results on every rank without needing an explicit weight broadcast step.
Closes #2219
Design (per RFC #2180)
AWQ's grid search is computationally lightweight compared to GPTQ's Hessian-based quantization. Rather than sharding work across ranks and broadcasting results (as GPTQ does), it is cheaper to:
Changes
Testing
This change is designed to be a no-op in single-GPU / non-distributed contexts: \is_distributed()\ short-circuits all new code paths. Multi-GPU correctness can be verified by comparing per-layer best_scales between ranks (they should be identical).