Skip to content

feat: add distributed weight-parallel support to AWQ modifier#2442

Closed
NJX-njx wants to merge 1 commit intovllm-project:mainfrom
NJX-njx:feat/awq-distributed-weight-parallel
Closed

feat: add distributed weight-parallel support to AWQ modifier#2442
NJX-njx wants to merge 1 commit intovllm-project:mainfrom
NJX-njx:feat/awq-distributed-weight-parallel

Conversation

@NJX-njx
Copy link

@NJX-njx NJX-njx commented Mar 4, 2026

Summary

Add data-parallel distributed support to AWQModifier so that multi-GPU calibration produces identical smoothing results on every rank without needing an explicit weight broadcast step.

Closes #2219

Design (per RFC #2180)

AWQ's grid search is computationally lightweight compared to GPTQ's Hessian-based quantization. Rather than sharding work across ranks and broadcasting results (as GPTQ does), it is cheaper to:

  1. All-reduce activation means across data-parallel ranks before the grid search begins, so every rank starts with globally-consistent activation statistics.
  2. All-reduce MSE loss (and element counts) during the grid search, so every rank's loss landscape is identical they independently converge to the same \�est_scales.
  3. Skip the broadcast: since every rank computes the same scales, there is no need to designate a single rank and broadcast weights.

Changes

Location What
Imports Added \is_distributed()\ from \compressed_tensors\ and \ orch.distributed\
_reduce_activation_means()\ New method converts running (mean, count)\ pairs back to sums, all-reduces across ranks, re-derives global means
_apply_smoothing()\ Calls _reduce_activation_means()\ when \is_distributed()\ before entering the grid-search loop
_compute_loss()\ All-reduces local MSE sums and element counts across ranks; returns the global average loss

Testing

This change is designed to be a no-op in single-GPU / non-distributed contexts: \is_distributed()\ short-circuits all new code paths. Multi-GPU correctness can be verified by comparing per-layer best_scales between ranks (they should be identical).

Add data-parallel distributed support to AWQModifier so that multi-GPU
calibration produces identical results on every rank without needing an
explicit broadcast step.

Changes:
- Import is_distributed() from compressed_tensors and torch.distributed
- Add _reduce_activation_means(): all-reduces cached activation mean/count
  pairs across DP ranks before the grid search, ensuring every rank uses
  globally-consistent activation statistics
- Update _compute_loss(): all-reduces MSE loss and element counts across
  ranks so the grid search converges to the same best_scales on every rank
- Call _reduce_activation_means() at the start of _apply_smoothing() when
  running in a distributed context

Design follows RFC vllm-project#2180: since the AWQ grid search is lightweight,
duplicating it on every rank (with synchronized inputs) is cheaper than
sharding work and broadcasting results (as done in GPTQ).

Closes vllm-project#2219
Copilot AI review requested due to automatic review settings March 4, 2026 11:43
@chatgpt-codex-connector
Copy link

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

@github-actions
Copy link

github-actions bot commented Mar 4, 2026

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the AWQModifier by integrating distributed weight-parallel support. The core idea is to enable multi-GPU calibration to produce identical smoothing results on every rank by synchronizing activation statistics and loss calculations across distributed processes. This design avoids the overhead of explicit weight broadcasting, making the quantization process more efficient in distributed environments.

Highlights

  • Distributed Support for AWQModifier: Implemented data-parallel distributed support for the AWQModifier to ensure consistent smoothing results across multi-GPU calibration without explicit weight broadcasting.
  • Activation Mean Reduction: Introduced an all-reduce mechanism for activation means across data-parallel ranks, ensuring globally-consistent activation statistics before the grid search.
  • MSE Loss All-Reduction: Added all-reduction for MSE loss and element counts during the grid search, allowing each rank to independently converge to the same optimal scales.
  • Broadcast Elimination: Eliminated the need for an explicit weight broadcast step, as all ranks compute identical scales due to the distributed synchronization of activation means and loss.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • src/llmcompressor/modifiers/awq/base.py
    • Imported is_distributed from compressed_tensors.offload.dist_utils.
    • Imported torch.distributed as dist.
    • Added a new private method _reduce_activation_means to perform all-reduction of cached activation means across data-parallel ranks.
    • Modified _apply_smoothing to call _reduce_activation_means conditionally when distributed.
    • Updated _compute_loss to all-reduce local MSE sums and element counts across ranks, returning the global average loss in a distributed setting.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces distributed support for the AWQ modifier, enabling consistent multi-GPU calibration. However, the implementation lacks proper synchronization of the mappings being processed across different ranks, which can lead to inconsistent calls to collective operations and potential deadlocks in distributed data-parallel settings, especially with sparse models like MoE. Additionally, there are potential division-by-zero or NaN issues if certain modules are not activated during calibration or when loss masks are used extensively, which could lead to errors or model corruption.

Comment on lines +522 to +531
for name, (mean, count) in self._smooth_activation_means.items():
device = mean.device
# Recover the local sum from the running mean
local_sum = mean * count
count_tensor = torch.tensor(
[count], dtype=torch.int64, device=device
)

dist.all_reduce(local_sum, op=dist.ReduceOp.SUM)
dist.all_reduce(count_tensor, op=dist.ReduceOp.SUM)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The _reduce_activation_means method iterates over the keys of self._smooth_activation_means and performs collective all_reduce operations. However, self._smooth_activation_means is populated dynamically during the calibration phase based on which layers are activated by the input data. In a distributed data-parallel setting, different ranks process different batches of data. In sparse models like Mixture of Experts (MoE), it is highly likely that certain experts or layers are activated on some ranks but not others. This results in inconsistent sets of keys in self._smooth_activation_means across ranks. When dist.all_reduce is called inconsistently (i.e., some ranks call it while others do not), it leads to a permanent hang (deadlock) of the distributed process.

dist.all_reduce(count_tensor, op=dist.ReduceOp.SUM)

total_count = count_tensor.item()
global_mean = local_sum / total_count
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There's a potential for a division-by-zero error if total_count is 0. This can happen if a module (like an expert in a MoE model) is not activated by any calibration samples across all distributed ranks. This would lead to NaN values for global_mean, which will cause issues later.

To prevent this, you should handle the case where total_count is zero. When total_count is 0, the global sum of activations (local_sum after all-reduce) will also be 0, so setting global_mean to a tensor of zeros is a safe approach.

Suggested change
global_mean = local_sum / total_count
global_mean = local_sum / total_count if total_count > 0 else torch.zeros_like(local_sum)

)
dist.all_reduce(loss_t, op=dist.ReduceOp.SUM)
dist.all_reduce(count_t, op=dist.ReduceOp.SUM)
return (loss_t.item() / count_t.item())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

This line is susceptible to a division-by-zero error if count_t.item() is 0, which can occur if a particular mapping or expert is not activated by any sample across all ranks in the distributed group, or if all tokens are masked out by loss_mask. This vulnerability can lead to a NaN result in _reduce_activation_means or a ZeroDivisionError in _compute_loss. A ZeroDivisionError causes a crash (DoS), while NaN weights can corrupt the model. To prevent this, a check should be added to handle the zero count case, returning 0.0 as the total loss would also be zero.

Suggested change
return (loss_t.item() / count_t.item())
return (loss_t.item() / count_t.item()) if count_t.item() > 0 else 0.0

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds data-parallel distributed support to AWQModifier so that multi-GPU calibration yields identical smoothing scales on every rank (without an explicit post-step broadcast), aligning with the distributed calibration/optimization approach described in RFC #2180.

Changes:

  • Add distributed detection (is_distributed) and torch.distributed usage in AWQ.
  • Introduce _reduce_activation_means() to all-reduce cached activation statistics across ranks before grid search.
  • All-reduce MSE loss (and element counts) during the AWQ grid search so each rank sees the same loss landscape.
Comments suppressed due to low confidence (2)

src/llmcompressor/modifiers/awq/base.py:552

  • New distributed behavior is introduced here (global reduction of activation means before the AWQ grid search), but the existing AWQ unit tests don't cover distributed execution. Adding a small test that initializes a 2-rank process group (e.g., gloo) and asserts _reduce_activation_means() produces identical means on both ranks would help prevent regressions.
        # ── Distributed: all-reduce activation means across DP ranks ──
        # Each rank has computed activation means from its local data
        # partition. We average them so that every rank uses identical
        # statistics (and therefore computes the same best scales).
        if is_distributed():
            self._reduce_activation_means()

src/llmcompressor/modifiers/awq/base.py:884

  • loss is already a scalar tensor after the first mse_loss accumulation; wrapping it with torch.tensor([loss], ...) forces an extra copy/conversion (and may emit the "copy construct from a tensor" warning) inside the grid-search loop. Use the existing tensor directly (e.g., detach/cast to the desired dtype) rather than constructing a new tensor each call to reduce overhead in distributed runs.
            device = fp16_outputs[0].device if fp16_outputs else "cpu"
            loss_t = torch.tensor([loss], dtype=torch.float64, device=device)
            count_t = torch.tensor(

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +523 to +534
device = mean.device
# Recover the local sum from the running mean
local_sum = mean * count
count_tensor = torch.tensor(
[count], dtype=torch.int64, device=device
)

dist.all_reduce(local_sum, op=dist.ReduceOp.SUM)
dist.all_reduce(count_tensor, op=dist.ReduceOp.SUM)

total_count = count_tensor.item()
global_mean = local_sum / total_count
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_smooth_activation_means entries are stored on CPU (see the hook storing act_mean.cpu()), so mean.device will typically be cpu here. dist.all_reduce on CPU tensors only works with a CPU-capable backend (e.g., gloo); if the distributed context is initialized with NCCL (common for multi-GPU), this will raise at runtime. Consider moving local_sum/count_tensor to a CUDA device for the reduction (or using a dedicated gloo process group for CPU reductions), and ensure the dtype is supported for the chosen backend (e.g., cast sums to fp32/fp64 before all-reduce).

Suggested change
device = mean.device
# Recover the local sum from the running mean
local_sum = mean * count
count_tensor = torch.tensor(
[count], dtype=torch.int64, device=device
)
dist.all_reduce(local_sum, op=dist.ReduceOp.SUM)
dist.all_reduce(count_tensor, op=dist.ReduceOp.SUM)
total_count = count_tensor.item()
global_mean = local_sum / total_count
orig_device = mean.device
orig_dtype = mean.dtype
# Recover the local sum from the running mean
local_sum = mean * count
count_tensor = torch.tensor(
[count], dtype=torch.int64, device=orig_device
)
backend = dist.get_backend() if dist.is_initialized() else None
# For NCCL, tensors must be on CUDA; optionally upcast sums to fp32
if backend in (dist.Backend.NCCL, "nccl"):
if not local_sum.is_cuda:
if not torch.cuda.is_available():
raise RuntimeError(
"NCCL backend requires CUDA tensors for all_reduce, "
"but CUDA is not available."
)
reduce_device = torch.device("cuda", torch.cuda.current_device())
local_sum = local_sum.to(reduce_device, dtype=torch.float32)
count_tensor = count_tensor.to(reduce_device)
else:
# Ensure a supported dtype for reduction
local_sum = local_sum.to(dtype=torch.float32)
# Perform distributed reduction
dist.all_reduce(local_sum, op=dist.ReduceOp.SUM)
dist.all_reduce(count_tensor, op=dist.ReduceOp.SUM)
total_count = int(count_tensor.item())
# Compute global mean and move back to original device/dtype
global_mean = (local_sum / total_count).to(dtype=orig_dtype, device=orig_device)

Copilot uses AI. Check for mistakes.
@dsikka dsikka closed this Mar 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Performance Refactor] Extend modifiers to support weight-parallel optimization - AWQModifier

3 participants