feat: add distributed weight-parallel support to AWQ modifier by NJX-njx · Pull Request #2442 · vllm-project/llm-compressor

NJX-njx · 2026-03-04T11:43:25Z

Summary

Add data-parallel distributed support to AWQModifier so that multi-GPU calibration produces identical smoothing results on every rank without needing an explicit weight broadcast step.

Closes #2219

Design (per RFC #2180)

AWQ's grid search is computationally lightweight compared to GPTQ's Hessian-based quantization. Rather than sharding work across ranks and broadcasting results (as GPTQ does), it is cheaper to:

All-reduce activation means across data-parallel ranks before the grid search begins, so every rank starts with globally-consistent activation statistics.
All-reduce MSE loss (and element counts) during the grid search, so every rank's loss landscape is identical they independently converge to the same \�est_scales.
Skip the broadcast: since every rank computes the same scales, there is no need to designate a single rank and broadcast weights.

Changes

Location	What
Imports	Added \is_distributed()\ from \compressed_tensors\ and \ orch.distributed\
_reduce_activation_means()\	New method converts running (mean, count)\ pairs back to sums, all-reduces across ranks, re-derives global means
_apply_smoothing()\	Calls _reduce_activation_means()\ when \is_distributed()\ before entering the grid-search loop
_compute_loss()\	All-reduces local MSE sums and element counts across ranks; returns the global average loss

Testing

This change is designed to be a no-op in single-GPU / non-distributed contexts: \is_distributed()\ short-circuits all new code paths. Multi-GPU correctness can be verified by comparing per-layer best_scales between ranks (they should be identical).

Add data-parallel distributed support to AWQModifier so that multi-GPU calibration produces identical results on every rank without needing an explicit broadcast step. Changes: - Import is_distributed() from compressed_tensors and torch.distributed - Add _reduce_activation_means(): all-reduces cached activation mean/count pairs across DP ranks before the grid search, ensuring every rank uses globally-consistent activation statistics - Update _compute_loss(): all-reduces MSE loss and element counts across ranks so the grid search converges to the same best_scales on every rank - Call _reduce_activation_means() at the start of _apply_smoothing() when running in a distributed context Design follows RFC vllm-project#2180: since the AWQ grid search is lightweight, duplicating it on every rank (with synchronized inputs) is cheaper than sharding work and broadcasting results (as done in GPTQ). Closes vllm-project#2219

chatgpt-codex-connector · 2026-03-04T11:43:32Z

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

github-actions · 2026-03-04T11:45:24Z

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

gemini-code-assist · 2026-03-04T11:47:02Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the AWQModifier by integrating distributed weight-parallel support. The core idea is to enable multi-GPU calibration to produce identical smoothing results on every rank by synchronizing activation statistics and loss calculations across distributed processes. This design avoids the overhead of explicit weight broadcasting, making the quantization process more efficient in distributed environments.

Highlights

Distributed Support for AWQModifier: Implemented data-parallel distributed support for the AWQModifier to ensure consistent smoothing results across multi-GPU calibration without explicit weight broadcasting.
Activation Mean Reduction: Introduced an all-reduce mechanism for activation means across data-parallel ranks, ensuring globally-consistent activation statistics before the grid search.
MSE Loss All-Reduction: Added all-reduction for MSE loss and element counts during the grid search, allowing each rank to independently converge to the same optimal scales.
Broadcast Elimination: Eliminated the need for an explicit weight broadcast step, as all ranks compute identical scales due to the distributed synchronization of activation means and loss.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

src/llmcompressor/modifiers/awq/base.py
- Imported is_distributed from compressed_tensors.offload.dist_utils.
- Imported torch.distributed as dist.
- Added a new private method _reduce_activation_means to perform all-reduction of cached activation means across data-parallel ranks.
- Modified _apply_smoothing to call _reduce_activation_means conditionally when distributed.
- Updated _compute_loss to all-reduce local MSE sums and element counts across ranks, returning the global average loss in a distributed setting.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces distributed support for the AWQ modifier, enabling consistent multi-GPU calibration. However, the implementation lacks proper synchronization of the mappings being processed across different ranks, which can lead to inconsistent calls to collective operations and potential deadlocks in distributed data-parallel settings, especially with sparse models like MoE. Additionally, there are potential division-by-zero or NaN issues if certain modules are not activated during calibration or when loss masks are used extensively, which could lead to errors or model corruption.

gemini-code-assist · 2026-03-04T11:49:21Z

src/llmcompressor/modifiers/awq/base.py

+        for name, (mean, count) in self._smooth_activation_means.items():
+            device = mean.device
+            # Recover the local sum from the running mean
+            local_sum = mean * count
+            count_tensor = torch.tensor(
+                [count], dtype=torch.int64, device=device
+            )
+
+            dist.all_reduce(local_sum, op=dist.ReduceOp.SUM)
+            dist.all_reduce(count_tensor, op=dist.ReduceOp.SUM)


The _reduce_activation_means method iterates over the keys of self._smooth_activation_means and performs collective all_reduce operations. However, self._smooth_activation_means is populated dynamically during the calibration phase based on which layers are activated by the input data. In a distributed data-parallel setting, different ranks process different batches of data. In sparse models like Mixture of Experts (MoE), it is highly likely that certain experts or layers are activated on some ranks but not others. This results in inconsistent sets of keys in self._smooth_activation_means across ranks. When dist.all_reduce is called inconsistently (i.e., some ranks call it while others do not), it leads to a permanent hang (deadlock) of the distributed process.

gemini-code-assist · 2026-03-04T11:49:21Z

src/llmcompressor/modifiers/awq/base.py

+            dist.all_reduce(count_tensor, op=dist.ReduceOp.SUM)
+
+            total_count = count_tensor.item()
+            global_mean = local_sum / total_count


There's a potential for a division-by-zero error if total_count is 0. This can happen if a module (like an expert in a MoE model) is not activated by any calibration samples across all distributed ranks. This would lead to NaN values for global_mean, which will cause issues later.

To prevent this, you should handle the case where total_count is zero. When total_count is 0, the global sum of activations (local_sum after all-reduce) will also be 0, so setting global_mean to a tensor of zeros is a safe approach.

Suggested change

global_mean = local_sum / total_count

global_mean = local_sum / total_count if total_count > 0 else torch.zeros_like(local_sum)

gemini-code-assist · 2026-03-04T11:49:22Z

src/llmcompressor/modifiers/awq/base.py

+            )
+            dist.all_reduce(loss_t, op=dist.ReduceOp.SUM)
+            dist.all_reduce(count_t, op=dist.ReduceOp.SUM)
+            return (loss_t.item() / count_t.item())


This line is susceptible to a division-by-zero error if count_t.item() is 0, which can occur if a particular mapping or expert is not activated by any sample across all ranks in the distributed group, or if all tokens are masked out by loss_mask. This vulnerability can lead to a NaN result in _reduce_activation_means or a ZeroDivisionError in _compute_loss. A ZeroDivisionError causes a crash (DoS), while NaN weights can corrupt the model. To prevent this, a check should be added to handle the zero count case, returning 0.0 as the total loss would also be zero.

Suggested change

return (loss_t.item() / count_t.item())

return (loss_t.item() / count_t.item()) if count_t.item() > 0 else 0.0

Copilot

Pull request overview

Adds data-parallel distributed support to AWQModifier so that multi-GPU calibration yields identical smoothing scales on every rank (without an explicit post-step broadcast), aligning with the distributed calibration/optimization approach described in RFC #2180.

Changes:

Add distributed detection (is_distributed) and torch.distributed usage in AWQ.
Introduce _reduce_activation_means() to all-reduce cached activation statistics across ranks before grid search.
All-reduce MSE loss (and element counts) during the AWQ grid search so each rank sees the same loss landscape.

Comments suppressed due to low confidence (2)

src/llmcompressor/modifiers/awq/base.py:552

New distributed behavior is introduced here (global reduction of activation means before the AWQ grid search), but the existing AWQ unit tests don't cover distributed execution. Adding a small test that initializes a 2-rank process group (e.g., gloo) and asserts _reduce_activation_means() produces identical means on both ranks would help prevent regressions.

        # ── Distributed: all-reduce activation means across DP ranks ──
        # Each rank has computed activation means from its local data
        # partition. We average them so that every rank uses identical
        # statistics (and therefore computes the same best scales).
        if is_distributed():
            self._reduce_activation_means()

src/llmcompressor/modifiers/awq/base.py:884

loss is already a scalar tensor after the first mse_loss accumulation; wrapping it with torch.tensor([loss], ...) forces an extra copy/conversion (and may emit the "copy construct from a tensor" warning) inside the grid-search loop. Use the existing tensor directly (e.g., detach/cast to the desired dtype) rather than constructing a new tensor each call to reduce overhead in distributed runs.

            device = fp16_outputs[0].device if fp16_outputs else "cpu"
            loss_t = torch.tensor([loss], dtype=torch.float64, device=device)
            count_t = torch.tensor(

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-04T11:50:27Z

src/llmcompressor/modifiers/awq/base.py

+            device = mean.device
+            # Recover the local sum from the running mean
+            local_sum = mean * count
+            count_tensor = torch.tensor(
+                [count], dtype=torch.int64, device=device
+            )
+
+            dist.all_reduce(local_sum, op=dist.ReduceOp.SUM)
+            dist.all_reduce(count_tensor, op=dist.ReduceOp.SUM)
+
+            total_count = count_tensor.item()
+            global_mean = local_sum / total_count


_smooth_activation_means entries are stored on CPU (see the hook storing act_mean.cpu()), so mean.device will typically be cpu here. dist.all_reduce on CPU tensors only works with a CPU-capable backend (e.g., gloo); if the distributed context is initialized with NCCL (common for multi-GPU), this will raise at runtime. Consider moving local_sum/count_tensor to a CUDA device for the reduction (or using a dedicated gloo process group for CPU reductions), and ensure the dtype is supported for the chosen backend (e.g., cast sums to fp32/fp64 before all-reduce).

Suggested change

device = mean.device

# Recover the local sum from the running mean

local_sum = mean * count

count_tensor = torch.tensor(

[count], dtype=torch.int64, device=device

)

dist.all_reduce(local_sum, op=dist.ReduceOp.SUM)

dist.all_reduce(count_tensor, op=dist.ReduceOp.SUM)

total_count = count_tensor.item()

global_mean = local_sum / total_count

orig_device = mean.device

orig_dtype = mean.dtype

# Recover the local sum from the running mean

local_sum = mean * count

count_tensor = torch.tensor(

[count], dtype=torch.int64, device=orig_device

)

backend = dist.get_backend() if dist.is_initialized() else None

# For NCCL, tensors must be on CUDA; optionally upcast sums to fp32

if backend in (dist.Backend.NCCL, "nccl"):

if not local_sum.is_cuda:

if not torch.cuda.is_available():

raise RuntimeError(

"NCCL backend requires CUDA tensors for all_reduce, "

"but CUDA is not available."

)

reduce_device = torch.device("cuda", torch.cuda.current_device())

local_sum = local_sum.to(reduce_device, dtype=torch.float32)

count_tensor = count_tensor.to(reduce_device)

else:

# Ensure a supported dtype for reduction

local_sum = local_sum.to(dtype=torch.float32)

# Perform distributed reduction

dist.all_reduce(local_sum, op=dist.ReduceOp.SUM)

dist.all_reduce(count_tensor, op=dist.ReduceOp.SUM)

total_count = int(count_tensor.item())

# Compute global mean and move back to original device/dtype

global_mean = (local_sum / total_count).to(dtype=orig_dtype, device=orig_device)

Copilot AI review requested due to automatic review settings March 4, 2026 11:43

Copilot started reviewing on behalf of NJX-njx March 4, 2026 11:45 View session

gemini-code-assist bot reviewed Mar 4, 2026

View reviewed changes

Copilot AI reviewed Mar 4, 2026

View reviewed changes

dsikka closed this Mar 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add distributed weight-parallel support to AWQ modifier#2442

feat: add distributed weight-parallel support to AWQ modifier#2442
NJX-njx wants to merge 1 commit intovllm-project:mainfrom
NJX-njx:feat/awq-distributed-weight-parallel

NJX-njx commented Mar 4, 2026

Uh oh!

chatgpt-codex-connector bot commented Mar 4, 2026

Uh oh!

github-actions bot commented Mar 4, 2026

Uh oh!

gemini-code-assist bot commented Mar 4, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 4, 2026

Uh oh!

gemini-code-assist bot Mar 4, 2026

Uh oh!

gemini-code-assist bot Mar 4, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	global_mean = local_sum / total_count
	global_mean = local_sum / total_count if total_count > 0 else torch.zeros_like(local_sum)

	return (loss_t.item() / count_t.item())
	return (loss_t.item() / count_t.item()) if count_t.item() > 0 else 0.0

-            device = mean.device
-            # Recover the local sum from the running mean
-            local_sum = mean * count
-            count_tensor = torch.tensor(
-                [count], dtype=torch.int64, device=device
-            )
-            dist.all_reduce(local_sum, op=dist.ReduceOp.SUM)
-            dist.all_reduce(count_tensor, op=dist.ReduceOp.SUM)
-            total_count = count_tensor.item()
-            global_mean = local_sum / total_count
+            orig_device = mean.device
+            orig_dtype = mean.dtype
+            # Recover the local sum from the running mean
+            local_sum = mean * count
+            count_tensor = torch.tensor(
+                [count], dtype=torch.int64, device=orig_device
+            )
+            backend = dist.get_backend() if dist.is_initialized() else None
+            # For NCCL, tensors must be on CUDA; optionally upcast sums to fp32
+            if backend in (dist.Backend.NCCL, "nccl"):
+                if not local_sum.is_cuda:
+                    if not torch.cuda.is_available():
+                        raise RuntimeError(
+                            "NCCL backend requires CUDA tensors for all_reduce, "
+                            "but CUDA is not available."
+                        )
+                    reduce_device = torch.device("cuda", torch.cuda.current_device())
+                    local_sum = local_sum.to(reduce_device, dtype=torch.float32)
+                    count_tensor = count_tensor.to(reduce_device)
+                else:
+                    # Ensure a supported dtype for reduction
+                    local_sum = local_sum.to(dtype=torch.float32)
+            # Perform distributed reduction
+            dist.all_reduce(local_sum, op=dist.ReduceOp.SUM)
+            dist.all_reduce(count_tensor, op=dist.ReduceOp.SUM)
+            total_count = int(count_tensor.item())
+            # Compute global mean and move back to original device/dtype
+            global_mean = (local_sum / total_count).to(dtype=orig_dtype, device=orig_device)

Conversation

NJX-njx commented Mar 4, 2026

Summary

Design (per RFC #2180)

Changes

Testing

Uh oh!

chatgpt-codex-connector bot commented Mar 4, 2026

Uh oh!

github-actions bot commented Mar 4, 2026

Uh oh!

gemini-code-assist bot commented Mar 4, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants