[Distributed] Extend QuantizationModifier to support distributed activation calibration by Etelis · Pull Request #2391 · vllm-project/llm-compressor

Etelis · 2026-02-22T12:11:01Z

Adds DDP support to QuantizationModifier for activation observer synchronization across multiple GPUs during calibration.

At SEQUENTIAL_EPOCH_END and CALIBRATION_EPOCH_END, activation observer min/max values are all-reduced across ranks. Scale/zp are then recomputed from the global statistics so all ranks have identical quantization parameters.

Changes

Add synchronize(), recompute_qparams(), recompute_global_scale() to Observer base class
Add sync_activation_observers() to QuantizationMixin (shared by QuantizationModifier and GPTQModifier)
Batch all async dist.all_reduce operations and wait once, matching GPTQ DDP pattern
Remove recompute_qparams_from_observer from calibration.py (now encapsulated in Observer methods)
Align distributed example with existing DDP patterns (init_dist, get_rank_partition)

Runtime & Evaluation Results

Model: Meta-Llama-3-8B-Instruct, W8A8 (static input activations), 256 calibration samples

Setup	Quantization Time	Speedup	Perplexity (wikitext)
1x A100-80GB (baseline)	270.1s	1.0x	37.51
2x A100-80GB (DDP)	138.0s	1.96x	37.64
4x A100-80GB (DDP)	72.1s	3.75x	36.40

Near-linear scaling with GPU count
Perplexity matches across all configurations, confirming activation observer sync produces correct quantization parameters

Test plan

Unit tests: pytest tests/llmcompressor/utils/test_distributed.py (8 tests)
Multi-GPU tests: torchrun --nproc_per_node=2 -m pytest tests/llmcompressor/modifiers/quantization/test_quantization_ddp.py (2 tests)
End-to-end quantization + eval on 1/2/4 GPUs (see results table above)

github-actions · 2026-02-22T12:11:10Z

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

gemini-code-assist · 2026-02-22T12:11:24Z

Summary of Changes

Hello @Etelis, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the QuantizationModifier by integrating robust distributed training support. It introduces mechanisms to efficiently partition weight calibration tasks and synchronize activation observer statistics across multiple GPUs, thereby enabling large language model quantization in distributed environments. The changes are designed to be backward compatible, maintaining existing functionality for single-device operations.

Highlights

Distributed Quantization Support: The QuantizationModifier now supports Distributed Data Parallel (DDP) environments, enabling quantization workflows to scale across multiple GPUs.
Partitioned Weight Calibration: Weight calibration is distributed across ranks using a greedy bin-packing algorithm based on weight size. Each rank calibrates its assigned subset of modules and broadcasts the resulting quantization parameters (global_scale, scale, zero_point) to all other ranks.
Synchronized Activation Observers: Activation observer statistics (min/max values) are now synchronized across all DDP ranks at SEQUENTIAL_EPOCH_END and CALIBRATION_EPOCH_END events using all-reduce operations. This ensures all ranks compute identical quantization parameters for activations.
Backward Compatibility: All distributed operations are guarded by is_distributed() checks, ensuring that the modifier's behavior remains unchanged in single-GPU or non-distributed setups.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

examples/big_models_with_sequential_onloading/llama3_8b_w8a8_distributed.py
- Added a new example script demonstrating distributed W8A8 quantization for the Llama-3-8B-Instruct model.
src/llmcompressor/modifiers/quantization/calibration.py
- Exported the new recompute_qparams_from_observer function.
- Implemented recompute_qparams_from_observer to recalculate quantization parameters from observer statistics, particularly useful after distributed synchronization.
src/llmcompressor/modifiers/quantization/quantization/base.py
- Imported new distributed utility functions to facilitate DDP operations.
- Updated the QuantizationModifier docstring to reflect its new DDP capabilities.
- Refactored the on_start method to conditionally execute single-process or distributed weight calibration.
- Added _calibrate_weights_distributed method to manage DDP-specific weight calibration, including module partitioning, global scale computation, and broadcasting of results.
- Introduced _sync_activation_observers to perform all-reduce on activation observer min/max values and recompute quantization parameters.
- Integrated _sync_activation_observers into the on_event method, triggering synchronization at SEQUENTIAL_EPOCH_END and CALIBRATION_EPOCH_END.
src/llmcompressor/utils/init.py
- Imported the newly added distributed utility module.
src/llmcompressor/utils/distributed.py
- Added a new utility module for distributed processing.
- Implemented is_distributed, get_rank, and get_world_size functions for DDP environment detection and information retrieval.
- Developed _compute_rank_assignments for greedy bin-packing of modules based on weight size to balance workload across ranks.
- Provided partition_modules_by_weight_size to return the subset of modules assigned to the current rank.
- Created build_module_to_rank_map to establish a consistent mapping of modules to ranks.
- Implemented broadcast_module_parameter to broadcast module parameters from a source rank to all other ranks, supporting CPU-offloaded parameters.
- Added all_reduce_min and all_reduce_max functions for distributed minimum and maximum aggregation.
tests/llmcompressor/modifiers/quantization/test_quantization_ddp.py
- Added new multi-GPU tests to verify the correctness of all_reduce_min and all_reduce_max operations.
- Included tests to confirm that synchronized quantization parameters are identical across all DDP ranks.
tests/llmcompressor/utils/test_distributed.py
- Added unit tests for the new distributed utility functions, covering non-distributed behavior, module partitioning logic, and all-reduce operations.

Activity

Unit tests passed locally (5/5).
Multi-GPU tests passed on 2x A100 (2/2).
An end-to-end oneshot() run with nm-testing/tinysmokellama-3.2 on 2x A100 successfully quantized 42 modules.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

mergify · 2026-02-22T12:11:46Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Etelis.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

gemini-code-assist

Code Review

The pull request introduces distributed support for the QuantizationModifier, enabling weight calibration and activation observer synchronization across multiple GPUs. This is a significant improvement for scaling quantization to large models. The implementation uses a greedy bin-packing algorithm for load balancing weight calibration, which is a solid choice. However, the current approach to synchronization involves a large number of individual collective communication calls (all-reduces and broadcasts) within loops, which will likely become a performance bottleneck due to network latency. Additionally, there are a few issues with device indexing in multi-node environments that should be addressed to ensure robustness.

src/llmcompressor/modifiers/quantization/quantization/base.py

src/llmcompressor/utils/distributed.py

examples/big_models_with_sequential_onloading/llama3_8b_w8a8_distributed.py

Add shared utility functions for multi-GPU weight calibration and activation observer synchronization. All functions are no-ops when torch.distributed is not initialized. Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

Add a helper function to recompute scale and zero_point from an observer's accumulated min/max after DDP all-reduce synchronization. Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

Refactor QuantizationModifier.on_start to support distributed weight calibration. Each rank calibrates a subset of modules (assigned by greedy bin-packing on weight size) and broadcasts results to all ranks. Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

Use each rank's own GPU device for NCCL broadcast instead of the module's execution device, which may be CPU or shared across ranks when the model is not GPU-resident. Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

kylesayrs

I recommend letting @GOavi101 handle parallelized weight quantization, and instead focusing on parallelized activation quantization.

Once the requested changes have been made, please add a table of runtime and eval results, similar to GPTQ

src/llmcompressor/utils/distributed.py

src/llmcompressor/modifiers/quantization/quantization/base.py

src/llmcompressor/utils/distributed.py

src/llmcompressor/modifiers/quantization/quantization/base.py

src/llmcompressor/modifiers/quantization/calibration.py

examples/big_models_with_sequential_onloading/llama3_8b_w8a8_distributed.py

Remove distributed weight calibration (partition, broadcast, rank assignment) and focus exclusively on activation observer synchronization. Key changes: - Add synchronize(), recompute_qparams(), recompute_global_scale() to Observer base class for clean DDP interface - Move sync_activation_observers() to QuantizationMixin for reuse by both QuantizationModifier and GPTQModifier - Batch all async all_reduce ops and wait once, matching GPTQ pattern - Delete distributed.py (consolidated into Observer methods + dist.py) - Remove recompute_qparams_from_observer from calibration.py - Align example with existing DDP patterns (init_dist, get_rank_partition) - Update unit and multi-GPU tests for new observer-based sync Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

Etelis · 2026-02-24T08:25:29Z

All review comments have been addressed:

Removed distributed weight calibration — weight calibration runs identically on each rank
Focus is exclusively on activation observer synchronization
synchronize(), recompute_qparams(), recompute_global_scale() added to Observer base class
sync_activation_observers() moved to QuantizationMixin (shared with GPTQModifier)
All async all_reduce ops batched + waited once via wait_for_comms() (GPTQ pattern)
Deleted distributed.py, removed all_reduce_min/all_reduce_max wrappers, removed CPU-to-GPU tensor movement
Removed recompute_qparams_from_observer from calibration.py
Example uses init_dist(), get_rank_partition(), load_offloaded_model()

Runtime & eval results added to PR description (Llama-3-8B, W8A8 static activations, 256 samples, A100-80GB):

Setup	Time	Speedup	Perplexity (wikitext)
1x GPU	270.1s	1.0x	37.51
2x GPU (DDP)	138.0s	1.96x	37.64
4x GPU (DDP)	72.1s	3.75x	36.40

kylesayrs

I think this code looks really great, the benchmarks look great as well. A couple notes from me:

The fact that the 4x DDP setup does not increase perplexity gives me confidence that syncing once per epoch (rather than once per batch) is good enough, nice work.
From your speedup benchmarks, it seems like repeating work (calculate_q/gparams) across ranks is not too much of a cost. That seems to match expectations as well, nice work.

I'll make sure this code gets merged as part of the next LLM Compressor release.

src/llmcompressor/observers/base.py

src/llmcompressor/modifiers/quantization/quantization/base.py

src/llmcompressor/observers/base.py

HDCharles

See comments

src/llmcompressor/observers/base.py

Etelis · 2026-03-20T13:32:10Z

Thanks for the latest round of feedback! Addressed all three points:

- Replace custom _all_reduce_fp8_safe with as_broadcastable from compressed_tensors, matching the GPTQ pattern - Divide by world_size before SUM all-reduce in moving-average observers, removing the need for finalize_synchronize - Remove _FP8_DTYPES set and _all_reduce_fp8_safe helper Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

…m/Etelis/llm-compressor into feature/quantization-modifier-ddp

src/llmcompressor/observers/moving_base.py

kylesayrs · 2026-03-23T11:49:21Z

Looks good to me! Please use an average when synchronizing moving average observers across ranks

mergify · 2026-03-23T11:50:36Z

The quality checks have failed. Please run make style and make quality under
the root directory to adddress the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/llm-compressor/blob/main/CONTRIBUTING.md

Co-authored-by: Kyle Sayers <kylesayrs@gmail.com> Signed-off-by: Itay Etelis <92247226+Etelis@users.noreply.github.com>

HDCharles

looks good

mergify · 2026-03-24T15:06:02Z

The quality checks have failed. Please run make style and make quality under
the root directory to adddress the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/llm-compressor/blob/main/CONTRIBUTING.md

mergify · 2026-03-24T18:31:39Z

The quality checks have failed. Please run make style and make quality under
the root directory to adddress the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/llm-compressor/blob/main/CONTRIBUTING.md

- Run ruff format on observers/base.py - Fix test_moving_avg_synchronize_issues_all_reduce to mock moving_base.dist (not base.dist) and use SUM/get_world_size Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

…m/Etelis/llm-compressor into feature/quantization-modifier-ddp

Etelis requested review from HDCharles, brian-dellabetta, dsikka and kylesayrs as code owners February 22, 2026 12:11

mergify bot added the documentation Improvements or additions to documentation label Feb 22, 2026

mergify bot added the needs-rebase label Feb 22, 2026

gemini-code-assist bot reviewed Feb 22, 2026

View reviewed changes

src/llmcompressor/modifiers/quantization/quantization/base.py Outdated Show resolved Hide resolved

src/llmcompressor/utils/distributed.py Outdated Show resolved Hide resolved

examples/big_models_with_sequential_onloading/llama3_8b_w8a8_distributed.py Outdated Show resolved Hide resolved

EtelisIBM added 7 commits February 22, 2026 14:12

[Distributed] Add distributed utilities for DDP calibration

f60200a

Add shared utility functions for multi-GPU weight calibration and activation observer synchronization. All functions are no-ops when torch.distributed is not initialized. Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

[Distributed] Add recompute_qparams_from_observer helper

c4d630d

Add a helper function to recompute scale and zero_point from an observer's accumulated min/max after DDP all-reduce synchronization. Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

[Tests] Add unit tests for distributed utilities

ac0cc2a

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

[Tests] Add multi-GPU integration tests for DDP quantization

76cf40f

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

[Examples] Add distributed W8A8 quantization example

0f3e1f9

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

[Distributed] Fix broadcast_module_parameter for CPU-resident models

9975edc

Use each rank's own GPU device for NCCL broadcast instead of the module's execution device, which may be CPU or shared across ranks when the model is not GPU-resident. Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

Etelis force-pushed the feature/quantization-modifier-ddp branch from 61c255a to 72ed4b2 Compare February 22, 2026 12:13

mergify bot removed the needs-rebase label Feb 22, 2026

Etelis force-pushed the feature/quantization-modifier-ddp branch from 72ed4b2 to 9975edc Compare February 22, 2026 12:15

kylesayrs requested changes Feb 23, 2026

View reviewed changes

kylesayrs reviewed Feb 23, 2026

View reviewed changes

examples/big_models_with_sequential_onloading/llama3_8b_w8a8_distributed.py Outdated Show resolved Hide resolved

Merge branch 'main' into feature/quantization-modifier-ddp

87f4b0d

kylesayrs reviewed Feb 24, 2026

View reviewed changes

src/llmcompressor/observers/base.py Show resolved Hide resolved

src/llmcompressor/modifiers/quantization/quantization/base.py Show resolved Hide resolved

kylesayrs self-requested a review February 24, 2026 20:15

kylesayrs changed the title ~~[Distributed] Extend QuantizationModifier to support weight-parallel optimization~~ [Distributed] Extend QuantizationModifier to support distributed activation calibration Feb 26, 2026

HDCharles reviewed Feb 28, 2026

View reviewed changes

src/llmcompressor/observers/base.py Outdated Show resolved Hide resolved

HDCharles requested changes Mar 4, 2026

View reviewed changes

HDCharles requested changes Mar 19, 2026

View reviewed changes

HDCharles reviewed Mar 19, 2026

View reviewed changes

src/llmcompressor/observers/base.py Outdated Show resolved Hide resolved

Etelis requested a review from HDCharles March 20, 2026 13:32

Merge branch 'feature/quantization-modifier-ddp' of https://github.co…

82d808c

…m/Etelis/llm-compressor into feature/quantization-modifier-ddp

kylesayrs reviewed Mar 23, 2026

View reviewed changes

src/llmcompressor/observers/moving_base.py Outdated Show resolved Hide resolved

mergify bot added the quality-failed label Mar 23, 2026

Update src/llmcompressor/observers/moving_base.py

d6b3575

Co-authored-by: Kyle Sayers <kylesayrs@gmail.com> Signed-off-by: Itay Etelis <92247226+Etelis@users.noreply.github.com>

mergify bot removed the quality-failed label Mar 23, 2026

Merge branch 'main' into feature/quantization-modifier-ddp

9680d9e

kylesayrs approved these changes Mar 23, 2026

View reviewed changes

kylesayrs self-assigned this Mar 23, 2026

HDCharles approved these changes Mar 23, 2026

View reviewed changes

kylesayrs enabled auto-merge (squash) March 23, 2026 23:21

Merge branch 'main' into feature/quantization-modifier-ddp

688b309

mergify bot added the quality-failed label Mar 24, 2026

Merge branch 'main' into feature/quantization-modifier-ddp

7f31744

mergify bot removed the quality-failed label Mar 24, 2026

Merge branch 'main' into feature/quantization-modifier-ddp

4f80617

mergify bot added the quality-failed label Mar 24, 2026

EtelisIBM added 2 commits March 25, 2026 09:20

fix formatting and moving-average test mock path

f959d4b

- Run ruff format on observers/base.py - Fix test_moving_avg_synchronize_issues_all_reduce to mock moving_base.dist (not base.dist) and use SUM/get_world_size Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

Merge branch 'feature/quantization-modifier-ddp' of https://github.co…

7baf545

…m/Etelis/llm-compressor into feature/quantization-modifier-ddp

auto-merge was automatically disabled March 25, 2026 07:20
Head branch was pushed to by a user without write access

mergify bot removed the quality-failed label Mar 25, 2026

Merge branch 'main' into feature/quantization-modifier-ddp

09e817b

Conversation

Etelis commented Feb 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Runtime & Evaluation Results

Test plan

Uh oh!

github-actions bot commented Feb 22, 2026

Uh oh!

gemini-code-assist bot commented Feb 22, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

mergify bot commented Feb 22, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kylesayrs left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Etelis commented Feb 24, 2026

Uh oh!

kylesayrs left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HDCharles left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Etelis commented Mar 20, 2026

Uh oh!

Uh oh!

kylesayrs commented Mar 23, 2026

Uh oh!

mergify bot commented Mar 23, 2026

Uh oh!

HDCharles left a comment

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Mar 24, 2026

Uh oh!

mergify bot commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Etelis commented Feb 22, 2026 •

edited

Loading