Refactor logging, `CompressionLogger`, support distributed by kylesayrs · Pull Request #2408 · vllm-project/llm-compressor

kylesayrs · 2026-02-25T20:32:30Z

Purpose

Remove misleading information about module size after compression
Support loguru logging which logs which rank logs come from
Support compression logging that is specific to distributed workloads

Changes

Refactor CompressionLogger
- Remove nvidia/amd logic, instead just use cuda interface
  - This already accounts for "CUDA/AMD_VISIBLE_DEVICES", no need to hard code these env variables
- Remove "module size" log, which is misleading, as the module size does not actually change as optimization occurs (qdq)
- Limit devices to just the current device in distributed cases
Refactor loguru logger configuration
- configure_logger can now be called multiple times
- When oneshot occurs, configure_logger is called again with the rank set
- Logger now prints rank if applicable

Testing

Single-thread

2026-02-25T17:04:36.8189 | compress_module_list | INFO - Quantizing model.layers.0.mlp.gate_proj using 512 samples
2026-02-25T17:04:38.5924 | GPTQ | METRIC - time 1.77s
2026-02-25T17:04:38.5926 | GPTQ | METRIC - error 663.60
2026-02-25T17:04:38.5932 | GPTQ | METRIC - GPU 0 | usage: 4.45% | total memory: 85.1 GB
2026-02-25T17:04:38.5933 | GPTQ | METRIC - GPU 1 | usage: 0.00% | total memory: 85.1 GB

Distributed

[Rank 1] 2026-02-25T17:10:18.8569 | compress_module_list | INFO - Quantizing model.layers.2.self_attn.o_proj using 512 samples
[Rank 1] 2026-02-25T17:10:20.4585 | GPTQ | METRIC - time 1.60s
[Rank 1] 2026-02-25T17:10:20.4586 | GPTQ | METRIC - error 1.27
[Rank 1] 2026-02-25T17:10:20.4593 | GPTQ | METRIC - GPU 1 | usage: 4.45% | total memory: 85.1 Gb
[Rank 1] 2026-02-25T17:10:20.4637 | compress_module_list | INFO - Quantizing model.layers.2.mlp.up_proj using 512 samples
[Rank 0] 2026-02-25T17:10:20.7379 | GPTQ | METRIC - time 6.59s
[Rank 0] 2026-02-25T17:10:20.7381 | GPTQ | METRIC - error 7.45
[Rank 0] 2026-02-25T17:10:20.7401 | GPTQ | METRIC - GPU 0 | usage: 5.98% | total memory: 85.1 Gb
[Rank 0] 2026-02-25T17:10:20.7590 | compress_module_list | INFO - Quantizing model.layers.2.mlp.gate_proj using 512 samples

github-actions · 2026-02-25T20:32:40Z

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

gemini-code-assist · 2026-02-25T20:32:44Z

Summary of Changes

Hello @kylesayrs, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refactors the CompressionLogger utility to streamline GPU memory monitoring and enhance its compatibility with distributed workloads. It removes hardware-specific GPU monitoring logic in favor of a unified torch.cuda interface and ensures that only relevant device metrics are logged in distributed settings. Additionally, a potentially misleading metric related to compressed module size has been removed for clarity.

Highlights

Simplified GPU Memory Logging: The CompressionLogger now directly utilizes torch.cuda functions for GPU memory metrics, eliminating the need for separate NVIDIA (pynvml) and AMD (amdsmi) specific implementations.
Distributed Workload Support: The logging mechanism has been enhanced to correctly identify and monitor only the current device when operating in a distributed environment, preventing redundant logging across multiple processes.
Removed Misleading Module Size Metric: The logging of 'Compressed module size' was removed as it provided potentially misleading information regarding the module's actual size post-compression.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

src/llmcompressor/utils/metric_logging.py
- Removed GPUMemory namedtuple, GPUType enum, and get_layer_size_mb function.
- Removed os import and added Iterable and is_distributed imports.
- Refactored CompressionLogger initialization to remove GPU type detection and visible device parsing.
- Updated CompressionLogger.__exit__ to directly query torch.cuda for memory usage.
- Removed get_GPU_memory_usage, _get_GPU_usage_nv, and _get_GPU_usage_amd methods.
- Introduced _get_visible_devices helper function to determine devices to monitor based on distributed status.
- Removed logging of 'Compressed module size'.

Activity

No specific activity (comments, reviews, progress) was provided in the context.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request simplifies the CompressionLogger by removing hardware-specific GPU monitoring code in favor of the torch.cuda interface and adds support for distributed environments. These changes are a good step towards simplification and better maintainability. However, I've identified a critical issue in the new memory usage logging logic which causes incorrect metrics to be reported and can lead to a ZeroDivisionError. I've provided a detailed comment with a suggested fix.

src/llmcompressor/utils/metric_logging.py

brian-dellabetta

So much cleaner! Didn't know you could do this all through torch.cuda API

src/llmcompressor/utils/metric_logging.py

HDCharles

see comment, otherwise looks good

brian-dellabetta · 2026-02-27T22:04:41Z

Tried running GPTQ on this branch, on amd device. output looks good

Preparing cache: 100%|███████████████████████████████████████████████████████████| 512/512 [00:00<00:00, 1222.47it/s]
(1/33): Calibrating: 100%|█████████████████████████████████████████████████████████| 512/512 [00:13<00:00, 37.81it/s]
2026-02-27T22:02:21.4799 | compress_module_list | INFO - Quantizing model.layers.0.self_attn.q_proj using 512 samples
2026-02-27T22:02:29.4214 | GPTQ | METRIC - time 7.94s
2026-02-27T22:02:29.4216 | GPTQ | METRIC - error 1121.90
2026-02-27T22:02:29.4218 | GPTQ | METRIC - GPU 0 | usage: 1.87% | total memory: 206.1 Gb
2026-02-27T22:02:29.4230 | compress_module_list | INFO - Quantizing model.layers.0.self_attn.k_proj using 512 samples
2026-02-27T22:02:30.5821 | GPTQ | METRIC - time 1.16s
2026-02-27T22:02:30.5823 | GPTQ | METRIC - error 593.86
2026-02-27T22:02:30.5825 | GPTQ | METRIC - GPU 0 | usage: 1.87% | total memory: 206.1 Gb
2026-02-27T22:02:30.5830 | compress_module_list | INFO - Quantizing model.layers.0.self_attn.v_proj using 512 samples
2026-02-27T22:02:31.7442 | GPTQ | METRIC - time 1.16s
2026-02-27T22:02:31.7443 | GPTQ | METRIC - error 17.22
2026-02-27T22:02:31.7445 | GPTQ | METRIC - GPU 0 | usage: 1.87% | total memory: 206.1 Gb
2026-02-27T22:02:31.7450 | compress_module_list | INFO - Quantizing model.layers.0.self_attn.o_proj using 512 samples
2026-02-27T22:02:32.9174 | GPTQ | METRIC - time 1.17s
2026-02-27T22:02:32.9175 | GPTQ | METRIC - error 0.31
2026-02-27T22:02:32.9177 | GPTQ | METRIC - GPU 0 | usage: 1.87% | total memory: 206.1 Gb
2026-02-27T22:02:32.9187 | compress_module_list | INFO - Quantizing model.layers.0.mlp.gate_proj using 512 samples
2026-02-27T22:02:34.2249 | GPTQ | METRIC - time 1.31s
2026-02-27T22:02:34.2251 | GPTQ | METRIC - error 663.50
2026-02-27T22:02:34.2253 | GPTQ | METRIC - GPU 0 | usage: 1.87% | total memory: 206.1 Gb
2026-02-27T22:02:34.2282 | compress_module_list | INFO - Quantizing model.layers.0.mlp.up_proj using 512 samples
2026-02-27T22:02:35.5349 | GPTQ | METRIC - time 1.31s
2026-02-27T22:02:35.5351 | GPTQ | METRIC - error 526.16
2026-02-27T22:02:35.5353 | GPTQ | METRIC - GPU 0 | usage: 1.87% | total memory: 206.1 Gb
2026-02-27T22:02:35.5382 | compress_module_list | INFO - Quantizing model.layers.0.mlp.down_proj using 512 samples
2026-02-27T22:02:40.8984 | GPTQ | METRIC - time 5.36s
2026-02-27T22:02:40.8985 | GPTQ | METRIC - error 1.87
2026-02-27T22:02:40.8988 | GPTQ | METRIC - GPU 0 | usage: 2.53% | total memory: 206.1 Gb
(1/33): Propagating: 100%|████████████████████████████████████████████████████████| 512/512 [00:02<00:00, 182.29it/s]

mergify · 2026-03-02T18:47:24Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @kylesayrs.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

gemini-code-assist bot reviewed Feb 25, 2026

View reviewed changes

src/llmcompressor/utils/metric_logging.py Outdated Show resolved Hide resolved

kylesayrs marked this pull request as ready for review February 25, 2026 22:17

kylesayrs requested review from HDCharles, brian-dellabetta and dsikka as code owners February 25, 2026 22:17

kylesayrs changed the title ~~Simplify CompressionLogger, support distributed~~ Refactor logging, CompressionLogger, support distributed Feb 25, 2026

brian-dellabetta approved these changes Feb 25, 2026

View reviewed changes

brian-dellabetta mentioned this pull request Feb 26, 2026

[Bug]: AWQ for Gemma 3 fails to quantize, or fails to produce a viable model #2102

Closed

HDCharles reviewed Feb 27, 2026

View reviewed changes

src/llmcompressor/utils/metric_logging.py Show resolved Hide resolved

HDCharles approved these changes Feb 27, 2026

View reviewed changes

mergify bot added the needs-rebase label Mar 2, 2026

kylesayrs added 5 commits March 5, 2026 10:06

simplify

a0b7e49

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

support distributed

f6f74c4

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

distributed logging

8c237b4

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

improve distributed logging

b1a9c4f

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

docstring

0bcef4f

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

kylesayrs force-pushed the kylesayrs/better-compression-logger branch from 04ca9bc to 0bcef4f Compare March 5, 2026 15:06

kylesayrs added the ready When a PR is ready for review label Mar 5, 2026

mergify bot removed the needs-rebase label Mar 5, 2026

kylesayrs merged commit 6d73ce6 into main Mar 5, 2026
13 of 18 checks passed

kylesayrs deleted the kylesayrs/better-compression-logger branch March 5, 2026 19:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor logging, `CompressionLogger`, support distributed#2408

Refactor logging, `CompressionLogger`, support distributed#2408
kylesayrs merged 5 commits intomainfrom
kylesayrs/better-compression-logger

kylesayrs commented Feb 25, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Feb 25, 2026

Uh oh!

gemini-code-assist bot commented Feb 25, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

brian-dellabetta left a comment

Uh oh!

Uh oh!

HDCharles left a comment

Uh oh!

brian-dellabetta commented Feb 27, 2026 •

edited

Loading

Uh oh!

mergify bot commented Mar 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Milestone

Conversation

kylesayrs commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Changes

Testing

Uh oh!

github-actions bot commented Feb 25, 2026

Uh oh!

gemini-code-assist bot commented Feb 25, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

brian-dellabetta left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

HDCharles left a comment

Choose a reason for hiding this comment

Uh oh!

brian-dellabetta commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify bot commented Mar 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Milestone

kylesayrs commented Feb 25, 2026 •

edited

Loading

brian-dellabetta commented Feb 27, 2026 •

edited

Loading