Conversation
Signed-off-by: Dan Huang <dahuang@redhat.com>
|
👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review. Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed. |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request updates the Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request updates the version of the compressed-tensors dependency. While the update to version 0.14.0 for release builds is correct, I've identified a potential issue with the version specified for development builds, which could lead to discrepancies between development and release environments. A suggestion has been made to align these versions to mitigate this risk.
SUMMARY: Compressed-tensors 0.14.0 has been released. Bump up its version in llmcompressor. TEST PLAN: All tests. Signed-off-by: Dan Huang <dahuang@redhat.com> Signed-off-by: yiliu30 <yi4.liu@intel.com>
commit a2433a9b0128fb5113a362d553d7984de6246053
Author: Yi Liu <yi4.liu@intel.com>
Date: Sat Mar 7 07:24:20 2026 +0800
[AutoRound] Add DDP Support and Example (#2411)
SUMMARY:
Add DDP support for Autoround and use Qwen as example.
Depends on https://github.com/vllm-project/llm-compressor/pull/2410
TEST PLAN:
"please outline how the changes were tested"
cc @hshen14 @thuang6
---------
Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Co-authored-by: HDCharles <charlesdavidhernandez@gmail.com>
Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com>
commit a88ebbd2e6e5fa02d9f33bc86b7118149dac3cb4
Author: Gilles Turpin <turpingilles15@gmail.com>
Date: Fri Mar 6 16:52:20 2026 +0100
Add MoE calibration module for GlmMoeDsa (GLM-5) (#2434)
SUMMARY:
GlmMoeDsaNaiveMoe uses packed 3D nn.Parameter tensors instead of
nn.Linear modules, causing targets=["Linear"] to match nothing in MoE
experts during AWQ/GPTQ quantization.
This PR permanently unpacks the fused expert weights into individual
nn.Linear layers, following the same calibration pattern as glm4_moe
with dtype handling aligned.
Key differences from glm4_moe: is_permanent=True (experts must be
unpacked for quantization targets to match), DeepSeek-style routing with
groups/topk_group/norm, and SequentialGlmMoeDsaExperts for 3D->2D weight
unpacking.
Closes #2430
TEST PLAN:
pytest.importorskip: tests skip gracefully on transformers < 5.x
3 unit tests: all experts triggered, output matches original, experts
converted to nn.Linear
Full e2e validation pending transformers 5.x compatibility
No smaller GLM-5 checkpoint available for e2e testing (744B only)
Signed-off-by: Gilles Turpin <turpingilles15@gmail.com>
commit 47ec10e84d659719f1ff9959df0effb3e6f2d95d
Author: Yi Liu <yi4.liu@intel.com>
Date: Fri Mar 6 05:58:04 2026 +0800
Upgrade autoround 0.10.2 (#2410)
Signed-off-by: yiliu30 <yi4.liu@intel.com>
SUMMARY:
"please provide a brief summary"
TEST PLAN:
"please outline how the changes were tested"
cc @hshen14 @thuang6 @chensuyue
---------
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Co-authored-by: HDCharles <39544797+HDCharles@users.noreply.github.com>
commit 04dea55db919c1e8783a1f9a4c26977aff89fdfc
Author: Brian Dellabetta <brian-dellabetta@users.noreply.github.com>
Date: Thu Mar 5 16:16:59 2026 -0500
[Hotfix] _match_name hotfix (#2447)
SUMMARY:
To account for exposing `match_name` in compressed-tensors PR in
* https://github.com/vllm-project/compressed-tensors/pull/607
TEST PLAN:
tests pass
Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>
commit 6d73ce60fac726496365f5144b98091f74876528
Author: Kyle Sayers <kylesayrs@gmail.com>
Date: Thu Mar 5 14:25:15 2026 -0500
Refactor logging, `CompressionLogger`, support distributed (#2408)
* Remove misleading information about module size after compression
* Support loguru logging which logs which rank logs come from
* Support compression logging that is specific to distributed workloads
* Refactor `CompressionLogger`
* Remove nvidia/amd logic, instead just use cuda interface
* This already accounts for "CUDA/AMD_VISIBLE_DEVICES", no need to hard
code these env variables
* Remove "module size" log, which is misleading, as the module size does
not actually change as optimization occurs (qdq)
* Limit devices to just the current device in distributed cases
* Refactor loguru logger configuration
* `configure_logger` can now be called multiple times
* When oneshot occurs, `configure_logger` is called again with the rank
set
* Logger now prints rank if applicable
Single-thread
```
2026-02-25T17:04:36.8189 | compress_module_list | INFO - Quantizing model.layers.0.mlp.gate_proj using 512 samples
2026-02-25T17:04:38.5924 | GPTQ | METRIC - time 1.77s
2026-02-25T17:04:38.5926 | GPTQ | METRIC - error 663.60
2026-02-25T17:04:38.5932 | GPTQ | METRIC - GPU 0 | usage: 4.45% | total memory: 85.1 GB
2026-02-25T17:04:38.5933 | GPTQ | METRIC - GPU 1 | usage: 0.00% | total memory: 85.1 GB
```
Distributed
```
[Rank 1] 2026-02-25T17:10:18.8569 | compress_module_list | INFO - Quantizing model.layers.2.self_attn.o_proj using 512 samples
[Rank 1] 2026-02-25T17:10:20.4585 | GPTQ | METRIC - time 1.60s
[Rank 1] 2026-02-25T17:10:20.4586 | GPTQ | METRIC - error 1.27
[Rank 1] 2026-02-25T17:10:20.4593 | GPTQ | METRIC - GPU 1 | usage: 4.45% | total memory: 85.1 Gb
[Rank 1] 2026-02-25T17:10:20.4637 | compress_module_list | INFO - Quantizing model.layers.2.mlp.up_proj using 512 samples
[Rank 0] 2026-02-25T17:10:20.7379 | GPTQ | METRIC - time 6.59s
[Rank 0] 2026-02-25T17:10:20.7381 | GPTQ | METRIC - error 7.45
[Rank 0] 2026-02-25T17:10:20.7401 | GPTQ | METRIC - GPU 0 | usage: 5.98% | total memory: 85.1 Gb
[Rank 0] 2026-02-25T17:10:20.7590 | compress_module_list | INFO - Quantizing model.layers.2.mlp.gate_proj using 512 samples
```
---------
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
commit d6eb2be988706e46cefb03ab6acf1bbd104d35af
Author: Gilles Turpin <turpingilles@orange.fr>
Date: Thu Mar 5 01:45:04 2026 +0100
fix: handle packed weights in granite4 to_3d_expert (W4A16 support) (#2425)
SUMMARY:
Fix the W4A16 shape mismatch in to_3d_expert() reported in #2338 (first
error). The original code hardcoded shapes for FP8 quantization only.
The fix calculates all shapes up front (packed weights, grouped scales,
packed zero points) then asserts and reshapes. This supports FP8
per-channel, FP8 block quantization, W4A16 symmetric, and W4A16
asymmetric (with packed zero_point on dim0).
Companion to #2426 (FX tracing fix) and compressed-tensors #609 (3D
pack/unpack). Together they resolve #2338.
TEST PLAN:
4 unit tests covering all quantization configurations:
- int4 symmetric (packed weights, per-channel scale)
- int4 asymmetric (packed weights + packed zero_point on dim0)
- fp8 block (grouped scale)
- fp8 per-channel (no packing)
All passing.
Signed-off-by: Gilles Turpin <turpingilles15@gmail.com>
Co-authored-by: HDCharles <39544797+HDCharles@users.noreply.github.com>
commit 4c522137771b2223dcbfec2001658a744b37a3d5
Author: Gilles Turpin <turpingilles@orange.fr>
Date: Thu Mar 5 00:50:57 2026 +0100
fix: use topological ordering in FX graph cleanup to fix erase_node crash (Granite4 GPTQ) (#2426)
Fix the FX tracing crash reported as the second error in #2338. The BFS
cleanup of concrete args did not maintain topological ordering — if a
node was visited multiple times, its position in the deletion dict was
not updated, causing dependents to be deleted before their dependencies
(`RuntimeError: Tried to erase Node getitem_169`).
The fix uses `move_to_end` in the BFS traversal so that revisited nodes
are moved to the end of the deletion dict, ensuring topological order.
Companion to #2425 (shape fix) and compressed-tensors #609 (3D
pack/unpack). Together they resolve #2338.
Tested on Granite 4.0-h-small with a single layer, using all three fixes
(#2425, #2426, compressed-tensors #609).
Script based on `test_gptq_no_exclusion.py` from #2338 with
`model.model.layers = model.model.layers[:1]` added after model loading.
Command: `python test_gptq_no_exclusion.py --model-name
ibm-granite/granite-4.0-h-small --output /workspace/test-output
--calibration-samples 16`
Results:
- FX tracing completed — no `erase_node` crash
- 3D→2D conversion OK
- Cache preparation OK (16/16 samples)
- Calibration started but hit OOM on the Mamba layer (unrelated to the
fix — naive Mamba path without `causal_conv1d` on a 31GB GPU)
Signed-off-by: gillesturpin <turpingilles@orange.fr>
Co-authored-by: HDCharles <39544797+HDCharles@users.noreply.github.com>
commit 7461d02b9bf9edc35f3be9effdaa97d6639baf1f
Author: JinRiYao2001 <jinriyao@qq.com>
Date: Thu Mar 5 02:19:54 2026 +0800
fix(examples): correct W8A16 -> W4A16 in Qwen3-VL AWQ example save dir (#2443)
SUMMARY:
The AWQ recipe in this example uses num_bits=4 for weights (W4A16).
However the save directory name incorrectly uses "W8A16":
-AWQ-W8A16-mse-seq
This PR updates it to:
-AWQ-W4A16-mse-seq
to match the actual quantization configuration and the comment above the
recipe.
TEST PLAN:
Not applicable. This PR only fixes an incorrect save directory string in
the example script.
No functional code paths are changed.
commit e6fdd066c785b11453875e777c229a954a9c438e
Author: Kyle Sayers <kylesayrs@gmail.com>
Date: Tue Mar 3 16:25:18 2026 -0500
Remove dead code (#2435)
* Remove dead code
* Remove `save_checkpoint` (this is now done by
[post_process](https://github.com/vllm-project/llm-compressor/blob/main/src/llmcompressor/entrypoints/utils.py#L95))
* Remove `get_completed_stages`, `save_completed_stages` (stages no
longer exist)
* Remove `load_safetensors_state_dict` (we now either load with the
transformers model definition or `model_free_ptq`)
* Remove `set_deterministic_seeds` (not used)
* Remove `is_package_available`
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
commit 7b7d1a5dc1fbca660acc04ff993fcb0c9d15acbb
Author: Dipika Sikka <dipikasikka1@gmail.com>
Date: Tue Mar 3 14:04:56 2026 -0500
Enable merge queue support in GitHub workflows (#2433)
- Configures Mergify merge queue with automatic DCO sign-off to resolve
DCO check failures on merge commits
- Removes GitHub native merge queue triggers from all workflows
- Adds auto-merge rule for PRs with `ready` label and required approvals
The DCO (Developer Certificate of Origin) GitHub App was failing on
merge commits created by GitHub's native merge queue, as those commits
lacked the required `Signed-off-by:` trailer.
Switch to Mergify's merge queue which automatically adds DCO sign-off to
all merge commits it creates.
- Added `queue_rules` with automatic DCO sign-off in commit messages
- Added auto-merge rule that queues PRs when:
- Label `ready` is applied
- 2+ approvals received
- All required checks pass (DCO, tests, quality, etc.)
- `.github/workflows/ready-label-check.yaml`: Removed merge_group
trigger
- `.github/workflows/test-check-transformers.yaml`: Removed merge_group
trigger and condition
- `.github/workflows/test-check.yaml`: Removed merge_group trigger
- `.github/workflows/quality-check.yaml`: Removed merge_group trigger
- `.github/workflows/linkcheck.yml`: Removed merge_group trigger
After merging, GitHub's native merge queue should be disabled in
repository settings and Mergify will handle all merge queue operations.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
commit f15296fb966bebd2652e1a31ae106e70eff8b5e2
Author: Itay Etelis <92247226+Etelis@users.noreply.github.com>
Date: Tue Mar 3 17:07:20 2026 +0200
Refactor Matching Logic to Use compressed-tensors Utilities (#2284)
Consolidates 17 redundant matching functions into standardized
compressed-tensors APIs.
Fixes #1686
- **Deleted 15 functions** from `module.py`: `get_layers`, `get_params`,
`get_prunable_layers`, `get_quantizable_layers`, `match_targets`, etc.
- **Added 2 helpers**: `expand_special_targets()` (backward
compatibility) and `build_parameterized_layers()`
- **Updated modifiers**: SparseGPT, magnitude pruning, constant pruning
to use new APIs
- **Bug fix**: Added missing `self.targets` parameter in magnitude
pruning
---------
Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
Co-authored-by: Itay Etelis <itay.etelis@ibm.com>
Co-authored-by: Brian Dellabetta <brian-dellabetta@users.noreply.github.com>
Co-authored-by: HDCharles <39544797+HDCharles@users.noreply.github.com>
commit a956d688892c2cff3757c598e4c870c796a42f78
Author: Xin He <xin3.he@intel.com>
Date: Tue Mar 3 07:45:34 2026 +0800
add qwen3 vl autoround example (#2357)
SUMMARY:
AutoRound quantization example: qwen3-vl nvfp4
TEST PLAN:
python qwen3_vl_example.py
Output:
```
Hello my name is Mihai, I am a 30 year old male, and I am currently a software engineer working in a company that develops software for the financial sector. I am a very passionate person, and I am always eager to learn new things. I have a strong interest in AI, machine learning, and data science. I am also very interested in the intersection of these fields with finance. I am currently working on a project that involves building a machine learning model to predict stock prices. I am
```
---------
Signed-off-by: Xin He <xin3.he@intel.com>
Co-authored-by: HDCharles <39544797+HDCharles@users.noreply.github.com>
Co-authored-by: Brian Dellabetta <brian-dellabetta@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
commit 2b0684c132d130b84b1b8ec9cce9f29a3239debc
Author: Omkar Kabde <omkarkabde@gmail.com>
Date: Tue Mar 3 05:06:16 2026 +0530
Remove training loggers and all related code (#2414)
SUMMARY:
Fixes #2409.
cc @kylesayrs
This PR removed training loggers and all related code. Replaces their
functionality with `loguru`.
It also removes other helper functions and `FrequencyManager` as well.
TEST PLAN:
most tests are passing, but getting stuck at gptq test
---------
Signed-off-by: Dan Huang <dahuang@redhat.com>
Signed-off-by: Omkar Kabde <omkarkabde@gmail.com>
Co-authored-by: dhuangnm <74931910+dhuangnm@users.noreply.github.com>
Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com>
Co-authored-by: Brian Dellabetta <brian-dellabetta@users.noreply.github.com>
commit 795198790668807f0c90c9f9df9842ad0cc6cc25
Author: Gilles Turpin <turpingilles15@gmail.com>
Date: Tue Mar 3 00:13:15 2026 +0100
Add SmoothQuant mapping for GlmMoeDsaForCausalLM (GLM-5) (#2419)
Part of #1442
GLM-5 (GlmMoeDsaForCausalLM) uses MLA identical to DeepSeek V2/V3 — same
projection names (q_a_proj, kv_a_proj_with_mqa). Reuses
DEEPSEEK_V2_SMOOTHQUANT_MAPPINGS which smooths input_layernorm only,
conservative choice for MoE models with fused expert parameters
(gate_up_proj 3D tensor).
Also adds Glm4MoeForCausalLM with DEFAULT_SMOOTHQUANT_MAPPINGS.
SUMMARY:
Add GLM-5 and GLM-4-MoE to SmoothQuant MAPPINGS_REGISTRY.
TEST PLAN:
Registry-only change. Verified GLM-5 layer names match DeepSeek V2
patterns by inspecting GlmMoeDsaForCausalLM in transformers.
Signed-off-by: gillesturpin <turpingilles@orange.fr>
Co-authored-by: Brian Dellabetta <brian-dellabetta@users.noreply.github.com>
Co-authored-by: HDCharles <39544797+HDCharles@users.noreply.github.com>
commit 69af79c4f2016f090deaf6e06faf73e3403e5d1d
Author: Gilles Turpin <turpingilles@orange.fr>
Date: Mon Mar 2 22:55:15 2026 +0100
Fix SmoothQuant regex to match q_a_proj in DeepSeek/GLM-5 (#2421)
Fixes #2420
The balance_layers pattern re:.*q_proj in
DEEPSEEK_V2_SMOOTHQUANT_MAPPINGS does not match q_a_proj (used by
DeepSeek V2/V3 and GLM-5). Changed to re:.*q(_a)?_proj$ as suggested by
@brian-dellabetta.
SUMMARY:
Fix regex pattern in DEEPSEEK_V2_SMOOTHQUANT_MAPPINGS to cover both
q_proj and q_a_proj.
TEST PLAN:
Verified with Python regex that the new pattern matches both layer
names:
re.fullmatch(".*q(_a)?_proj$", "model.layers.0.self_attn.q_proj") ->
match
re.fullmatch(".*q(_a)?_proj$", "model.layers.0.self_attn.q_a_proj") ->
match
Signed-off-by: gillesturpin <turpingilles@orange.fr>
Co-authored-by: Brian Dellabetta <brian-dellabetta@users.noreply.github.com>
Co-authored-by: HDCharles <39544797+HDCharles@users.noreply.github.com>
commit 9e9ae3dbb2b239bc22cac1a9fc463f4895e87250
Author: Gilles Turpin <turpingilles@orange.fr>
Date: Mon Mar 2 20:05:46 2026 +0100
Add AWQ mapping for GlmMoeDsaForCausalLM (GLM-5) (#2418)
Closes #2412 (part of #1442)
GLM-5 (`GlmMoeDsaForCausalLM`) uses Multi-head Latent Attention
identical to DeepSeek V3 — same projection layer names (`q_a_proj`,
`kv_a_proj_with_mqa`, etc.) and same MoE structure. Reuses
`_deepseek_mappings`.
Also moves `Glm4MoeForCausalLM` to its correct alphabetical position in
the registry.
SUMMARY:
Add GLM-5 (GlmMoeDsaForCausalLM) to AWQ_MAPPING_REGISTRY using
_deepseek_mappings. GLM-5's MLA layer names are identical to DeepSeek
V3. Also fixes alphabetical ordering of Glm4MoeForCausalLM.
TEST PLAN:
Registry-only change (no logic modified). Verified that GLM-5 layer
names (q_a_proj, kv_a_proj_with_mqa, kv_a_layernorm, kv_b_proj, o_proj)
match the patterns in _deepseek_mappings by inspecting the
GlmMoeDsaForCausalLM source in transformers.
Signed-off-by: gillesturpin <turpingilles@orange.fr>
Co-authored-by: Brian Dellabetta <brian-dellabetta@users.noreply.github.com>
Co-authored-by: HDCharles <39544797+HDCharles@users.noreply.github.com>
commit a27d9e2e318fdb81254285c2ed5987b97897d973
Author: 김대익 <33992354+dik654@users.noreply.github.com>
Date: Tue Mar 3 03:41:47 2026 +0900
[GPTQ] Move modifier to top-level for consistent folder structure (#2368)
Move GPTQModifier from `modifiers/quantization/gptq/` to
`modifiers/gptq/`
for consistent folder structure with AWQ and AutoRound (related: #2306).
- Add deprecation wrapper at old import path for backward compatibility
- Exclude old GPTQ paths from ModifierFactory to prevent duplicate
registration
- Update test and example imports to new canonical path
Import verification (all passed):
- from llmcompressor.modifiers.gptq import GPTQModifier (new path, no
warning)
- from llmcompressor.modifiers.quantization import GPTQModifier (BC, no
warning)
- from llmcompressor.modifiers.quantization.gptq import GPTQModifier
(BC, DeprecationWarning)
- ModifierFactory.refresh() registers GPTQModifier from new location
pytest (11 passed, 3 skipped for GPU):
- tests/llmcompressor/transformers/gptq/test_gptq_oneshot.py
-
tests/llmcompressor/pytorch/modifiers/pruning/sparsegpt/test_pytorch.py
- tests/llmcompressor/transformers/compression/test_recipe_parsing.py
(requires GPU)
ruff check + ruff format passed
---------
Signed-off-by: 김대익 <33992354+dik654@users.noreply.github.com>
Co-authored-by: Brian Dellabetta <brian-dellabetta@users.noreply.github.com>
commit a99f159abe94dd119f6a13e5ae4004505fcd8355
Author: HDCharles <39544797+HDCharles@users.noreply.github.com>
Date: Mon Mar 2 10:44:26 2026 -0500
Smoothquant bugfixes (#2422)
Summary:
smooth quant wasn't actually doing anything since it was only updating
the onload, this PR fixes that and adds a test to check the behavior of
smoothquant in the future
TEST PLAN:
pytest
/home/HDCharles/repos/llm-compressor/tests/llmcompressor/modifiers/transform/smoothquant/test_base.py
-k "e2e"
Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>
commit 732316c8980913d173d3a202eeab95eda39af230
Author: Sören Dréano <71752785+SorenDreano@users.noreply.github.com>
Date: Sat Feb 28 16:25:35 2026 +0100
Add support for passing a custom DataLoader to oneshot() (#2390)
SUMMARY:
Adds a `dataloader` argument to the `oneshot` entrypoint.
Allow users to pass a pre-built PyTorch DataLoader directly via the
`dataloader` parameter, bypassing the internal dataset-to-dataloader
conversion. This is useful for custom data pipelines where users already
have a prepared DataLoader and don't need get_calibration_dataloader().
Rather than using `self.dataloader = kwargs.pop("dataloader", None)`, we
could also add a `dataloader` argument/attribute to `DatasetArguments`
if you prefer.
TEST PLAN:
This change is fairly trivial, I made sure
[https://github.com/vllm-project/llm-compressor/blob/main/examples/quantization_w4a16/README.md](https://github.com/vllm-project/llm-compressor/blob/main/examples/quantization_w4a16/README.md)
could still run and that passing the DataLoader too:
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct"
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, dtype="auto")
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
from datasets import load_dataset
NUM_CALIBRATION_SAMPLES=512
MAX_SEQUENCE_LENGTH=2048
ds = load_dataset("HuggingFaceH4/ultrachat_200k", split=f"train_sft[:{NUM_CALIBRATION_SAMPLES}]")
ds = ds.shuffle(seed=42)
def preprocess(example):
return {"text": tokenizer.apply_chat_template(example["messages"], tokenize=False,)}
ds = ds.map(preprocess)
def tokenize(sample):
return tokenizer(sample["text"], padding=False, max_length=MAX_SEQUENCE_LENGTH, truncation=True, add_special_tokens=False)
ds = ds.map(tokenize, remove_columns=ds.column_names)
from llmcompressor.datasets import get_calibration_dataloader
from llmcompressor.args import DatasetArguments
dataset_args = DatasetArguments(
dataset=ds,
max_seq_length=MAX_SEQUENCE_LENGTH,
num_calibration_samples=NUM_CALIBRATION_SAMPLES,
)
dataloader = get_calibration_dataloader(dataset_args, tokenizer)
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import GPTQModifier
recipe = GPTQModifier(targets="Linear", scheme="W4A16", ignore=["lm_head"])
oneshot(
model=model,
recipe=recipe,
dataloader=dataloader,
)
SAVE_DIR = MODEL_ID.rstrip("/").split("/")[-1] + "-W4A16-G128"
model.save_pretrained(SAVE_DIR, save_compressed=True)
tokenizer.save_pretrained(SAVE_DIR)
```
This is the exact same code from the documentation, with the DataLoader
built outside of the `oneshot` call (`dataloader =
get_calibration_dataloader(dataset_args, tokenizer)`) and passed
directly to `oneshot`.
---------
Signed-off-by: Sören Dréano <71752785+SorenDreano@users.noreply.github.com>
Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>
Co-authored-by: Soren Dreano <soren@numind.ai>
Co-authored-by: Brian Dellabetta <brian-dellabetta@users.noreply.github.com>
Co-authored-by: Brian Dellabetta <bdellabe@redhat.com>
Co-authored-by: HDCharles <39544797+HDCharles@users.noreply.github.com>
commit bdb65473ba21ca6aaaf726ffe66c695f5608c953
Author: dhuangnm <74931910+dhuangnm@users.noreply.github.com>
Date: Fri Feb 27 18:06:58 2026 -0500
Bump compressed-tensors version (#2423)
SUMMARY:
Compressed-tensors 0.14.0 has been released. Bump up its version in
llmcompressor.
TEST PLAN:
All tests.
Signed-off-by: Dan Huang <dahuang@redhat.com>
commit 0c0ead359a355ea443df50f3f6c91de7d1df255d
Author: Dipika Sikka <dipikasikka1@gmail.com>
Date: Thu Feb 26 18:04:11 2026 -0500
[ReadMe] Update whats new (#2417)
SUMMARY:
Sample Build:
https://app.readthedocs.org/projects/vllm-llm-compressor/builds/31579228/
commit a9847e04a92f75d64416b133991b868ed4564bf6
Author: Dipika Sikka <dipikasikka1@gmail.com>
Date: Thu Feb 26 17:39:55 2026 -0500
[Docs] Updates (#2416)
SUMMARY:
- Fix torchrun command
- Add reference to guides in compress.md
- Update model loading table
commit fe512727a4584c79f62dba984f004e1b4f6f9277
Author: dhuangnm <74931910+dhuangnm@users.noreply.github.com>
Date: Thu Feb 26 15:09:59 2026 -0500
Improve how we identify and run e2e smoke tests (#2336)
SUMMARY:
Currently we use the file `tests/e2e/vLLM/rhaiis-e2e-smoke.list` to mark
the configs for smoke tests that we use to run for the RHAIIS image.
This is vulnerable as we need to keep the list in this file up-to-date
to any changes in the config yaml files and this is error-prone.
This PR removes this `tests/e2e/vLLM/rhaiis-e2e-smoke.list` file and use
the config yaml file directly to mark the smoke tests. We added a new
field `test_group` to the yaml file and updated the `run_tests_in_*.sh`
scripts to parse this field and filter out tests if a test group (-g) is
specified. This allows both python and rhaiis image testing be able to
run smoke and full tests for the configs.
To be more specific:
$# to run e2e tests for all configs (default)
`
bash tests/e2e/vLLM/run_tests_in_python.sh -c tests/e2e/vLLM/configs -t
tests/e2e/vLLM/test_vllm.py
`
$# to run e2e tests for configs with smoke only
`
bash tests/e2e/vLLM/run_tests_in_python.sh -c tests/e2e/vLLM/configs -t
tests/e2e/vLLM/test_vllm.py -g rhaiis
`
Similar commands for the `run_tests_in_rhaiis.sh` script.
Going forward, for any newly added configs for the e2e tests, if we want
to include them into the smoke tests for the RHAIIS image, we need to
remember to add the `test_group: "smoke"` into their yaml file under the
configs/ so we can automatically pick it up for the RHAIIS image
testing.
TEST PLAN:
A successful run of the smoke tests is here:
https://github.com/neuralmagic/llm-compressor-testing/actions/runs/21727920814
---------
Signed-off-by: Dan Huang <dahuang@redhat.com>
Signed-off-by: Dipika Sikka <dipikasikka1@gmail.com>
Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com>
commit d0228407111ad6a70fa74c933cd138ab0404a9f6
Author: Dipika Sikka <dipikasikka1@gmail.com>
Date: Thu Feb 26 11:41:09 2026 -0500
[Example Testing] Remove and update example test cases (#2406)
SUMMARY:
- Remove out-dated cases
- Add more up-to-date cases (e.g disk offload, ddp, model free ptq),
examples, and models
- Ensure all cases are verified for correct compression format
- Add an optional `qwen` install to enable qwen VL examples which
leverage `qwen_vl_utils`
- Will require
https://github.com/neuralmagic/llm-compressor-testing/pull/219 for
example testing
With these changes, all examples pass:
https://github.com/neuralmagic/llm-compressor-testing/actions/runs/22450404023
---------
Signed-off-by: Dipika Sikka <dipikasikka1@gmail.com>
commit 12aa5639a3276bb7fe493a0a2158e9846c63f3ff
Author: dhuangnm <74931910+dhuangnm@users.noreply.github.com>
Date: Wed Feb 25 17:17:21 2026 -0500
[WIP] Update dependency bounds for new release (#2407)
SUMMARY:
Update llmcompressor dependency bounds except for compressed-tensors,
which will be updated after the compressed-tensors 0.14.0 is released.
TEST PLAN:
All tests
---------
Signed-off-by: Dan Huang <dahuang@redhat.com>
Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com>
commit 81ec39c1c36c7f1d092dbd518591ca1bfb171c18
Author: Kyle Sayers <kylesayrs@gmail.com>
Date: Wed Feb 25 14:45:45 2026 -0500
[Offload] Convert model back to CT offloading for testing (#2403)
* Fix testing which requires access to the model after the model has
been saved
* https://github.com/vllm-project/compressed-tensors/pull/601
* Convert back to CT offloading after converting to accelerate
offloading for saving
* Previously we just "removed dispatch", but this is bad practice as it
won't work for disk offloading
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com>
commit c6e4d38dde4471874e4a3100f928cd3fef473cd5
Author: HDCharles <39544797+HDCharles@users.noreply.github.com>
Date: Wed Feb 25 14:20:44 2026 -0500
[dist][moe] fix add moe_context for big models (#2405)
Summary:
large models like Qwen/Qwen3-VL-235B-A22B-Instruct, when they add moe
calibration context, different threads can take different lengths of
time, for larger models this difference can be longer than the nccl
timeout.
fix: add a sync point at each module since we're rate limited to the
slowest thread as is. at some point this should be changed to add moe
calibration context in parallel and broadcast the updated modules
TEST PLAN:
tested e2e
<details>
```
qwen3_vl_235b_moe_gptq_int4_ddp_example.py`
supported for Qwen3-VL-MoE
from compressed_tensors.offload import init_dist, load_offloaded_model
from transformers import AutoProcessor,
Qwen3VLMoeForConditionalGeneration
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import GPTQModifier
MODEL_ID = "Qwen/Qwen3-VL-235B-A22B-Instruct"
init_dist()
with load_offloaded_model():
model = Qwen3VLMoeForConditionalGeneration.from_pretrained(
MODEL_ID, dtype="auto", device_map="auto_offload"
)
processor = AutoProcessor.from_pretrained(MODEL_ID)
currently
recipe = GPTQModifier(
targets="Linear",
scheme="W4A16",
ignore=[
"re:.*lm_head",
"re:visual.*",
"re:model.visual.*",
"re:.*mlp.gate$",
],
)
oneshot(model=model, recipe=recipe)
import torch
SAVE_DIR = (
MODEL_ID.rstrip("/").split("/")[-1]
+ "-GPTQ-W4A16-G128-DDP"
+ str(torch.distributed.get_world_size())
)
model.save_pretrained(SAVE_DIR, save_compressed=True)
processor.save_pretrained(SAVE_DIR)
```
<\details>
Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>
commit f18d6e384fb9244c82eeb5ce715c3c54b4a91313
Author: HDCharles <39544797+HDCharles@users.noreply.github.com>
Date: Wed Feb 25 13:09:01 2026 -0500
fix ddp for nvfp4 on A100 (#2404)
depends on https://github.com/vllm-project/compressed-tensors/pull/603
Summary:
nccl only allows broadcasting fp8 on a100 but we can work around it with
this util
Test Plan:
<details>
Test Script
</details>
Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>
commit ff526d72e41b3e13ae9df4f0d0524764751cd2ec
Author: Kyle Sayers <kylesayrs@gmail.com>
Date: Wed Feb 25 11:27:12 2026 -0500
[Docs] Add Sequential Onloading, Disk Offloading, and Distributed Oneshot Docs (#2396)
* Add documentation for new features in v0.10.0
* Add up-to-date documentation on sequential onloading
* Add docs page for Sequential Onloading
* Add docs page for Model Loading
* Add docs page for Distributed Oneshot
* Fix the path of observers.md
* Slightly change wording on docs home page
* Add redirect to model loading docs in disk offloading examples folder
---------
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
commit 1e4d3c5bca95ac75fc301005d1fe5b2adca9a955
Author: Brian Dellabetta <brian-dellabetta@users.noreply.github.com>
Date: Wed Feb 25 11:09:57 2026 -0500
[Examples] Remove diagnostic `model.generate` calls for models with 40B+ parameters (#2401)
SUMMARY:
Remove all calls to `model.generate` in examples involving models with
~40B+ parameters. Anything smaller should run on a single 80GB GPU.
TEST PLAN:
n/a
---------
Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>
Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com>
commit f0a1824bc5440597d071bcc21bd8ad01bd8b0038
Author: Dipika Sikka <dipikasikka1@gmail.com>
Date: Wed Feb 25 11:03:02 2026 -0500
[Tests][LM Eval] Fix test seeding for consistent results (#2395)
SUMMARY:
- Enables consistent test results before runs
Test Run:
https://github.com/neuralmagic/llm-compressor-testing/actions/runs/22371360237
---------
Signed-off-by: Dipika Sikka <dipikasikka1@gmail.com>
commit b0cc7a05f7f6916d5757f452f7147e066f318451
Author: Dipika Sikka <dipikasikka1@gmail.com>
Date: Wed Feb 25 10:54:34 2026 -0500
[Docs] Clean-up + Example ReadMe updates (#2399)
SUMMARY:
- Remove marlin24 examples
- Clean-up existing README docs
- Add examples/README.md file explaining repo structure
- Update MoE README.md
commit 778abe815c226669753308ea9ee76ee91186db26
Author: Dipika Sikka <dipikasikka1@gmail.com>
Date: Tue Feb 24 13:43:53 2026 -0500
[Docs] Remove finetune examples (#2398)
SUMMARY:
- Remove old finetune examples
- Remove old maintainers file as redundant with CODEOWNERS
commit 9b7fb9f77159967f90b66b37be5ea7bc21532504
Author: Bartowski <3266127+bartowski1182@users.noreply.github.com>
Date: Tue Feb 24 12:21:09 2026 -0500
Add AFMOE mappings for awq and smoothquant (#2316)
SUMMARY:
These mappings are needed to properly apply AWQ and smoothquant to the
Trinity series of models, AfmoeForCausalLM
TEST PLAN:
Quality was tested with benchmarks, without these changes the benchmark
results were extremely low, with these changes it was close to margin of
error compared to bf16/FP8 dynamic
Can test on Trinity-Large-Preview
https://huggingface.co/arcee-ai/Trinity-Large-Preview
Test code for quantization:
https://gist.github.com/bartowski1182/b7e05f6c96735ec5d03f234d37e11e4d
---------
Signed-off-by: Colin Kealty <3266127+bartowski1182@users.noreply.github.com>
Signed-off-by: Bartowski <3266127+bartowski1182@users.noreply.github.com>
Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com>
Co-authored-by: Brian Dellabetta <brian-dellabetta@users.noreply.github.com>
Co-authored-by: HDCharles <39544797+HDCharles@users.noreply.github.com>
commit be12cc6d70f3fb3fdd6b0bbe0a8ba35f19b549d9
Author: Dipika Sikka <dipikasikka1@gmail.com>
Date: Tue Feb 24 11:25:46 2026 -0500
[Docs] Reorganize + Additional Guides (#2379)
SUMMARY:
- Add choosing a model
- Add choosing a dataset
- Re-organize to set-up a step-by-step compression guide
- Additional clean-up and organization
Sample Doc Generation:
https://vllm--2379.org.readthedocs.build/projects/llm-compressor/en/2379/
---------
Signed-off-by: Dipika Sikka <ds3822@columbia.edu>
Co-authored-by: Brian Dellabetta <brian-dellabetta@users.noreply.github.com>
commit 986ac236f3bbdc95c8e47072fb33474511aee962
Author: Kyle Sayers <kylesayrs@gmail.com>
Date: Mon Feb 23 13:03:00 2026 -0500
[Misc] Remove usages of `update_parameter_data` (#2393)
* Begin deprecation of `update_parameter_data` in favor of
`update_offload_parameter`
---------
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
commit 5c757a6985d32ee74b7a2c30349c852624cf4100
Author: Kyle Sayers <kylesayrs@gmail.com>
Date: Mon Feb 23 11:59:36 2026 -0500
[Offloading] Support Disk Offloading (#2373)
* Support disk offloading for very large models
* [[Offload] Convert accelerate for
loading/saving](https://github.com/vllm-project/compressed-tensors/pull/572/)
* Add `examples/disk_offloading/qwen3_example.py`
* Add `examples/disk_offloading/kimi_k2_example.py`
* Remove post-processing step where `remove_dispatch` is called
* Previously, this was used to avoid conflicts between
`dispatch_for_sequential` and `dispatch_for_generation`.
* Now, the two functions are directly compatible: you don't need to
remove the dispatch of one to use the other
* Add `to_accelerate` to `save_pretrained_wrapper`
* This ensures that the model is converted to `accelerate` offloading
before saving
* This ensures the best compatibility with `save_pretrained`, and
reduces excess memory usage which would cause gpu/cpu ooms
* During oneshot preprocessing, convert `from_accelerate` if possible.
This guards against users who load their model outside of the
`load_offloaded_model` context
* Remove `offload_device` arguemnt from `dispatch_for_sequential` to
avoid deprecation warning
* `dispatch_for_sequential` now always respects the device the model was
loaded on
* Ran `Qwen/Qwen3-0.6B` example to completion
* [IN PROGRESS] Run `unsloth/Kimi-K2-Instruct-0905-BF16` example to
completion
---------
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
commit 5f63d7a9a6ae0f9944e69ff87ea5cce31f923ae2
Author: HDCharles <39544797+HDCharles@users.noreply.github.com>
Date: Thu Feb 19 14:17:53 2026 -0500
[GPTQ][ddp] enabling DDP for GPTQ (#2333)
After the changes in
https://github.com/vllm-project/compressed-tensors/pull/572
https://github.com/vllm-project/compressed-tensors/pull/534
https://github.com/vllm-project/llm-compressor/pull/2340 we're ready to
start rolling out DDP implementations of various modifiers
The Api we've landed on attempts to maintain the normal flow with
minimal changes necessary to enable DDP:
1) the user will call torchrun --nproc_per_node==<num_threads> script.py
to start the script
2) the user will initialize the distributed context, (they can use the
helper init_dist to do this)
3) the user will load the model using the new context manager, setting
the device map as outlined
[here](https://github.com/vllm-project/compressed-tensors/pull/572).
(For most users this will be "auto_offload")
4) (optional) the user can partition the dataset at load time using
get_rank_partition or just load as normal and oneshot will partition the
data later (will load 1 copy of dataset into cpu memory for each rank
which may be onerous)
```python
from compressed_tensors.offload import load_offloaded_model, init_dist
init_dist()
with load_offloaded_model():
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, device_map="auto_offload")
...
ds = load_dataset(
DATASET_ID, split=get_rank_partition(DATASET_SPLIT, NUM_CALIBRATION_SAMPLES)
```
Adding the DDP process to GPTQ has relatively straightforward though
optimizing it for speed was a bit trickier. There are 4 steps
1) assigning each module to a rank which it will be compressed by
2) for each module assigned to a rank, having all hessian information
sent by other ranks to the assigned rank
3) each rank compresses the modules that it was assigned
4) broadcast the final quantized values to all ranks
Step 1 required the largest optimization, without any load balancing, we
ran into situations where 1 rank could be doing twice as much work as
another. Thus we implemented basic load balancing and time estimation
that seems to be working well in practice. The other major optimization
was using asynchronous ops for thread to thread communication. Before
these optimizations, 2 thread GPTQ was as fast as 1 thread GPTQ for
llama3-8B, afterward it results in a 27% speedup despite being a
relatively small model.
| model_id | world_size | max_time | max_memory | save_time |
flex_extract | eval_time |
|----------|-------------|----------|------------|-----------|--------------|-----------|
| Meta-Llama-3-8B-Instruct | 1 | 745.03 | 5.82 | 19.57 | 0.7066 | 95.28
|
| Meta-Llama-3-8B-Instruct | 2 | 372.20 | 5.57 | 49.10 | 0.7089 | 95.24
|
| Meta-Llama-3-8B-Instruct | 4 | 264.07 | 5.82 | 52.50 | 0.7180 | 96.74
|
| Qwen3-30B-A3B | 1 | 14207.53 | 6.56 | 748.23 | 0.8704 | 209.93 |
| Qwen3-30B-A3B | 2 | 7018.25 | 6.36 | 696.65 | 0.8810 | 205.89 |
| Qwen3-30B-A3B | 4 | 3694.46 | 6.36 | 723.05 | 0.8832 | 217.62 |
while validating numerical accuracy of the DDP technique, we noticed
that accuracy improved significantly for each thread added. After some
debugging we realized this was because the existing [hessian
calculation](https://github.com/vllm-project/llm-compressor/pull/2333/changes#diff-18d1319f01629ca65cc54f955dc6177f6dd025f057013932b2ed29842854f3ecL61-L65)
was causing an accumulation of floating point errors. By rewriting the
hessian calculation to sum the intermediate hessians and only divide by
num_samples at the end, we improved the GSM8K evaluation from (.67, .66)
to (.71, .71). You can repro these results
[here](https://github.com/vllm-project/llm-compressor/pull/2333/changes#diff-d31ce0453051853c17ba2a5225b3d1bfab548e095bab0967d6acfd1b3ce1b35d)
---------
Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>
commit 881dd462975a92551685b5507dfa1272f8c40bb8
Author: Kyle Sayers <kylesayrs@gmail.com>
Date: Thu Feb 19 12:55:13 2026 -0500
[Bugfix] Reduce device movement while checking layer divisibility (#2385)
* Improve runtime and memory usage by checking the shape of the
offloaded weight, not the onloaded weight
* Wrap all calls to `_layer_indivisible` with the `disable_onloading`
context
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
commit 70b610acb234e095f664d361412f6a4e9ef2ff09
Author: Brian Dellabetta <brian-dellabetta@users.noreply.github.com>
Date: Thu Feb 19 11:41:32 2026 -0500
[Observers] Allow for case when weight shape and block size are not evenly divisble (#2283)
SUMMARY:
Update observer logic for block strategy when weight shape is not
divisible by block size
Prerequisite:
- [x] https://github.com/vllm-project/compressed-tensors/pull/547
TEST PLAN:
- [x] Quantized checkpoint made with this branch (and above CT branch)
runs on vllm main for flashinfer, deepgemm and default kernels --
https://huggingface.co/bdellabe/DeepSeek-V2-Lite-FP8-BLOCK
Run script below with
- `VLLM_BLOCKSCALE_FP8_GEMM_FLASHINFER=1 VLLM_USE_DEEP_GEMM=1` for
flashinfer
- `VLLM_BLOCKSCALE_FP8_GEMM_FLASHINFER=0 VLLM_USE_DEEP_GEMM=1` for
deepgemm
- `VLLM_BLOCKSCALE_FP8_GEMM_FLASHINFER=0 VLLM_USE_DEEP_GEMM=0` for
default
```python
if __name__ == "__main__":
from vllm import LLM, SamplingParams
prompts = ["The Swiss Alps are", "Brad Marchand is", "The Toronto Maple Leafs are"]
sampling_params = SamplingParams(
temperature=0.80, top_p=0.95, max_tokens=40, min_tokens=10
)
llm = LLM(
"bdellabe/DeepSeek-V2-Lite-FP8-BLOCK",
max_model_len=4096,
enforce_eager=True,
)
output = llm.generate(prompts, sampling_params)
for out in output:
print(out.outputs[0].text)
print("COMPLETE")
```
---------
Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>
commit d0ce1d827ce3981eaf173f47452f66793d8d1d78
Author: Itay Etelis <92247226+Etelis@users.noreply.github.com>
Date: Thu Feb 19 11:35:03 2026 +0200
move smoothquant to transforms (#2314)
Moves `SmoothQuantModifier` from `modifiers/smoothquant/` to
`modifiers/transform/smoothquant/` to correctly categorize it as a
transform rather than a modifier.
Closes #2306
- Moved SmoothQuant source files to `modifiers/transform/smoothquant/`
- Moved corresponding test files
- Updated all imports across examples, docs, and dependent code
- Exported `SmoothQuantModifier` from `modifiers.transform`
```python
from llmcompressor.modifiers.transform.smoothquant import SmoothQuantModifier
```
---------
Signed-off-by: Itay Etelis <itayetelis@gmail.com>
Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
Co-authored-by: Itay Etelis <itay.etelis@ibm.com>
commit 936e0a701e55e8d9f9b9145b64673510bfe2a79c
Author: Dipika Sikka <dipikasikka1@gmail.com>
Date: Thu Feb 19 01:15:24 2026 -0500
[Tests][e2e] Release memory before running vLLM (#2375)
Co-authored-by: Kyle Sayers <kylesayrs@gmail.com>
commit 2e469aa913c41d0c832d3a0a5785751b48e065ed
Author: Kyle Sayers <kylesayrs@gmail.com>
Date: Thu Feb 19 00:59:14 2026 -0500
[Bugfix] Fix circular references when activation offload device is cuda (#2387)
tensors. This is a good approach, but has an edge case where, if the
value of entry is identical to the key of the entry, then the key will
never be garbage collected.
This can occur if the user specifies `sequential_offload_device="cuda"`,
or if the AWQ offload device is "cuda" (default true in most cases).
* Fix memory leak in AWQ which led to very high CUDA memory usage
* Guard against entries into the `WeakKeyDictionary` where the key and
value are identical
* Misc
* Move `OverrideEqMode` to the bottom of the `pipelines/cache.py`
* Remove `_fp16_baseline_cache`, which was not being used
| Before Changes | After Changes |
| - | - |
| <img width="640" height="480" alt="awq_before"
src="https://github.com/user-attachments/assets/07714321-4b2f-49b7-aa2b-5c745a60d2f4"
/> | <img width="640" height="480" alt="awq_after"
src="https://github.com/user-attachments/assets/336b0e98-c24c-4e0c-a873-3166effc32b7"
/> |
---------
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Co-authored-by: HDCharles <39544797+HDCharles@users.noreply.github.com>
commit 9979e9829ba034ee323b24a841e2288572c594df
Author: Kyle Sayers <kylesayrs@gmail.com>
Date: Wed Feb 18 18:20:15 2026 -0500
[`model_free_ptq`] Earlier Shape Validation (#2372)
* Add earlier shape validation, at the cost of loading tensors twice
* Add a validation step which loads tensors and validates their shapes
* Misc
* Add `iter_quantizable_tensors` to reduce code reuse
* Added `tests/llmcompressor/pipelines/test_model_free_validation.py`
------
[Codex
Task](https://chatgpt.com/codex/tasks/task_e_69936b53d28c8327aa0b784040c34734):
I had to do significant cleanup to make this multithreaded/fix
duplicated code.
---------
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Co-authored-by: HDCharles <39544797+HDCharles@users.noreply.github.com>
commit 7ac94483be892f1e522079065c5b6ad4ff683c79
Author: D!NE$H <67671800+gDINESH13@users.noreply.github.com>
Date: Thu Feb 19 03:15:07 2026 +0530
input_id not required for Step3-VL-10B (#2370)
SUMMARY:
Closes #2272
Adds TypeError exception handling to `get_embeddings()` to support
models with non-standard `get_input_embeddings() `implementations,
specifically Step3-VL-10B.
As @kylesayrs mentioned, Step3-VL-10B has a non-standard implementation
of get_input_embeddings() that requires an input_ids parameter. This fix
gracefully handles the TypeError that occurs when calling this method
without the required parameter, allowing quantization to proceed.
TEST PLAN:
I do not have system specs to run this model locally. But testing would
be running
`examples/quantization_w8a8_int8/llama3_example.py` just changing
model_id to "stepfun-ai/Step3-VL-10B"
It should gracefully handle the TypeError to require input_ids.
Signed-off-by: gDINESH13 <dinesh13g@gmail.com>
Co-authored-by: HDCharles <39544797+HDCharles@users.noreply.github.com>
commit f4a1490d09b65bb125ecbd855f0866f4ef80cd1a
Author: Avishek Goswami <86944690+GOavi101@users.noreply.github.com>
Date: Thu Feb 19 00:49:10 2026 +0530
DataLoader options, single-pass weight calibration, optional sequential prefetch (#2349)
Performance-oriented changes for calibration and weight quantization:
DataLoader tuning when workers are used, a single-pass weight
calibration in `QuantizationModifier`, and an optional sequential
prefetch to overlap onload with forward. Defaults stay safe for low
RAM/GPU.
- **`pin_memory`**: Set to `True` only when CUDA is available **and**
`dataloader_num_workers > 0` (avoids extra pinned memory when
`num_workers=0`).
- When `num_workers > 0`: set `persistent_workers=True` and
`prefetch_factor=2` for faster calibration.
`args/dataset_arguments.py`, `entrypoints/oneshot.py`)
- New argument **`sequential_prefetch`** (default **`False`**).
- When `False`: same as before — one batch on GPU at a time (low peak
memory).
- When `True`: prefetch next batch in a background thread to overlap
onload with forward (faster when GPU memory allows two batches).
- `dataloader_num_workers` default remains **0** (low-memory safe); help
text updated.
- `sequential_prefetch` added to `DatasetArguments` and `oneshot()` with
default `False`.
---------
Signed-off-by: Avishek Goswami <avishek.goswami@ibm.com>
Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>
Co-authored-by: Avishek Goswami <avishek.goswami@ibm.com>
Co-authored-by: HDCharles <39544797+HDCharles@users.noreply.github.com>
Co-authored-by: HDCharles <charlesdavidhernandez@gmail.com>
commit 36c30ee5848427046d006c7fc9cb46113c7ac5ba
Author: Kyle Sayers <kylesayrs@gmail.com>
Date: Tue Feb 17 17:12:59 2026 -0500
[Examples] Deprecate `dispatch_for_generation` in favor of `dispatch_model` (#2376)
* Start using `dispatch_model` as a primitive instead of
`dispatch_for_generation`, which doesn't add anything but indirection
* Find and replace `dispatch_for_generation` -> `dispatch_model`
* Add deprecation warning to `dispatch_for_generation`
---------
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
commit ef70f436188e919ae572b8ece384942d00f09d4d
Author: Avishek Goswami <86944690+GOavi101@users.noreply.github.com>
Date: Tue Feb 17 23:12:12 2026 +0530
feat: early group-size divisibility check with layer FQNs (#2353)
Add an early check so users hit a clear error at `initialize()` (before
long calibration e.g. GPTQ) when using group/tensor-group quantization
on layers whose weight columns are not divisible by `group_size`,
instead of failing at save with an opaque message.
- **Policy:** Only GROUP and TENSOR_GROUP require strict divisibility
(those kernels don’t support non-divisible shapes). BLOCK is
intentionally not checked (block kernels support non-divisible). This is
centralized in `group_size_validation.py`.
- **Early error:** We fail during `initialize_quantization()` and raise
with:
- The exact layer FQNs and `(columns, group_size)` for each problematic
layer
- Instructions to add those names to the modifier’s `ignore` list
- **Tests:** Added tests for the validation helper and for the modifier
(raises with expected message, succeeds when layers are ignored or all
divisible).
Fixes #1983
---------
Signed-off-by: Avishek Goswami <avishek.goswami@ibm.com>
Co-authored-by: Avishek Goswami <avishek.goswami@ibm.com>
Co-authored-by: HDCharles <39544797+HDCharles@users.noreply.github.com>
Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com>
commit ccc26f1c7f01ba8256efffe28549ad6775044fb7
Author: Cassie Jeon <cajeon@redhat.com>
Date: Tue Feb 17 11:11:09 2026 -0500
First draft for INFERENG-2666 (#2251)
SUMMARY:
This is a first draft for INFERENG-2666. This draft covers Llama4,
Qwen3, Kimi K2, and Mistral models for FP8 quantization.
TEST PLAN:
N/A. Documentation and code examples will need to be verified and
reviewed by developers.
Additional questions for reviewers:
1. Should all the examples be in one page? Or should I separate the
examples into separate pages for each model? This is for FP8, but I know
FP4 will also need documentation so wanted to get your thoughts if FP4
examples should also be one document or separated by model.
2. Are there any specific wording or content that should be called out
before the examples for each model?
3. I modeled the draft from [this Example
page](https://docs.vllm.ai/projects/llm-compressor/en/latest/examples/quantization_w8a8_fp8/)
that Dipika had initially pointed out. Let me know if you think I should
organize the information differently.
Signed-off-by: Cassie Jeon <cajeon@redhat.com>
commit cc3eed27da218662c629451ecdc7bac558873d30
Author: Kyle Sayers <kylesayrs@gmail.com>
Date: Tue Feb 17 10:15:19 2026 -0500
[Bugfix] Guard against MLA (#2337)
* Support INT4 quantization of models with MLA attention
* As of https://github.com/vllm-project/compressed-tensors/pull/533, MLA
attention is considered an attention module
* However, checking for submodule.q_proj fails for MLA, since MLA does
not have a q_proj
* Guard against layers without q_proj
* Able to quantize MLA model
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Co-authored-by: HDCharles <39544797+HDCharles@users.noreply.github.com>
commit 0d556a7da6c047b583a24b5e702ba2bfa647e05a
Author: Kyle Sayers <kylesayrs@gmail.com>
Date: Tue Feb 17 09:37:24 2026 -0500
[Sequential Pipeline] only cache unique offloaded values (#2366)
Updated by @brian-dellabetta
SUMMARY:
The SequentialPipeline offloads subgraph outputs as part of normal
usage. Occasionally these outputs share duplicates in kwargs that point
to the same memory location on the onloaded device. When offloading is
enabled, there was previously no check to see if any tensors to be
offloaded had already previously been offloaded, which can cause a huge
increase in memory requirements in some models, as reported in #2363.
This PR
- [x] adds an offload map to IntermediatesCache to ensure tensors are
not redundantly offloaded
- [x] wraps the map in an override to ensure `torch.equal` is used
rather than `torch.eq` (which is the one used with `==` checks).
`torch.eq` can return multiple boolean values depending on the tensors
being compared, resulting in an error. This override, which should only
be used when the tensors are immutable (the case here), allows us to
retain the original hashing function and have an `O(1)` lookup. Our
other attempts to circumvent the issue added to runtime or required
`O(N)` lookup.
Resolves #2363
TEST PLAN:
- [x] Unit test added for `OverrideEqMode`
- [x] Script from #2363 runs with ~81GB CPU RAM after first layer
propagation, increased to ~88GB CPU RAM used by layer 11/49, and then
stays consistently <89GB CPU RAM used by layer 25/49. On current main,
this script would hit ~750GB CPU RAM usage during first layer
propagastion
---------
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>
Signed-off-by: Brian Dellabetta <brian-dellabetta@users.noreply.github.com>
Co-authored-by: Brian Dellabetta <brian-dellabetta@users.noreply.github.com>
Co-authored-by: Brian Dellabetta <bdellabe@redhat.com>
commit 556b50306657186c7ca21b99d578491edc0f0a43
Author: Kyle Sayers <kylesayrs@gmail.com>
Date: Mon Feb 16 11:36:12 2026 -0500
[Misc] Reword warning message to make log grepping easier (#2312)
* Make it easier to find failures in logs by removing the word "failed"
from this very common warning
Signed-off-by: Kyle Sayers <kylesayrs@a100-08.nemg-001.lab.rdu2.dc.redhat.com>
Co-authored-by: Kyle Sayers <kylesayrs@a100-08.nemg-001.lab.rdu2.dc.redhat.com>
Co-authored-by: HDCharles <39544797+HDCharles@users.noreply.github.com>
commit b6c331e2fa8faabf851a48b5458ccd9632e6206b
Author: HDCharles <39544797+HDCharles@users.noreply.github.com>
Date: Mon Feb 16 10:18:30 2026 -0500
[ddp] fixing data slice bug (#2361)
Summary:
that's not how you slice a dataset, previously not tested with
world_size==1
Test Plan:
[script](https://gist.github.com/HDCharles/282950166fd0c95a7a2594fe922bcb53)
(world_size==1)
---------
Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>
Signed-off-by: HDCharles <39544797+HDCharles@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
commit 6ddd0361e41f65d86a08889efd58c7dec00282e3
Author: ZewenShen-Cohere <zewen.shen@cohere.com>
Date: Fri Feb 13 14:17:41 2026 -0500
[AWQ] Add activation_hook_target field for custom activation cache hooking (#2346)
- Adds an optional `activation_hook_target` field to `AWQMapping` that
lets users specify which submodule (relative to the parent/LCA) to hook
for activation caching, replacing the hardcoded `hasattr(parent, 'mlp')`
workaround for MoE models with parallel transformer blocks.
- When `activation_hook_target` is `None` (default), behavior is
unchanged: the hook is placed on `balance_layers[0]`. When set (e.g.
`"mlp"`), it resolves to the corresponding submodule on the parent via
`getattr_chain`.
In parallel transformer architectures, attention and MLP run in parallel
from the same input. The existing code always hooks `balance_layers[0]`
for activation caching, which captures the wrong activations when
balance layers span both attention and MLP branches. There was a
commented-out `hasattr(parent, 'mlp')` workaround, but it was brittle
and not generalizable. This change makes the hook target explicitly
configurable per mapping.
I've tested this change with our internal models, and it aligns with
previous results.
---------
Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com>
Co-authored-by: HDCharles <39544797+HDCharles@users.noreply.github.com>
commit b0463d101350e04b40268596c71531622b26ad20
Author: HDCharles <39544797+HDCharles@users.noreply.github.com>
Date: Fri Feb 13 11:16:34 2026 -0500
[bug][awq] fix inf handling (#2332)
Must have been a bad merge or rebase at some point, scalesview was being
set before the inf/nan check
TEST PLAN:
CI
---------
Signed-off-by: HDCharles <39544797+HDCharles@users.noreply.github.com>
commit 6d600d4b91e8fb8991cc2c5c6e4f8cd911c36815
Author: Dipika Sikka <dipikasikka1@gmail.com>
Date: Fri Feb 13 10:05:07 2026 -0500
Fix CI/CD failures (#2359)
SUMMARY:
autoround:
- Test was previously using int weights with float activations which
silently fails with torch 2.9 but results in a failure for 2.10
- Fix the args to appropriately use a valid scheme where weights are
also float
quant_reload:
- Remove old unused argument
- Set tie_word_embeddings to false to account for what the test is
targeting - I believe we’re seeing this now from recent
compressed-tensors changes cc @kylesayrs
commit 302c2c7a190f1b6c6151afb0fbc5bf63b75f240e
Author: ZewenShen-Cohere <zewen.shen@cohere.com>
Date: Fri Feb 13 08:06:15 2026 -0500
AWQ: orig_layer_weights should save all balance layer weights (#2344)
Currently, orig_layer_weights only clones weights for layers that have a
quantization scheme and are listed in `mapping.balance_layers`. This
becomes a problem when we disable quantization for a layer that is still
in `mapping.balance_layers`: all balance layers still need to be
smoothed at the end, but orig_layer_weights does not store the original
weights for all of them. As a result, the smoothing step fails (see
where the error is triggered:
https://github.com/ZewenShen-Cohere/llm-compressor-fork/blob/e9e3d3191f7598198f070c5f8269f08ec89e0b2f/src/llmcompressor/modifiers/awq/base.py#L554
).
Co-authored-by: Brian Dellabetta <brian-dellabetta@users.noreply.github.com>
Co-authored-by: HDCharles <39544797+HDCharles@users.noreply.github.com>
commit d2b67d15139f7a55699f5378cb477c945eb9ed5e
Author: Dipika Sikka <dipikasikka1@gmail.com>
Date: Thu Feb 12 14:33:00 2026 -0500
Update CI/CD Logs (#2358)
SUMMARY:
- Provide summary for why a test was skipped
commit 05a13f35711e12bed4771aea7755f27d248fdaeb
Author: Dipika Sikka <dipikasikka1@gmail.com>
Date: Thu Feb 12 10:11:37 2026 -0500
Support torch 2.10 (#2356)
SUMMARY:
- Requires: https://github.com/vllm-project/compressed-tensors/pull/583
- Transformers tests currently already failing on main
commit c37fcfa081daa024f865a3f4798db029a0a67d43
Author: Fynn Schmitt-Ulms <fynnsu@outlook.com>
Date: Wed Feb 11 19:52:19 2026 -0500
Add synchronize trigger to ready label check (#2354)
SUMMARY:
Triggers the ready label check each time new commits are pushed to a pr.
Looking at https://github.com/vllm-project/llm-compressor/pull/2350 it
seems like there is still an issue with our ready check system.
1. The first commit was added
2. The ready label was added and second commit
(7d7ebd2247142dfb75cbe631aa37859092654f71) was pushed, this caused the
ready check to run and pass
3. Further commits were added but the ready check was never retriggered
4. "ready-label-check Expected — Waiting for status to be reported" is
blocking merge, despite the most recent run of the ready check passing.
It seems like required checks may need to run and pass on the most
recent commit for github to allow the merge. This pr causes subsequent
commits to re-trigger the ready check workflow.
TEST PLAN:
Merge and see if this fixes the problem. It can't make it worse since
this just causes the check to run more often.
Signed-off-by: Fynn Schmitt-Ulms <fschmitt@redhat.com>
Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com>
commit fcd7fdbda73b88168095e728dfdc6d3ce7cf004f
Author: Dipika Sikka <dipikasikka1@gmail.com>
Date: Wed Feb 11 16:03:54 2026 -0500
Swap to use CPU runners (#2350)
SUMMARY:
- Swap ubuntu runners to use our cpu runner
- Remove 2 year old docker build workflow that we never use
---------
Signed-off-by: Dipika Sikka <dipikasikka1@gmail.com>
commit d316e27e6aaafbb17480519ad15fc0d2b723353f
Author: Fynn Schmitt-Ulms <fynnsu@outlook.com>
Date: Wed Feb 11 13:46:46 2026 -0500
Add concurrency check to all pr workflows (#2348)
SUMMARY:
We typically only care about the test results for the final commit in a
pr. This pr will reduce the load on github actions runners by cancelling
all jobs except for the one on the latest commit.
For example, if the following commits are all uploaded one at a time
quickly in a row:
Commit A1 uploaded, job A1 starts
Commit B1 (separate pr) uploaded, job B1 queued
Commit A2 uploaded, job A1 cancelled, job A2 queued, job B1 started
Job B1 finishes, job A1 starts
Note: this is the same concurrency logic we already have on
`test-check-transformers.yaml`
Signed-off-by: Fynn Schmitt-Ulms <fschmitt@redhat.com>
commit 22fc354d25248f1ef9d990a9a20c6aeca8a94d6d
Author: Dipika Sikka <dipikasikka1@gmail.com>
Date: Wed Feb 11 13:06:15 2026 -0500
Revert "add qwen3 vl autoround example (#2334)" (#2351)
This reverts commit 7b366711cba3982bbac99abdf6bf2c3572395f1a.
commit 7b366711cba3982bbac99abdf6bf2c3572395f1a
Author: Xin He <xin3.he@intel.com>
Date: Thu Feb 12 01:34:04 2026 +0800
add qwen3 vl autoround example (#2334)
SUMMARY:
AutoRound quantization example: qwen3-vl nvfp4
TEST PLAN:
python qwen3_vl_example.py
Output:
```
Hello my name is Mihai, I am a 30 year old male, and I am currently a software engineer working in a company that develops software for the financial sector. I am a very passionate person, and I am always eager to learn new things. I have a strong interest in AI, machine learning, and data science. I am also very interested in the intersection of these fields with finance. I am currently working on a project that involves building a machine learning model to predict stock prices. I am
```
---------
Signed-off-by: Xin He <xin3.he@intel.com>
Co-authored-by: HDCharles <39544797+HDCharles@users.noreply.github.com>
Co-authored-by: Brian Dellabetta <brian-dellabetta@users.noreply.github.com>
commit b49fbfda933e168f7b58f10ff45e019b3f24baee
Author: Brian Dellabetta <brian-dellabetta@users.noreply.github.com>
Date: Tue Feb 10 14:11:18 2026 -0500
[cicd] move check ready action to on pull_request_target (#2342)
SUMMARY:
Change the check ready label ci/cd action to run on
`pull_request_target` so that it runs more robustly for community user
PRs. From
[docs](https://docs.github.com/en/actions/reference/workflows-and-actions/events-that-trigger-workflows#pull_request_target):
> This event runs in the context of the default branch of the base
repository, rather than in the context of the merge commit, as the
pull_request event does. This prevents execution of unsafe code from the
head of the pull request that could alter your repository or steal any
secrets you use in your workflow. This event allows your workflow to do
things like label or comment on pull requests from forks. Avoid using
this event if you need to build or run code from the pull request.
…
SUMMARY:
Compressed-tensors 0.14.0 has been released. Bump up its version in llmcompressor.
TEST PLAN:
All tests.