Revert "add qwen3 vl autoround example (#2334)" by dsikka · Pull Request #2351 · vllm-project/llm-compressor

dsikka · 2026-02-11T18:03:33Z

This reverts commit 7b36671.

gemini-code-assist · 2026-02-11T18:03:48Z

Summary of Changes

Hello @dsikka, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request performs a complete revert of a previous commit that introduced a Qwen3-VL AutoRound quantization example. The change effectively removes the example script and all associated documentation, streamlining the project by eliminating content that was presumably no longer desired or functional.

Highlights

Revert Qwen3-VL AutoRound Example: This pull request fully reverts the addition of the Qwen3-VL AutoRound example, removing its dedicated script and all related documentation from the README.md.
File Removal: The qwen3_vl_example.py script, which provided an end-to-end example for applying AutoRound quantization to the Qwen3-VL model, has been entirely removed.
Documentation Update: The README.md file was updated to reflect the removal of the Qwen3-VL example, including its quickstart instructions and accuracy evaluation tables. Minor heading adjustments were also made for the remaining Llama 3.1 example.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

examples/autoround/quantization_w4a4_fp4/README.md
- Removed the 'Qwen3-VL Example' section, including its code block and accuracy evaluation tables.
- Adjusted the heading level for 'Evaluate Accuracy' under the Llama 3.1 example from '####' to '###'.
- Updated the introductory sentence in the 'Quickstart' section from 'end-to-end scripts' to 'an end-to-end script' to reflect the singular remaining example.
examples/autoround/quantization_w4a4_fp4/qwen3_vl_example.py
- Removed the entire Python script that implemented the Qwen3-VL AutoRound quantization example.

Activity

The pull request is a revert of a previous commit, indicating a decision to undo a recently introduced feature or change.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request reverts the addition of the Qwen3-VL autoround example. The changes involve deleting the qwen3_vl_example.py script and updating the README.md to remove all associated documentation and benchmark results. The adjustments in the README are correct and consistent with the removal of the example. The revert is clean and I have no issues to report.

github-actions · 2026-02-11T18:04:42Z

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

commit a2433a9b0128fb5113a362d553d7984de6246053 Author: Yi Liu <yi4.liu@intel.com> Date: Sat Mar 7 07:24:20 2026 +0800 [AutoRound] Add DDP Support and Example (#2411) SUMMARY: Add DDP support for Autoround and use Qwen as example. Depends on https://github.com/vllm-project/llm-compressor/pull/2410 TEST PLAN: "please outline how the changes were tested" cc @hshen14 @thuang6 --------- Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com> Signed-off-by: yiliu30 <yi4.liu@intel.com> Co-authored-by: HDCharles <charlesdavidhernandez@gmail.com> Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com> commit a88ebbd2e6e5fa02d9f33bc86b7118149dac3cb4 Author: Gilles Turpin <turpingilles15@gmail.com> Date: Fri Mar 6 16:52:20 2026 +0100 Add MoE calibration module for GlmMoeDsa (GLM-5) (#2434) SUMMARY: GlmMoeDsaNaiveMoe uses packed 3D nn.Parameter tensors instead of nn.Linear modules, causing targets=["Linear"] to match nothing in MoE experts during AWQ/GPTQ quantization. This PR permanently unpacks the fused expert weights into individual nn.Linear layers, following the same calibration pattern as glm4_moe with dtype handling aligned. Key differences from glm4_moe: is_permanent=True (experts must be unpacked for quantization targets to match), DeepSeek-style routing with groups/topk_group/norm, and SequentialGlmMoeDsaExperts for 3D->2D weight unpacking. Closes #2430 TEST PLAN: pytest.importorskip: tests skip gracefully on transformers < 5.x 3 unit tests: all experts triggered, output matches original, experts converted to nn.Linear Full e2e validation pending transformers 5.x compatibility No smaller GLM-5 checkpoint available for e2e testing (744B only) Signed-off-by: Gilles Turpin <turpingilles15@gmail.com> commit 47ec10e84d659719f1ff9959df0effb3e6f2d95d Author: Yi Liu <yi4.liu@intel.com> Date: Fri Mar 6 05:58:04 2026 +0800 Upgrade autoround 0.10.2 (#2410) Signed-off-by: yiliu30 <yi4.liu@intel.com> SUMMARY: "please provide a brief summary" TEST PLAN: "please outline how the changes were tested" cc @hshen14 @thuang6 @chensuyue --------- Signed-off-by: yiliu30 <yi4.liu@intel.com> Co-authored-by: HDCharles <39544797+HDCharles@users.noreply.github.com> commit 04dea55db919c1e8783a1f9a4c26977aff89fdfc Author: Brian Dellabetta <brian-dellabetta@users.noreply.github.com> Date: Thu Mar 5 16:16:59 2026 -0500 [Hotfix] _match_name hotfix (#2447) SUMMARY: To account for exposing `match_name` in compressed-tensors PR in * https://github.com/vllm-project/compressed-tensors/pull/607 TEST PLAN: tests pass Signed-off-by: Brian Dellabetta <bdellabe@redhat.com> commit 6d73ce60fac726496365f5144b98091f74876528 Author: Kyle Sayers <kylesayrs@gmail.com> Date: Thu Mar 5 14:25:15 2026 -0500 Refactor logging, `CompressionLogger`, support distributed (#2408) * Remove misleading information about module size after compression * Support loguru logging which logs which rank logs come from * Support compression logging that is specific to distributed workloads * Refactor `CompressionLogger` * Remove nvidia/amd logic, instead just use cuda interface * This already accounts for "CUDA/AMD_VISIBLE_DEVICES", no need to hard code these env variables * Remove "module size" log, which is misleading, as the module size does not actually change as optimization occurs (qdq) * Limit devices to just the current device in distributed cases * Refactor loguru logger configuration * `configure_logger` can now be called multiple times * When oneshot occurs, `configure_logger` is called again with the rank set * Logger now prints rank if applicable Single-thread ``` 2026-02-25T17:04:36.8189 | compress_module_list | INFO - Quantizing model.layers.0.mlp.gate_proj using 512 samples 2026-02-25T17:04:38.5924 | GPTQ | METRIC - time 1.77s 2026-02-25T17:04:38.5926 | GPTQ | METRIC - error 663.60 2026-02-25T17:04:38.5932 | GPTQ | METRIC - GPU 0 | usage: 4.45% | total memory: 85.1 GB 2026-02-25T17:04:38.5933 | GPTQ | METRIC - GPU 1 | usage: 0.00% | total memory: 85.1 GB ``` Distributed ``` [Rank 1] 2026-02-25T17:10:18.8569 | compress_module_list | INFO - Quantizing model.layers.2.self_attn.o_proj using 512 samples [Rank 1] 2026-02-25T17:10:20.4585 | GPTQ | METRIC - time 1.60s [Rank 1] 2026-02-25T17:10:20.4586 | GPTQ | METRIC - error 1.27 [Rank 1] 2026-02-25T17:10:20.4593 | GPTQ | METRIC - GPU 1 | usage: 4.45% | total memory: 85.1 Gb [Rank 1] 2026-02-25T17:10:20.4637 | compress_module_list | INFO - Quantizing model.layers.2.mlp.up_proj using 512 samples [Rank 0] 2026-02-25T17:10:20.7379 | GPTQ | METRIC - time 6.59s [Rank 0] 2026-02-25T17:10:20.7381 | GPTQ | METRIC - error 7.45 [Rank 0] 2026-02-25T17:10:20.7401 | GPTQ | METRIC - GPU 0 | usage: 5.98% | total memory: 85.1 Gb [Rank 0] 2026-02-25T17:10:20.7590 | compress_module_list | INFO - Quantizing model.layers.2.mlp.gate_proj using 512 samples ``` --------- Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> commit d6eb2be988706e46cefb03ab6acf1bbd104d35af Author: Gilles Turpin <turpingilles@orange.fr> Date: Thu Mar 5 01:45:04 2026 +0100 fix: handle packed weights in granite4 to_3d_expert (W4A16 support) (#2425) SUMMARY: Fix the W4A16 shape mismatch in to_3d_expert() reported in #2338 (first error). The original code hardcoded shapes for FP8 quantization only. The fix calculates all shapes up front (packed weights, grouped scales, packed zero points) then asserts and reshapes. This supports FP8 per-channel, FP8 block quantization, W4A16 symmetric, and W4A16 asymmetric (with packed zero_point on dim0). Companion to #2426 (FX tracing fix) and compressed-tensors #609 (3D pack/unpack). Together they resolve #2338. TEST PLAN: 4 unit tests covering all quantization configurations: - int4 symmetric (packed weights, per-channel scale) - int4 asymmetric (packed weights + packed zero_point on dim0) - fp8 block (grouped scale) - fp8 per-channel (no packing) All passing. Signed-off-by: Gilles Turpin <turpingilles15@gmail.com> Co-authored-by: HDCharles <39544797+HDCharles@users.noreply.github.com> commit 4c522137771b2223dcbfec2001658a744b37a3d5 Author: Gilles Turpin <turpingilles@orange.fr> Date: Thu Mar 5 00:50:57 2026 +0100 fix: use topological ordering in FX graph cleanup to fix erase_node crash (Granite4 GPTQ) (#2426) Fix the FX tracing crash reported as the second error in #2338. The BFS cleanup of concrete args did not maintain topological ordering — if a node was visited multiple times, its position in the deletion dict was not updated, causing dependents to be deleted before their dependencies (`RuntimeError: Tried to erase Node getitem_169`). The fix uses `move_to_end` in the BFS traversal so that revisited nodes are moved to the end of the deletion dict, ensuring topological order. Companion to #2425 (shape fix) and compressed-tensors #609 (3D pack/unpack). Together they resolve #2338. Tested on Granite 4.0-h-small with a single layer, using all three fixes (#2425, #2426, compressed-tensors #609). Script based on `test_gptq_no_exclusion.py` from #2338 with `model.model.layers = model.model.layers[:1]` added after model loading. Command: `python test_gptq_no_exclusion.py --model-name ibm-granite/granite-4.0-h-small --output /workspace/test-output --calibration-samples 16` Results: - FX tracing completed — no `erase_node` crash - 3D→2D conversion OK - Cache preparation OK (16/16 samples) - Calibration started but hit OOM on the Mamba layer (unrelated to the fix — naive Mamba path without `causal_conv1d` on a 31GB GPU) Signed-off-by: gillesturpin <turpingilles@orange.fr> Co-authored-by: HDCharles <39544797+HDCharles@users.noreply.github.com> commit 7461d02b9bf9edc35f3be9effdaa97d6639baf1f Author: JinRiYao2001 <jinriyao@qq.com> Date: Thu Mar 5 02:19:54 2026 +0800 fix(examples): correct W8A16 -> W4A16 in Qwen3-VL AWQ example save dir (#2443) SUMMARY: The AWQ recipe in this example uses num_bits=4 for weights (W4A16). However the save directory name incorrectly uses "W8A16": -AWQ-W8A16-mse-seq This PR updates it to: -AWQ-W4A16-mse-seq to match the actual quantization configuration and the comment above the recipe. TEST PLAN: Not applicable. This PR only fixes an incorrect save directory string in the example script. No functional code paths are changed. commit e6fdd066c785b11453875e777c229a954a9c438e Author: Kyle Sayers <kylesayrs@gmail.com> Date: Tue Mar 3 16:25:18 2026 -0500 Remove dead code (#2435) * Remove dead code * Remove `save_checkpoint` (this is now done by [post_process](https://github.com/vllm-project/llm-compressor/blob/main/src/llmcompressor/entrypoints/utils.py#L95)) * Remove `get_completed_stages`, `save_completed_stages` (stages no longer exist) * Remove `load_safetensors_state_dict` (we now either load with the transformers model definition or `model_free_ptq`) * Remove `set_deterministic_seeds` (not used) * Remove `is_package_available` Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> commit 7b7d1a5dc1fbca660acc04ff993fcb0c9d15acbb Author: Dipika Sikka <dipikasikka1@gmail.com> Date: Tue Mar 3 14:04:56 2026 -0500 Enable merge queue support in GitHub workflows (#2433) - Configures Mergify merge queue with automatic DCO sign-off to resolve DCO check failures on merge commits - Removes GitHub native merge queue triggers from all workflows - Adds auto-merge rule for PRs with `ready` label and required approvals The DCO (Developer Certificate of Origin) GitHub App was failing on merge commits created by GitHub's native merge queue, as those commits lacked the required `Signed-off-by:` trailer. Switch to Mergify's merge queue which automatically adds DCO sign-off to all merge commits it creates. - Added `queue_rules` with automatic DCO sign-off in commit messages - Added auto-merge rule that queues PRs when: - Label `ready` is applied - 2+ approvals received - All required checks pass (DCO, tests, quality, etc.) - `.github/workflows/ready-label-check.yaml`: Removed merge_group trigger - `.github/workflows/test-check-transformers.yaml`: Removed merge_group trigger and condition - `.github/workflows/test-check.yaml`: Removed merge_group trigger - `.github/workflows/quality-check.yaml`: Removed merge_group trigger - `.github/workflows/linkcheck.yml`: Removed merge_group trigger After merging, GitHub's native merge queue should be disabled in repository settings and Mergify will handle all merge queue operations. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> commit f15296fb966bebd2652e1a31ae106e70eff8b5e2 Author: Itay Etelis <92247226+Etelis@users.noreply.github.com> Date: Tue Mar 3 17:07:20 2026 +0200 Refactor Matching Logic to Use compressed-tensors Utilities (#2284) Consolidates 17 redundant matching functions into standardized compressed-tensors APIs. Fixes #1686 - **Deleted 15 functions** from `module.py`: `get_layers`, `get_params`, `get_prunable_layers`, `get_quantizable_layers`, `match_targets`, etc. - **Added 2 helpers**: `expand_special_targets()` (backward compatibility) and `build_parameterized_layers()` - **Updated modifiers**: SparseGPT, magnitude pruning, constant pruning to use new APIs - **Bug fix**: Added missing `self.targets` parameter in magnitude pruning --------- Signed-off-by: Itay Etelis <itay.etelis@ibm.com> Co-authored-by: Itay Etelis <itay.etelis@ibm.com> Co-authored-by: Brian Dellabetta <brian-dellabetta@users.noreply.github.com> Co-authored-by: HDCharles <39544797+HDCharles@users.noreply.github.com> commit a956d688892c2cff3757c598e4c870c796a42f78 Author: Xin He <xin3.he@intel.com> Date: Tue Mar 3 07:45:34 2026 +0800 add qwen3 vl autoround example (#2357) SUMMARY: AutoRound quantization example: qwen3-vl nvfp4 TEST PLAN: python qwen3_vl_example.py Output: ``` Hello my name is Mihai, I am a 30 year old male, and I am currently a software engineer working in a company that develops software for the financial sector. I am a very passionate person, and I am always eager to learn new things. I have a strong interest in AI, machine learning, and data science. I am also very interested in the intersection of these fields with finance. I am currently working on a project that involves building a machine learning model to predict stock prices. I am ``` --------- Signed-off-by: Xin He <xin3.he@intel.com> Co-authored-by: HDCharles <39544797+HDCharles@users.noreply.github.com> Co-authored-by: Brian Dellabetta <brian-dellabetta@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> commit 2b0684c132d130b84b1b8ec9cce9f29a3239debc Author: Omkar Kabde <omkarkabde@gmail.com> Date: Tue Mar 3 05:06:16 2026 +0530 Remove training loggers and all related code (#2414) SUMMARY: Fixes #2409. cc @kylesayrs This PR removed training loggers and all related code. Replaces their functionality with `loguru`. It also removes other helper functions and `FrequencyManager` as well. TEST PLAN: most tests are passing, but getting stuck at gptq test --------- Signed-off-by: Dan Huang <dahuang@redhat.com> Signed-off-by: Omkar Kabde <omkarkabde@gmail.com> Co-authored-by: dhuangnm <74931910+dhuangnm@users.noreply.github.com> Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com> Co-authored-by: Brian Dellabetta <brian-dellabetta@users.noreply.github.com> commit 795198790668807f0c90c9f9df9842ad0cc6cc25 Author: Gilles Turpin <turpingilles15@gmail.com> Date: Tue Mar 3 00:13:15 2026 +0100 Add SmoothQuant mapping for GlmMoeDsaForCausalLM (GLM-5) (#2419) Part of #1442 GLM-5 (GlmMoeDsaForCausalLM) uses MLA identical to DeepSeek V2/V3 — same projection names (q_a_proj, kv_a_proj_with_mqa). Reuses DEEPSEEK_V2_SMOOTHQUANT_MAPPINGS which smooths input_layernorm only, conservative choice for MoE models with fused expert parameters (gate_up_proj 3D tensor). Also adds Glm4MoeForCausalLM with DEFAULT_SMOOTHQUANT_MAPPINGS. SUMMARY: Add GLM-5 and GLM-4-MoE to SmoothQuant MAPPINGS_REGISTRY. TEST PLAN: Registry-only change. Verified GLM-5 layer names match DeepSeek V2 patterns by inspecting GlmMoeDsaForCausalLM in transformers. Signed-off-by: gillesturpin <turpingilles@orange.fr> Co-authored-by: Brian Dellabetta <brian-dellabetta@users.noreply.github.com> Co-authored-by: HDCharles <39544797+HDCharles@users.noreply.github.com> commit 69af79c4f2016f090deaf6e06faf73e3403e5d1d Author: Gilles Turpin <turpingilles@orange.fr> Date: Mon Mar 2 22:55:15 2026 +0100 Fix SmoothQuant regex to match q_a_proj in DeepSeek/GLM-5 (#2421) Fixes #2420 The balance_layers pattern re:.*q_proj in DEEPSEEK_V2_SMOOTHQUANT_MAPPINGS does not match q_a_proj (used by DeepSeek V2/V3 and GLM-5). Changed to re:.*q(_a)?_proj$ as suggested by @brian-dellabetta. SUMMARY: Fix regex pattern in DEEPSEEK_V2_SMOOTHQUANT_MAPPINGS to cover both q_proj and q_a_proj. TEST PLAN: Verified with Python regex that the new pattern matches both layer names: re.fullmatch(".*q(_a)?_proj$", "model.layers.0.self_attn.q_proj") -> match re.fullmatch(".*q(_a)?_proj$", "model.layers.0.self_attn.q_a_proj") -> match Signed-off-by: gillesturpin <turpingilles@orange.fr> Co-authored-by: Brian Dellabetta <brian-dellabetta@users.noreply.github.com> Co-authored-by: HDCharles <39544797+HDCharles@users.noreply.github.com> commit 9e9ae3dbb2b239bc22cac1a9fc463f4895e87250 Author: Gilles Turpin <turpingilles@orange.fr> Date: Mon Mar 2 20:05:46 2026 +0100 Add AWQ mapping for GlmMoeDsaForCausalLM (GLM-5) (#2418) Closes #2412 (part of #1442) GLM-5 (`GlmMoeDsaForCausalLM`) uses Multi-head Latent Attention identical to DeepSeek V3 — same projection layer names (`q_a_proj`, `kv_a_proj_with_mqa`, etc.) and same MoE structure. Reuses `_deepseek_mappings`. Also moves `Glm4MoeForCausalLM` to its correct alphabetical position in the registry. SUMMARY: Add GLM-5 (GlmMoeDsaForCausalLM) to AWQ_MAPPING_REGISTRY using _deepseek_mappings. GLM-5's MLA layer names are identical to DeepSeek V3. Also fixes alphabetical ordering of Glm4MoeForCausalLM. TEST PLAN: Registry-only change (no logic modified). Verified that GLM-5 layer names (q_a_proj, kv_a_proj_with_mqa, kv_a_layernorm, kv_b_proj, o_proj) match the patterns in _deepseek_mappings by inspecting the GlmMoeDsaForCausalLM source in transformers. Signed-off-by: gillesturpin <turpingilles@orange.fr> Co-authored-by: Brian Dellabetta <brian-dellabetta@users.noreply.github.com> Co-authored-by: HDCharles <39544797+HDCharles@users.noreply.github.com> commit a27d9e2e318fdb81254285c2ed5987b97897d973 Author: 김대익 <33992354+dik654@users.noreply.github.com> Date: Tue Mar 3 03:41:47 2026 +0900 [GPTQ] Move modifier to top-level for consistent folder structure (#2368) Move GPTQModifier from `modifiers/quantization/gptq/` to `modifiers/gptq/` for consistent folder structure with AWQ and AutoRound (related: #2306). - Add deprecation wrapper at old import path for backward compatibility - Exclude old GPTQ paths from ModifierFactory to prevent duplicate registration - Update test and example imports to new canonical path Import verification (all passed): - from llmcompressor.modifiers.gptq import GPTQModifier (new path, no warning) - from llmcompressor.modifiers.quantization import GPTQModifier (BC, no warning) - from llmcompressor.modifiers.quantization.gptq import GPTQModifier (BC, DeprecationWarning) - ModifierFactory.refresh() registers GPTQModifier from new location pytest (11 passed, 3 skipped for GPU): - tests/llmcompressor/transformers/gptq/test_gptq_oneshot.py - tests/llmcompressor/pytorch/modifiers/pruning/sparsegpt/test_pytorch.py - tests/llmcompressor/transformers/compression/test_recipe_parsing.py (requires GPU) ruff check + ruff format passed --------- Signed-off-by: 김대익 <33992354+dik654@users.noreply.github.com> Co-authored-by: Brian Dellabetta <brian-dellabetta@users.noreply.github.com> commit a99f159abe94dd119f6a13e5ae4004505fcd8355 Author: HDCharles <39544797+HDCharles@users.noreply.github.com> Date: Mon Mar 2 10:44:26 2026 -0500 Smoothquant bugfixes (#2422) Summary: smooth quant wasn't actually doing anything since it was only updating the onload, this PR fixes that and adds a test to check the behavior of smoothquant in the future TEST PLAN: pytest /home/HDCharles/repos/llm-compressor/tests/llmcompressor/modifiers/transform/smoothquant/test_base.py -k "e2e" Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com> commit 732316c8980913d173d3a202eeab95eda39af230 Author: Sören Dréano <71752785+SorenDreano@users.noreply.github.com> Date: Sat Feb 28 16:25:35 2026 +0100 Add support for passing a custom DataLoader to oneshot() (#2390) SUMMARY: Adds a `dataloader` argument to the `oneshot` entrypoint. Allow users to pass a pre-built PyTorch DataLoader directly via the `dataloader` parameter, bypassing the internal dataset-to-dataloader conversion. This is useful for custom data pipelines where users already have a prepared DataLoader and don't need get_calibration_dataloader(). Rather than using `self.dataloader = kwargs.pop("dataloader", None)`, we could also add a `dataloader` argument/attribute to `DatasetArguments` if you prefer. TEST PLAN: This change is fairly trivial, I made sure [https://github.com/vllm-project/llm-compressor/blob/main/examples/quantization_w4a16/README.md](https://github.com/vllm-project/llm-compressor/blob/main/examples/quantization_w4a16/README.md) could still run and that passing the DataLoader too: ```python from transformers import AutoTokenizer, AutoModelForCausalLM MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct" model = AutoModelForCausalLM.from_pretrained(MODEL_ID, dtype="auto") tokenizer = AutoTokenizer.from_pretrained(MODEL_ID) from datasets import load_dataset NUM_CALIBRATION_SAMPLES=512 MAX_SEQUENCE_LENGTH=2048 ds = load_dataset("HuggingFaceH4/ultrachat_200k", split=f"train_sft[:{NUM_CALIBRATION_SAMPLES}]") ds = ds.shuffle(seed=42) def preprocess(example): return {"text": tokenizer.apply_chat_template(example["messages"], tokenize=False,)} ds = ds.map(preprocess) def tokenize(sample): return tokenizer(sample["text"], padding=False, max_length=MAX_SEQUENCE_LENGTH, truncation=True, add_special_tokens=False) ds = ds.map(tokenize, remove_columns=ds.column_names) from llmcompressor.datasets import get_calibration_dataloader from llmcompressor.args import DatasetArguments dataset_args = DatasetArguments( dataset=ds, max_seq_length=MAX_SEQUENCE_LENGTH, num_calibration_samples=NUM_CALIBRATION_SAMPLES, ) dataloader = get_calibration_dataloader(dataset_args, tokenizer) from llmcompressor import oneshot from llmcompressor.modifiers.quantization import GPTQModifier recipe = GPTQModifier(targets="Linear", scheme="W4A16", ignore=["lm_head"]) oneshot( model=model, recipe=recipe, dataloader=dataloader, ) SAVE_DIR = MODEL_ID.rstrip("/").split("/")[-1] + "-W4A16-G128" model.save_pretrained(SAVE_DIR, save_compressed=True) tokenizer.save_pretrained(SAVE_DIR) ``` This is the exact same code from the documentation, with the DataLoader built outside of the `oneshot` call (`dataloader = get_calibration_dataloader(dataset_args, tokenizer)`) and passed directly to `oneshot`. --------- Signed-off-by: Sören Dréano <71752785+SorenDreano@users.noreply.github.com> Signed-off-by: Brian Dellabetta <bdellabe@redhat.com> Co-authored-by: Soren Dreano <soren@numind.ai> Co-authored-by: Brian Dellabetta <brian-dellabetta@users.noreply.github.com> Co-authored-by: Brian Dellabetta <bdellabe@redhat.com> Co-authored-by: HDCharles <39544797+HDCharles@users.noreply.github.com> commit bdb65473ba21ca6aaaf726ffe66c695f5608c953 Author: dhuangnm <74931910+dhuangnm@users.noreply.github.com> Date: Fri Feb 27 18:06:58 2026 -0500 Bump compressed-tensors version (#2423) SUMMARY: Compressed-tensors 0.14.0 has been released. Bump up its version in llmcompressor. TEST PLAN: All tests. Signed-off-by: Dan Huang <dahuang@redhat.com> commit 0c0ead359a355ea443df50f3f6c91de7d1df255d Author: Dipika Sikka <dipikasikka1@gmail.com> Date: Thu Feb 26 18:04:11 2026 -0500 [ReadMe] Update whats new (#2417) SUMMARY: Sample Build: https://app.readthedocs.org/projects/vllm-llm-compressor/builds/31579228/ commit a9847e04a92f75d64416b133991b868ed4564bf6 Author: Dipika Sikka <dipikasikka1@gmail.com> Date: Thu Feb 26 17:39:55 2026 -0500 [Docs] Updates (#2416) SUMMARY: - Fix torchrun command - Add reference to guides in compress.md - Update model loading table commit fe512727a4584c79f62dba984f004e1b4f6f9277 Author: dhuangnm <74931910+dhuangnm@users.noreply.github.com> Date: Thu Feb 26 15:09:59 2026 -0500 Improve how we identify and run e2e smoke tests (#2336) SUMMARY: Currently we use the file `tests/e2e/vLLM/rhaiis-e2e-smoke.list` to mark the configs for smoke tests that we use to run for the RHAIIS image. This is vulnerable as we need to keep the list in this file up-to-date to any changes in the config yaml files and this is error-prone. This PR removes this `tests/e2e/vLLM/rhaiis-e2e-smoke.list` file and use the config yaml file directly to mark the smoke tests. We added a new field `test_group` to the yaml file and updated the `run_tests_in_*.sh` scripts to parse this field and filter out tests if a test group (-g) is specified. This allows both python and rhaiis image testing be able to run smoke and full tests for the configs. To be more specific: $# to run e2e tests for all configs (default) ` bash tests/e2e/vLLM/run_tests_in_python.sh -c tests/e2e/vLLM/configs -t tests/e2e/vLLM/test_vllm.py ` $# to run e2e tests for configs with smoke only ` bash tests/e2e/vLLM/run_tests_in_python.sh -c tests/e2e/vLLM/configs -t tests/e2e/vLLM/test_vllm.py -g rhaiis ` Similar commands for the `run_tests_in_rhaiis.sh` script. Going forward, for any newly added configs for the e2e tests, if we want to include them into the smoke tests for the RHAIIS image, we need to remember to add the `test_group: "smoke"` into their yaml file under the configs/ so we can automatically pick it up for the RHAIIS image testing. TEST PLAN: A successful run of the smoke tests is here: https://github.com/neuralmagic/llm-compressor-testing/actions/runs/21727920814 --------- Signed-off-by: Dan Huang <dahuang@redhat.com> Signed-off-by: Dipika Sikka <dipikasikka1@gmail.com> Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com> commit d0228407111ad6a70fa74c933cd138ab0404a9f6 Author: Dipika Sikka <dipikasikka1@gmail.com> Date: Thu Feb 26 11:41:09 2026 -0500 [Example Testing] Remove and update example test cases (#2406) SUMMARY: - Remove out-dated cases - Add more up-to-date cases (e.g disk offload, ddp, model free ptq), examples, and models - Ensure all cases are verified for correct compression format - Add an optional `qwen` install to enable qwen VL examples which leverage `qwen_vl_utils` - Will require https://github.com/neuralmagic/llm-compressor-testing/pull/219 for example testing With these changes, all examples pass: https://github.com/neuralmagic/llm-compressor-testing/actions/runs/22450404023 --------- Signed-off-by: Dipika Sikka <dipikasikka1@gmail.com> commit 12aa5639a3276bb7fe493a0a2158e9846c63f3ff Author: dhuangnm <74931910+dhuangnm@users.noreply.github.com> Date: Wed Feb 25 17:17:21 2026 -0500 [WIP] Update dependency bounds for new release (#2407) SUMMARY: Update llmcompressor dependency bounds except for compressed-tensors, which will be updated after the compressed-tensors 0.14.0 is released. TEST PLAN: All tests --------- Signed-off-by: Dan Huang <dahuang@redhat.com> Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com> commit 81ec39c1c36c7f1d092dbd518591ca1bfb171c18 Author: Kyle Sayers <kylesayrs@gmail.com> Date: Wed Feb 25 14:45:45 2026 -0500 [Offload] Convert model back to CT offloading for testing (#2403) * Fix testing which requires access to the model after the model has been saved * https://github.com/vllm-project/compressed-tensors/pull/601 * Convert back to CT offloading after converting to accelerate offloading for saving * Previously we just "removed dispatch", but this is bad practice as it won't work for disk offloading Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com> commit c6e4d38dde4471874e4a3100f928cd3fef473cd5 Author: HDCharles <39544797+HDCharles@users.noreply.github.com> Date: Wed Feb 25 14:20:44 2026 -0500 [dist][moe] fix add moe_context for big models (#2405) Summary: large models like Qwen/Qwen3-VL-235B-A22B-Instruct, when they add moe calibration context, different threads can take different lengths of time, for larger models this difference can be longer than the nccl timeout. fix: add a sync point at each module since we're rate limited to the slowest thread as is. at some point this should be changed to add moe calibration context in parallel and broadcast the updated modules TEST PLAN: tested e2e <details> ``` qwen3_vl_235b_moe_gptq_int4_ddp_example.py` supported for Qwen3-VL-MoE from compressed_tensors.offload import init_dist, load_offloaded_model from transformers import AutoProcessor, Qwen3VLMoeForConditionalGeneration from llmcompressor import oneshot from llmcompressor.modifiers.quantization import GPTQModifier MODEL_ID = "Qwen/Qwen3-VL-235B-A22B-Instruct" init_dist() with load_offloaded_model(): model = Qwen3VLMoeForConditionalGeneration.from_pretrained( MODEL_ID, dtype="auto", device_map="auto_offload" ) processor = AutoProcessor.from_pretrained(MODEL_ID) currently recipe = GPTQModifier( targets="Linear", scheme="W4A16", ignore=[ "re:.*lm_head", "re:visual.*", "re:model.visual.*", "re:.*mlp.gate$", ], ) oneshot(model=model, recipe=recipe) import torch SAVE_DIR = ( MODEL_ID.rstrip("/").split("/")[-1] + "-GPTQ-W4A16-G128-DDP" + str(torch.distributed.get_world_size()) ) model.save_pretrained(SAVE_DIR, save_compressed=True) processor.save_pretrained(SAVE_DIR) ``` <\details> Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com> commit f18d6e384fb9244c82eeb5ce715c3c54b4a91313 Author: HDCharles <39544797+HDCharles@users.noreply.github.com> Date: Wed Feb 25 13:09:01 2026 -0500 fix ddp for nvfp4 on A100 (#2404) depends on https://github.com/vllm-project/compressed-tensors/pull/603 Summary: nccl only allows broadcasting fp8 on a100 but we can work around it with this util Test Plan: <details> Test Script </details> Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com> commit ff526d72e41b3e13ae9df4f0d0524764751cd2ec Author: Kyle Sayers <kylesayrs@gmail.com> Date: Wed Feb 25 11:27:12 2026 -0500 [Docs] Add Sequential Onloading, Disk Offloading, and Distributed Oneshot Docs (#2396) * Add documentation for new features in v0.10.0 * Add up-to-date documentation on sequential onloading * Add docs page for Sequential Onloading * Add docs page for Model Loading * Add docs page for Distributed Oneshot * Fix the path of observers.md * Slightly change wording on docs home page * Add redirect to model loading docs in disk offloading examples folder --------- Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> commit 1e4d3c5bca95ac75fc301005d1fe5b2adca9a955 Author: Brian Dellabetta <brian-dellabetta@users.noreply.github.com> Date: Wed Feb 25 11:09:57 2026 -0500 [Examples] Remove diagnostic `model.generate` calls for models with 40B+ parameters (#2401) SUMMARY: Remove all calls to `model.generate` in examples involving models with ~40B+ parameters. Anything smaller should run on a single 80GB GPU. TEST PLAN: n/a --------- Signed-off-by: Brian Dellabetta <bdellabe@redhat.com> Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com> commit f0a1824bc5440597d071bcc21bd8ad01bd8b0038 Author: Dipika Sikka <dipikasikka1@gmail.com> Date: Wed Feb 25 11:03:02 2026 -0500 [Tests][LM Eval] Fix test seeding for consistent results (#2395) SUMMARY: - Enables consistent test results before runs Test Run: https://github.com/neuralmagic/llm-compressor-testing/actions/runs/22371360237 --------- Signed-off-by: Dipika Sikka <dipikasikka1@gmail.com> commit b0cc7a05f7f6916d5757f452f7147e066f318451 Author: Dipika Sikka <dipikasikka1@gmail.com> Date: Wed Feb 25 10:54:34 2026 -0500 [Docs] Clean-up + Example ReadMe updates (#2399) SUMMARY: - Remove marlin24 examples - Clean-up existing README docs - Add examples/README.md file explaining repo structure - Update MoE README.md commit 778abe815c226669753308ea9ee76ee91186db26 Author: Dipika Sikka <dipikasikka1@gmail.com> Date: Tue Feb 24 13:43:53 2026 -0500 [Docs] Remove finetune examples (#2398) SUMMARY: - Remove old finetune examples - Remove old maintainers file as redundant with CODEOWNERS commit 9b7fb9f77159967f90b66b37be5ea7bc21532504 Author: Bartowski <3266127+bartowski1182@users.noreply.github.com> Date: Tue Feb 24 12:21:09 2026 -0500 Add AFMOE mappings for awq and smoothquant (#2316) SUMMARY: These mappings are needed to properly apply AWQ and smoothquant to the Trinity series of models, AfmoeForCausalLM TEST PLAN: Quality was tested with benchmarks, without these changes the benchmark results were extremely low, with these changes it was close to margin of error compared to bf16/FP8 dynamic Can test on Trinity-Large-Preview https://huggingface.co/arcee-ai/Trinity-Large-Preview Test code for quantization: https://gist.github.com/bartowski1182/b7e05f6c96735ec5d03f234d37e11e4d --------- Signed-off-by: Colin Kealty <3266127+bartowski1182@users.noreply.github.com> Signed-off-by: Bartowski <3266127+bartowski1182@users.noreply.github.com> Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com> Co-authored-by: Brian Dellabetta <brian-dellabetta@users.noreply.github.com> Co-authored-by: HDCharles <39544797+HDCharles@users.noreply.github.com> commit be12cc6d70f3fb3fdd6b0bbe0a8ba35f19b549d9 Author: Dipika Sikka <dipikasikka1@gmail.com> Date: Tue Feb 24 11:25:46 2026 -0500 [Docs] Reorganize + Additional Guides (#2379) SUMMARY: - Add choosing a model - Add choosing a dataset - Re-organize to set-up a step-by-step compression guide - Additional clean-up and organization Sample Doc Generation: https://vllm--2379.org.readthedocs.build/projects/llm-compressor/en/2379/ --------- Signed-off-by: Dipika Sikka <ds3822@columbia.edu> Co-authored-by: Brian Dellabetta <brian-dellabetta@users.noreply.github.com> commit 986ac236f3bbdc95c8e47072fb33474511aee962 Author: Kyle Sayers <kylesayrs@gmail.com> Date: Mon Feb 23 13:03:00 2026 -0500 [Misc] Remove usages of `update_parameter_data` (#2393) * Begin deprecation of `update_parameter_data` in favor of `update_offload_parameter` --------- Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> commit 5c757a6985d32ee74b7a2c30349c852624cf4100 Author: Kyle Sayers <kylesayrs@gmail.com> Date: Mon Feb 23 11:59:36 2026 -0500 [Offloading] Support Disk Offloading (#2373) * Support disk offloading for very large models * [[Offload] Convert accelerate for loading/saving](https://github.com/vllm-project/compressed-tensors/pull/572/) * Add `examples/disk_offloading/qwen3_example.py` * Add `examples/disk_offloading/kimi_k2_example.py` * Remove post-processing step where `remove_dispatch` is called * Previously, this was used to avoid conflicts between `dispatch_for_sequential` and `dispatch_for_generation`. * Now, the two functions are directly compatible: you don't need to remove the dispatch of one to use the other * Add `to_accelerate` to `save_pretrained_wrapper` * This ensures that the model is converted to `accelerate` offloading before saving * This ensures the best compatibility with `save_pretrained`, and reduces excess memory usage which would cause gpu/cpu ooms * During oneshot preprocessing, convert `from_accelerate` if possible. This guards against users who load their model outside of the `load_offloaded_model` context * Remove `offload_device` arguemnt from `dispatch_for_sequential` to avoid deprecation warning * `dispatch_for_sequential` now always respects the device the model was loaded on * Ran `Qwen/Qwen3-0.6B` example to completion * [IN PROGRESS] Run `unsloth/Kimi-K2-Instruct-0905-BF16` example to completion --------- Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> commit 5f63d7a9a6ae0f9944e69ff87ea5cce31f923ae2 Author: HDCharles <39544797+HDCharles@users.noreply.github.com> Date: Thu Feb 19 14:17:53 2026 -0500 [GPTQ][ddp] enabling DDP for GPTQ (#2333) After the changes in https://github.com/vllm-project/compressed-tensors/pull/572 https://github.com/vllm-project/compressed-tensors/pull/534 https://github.com/vllm-project/llm-compressor/pull/2340 we're ready to start rolling out DDP implementations of various modifiers The Api we've landed on attempts to maintain the normal flow with minimal changes necessary to enable DDP: 1) the user will call torchrun --nproc_per_node==<num_threads> script.py to start the script 2) the user will initialize the distributed context, (they can use the helper init_dist to do this) 3) the user will load the model using the new context manager, setting the device map as outlined [here](https://github.com/vllm-project/compressed-tensors/pull/572). (For most users this will be "auto_offload") 4) (optional) the user can partition the dataset at load time using get_rank_partition or just load as normal and oneshot will partition the data later (will load 1 copy of dataset into cpu memory for each rank which may be onerous) ```python from compressed_tensors.offload import load_offloaded_model, init_dist init_dist() with load_offloaded_model(): model = AutoModelForCausalLM.from_pretrained(MODEL_ID, device_map="auto_offload") ... ds = load_dataset( DATASET_ID, split=get_rank_partition(DATASET_SPLIT, NUM_CALIBRATION_SAMPLES) ``` Adding the DDP process to GPTQ has relatively straightforward though optimizing it for speed was a bit trickier. There are 4 steps 1) assigning each module to a rank which it will be compressed by 2) for each module assigned to a rank, having all hessian information sent by other ranks to the assigned rank 3) each rank compresses the modules that it was assigned 4) broadcast the final quantized values to all ranks Step 1 required the largest optimization, without any load balancing, we ran into situations where 1 rank could be doing twice as much work as another. Thus we implemented basic load balancing and time estimation that seems to be working well in practice. The other major optimization was using asynchronous ops for thread to thread communication. Before these optimizations, 2 thread GPTQ was as fast as 1 thread GPTQ for llama3-8B, afterward it results in a 27% speedup despite being a relatively small model. | model_id | world_size | max_time | max_memory | save_time | flex_extract | eval_time | |----------|-------------|----------|------------|-----------|--------------|-----------| | Meta-Llama-3-8B-Instruct | 1 | 745.03 | 5.82 | 19.57 | 0.7066 | 95.28 | | Meta-Llama-3-8B-Instruct | 2 | 372.20 | 5.57 | 49.10 | 0.7089 | 95.24 | | Meta-Llama-3-8B-Instruct | 4 | 264.07 | 5.82 | 52.50 | 0.7180 | 96.74 | | Qwen3-30B-A3B | 1 | 14207.53 | 6.56 | 748.23 | 0.8704 | 209.93 | | Qwen3-30B-A3B | 2 | 7018.25 | 6.36 | 696.65 | 0.8810 | 205.89 | | Qwen3-30B-A3B | 4 | 3694.46 | 6.36 | 723.05 | 0.8832 | 217.62 | while validating numerical accuracy of the DDP technique, we noticed that accuracy improved significantly for each thread added. After some debugging we realized this was because the existing [hessian calculation](https://github.com/vllm-project/llm-compressor/pull/2333/changes#diff-18d1319f01629ca65cc54f955dc6177f6dd025f057013932b2ed29842854f3ecL61-L65) was causing an accumulation of floating point errors. By rewriting the hessian calculation to sum the intermediate hessians and only divide by num_samples at the end, we improved the GSM8K evaluation from (.67, .66) to (.71, .71). You can repro these results [here](https://github.com/vllm-project/llm-compressor/pull/2333/changes#diff-d31ce0453051853c17ba2a5225b3d1bfab548e095bab0967d6acfd1b3ce1b35d) --------- Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com> commit 881dd462975a92551685b5507dfa1272f8c40bb8 Author: Kyle Sayers <kylesayrs@gmail.com> Date: Thu Feb 19 12:55:13 2026 -0500 [Bugfix] Reduce device movement while checking layer divisibility (#2385) * Improve runtime and memory usage by checking the shape of the offloaded weight, not the onloaded weight * Wrap all calls to `_layer_indivisible` with the `disable_onloading` context Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> commit 70b610acb234e095f664d361412f6a4e9ef2ff09 Author: Brian Dellabetta <brian-dellabetta@users.noreply.github.com> Date: Thu Feb 19 11:41:32 2026 -0500 [Observers] Allow for case when weight shape and block size are not evenly divisble (#2283) SUMMARY: Update observer logic for block strategy when weight shape is not divisible by block size Prerequisite: - [x] https://github.com/vllm-project/compressed-tensors/pull/547 TEST PLAN: - [x] Quantized checkpoint made with this branch (and above CT branch) runs on vllm main for flashinfer, deepgemm and default kernels -- https://huggingface.co/bdellabe/DeepSeek-V2-Lite-FP8-BLOCK Run script below with - `VLLM_BLOCKSCALE_FP8_GEMM_FLASHINFER=1 VLLM_USE_DEEP_GEMM=1` for flashinfer - `VLLM_BLOCKSCALE_FP8_GEMM_FLASHINFER=0 VLLM_USE_DEEP_GEMM=1` for deepgemm - `VLLM_BLOCKSCALE_FP8_GEMM_FLASHINFER=0 VLLM_USE_DEEP_GEMM=0` for default ```python if __name__ == "__main__": from vllm import LLM, SamplingParams prompts = ["The Swiss Alps are", "Brad Marchand is", "The Toronto Maple Leafs are"] sampling_params = SamplingParams( temperature=0.80, top_p=0.95, max_tokens=40, min_tokens=10 ) llm = LLM( "bdellabe/DeepSeek-V2-Lite-FP8-BLOCK", max_model_len=4096, enforce_eager=True, ) output = llm.generate(prompts, sampling_params) for out in output: print(out.outputs[0].text) print("COMPLETE") ``` --------- Signed-off-by: Brian Dellabetta <bdellabe@redhat.com> commit d0ce1d827ce3981eaf173f47452f66793d8d1d78 Author: Itay Etelis <92247226+Etelis@users.noreply.github.com> Date: Thu Feb 19 11:35:03 2026 +0200 move smoothquant to transforms (#2314) Moves `SmoothQuantModifier` from `modifiers/smoothquant/` to `modifiers/transform/smoothquant/` to correctly categorize it as a transform rather than a modifier. Closes #2306 - Moved SmoothQuant source files to `modifiers/transform/smoothquant/` - Moved corresponding test files - Updated all imports across examples, docs, and dependent code - Exported `SmoothQuantModifier` from `modifiers.transform` ```python from llmcompressor.modifiers.transform.smoothquant import SmoothQuantModifier ``` --------- Signed-off-by: Itay Etelis <itayetelis@gmail.com> Signed-off-by: Itay Etelis <itay.etelis@ibm.com> Co-authored-by: Itay Etelis <itay.etelis@ibm.com> commit 936e0a701e55e8d9f9b9145b64673510bfe2a79c Author: Dipika Sikka <dipikasikka1@gmail.com> Date: Thu Feb 19 01:15:24 2026 -0500 [Tests][e2e] Release memory before running vLLM (#2375) Co-authored-by: Kyle Sayers <kylesayrs@gmail.com> commit 2e469aa913c41d0c832d3a0a5785751b48e065ed Author: Kyle Sayers <kylesayrs@gmail.com> Date: Thu Feb 19 00:59:14 2026 -0500 [Bugfix] Fix circular references when activation offload device is cuda (#2387) tensors. This is a good approach, but has an edge case where, if the value of entry is identical to the key of the entry, then the key will never be garbage collected. This can occur if the user specifies `sequential_offload_device="cuda"`, or if the AWQ offload device is "cuda" (default true in most cases). * Fix memory leak in AWQ which led to very high CUDA memory usage * Guard against entries into the `WeakKeyDictionary` where the key and value are identical * Misc * Move `OverrideEqMode` to the bottom of the `pipelines/cache.py` * Remove `_fp16_baseline_cache`, which was not being used | Before Changes | After Changes | | - | - | | <img width="640" height="480" alt="awq_before" src="https://github.com/user-attachments/assets/07714321-4b2f-49b7-aa2b-5c745a60d2f4" /> | <img width="640" height="480" alt="awq_after" src="https://github.com/user-attachments/assets/336b0e98-c24c-4e0c-a873-3166effc32b7" /> | --------- Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> Co-authored-by: HDCharles <39544797+HDCharles@users.noreply.github.com> commit 9979e9829ba034ee323b24a841e2288572c594df Author: Kyle Sayers <kylesayrs@gmail.com> Date: Wed Feb 18 18:20:15 2026 -0500 [`model_free_ptq`] Earlier Shape Validation (#2372) * Add earlier shape validation, at the cost of loading tensors twice * Add a validation step which loads tensors and validates their shapes * Misc * Add `iter_quantizable_tensors` to reduce code reuse * Added `tests/llmcompressor/pipelines/test_model_free_validation.py` ------ [Codex Task](https://chatgpt.com/codex/tasks/task_e_69936b53d28c8327aa0b784040c34734): I had to do significant cleanup to make this multithreaded/fix duplicated code. --------- Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> Co-authored-by: HDCharles <39544797+HDCharles@users.noreply.github.com> commit 7ac94483be892f1e522079065c5b6ad4ff683c79 Author: D!NE$H <67671800+gDINESH13@users.noreply.github.com> Date: Thu Feb 19 03:15:07 2026 +0530 input_id not required for Step3-VL-10B (#2370) SUMMARY: Closes #2272 Adds TypeError exception handling to `get_embeddings()` to support models with non-standard `get_input_embeddings() `implementations, specifically Step3-VL-10B. As @kylesayrs mentioned, Step3-VL-10B has a non-standard implementation of get_input_embeddings() that requires an input_ids parameter. This fix gracefully handles the TypeError that occurs when calling this method without the required parameter, allowing quantization to proceed. TEST PLAN: I do not have system specs to run this model locally. But testing would be running `examples/quantization_w8a8_int8/llama3_example.py` just changing model_id to "stepfun-ai/Step3-VL-10B" It should gracefully handle the TypeError to require input_ids. Signed-off-by: gDINESH13 <dinesh13g@gmail.com> Co-authored-by: HDCharles <39544797+HDCharles@users.noreply.github.com> commit f4a1490d09b65bb125ecbd855f0866f4ef80cd1a Author: Avishek Goswami <86944690+GOavi101@users.noreply.github.com> Date: Thu Feb 19 00:49:10 2026 +0530 DataLoader options, single-pass weight calibration, optional sequential prefetch (#2349) Performance-oriented changes for calibration and weight quantization: DataLoader tuning when workers are used, a single-pass weight calibration in `QuantizationModifier`, and an optional sequential prefetch to overlap onload with forward. Defaults stay safe for low RAM/GPU. - **`pin_memory`**: Set to `True` only when CUDA is available **and** `dataloader_num_workers > 0` (avoids extra pinned memory when `num_workers=0`). - When `num_workers > 0`: set `persistent_workers=True` and `prefetch_factor=2` for faster calibration. `args/dataset_arguments.py`, `entrypoints/oneshot.py`) - New argument **`sequential_prefetch`** (default **`False`**). - When `False`: same as before — one batch on GPU at a time (low peak memory). - When `True`: prefetch next batch in a background thread to overlap onload with forward (faster when GPU memory allows two batches). - `dataloader_num_workers` default remains **0** (low-memory safe); help text updated. - `sequential_prefetch` added to `DatasetArguments` and `oneshot()` with default `False`. --------- Signed-off-by: Avishek Goswami <avishek.goswami@ibm.com> Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com> Co-authored-by: Avishek Goswami <avishek.goswami@ibm.com> Co-authored-by: HDCharles <39544797+HDCharles@users.noreply.github.com> Co-authored-by: HDCharles <charlesdavidhernandez@gmail.com> commit 36c30ee5848427046d006c7fc9cb46113c7ac5ba Author: Kyle Sayers <kylesayrs@gmail.com> Date: Tue Feb 17 17:12:59 2026 -0500 [Examples] Deprecate `dispatch_for_generation` in favor of `dispatch_model` (#2376) * Start using `dispatch_model` as a primitive instead of `dispatch_for_generation`, which doesn't add anything but indirection * Find and replace `dispatch_for_generation` -> `dispatch_model` * Add deprecation warning to `dispatch_for_generation` --------- Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> commit ef70f436188e919ae572b8ece384942d00f09d4d Author: Avishek Goswami <86944690+GOavi101@users.noreply.github.com> Date: Tue Feb 17 23:12:12 2026 +0530 feat: early group-size divisibility check with layer FQNs (#2353) Add an early check so users hit a clear error at `initialize()` (before long calibration e.g. GPTQ) when using group/tensor-group quantization on layers whose weight columns are not divisible by `group_size`, instead of failing at save with an opaque message. - **Policy:** Only GROUP and TENSOR_GROUP require strict divisibility (those kernels don’t support non-divisible shapes). BLOCK is intentionally not checked (block kernels support non-divisible). This is centralized in `group_size_validation.py`. - **Early error:** We fail during `initialize_quantization()` and raise with: - The exact layer FQNs and `(columns, group_size)` for each problematic layer - Instructions to add those names to the modifier’s `ignore` list - **Tests:** Added tests for the validation helper and for the modifier (raises with expected message, succeeds when layers are ignored or all divisible). Fixes #1983 --------- Signed-off-by: Avishek Goswami <avishek.goswami@ibm.com> Co-authored-by: Avishek Goswami <avishek.goswami@ibm.com> Co-authored-by: HDCharles <39544797+HDCharles@users.noreply.github.com> Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com> commit ccc26f1c7f01ba8256efffe28549ad6775044fb7 Author: Cassie Jeon <cajeon@redhat.com> Date: Tue Feb 17 11:11:09 2026 -0500 First draft for INFERENG-2666 (#2251) SUMMARY: This is a first draft for INFERENG-2666. This draft covers Llama4, Qwen3, Kimi K2, and Mistral models for FP8 quantization. TEST PLAN: N/A. Documentation and code examples will need to be verified and reviewed by developers. Additional questions for reviewers: 1. Should all the examples be in one page? Or should I separate the examples into separate pages for each model? This is for FP8, but I know FP4 will also need documentation so wanted to get your thoughts if FP4 examples should also be one document or separated by model. 2. Are there any specific wording or content that should be called out before the examples for each model? 3. I modeled the draft from [this Example page](https://docs.vllm.ai/projects/llm-compressor/en/latest/examples/quantization_w8a8_fp8/) that Dipika had initially pointed out. Let me know if you think I should organize the information differently. Signed-off-by: Cassie Jeon <cajeon@redhat.com> commit cc3eed27da218662c629451ecdc7bac558873d30 Author: Kyle Sayers <kylesayrs@gmail.com> Date: Tue Feb 17 10:15:19 2026 -0500 [Bugfix] Guard against MLA (#2337) * Support INT4 quantization of models with MLA attention * As of https://github.com/vllm-project/compressed-tensors/pull/533, MLA attention is considered an attention module * However, checking for submodule.q_proj fails for MLA, since MLA does not have a q_proj * Guard against layers without q_proj * Able to quantize MLA model Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> Co-authored-by: HDCharles <39544797+HDCharles@users.noreply.github.com> commit 0d556a7da6c047b583a24b5e702ba2bfa647e05a Author: Kyle Sayers <kylesayrs@gmail.com> Date: Tue Feb 17 09:37:24 2026 -0500 [Sequential Pipeline] only cache unique offloaded values (#2366) Updated by @brian-dellabetta SUMMARY: The SequentialPipeline offloads subgraph outputs as part of normal usage. Occasionally these outputs share duplicates in kwargs that point to the same memory location on the onloaded device. When offloading is enabled, there was previously no check to see if any tensors to be offloaded had already previously been offloaded, which can cause a huge increase in memory requirements in some models, as reported in #2363. This PR - [x] adds an offload map to IntermediatesCache to ensure tensors are not redundantly offloaded - [x] wraps the map in an override to ensure `torch.equal` is used rather than `torch.eq` (which is the one used with `==` checks). `torch.eq` can return multiple boolean values depending on the tensors being compared, resulting in an error. This override, which should only be used when the tensors are immutable (the case here), allows us to retain the original hashing function and have an `O(1)` lookup. Our other attempts to circumvent the issue added to runtime or required `O(N)` lookup. Resolves #2363 TEST PLAN: - [x] Unit test added for `OverrideEqMode` - [x] Script from #2363 runs with ~81GB CPU RAM after first layer propagation, increased to ~88GB CPU RAM used by layer 11/49, and then stays consistently <89GB CPU RAM used by layer 25/49. On current main, this script would hit ~750GB CPU RAM usage during first layer propagastion --------- Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> Signed-off-by: Brian Dellabetta <bdellabe@redhat.com> Signed-off-by: Brian Dellabetta <brian-dellabetta@users.noreply.github.com> Co-authored-by: Brian Dellabetta <brian-dellabetta@users.noreply.github.com> Co-authored-by: Brian Dellabetta <bdellabe@redhat.com> commit 556b50306657186c7ca21b99d578491edc0f0a43 Author: Kyle Sayers <kylesayrs@gmail.com> Date: Mon Feb 16 11:36:12 2026 -0500 [Misc] Reword warning message to make log grepping easier (#2312) * Make it easier to find failures in logs by removing the word "failed" from this very common warning Signed-off-by: Kyle Sayers <kylesayrs@a100-08.nemg-001.lab.rdu2.dc.redhat.com> Co-authored-by: Kyle Sayers <kylesayrs@a100-08.nemg-001.lab.rdu2.dc.redhat.com> Co-authored-by: HDCharles <39544797+HDCharles@users.noreply.github.com> commit b6c331e2fa8faabf851a48b5458ccd9632e6206b Author: HDCharles <39544797+HDCharles@users.noreply.github.com> Date: Mon Feb 16 10:18:30 2026 -0500 [ddp] fixing data slice bug (#2361) Summary: that's not how you slice a dataset, previously not tested with world_size==1 Test Plan: [script](https://gist.github.com/HDCharles/282950166fd0c95a7a2594fe922bcb53) (world_size==1) --------- Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com> Signed-off-by: HDCharles <39544797+HDCharles@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> commit 6ddd0361e41f65d86a08889efd58c7dec00282e3 Author: ZewenShen-Cohere <zewen.shen@cohere.com> Date: Fri Feb 13 14:17:41 2026 -0500 [AWQ] Add activation_hook_target field for custom activation cache hooking (#2346) - Adds an optional `activation_hook_target` field to `AWQMapping` that lets users specify which submodule (relative to the parent/LCA) to hook for activation caching, replacing the hardcoded `hasattr(parent, 'mlp')` workaround for MoE models with parallel transformer blocks. - When `activation_hook_target` is `None` (default), behavior is unchanged: the hook is placed on `balance_layers[0]`. When set (e.g. `"mlp"`), it resolves to the corresponding submodule on the parent via `getattr_chain`. In parallel transformer architectures, attention and MLP run in parallel from the same input. The existing code always hooks `balance_layers[0]` for activation caching, which captures the wrong activations when balance layers span both attention and MLP branches. There was a commented-out `hasattr(parent, 'mlp')` workaround, but it was brittle and not generalizable. This change makes the hook target explicitly configurable per mapping. I've tested this change with our internal models, and it aligns with previous results. --------- Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com> Co-authored-by: HDCharles <39544797+HDCharles@users.noreply.github.com> commit b0463d101350e04b40268596c71531622b26ad20 Author: HDCharles <39544797+HDCharles@users.noreply.github.com> Date: Fri Feb 13 11:16:34 2026 -0500 [bug][awq] fix inf handling (#2332) Must have been a bad merge or rebase at some point, scalesview was being set before the inf/nan check TEST PLAN: CI --------- Signed-off-by: HDCharles <39544797+HDCharles@users.noreply.github.com> commit 6d600d4b91e8fb8991cc2c5c6e4f8cd911c36815 Author: Dipika Sikka <dipikasikka1@gmail.com> Date: Fri Feb 13 10:05:07 2026 -0500 Fix CI/CD failures (#2359) SUMMARY: autoround: - Test was previously using int weights with float activations which silently fails with torch 2.9 but results in a failure for 2.10 - Fix the args to appropriately use a valid scheme where weights are also float quant_reload: - Remove old unused argument - Set tie_word_embeddings to false to account for what the test is targeting - I believe we’re seeing this now from recent compressed-tensors changes cc @kylesayrs commit 302c2c7a190f1b6c6151afb0fbc5bf63b75f240e Author: ZewenShen-Cohere <zewen.shen@cohere.com> Date: Fri Feb 13 08:06:15 2026 -0500 AWQ: orig_layer_weights should save all balance layer weights (#2344) Currently, orig_layer_weights only clones weights for layers that have a quantization scheme and are listed in `mapping.balance_layers`. This becomes a problem when we disable quantization for a layer that is still in `mapping.balance_layers`: all balance layers still need to be smoothed at the end, but orig_layer_weights does not store the original weights for all of them. As a result, the smoothing step fails (see where the error is triggered: https://github.com/ZewenShen-Cohere/llm-compressor-fork/blob/e9e3d3191f7598198f070c5f8269f08ec89e0b2f/src/llmcompressor/modifiers/awq/base.py#L554 ). Co-authored-by: Brian Dellabetta <brian-dellabetta@users.noreply.github.com> Co-authored-by: HDCharles <39544797+HDCharles@users.noreply.github.com> commit d2b67d15139f7a55699f5378cb477c945eb9ed5e Author: Dipika Sikka <dipikasikka1@gmail.com> Date: Thu Feb 12 14:33:00 2026 -0500 Update CI/CD Logs (#2358) SUMMARY: - Provide summary for why a test was skipped commit 05a13f35711e12bed4771aea7755f27d248fdaeb Author: Dipika Sikka <dipikasikka1@gmail.com> Date: Thu Feb 12 10:11:37 2026 -0500 Support torch 2.10 (#2356) SUMMARY: - Requires: https://github.com/vllm-project/compressed-tensors/pull/583 - Transformers tests currently already failing on main commit c37fcfa081daa024f865a3f4798db029a0a67d43 Author: Fynn Schmitt-Ulms <fynnsu@outlook.com> Date: Wed Feb 11 19:52:19 2026 -0500 Add synchronize trigger to ready label check (#2354) SUMMARY: Triggers the ready label check each time new commits are pushed to a pr. Looking at https://github.com/vllm-project/llm-compressor/pull/2350 it seems like there is still an issue with our ready check system. 1. The first commit was added 2. The ready label was added and second commit (7d7ebd2247142dfb75cbe631aa37859092654f71) was pushed, this caused the ready check to run and pass 3. Further commits were added but the ready check was never retriggered 4. "ready-label-check Expected — Waiting for status to be reported" is blocking merge, despite the most recent run of the ready check passing. It seems like required checks may need to run and pass on the most recent commit for github to allow the merge. This pr causes subsequent commits to re-trigger the ready check workflow. TEST PLAN: Merge and see if this fixes the problem. It can't make it worse since this just causes the check to run more often. Signed-off-by: Fynn Schmitt-Ulms <fschmitt@redhat.com> Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com> commit fcd7fdbda73b88168095e728dfdc6d3ce7cf004f Author: Dipika Sikka <dipikasikka1@gmail.com> Date: Wed Feb 11 16:03:54 2026 -0500 Swap to use CPU runners (#2350) SUMMARY: - Swap ubuntu runners to use our cpu runner - Remove 2 year old docker build workflow that we never use --------- Signed-off-by: Dipika Sikka <dipikasikka1@gmail.com> commit d316e27e6aaafbb17480519ad15fc0d2b723353f Author: Fynn Schmitt-Ulms <fynnsu@outlook.com> Date: Wed Feb 11 13:46:46 2026 -0500 Add concurrency check to all pr workflows (#2348) SUMMARY: We typically only care about the test results for the final commit in a pr. This pr will reduce the load on github actions runners by cancelling all jobs except for the one on the latest commit. For example, if the following commits are all uploaded one at a time quickly in a row: Commit A1 uploaded, job A1 starts Commit B1 (separate pr) uploaded, job B1 queued Commit A2 uploaded, job A1 cancelled, job A2 queued, job B1 started Job B1 finishes, job A1 starts Note: this is the same concurrency logic we already have on `test-check-transformers.yaml` Signed-off-by: Fynn Schmitt-Ulms <fschmitt@redhat.com> commit 22fc354d25248f1ef9d990a9a20c6aeca8a94d6d Author: Dipika Sikka <dipikasikka1@gmail.com> Date: Wed Feb 11 13:06:15 2026 -0500 Revert "add qwen3 vl autoround example (#2334)" (#2351) This reverts commit 7b366711cba3982bbac99abdf6bf2c3572395f1a. commit 7b366711cba3982bbac99abdf6bf2c3572395f1a Author: Xin He <xin3.he@intel.com> Date: Thu Feb 12 01:34:04 2026 +0800 add qwen3 vl autoround example (#2334) SUMMARY: AutoRound quantization example: qwen3-vl nvfp4 TEST PLAN: python qwen3_vl_example.py Output: ``` Hello my name is Mihai, I am a 30 year old male, and I am currently a software engineer working in a company that develops software for the financial sector. I am a very passionate person, and I am always eager to learn new things. I have a strong interest in AI, machine learning, and data science. I am also very interested in the intersection of these fields with finance. I am currently working on a project that involves building a machine learning model to predict stock prices. I am ``` --------- Signed-off-by: Xin He <xin3.he@intel.com> Co-authored-by: HDCharles <39544797+HDCharles@users.noreply.github.com> Co-authored-by: Brian Dellabetta <brian-dellabetta@users.noreply.github.com> commit b49fbfda933e168f7b58f10ff45e019b3f24baee Author: Brian Dellabetta <brian-dellabetta@users.noreply.github.com> Date: Tue Feb 10 14:11:18 2026 -0500 [cicd] move check ready action to on pull_request_target (#2342) SUMMARY: Change the check ready label ci/cd action to run on `pull_request_target` so that it runs more robustly for community user PRs. From [docs](https://docs.github.com/en/actions/reference/workflows-and-actions/events-that-trigger-workflows#pull_request_target): > This event runs in the context of the default branch of the base repository, rather than in the context of the merge commit, as the pull_request event does. This prevents execution of unsafe code from the head of the pull request that could alter your repository or steal any secrets you use in your workflow. This event allows your workflow to do things like label or comment on pull requests from forks. Avoid using this event if you need to build or run code from the pull request. …

Revert "add qwen3 vl autoround example (#2334)"

a3af4cb

This reverts commit 7b36671.

dsikka added the ready When a PR is ready for review label Feb 11, 2026

mergify bot added the documentation Improvements or additions to documentation label Feb 11, 2026

gemini-code-assist bot reviewed Feb 11, 2026

View reviewed changes

dsikka merged commit 22fc354 into main Feb 11, 2026
13 of 15 checks passed

dsikka deleted the revert_example branch February 11, 2026 18:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revert "add qwen3 vl autoround example (#2334)"#2351

Revert "add qwen3 vl autoround example (#2334)"#2351
dsikka merged 1 commit intomainfrom
revert_example

dsikka commented Feb 11, 2026

Uh oh!

gemini-code-assist bot commented Feb 11, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

github-actions bot commented Feb 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dsikka commented Feb 11, 2026

Uh oh!

gemini-code-assist bot commented Feb 11, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

github-actions bot commented Feb 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant