Skip to content

Revert "add qwen3 vl autoround example (#2334)"#2351

Merged
dsikka merged 1 commit intomainfrom
revert_example
Feb 11, 2026
Merged

Revert "add qwen3 vl autoround example (#2334)"#2351
dsikka merged 1 commit intomainfrom
revert_example

Conversation

@dsikka
Copy link
Collaborator

@dsikka dsikka commented Feb 11, 2026

This reverts commit 7b36671.

@dsikka dsikka added the ready When a PR is ready for review label Feb 11, 2026
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @dsikka, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request performs a complete revert of a previous commit that introduced a Qwen3-VL AutoRound quantization example. The change effectively removes the example script and all associated documentation, streamlining the project by eliminating content that was presumably no longer desired or functional.

Highlights

  • Revert Qwen3-VL AutoRound Example: This pull request fully reverts the addition of the Qwen3-VL AutoRound example, removing its dedicated script and all related documentation from the README.md.
  • File Removal: The qwen3_vl_example.py script, which provided an end-to-end example for applying AutoRound quantization to the Qwen3-VL model, has been entirely removed.
  • Documentation Update: The README.md file was updated to reflect the removal of the Qwen3-VL example, including its quickstart instructions and accuracy evaluation tables. Minor heading adjustments were also made for the remaining Llama 3.1 example.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • examples/autoround/quantization_w4a4_fp4/README.md
    • Removed the 'Qwen3-VL Example' section, including its code block and accuracy evaluation tables.
    • Adjusted the heading level for 'Evaluate Accuracy' under the Llama 3.1 example from '####' to '###'.
    • Updated the introductory sentence in the 'Quickstart' section from 'end-to-end scripts' to 'an end-to-end script' to reflect the singular remaining example.
  • examples/autoround/quantization_w4a4_fp4/qwen3_vl_example.py
    • Removed the entire Python script that implemented the Qwen3-VL AutoRound quantization example.
Activity
  • The pull request is a revert of a previous commit, indicating a decision to undo a recently introduced feature or change.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@mergify mergify bot added the documentation Improvements or additions to documentation label Feb 11, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request reverts the addition of the Qwen3-VL autoround example. The changes involve deleting the qwen3_vl_example.py script and updating the README.md to remove all associated documentation and benchmark results. The adjustments in the README are correct and consistent with the removal of the example. The revert is clean and I have no issues to report.

@github-actions
Copy link

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

@dsikka dsikka merged commit 22fc354 into main Feb 11, 2026
13 of 15 checks passed
@dsikka dsikka deleted the revert_example branch February 11, 2026 18:06
zhanglei1172 added a commit to zhanglei1172/llm-compressor that referenced this pull request Mar 9, 2026
commit a2433a9b0128fb5113a362d553d7984de6246053
Author: Yi Liu <yi4.liu@intel.com>
Date:   Sat Mar 7 07:24:20 2026 +0800

    [AutoRound] Add DDP Support and Example (#2411)

    SUMMARY:
    Add DDP support for Autoround and use Qwen as example.

    Depends on https://github.com/vllm-project/llm-compressor/pull/2410

    TEST PLAN:
    "please outline how the changes were tested"

    cc @hshen14 @thuang6

    ---------

    Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>
    Signed-off-by: yiliu30 <yi4.liu@intel.com>
    Co-authored-by: HDCharles <charlesdavidhernandez@gmail.com>
    Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com>

commit a88ebbd2e6e5fa02d9f33bc86b7118149dac3cb4
Author: Gilles Turpin <turpingilles15@gmail.com>
Date:   Fri Mar 6 16:52:20 2026 +0100

    Add MoE calibration module for GlmMoeDsa (GLM-5) (#2434)

    SUMMARY:
    GlmMoeDsaNaiveMoe uses packed 3D nn.Parameter tensors instead of
    nn.Linear modules, causing targets=["Linear"] to match nothing in MoE
    experts during AWQ/GPTQ quantization.

    This PR permanently unpacks the fused expert weights into individual
    nn.Linear layers, following the same calibration pattern as glm4_moe
    with dtype handling aligned.

    Key differences from glm4_moe: is_permanent=True (experts must be
    unpacked for quantization targets to match), DeepSeek-style routing with
    groups/topk_group/norm, and SequentialGlmMoeDsaExperts for 3D->2D weight
    unpacking.

    Closes #2430

    TEST PLAN:
    pytest.importorskip: tests skip gracefully on transformers < 5.x
    3 unit tests: all experts triggered, output matches original, experts
    converted to nn.Linear
    Full e2e validation pending transformers 5.x compatibility
    No smaller GLM-5 checkpoint available for e2e testing (744B only)

    Signed-off-by: Gilles Turpin <turpingilles15@gmail.com>

commit 47ec10e84d659719f1ff9959df0effb3e6f2d95d
Author: Yi Liu <yi4.liu@intel.com>
Date:   Fri Mar 6 05:58:04 2026 +0800

    Upgrade autoround 0.10.2 (#2410)

    Signed-off-by: yiliu30 <yi4.liu@intel.com>

    SUMMARY:
    "please provide a brief summary"

    TEST PLAN:
    "please outline how the changes were tested"

    cc @hshen14 @thuang6 @chensuyue

    ---------

    Signed-off-by: yiliu30 <yi4.liu@intel.com>
    Co-authored-by: HDCharles <39544797+HDCharles@users.noreply.github.com>

commit 04dea55db919c1e8783a1f9a4c26977aff89fdfc
Author: Brian Dellabetta <brian-dellabetta@users.noreply.github.com>
Date:   Thu Mar 5 16:16:59 2026 -0500

    [Hotfix] _match_name hotfix (#2447)

    SUMMARY:
    To account for exposing `match_name` in compressed-tensors PR in
    * https://github.com/vllm-project/compressed-tensors/pull/607

    TEST PLAN:
    tests pass

    Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>

commit 6d73ce60fac726496365f5144b98091f74876528
Author: Kyle Sayers <kylesayrs@gmail.com>
Date:   Thu Mar 5 14:25:15 2026 -0500

    Refactor logging, `CompressionLogger`, support distributed (#2408)
    * Remove misleading information about module size after compression
    * Support loguru logging which logs which rank logs come from
    * Support compression logging that is specific to distributed workloads
    * Refactor `CompressionLogger`
      * Remove nvidia/amd logic, instead just use cuda interface
    * This already accounts for "CUDA/AMD_VISIBLE_DEVICES", no need to hard
    code these env variables
    * Remove "module size" log, which is misleading, as the module size does
    not actually change as optimization occurs (qdq)
      * Limit devices to just the current device in distributed cases
    * Refactor loguru logger configuration
      * `configure_logger` can now be called multiple times
    * When oneshot occurs, `configure_logger` is called again with the rank
    set
      * Logger now prints rank if applicable
    Single-thread
    ```
    2026-02-25T17:04:36.8189 | compress_module_list | INFO - Quantizing model.layers.0.mlp.gate_proj using 512 samples
    2026-02-25T17:04:38.5924 | GPTQ | METRIC - time 1.77s
    2026-02-25T17:04:38.5926 | GPTQ | METRIC - error 663.60
    2026-02-25T17:04:38.5932 | GPTQ | METRIC - GPU 0 | usage: 4.45% | total memory: 85.1 GB
    2026-02-25T17:04:38.5933 | GPTQ | METRIC - GPU 1 | usage: 0.00% | total memory: 85.1 GB
    ```

    Distributed
    ```
    [Rank 1] 2026-02-25T17:10:18.8569 | compress_module_list | INFO - Quantizing model.layers.2.self_attn.o_proj using 512 samples
    [Rank 1] 2026-02-25T17:10:20.4585 | GPTQ | METRIC - time 1.60s
    [Rank 1] 2026-02-25T17:10:20.4586 | GPTQ | METRIC - error 1.27
    [Rank 1] 2026-02-25T17:10:20.4593 | GPTQ | METRIC - GPU 1 | usage: 4.45% | total memory: 85.1 Gb
    [Rank 1] 2026-02-25T17:10:20.4637 | compress_module_list | INFO - Quantizing model.layers.2.mlp.up_proj using 512 samples
    [Rank 0] 2026-02-25T17:10:20.7379 | GPTQ | METRIC - time 6.59s
    [Rank 0] 2026-02-25T17:10:20.7381 | GPTQ | METRIC - error 7.45
    [Rank 0] 2026-02-25T17:10:20.7401 | GPTQ | METRIC - GPU 0 | usage: 5.98% | total memory: 85.1 Gb
    [Rank 0] 2026-02-25T17:10:20.7590 | compress_module_list | INFO - Quantizing model.layers.2.mlp.gate_proj using 512 samples
    ```

    ---------

    Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

commit d6eb2be988706e46cefb03ab6acf1bbd104d35af
Author: Gilles Turpin <turpingilles@orange.fr>
Date:   Thu Mar 5 01:45:04 2026 +0100

    fix: handle packed weights in granite4 to_3d_expert (W4A16 support) (#2425)

    SUMMARY:
    Fix the W4A16 shape mismatch in to_3d_expert() reported in #2338 (first
    error). The original code hardcoded shapes for FP8 quantization only.

    The fix calculates all shapes up front (packed weights, grouped scales,
    packed zero points) then asserts and reshapes. This supports FP8
    per-channel, FP8 block quantization, W4A16 symmetric, and W4A16
    asymmetric (with packed zero_point on dim0).

    Companion to #2426 (FX tracing fix) and compressed-tensors #609 (3D
    pack/unpack). Together they resolve #2338.

    TEST PLAN:
    4 unit tests covering all quantization configurations:
    - int4 symmetric (packed weights, per-channel scale)
    - int4 asymmetric (packed weights + packed zero_point on dim0)
    - fp8 block (grouped scale)
    - fp8 per-channel (no packing)

    All passing.

    Signed-off-by: Gilles Turpin <turpingilles15@gmail.com>
    Co-authored-by: HDCharles <39544797+HDCharles@users.noreply.github.com>

commit 4c522137771b2223dcbfec2001658a744b37a3d5
Author: Gilles Turpin <turpingilles@orange.fr>
Date:   Thu Mar 5 00:50:57 2026 +0100

    fix: use topological ordering in FX graph cleanup to fix erase_node crash (Granite4 GPTQ) (#2426)
    Fix the FX tracing crash reported as the second error in #2338. The BFS
    cleanup of concrete args did not maintain topological ordering — if a
    node was visited multiple times, its position in the deletion dict was
    not updated, causing dependents to be deleted before their dependencies
    (`RuntimeError: Tried to erase Node getitem_169`).

    The fix uses `move_to_end` in the BFS traversal so that revisited nodes
    are moved to the end of the deletion dict, ensuring topological order.

    Companion to #2425 (shape fix) and compressed-tensors #609 (3D
    pack/unpack). Together they resolve #2338.
    Tested on Granite 4.0-h-small with a single layer, using all three fixes
    (#2425, #2426, compressed-tensors #609).

    Script based on `test_gptq_no_exclusion.py` from #2338 with
    `model.model.layers = model.model.layers[:1]` added after model loading.

    Command: `python test_gptq_no_exclusion.py --model-name
    ibm-granite/granite-4.0-h-small --output /workspace/test-output
    --calibration-samples 16`

    Results:
    - FX tracing completed — no `erase_node` crash
    - 3D→2D conversion OK
    - Cache preparation OK (16/16 samples)
    - Calibration started but hit OOM on the Mamba layer (unrelated to the
    fix — naive Mamba path without `causal_conv1d` on a 31GB GPU)

    Signed-off-by: gillesturpin <turpingilles@orange.fr>
    Co-authored-by: HDCharles <39544797+HDCharles@users.noreply.github.com>

commit 7461d02b9bf9edc35f3be9effdaa97d6639baf1f
Author: JinRiYao2001 <jinriyao@qq.com>
Date:   Thu Mar 5 02:19:54 2026 +0800

    fix(examples): correct W8A16 -> W4A16 in Qwen3-VL AWQ example save dir (#2443)

    SUMMARY:
    The AWQ recipe in this example uses num_bits=4 for weights (W4A16).

    However the save directory name incorrectly uses "W8A16":

        -AWQ-W8A16-mse-seq

    This PR updates it to:

        -AWQ-W4A16-mse-seq

    to match the actual quantization configuration and the comment above the
    recipe.

    TEST PLAN:
    Not applicable. This PR only fixes an incorrect save directory string in
    the example script.
    No functional code paths are changed.

commit e6fdd066c785b11453875e777c229a954a9c438e
Author: Kyle Sayers <kylesayrs@gmail.com>
Date:   Tue Mar 3 16:25:18 2026 -0500

    Remove dead code (#2435)
    * Remove dead code
    * Remove `save_checkpoint` (this is now done by
    [post_process](https://github.com/vllm-project/llm-compressor/blob/main/src/llmcompressor/entrypoints/utils.py#L95))
    * Remove `get_completed_stages`, `save_completed_stages` (stages no
    longer exist)
    * Remove `load_safetensors_state_dict` (we now either load with the
    transformers model definition or `model_free_ptq`)
    * Remove `set_deterministic_seeds` (not used)
    * Remove `is_package_available`

    Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

commit 7b7d1a5dc1fbca660acc04ff993fcb0c9d15acbb
Author: Dipika Sikka <dipikasikka1@gmail.com>
Date:   Tue Mar 3 14:04:56 2026 -0500

    Enable merge queue support in GitHub workflows (#2433)
    - Configures Mergify merge queue with automatic DCO sign-off to resolve
    DCO check failures on merge commits
    - Removes GitHub native merge queue triggers from all workflows
    - Adds auto-merge rule for PRs with `ready` label and required approvals
    The DCO (Developer Certificate of Origin) GitHub App was failing on
    merge commits created by GitHub's native merge queue, as those commits
    lacked the required `Signed-off-by:` trailer.
    Switch to Mergify's merge queue which automatically adds DCO sign-off to
    all merge commits it creates.
    - Added `queue_rules` with automatic DCO sign-off in commit messages
    - Added auto-merge rule that queues PRs when:
      - Label `ready` is applied
      - 2+ approvals received
      - All required checks pass (DCO, tests, quality, etc.)
    - `.github/workflows/ready-label-check.yaml`: Removed merge_group
    trigger
    - `.github/workflows/test-check-transformers.yaml`: Removed merge_group
    trigger and condition
    - `.github/workflows/test-check.yaml`: Removed merge_group trigger
    - `.github/workflows/quality-check.yaml`: Removed merge_group trigger
    - `.github/workflows/linkcheck.yml`: Removed merge_group trigger
    After merging, GitHub's native merge queue should be disabled in
    repository settings and Mergify will handle all merge queue operations.

    🤖 Generated with [Claude Code](https://claude.com/claude-code)

    ---------

    Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>

commit f15296fb966bebd2652e1a31ae106e70eff8b5e2
Author: Itay Etelis <92247226+Etelis@users.noreply.github.com>
Date:   Tue Mar 3 17:07:20 2026 +0200

    Refactor Matching Logic to Use compressed-tensors Utilities (#2284)

    Consolidates 17 redundant matching functions into standardized
    compressed-tensors APIs.

    Fixes #1686

    - **Deleted 15 functions** from `module.py`: `get_layers`, `get_params`,
    `get_prunable_layers`, `get_quantizable_layers`, `match_targets`, etc.
    - **Added 2 helpers**: `expand_special_targets()` (backward
    compatibility) and `build_parameterized_layers()`
    - **Updated modifiers**: SparseGPT, magnitude pruning, constant pruning
    to use new APIs
    - **Bug fix**: Added missing `self.targets` parameter in magnitude
    pruning

    ---------

    Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
    Co-authored-by: Itay Etelis <itay.etelis@ibm.com>
    Co-authored-by: Brian Dellabetta <brian-dellabetta@users.noreply.github.com>
    Co-authored-by: HDCharles <39544797+HDCharles@users.noreply.github.com>

commit a956d688892c2cff3757c598e4c870c796a42f78
Author: Xin He <xin3.he@intel.com>
Date:   Tue Mar 3 07:45:34 2026 +0800

    add qwen3 vl autoround example (#2357)

    SUMMARY:
    AutoRound quantization example: qwen3-vl nvfp4

    TEST PLAN:
    python qwen3_vl_example.py
    Output:
    ```
    Hello my name is Mihai, I am a 30 year old male, and I am currently a software engineer working in a company that develops software for the financial sector. I am a very passionate person, and I am always eager to learn new things. I have a strong interest in AI, machine learning, and data science. I am also very interested in the intersection of these fields with finance. I am currently working on a project that involves building a machine learning model to predict stock prices. I am
    ```

    ---------

    Signed-off-by: Xin He <xin3.he@intel.com>
    Co-authored-by: HDCharles <39544797+HDCharles@users.noreply.github.com>
    Co-authored-by: Brian Dellabetta <brian-dellabetta@users.noreply.github.com>
    Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

commit 2b0684c132d130b84b1b8ec9cce9f29a3239debc
Author: Omkar Kabde <omkarkabde@gmail.com>
Date:   Tue Mar 3 05:06:16 2026 +0530

    Remove training loggers and all related code (#2414)

    SUMMARY:
    Fixes #2409.
    cc @kylesayrs

    This PR removed training loggers and all related code. Replaces their
    functionality with `loguru`.
    It also removes other helper functions and `FrequencyManager` as well.

    TEST PLAN:
    most tests are passing, but getting stuck at gptq test

    ---------

    Signed-off-by: Dan Huang <dahuang@redhat.com>
    Signed-off-by: Omkar Kabde <omkarkabde@gmail.com>
    Co-authored-by: dhuangnm <74931910+dhuangnm@users.noreply.github.com>
    Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com>
    Co-authored-by: Brian Dellabetta <brian-dellabetta@users.noreply.github.com>

commit 795198790668807f0c90c9f9df9842ad0cc6cc25
Author: Gilles Turpin <turpingilles15@gmail.com>
Date:   Tue Mar 3 00:13:15 2026 +0100

    Add SmoothQuant mapping for GlmMoeDsaForCausalLM (GLM-5) (#2419)

    Part of #1442

    GLM-5 (GlmMoeDsaForCausalLM) uses MLA identical to DeepSeek V2/V3 — same
    projection names (q_a_proj, kv_a_proj_with_mqa). Reuses
    DEEPSEEK_V2_SMOOTHQUANT_MAPPINGS which smooths input_layernorm only,
    conservative choice for MoE models with fused expert parameters
    (gate_up_proj 3D tensor).

    Also adds Glm4MoeForCausalLM with DEFAULT_SMOOTHQUANT_MAPPINGS.

    SUMMARY:
    Add GLM-5 and GLM-4-MoE to SmoothQuant MAPPINGS_REGISTRY.

    TEST PLAN:
    Registry-only change. Verified GLM-5 layer names match DeepSeek V2
    patterns by inspecting GlmMoeDsaForCausalLM in transformers.

    Signed-off-by: gillesturpin <turpingilles@orange.fr>
    Co-authored-by: Brian Dellabetta <brian-dellabetta@users.noreply.github.com>
    Co-authored-by: HDCharles <39544797+HDCharles@users.noreply.github.com>

commit 69af79c4f2016f090deaf6e06faf73e3403e5d1d
Author: Gilles Turpin <turpingilles@orange.fr>
Date:   Mon Mar 2 22:55:15 2026 +0100

    Fix SmoothQuant regex to match q_a_proj in DeepSeek/GLM-5 (#2421)

    Fixes #2420
    The balance_layers pattern re:.*q_proj in
    DEEPSEEK_V2_SMOOTHQUANT_MAPPINGS does not match q_a_proj (used by
    DeepSeek V2/V3 and GLM-5). Changed to re:.*q(_a)?_proj$ as suggested by
    @brian-dellabetta.

    SUMMARY:
    Fix regex pattern in DEEPSEEK_V2_SMOOTHQUANT_MAPPINGS to cover both
    q_proj and q_a_proj.

    TEST PLAN:
    Verified with Python regex that the new pattern matches both layer
    names:
    re.fullmatch(".*q(_a)?_proj$", "model.layers.0.self_attn.q_proj") ->
    match
    re.fullmatch(".*q(_a)?_proj$", "model.layers.0.self_attn.q_a_proj") ->
    match

    Signed-off-by: gillesturpin <turpingilles@orange.fr>
    Co-authored-by: Brian Dellabetta <brian-dellabetta@users.noreply.github.com>
    Co-authored-by: HDCharles <39544797+HDCharles@users.noreply.github.com>

commit 9e9ae3dbb2b239bc22cac1a9fc463f4895e87250
Author: Gilles Turpin <turpingilles@orange.fr>
Date:   Mon Mar 2 20:05:46 2026 +0100

    Add AWQ mapping for GlmMoeDsaForCausalLM (GLM-5) (#2418)

    Closes #2412 (part of #1442)

    GLM-5 (`GlmMoeDsaForCausalLM`) uses Multi-head Latent Attention
    identical to DeepSeek V3 — same projection layer names (`q_a_proj`,
    `kv_a_proj_with_mqa`, etc.) and same MoE structure. Reuses
    `_deepseek_mappings`.

    Also moves `Glm4MoeForCausalLM` to its correct alphabetical position in
    the registry.

    SUMMARY:
    Add GLM-5 (GlmMoeDsaForCausalLM) to AWQ_MAPPING_REGISTRY using
    _deepseek_mappings. GLM-5's MLA layer names are identical to DeepSeek
    V3. Also fixes alphabetical ordering of Glm4MoeForCausalLM.

    TEST PLAN:
    Registry-only change (no logic modified). Verified that GLM-5 layer
    names (q_a_proj, kv_a_proj_with_mqa, kv_a_layernorm, kv_b_proj, o_proj)
    match the patterns in _deepseek_mappings by inspecting the
    GlmMoeDsaForCausalLM source in transformers.

    Signed-off-by: gillesturpin <turpingilles@orange.fr>
    Co-authored-by: Brian Dellabetta <brian-dellabetta@users.noreply.github.com>
    Co-authored-by: HDCharles <39544797+HDCharles@users.noreply.github.com>

commit a27d9e2e318fdb81254285c2ed5987b97897d973
Author: 김대익 <33992354+dik654@users.noreply.github.com>
Date:   Tue Mar 3 03:41:47 2026 +0900

    [GPTQ] Move modifier to top-level for consistent folder structure (#2368)
    Move GPTQModifier from `modifiers/quantization/gptq/` to
    `modifiers/gptq/`
    for consistent folder structure with AWQ and AutoRound (related: #2306).

    - Add deprecation wrapper at old import path for backward compatibility
    - Exclude old GPTQ paths from ModifierFactory to prevent duplicate
    registration
    - Update test and example imports to new canonical path
    Import verification (all passed):
    - from llmcompressor.modifiers.gptq import GPTQModifier (new path, no
    warning)
    - from llmcompressor.modifiers.quantization import GPTQModifier (BC, no
    warning)
    - from llmcompressor.modifiers.quantization.gptq import GPTQModifier
    (BC, DeprecationWarning)
    - ModifierFactory.refresh() registers GPTQModifier from new location

    pytest (11 passed, 3 skipped for GPU):
    - tests/llmcompressor/transformers/gptq/test_gptq_oneshot.py
    -
    tests/llmcompressor/pytorch/modifiers/pruning/sparsegpt/test_pytorch.py
    - tests/llmcompressor/transformers/compression/test_recipe_parsing.py
    (requires GPU)

    ruff check + ruff format passed

    ---------

    Signed-off-by: 김대익 <33992354+dik654@users.noreply.github.com>
    Co-authored-by: Brian Dellabetta <brian-dellabetta@users.noreply.github.com>

commit a99f159abe94dd119f6a13e5ae4004505fcd8355
Author: HDCharles <39544797+HDCharles@users.noreply.github.com>
Date:   Mon Mar 2 10:44:26 2026 -0500

    Smoothquant bugfixes (#2422)

    Summary:

    smooth quant wasn't actually doing anything since it was only updating
    the onload, this PR fixes that and adds a test to check the behavior of
    smoothquant in the future

    TEST PLAN:
    pytest
    /home/HDCharles/repos/llm-compressor/tests/llmcompressor/modifiers/transform/smoothquant/test_base.py
    -k "e2e"

    Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>

commit 732316c8980913d173d3a202eeab95eda39af230
Author: Sören Dréano <71752785+SorenDreano@users.noreply.github.com>
Date:   Sat Feb 28 16:25:35 2026 +0100

    Add support for passing a custom DataLoader to oneshot() (#2390)

    SUMMARY:
    Adds a `dataloader` argument to the `oneshot` entrypoint.

    Allow users to pass a pre-built PyTorch DataLoader directly via the
    `dataloader` parameter, bypassing the internal dataset-to-dataloader
    conversion. This is useful for custom data pipelines where users already
    have a prepared DataLoader and don't need get_calibration_dataloader().

    Rather than using `self.dataloader = kwargs.pop("dataloader", None)`, we
    could also add a `dataloader` argument/attribute to `DatasetArguments`
    if you prefer.

    TEST PLAN:
    This change is fairly trivial, I made sure
    [https://github.com/vllm-project/llm-compressor/blob/main/examples/quantization_w4a16/README.md](https://github.com/vllm-project/llm-compressor/blob/main/examples/quantization_w4a16/README.md)
    could still run and that passing the DataLoader too:

    ```python
    from transformers import AutoTokenizer, AutoModelForCausalLM

    MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct"
    model = AutoModelForCausalLM.from_pretrained(MODEL_ID, dtype="auto")
    tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

    from datasets import load_dataset

    NUM_CALIBRATION_SAMPLES=512
    MAX_SEQUENCE_LENGTH=2048
    ds = load_dataset("HuggingFaceH4/ultrachat_200k", split=f"train_sft[:{NUM_CALIBRATION_SAMPLES}]")
    ds = ds.shuffle(seed=42)
    def preprocess(example):
        return {"text": tokenizer.apply_chat_template(example["messages"], tokenize=False,)}
    ds = ds.map(preprocess)
    def tokenize(sample):
        return tokenizer(sample["text"], padding=False, max_length=MAX_SEQUENCE_LENGTH, truncation=True, add_special_tokens=False)
    ds = ds.map(tokenize, remove_columns=ds.column_names)

    from llmcompressor.datasets import get_calibration_dataloader
    from llmcompressor.args import DatasetArguments

    dataset_args = DatasetArguments(
        dataset=ds,
        max_seq_length=MAX_SEQUENCE_LENGTH,
        num_calibration_samples=NUM_CALIBRATION_SAMPLES,
    )

    dataloader = get_calibration_dataloader(dataset_args, tokenizer)

    from llmcompressor import oneshot
    from llmcompressor.modifiers.quantization import GPTQModifier
    recipe = GPTQModifier(targets="Linear", scheme="W4A16", ignore=["lm_head"])
    oneshot(
        model=model,
        recipe=recipe,
        dataloader=dataloader,
    )
    SAVE_DIR = MODEL_ID.rstrip("/").split("/")[-1] + "-W4A16-G128"
    model.save_pretrained(SAVE_DIR, save_compressed=True)
    tokenizer.save_pretrained(SAVE_DIR)
    ```

    This is the exact same code from the documentation, with the DataLoader
    built outside of the `oneshot` call (`dataloader =
    get_calibration_dataloader(dataset_args, tokenizer)`) and passed
    directly to `oneshot`.

    ---------

    Signed-off-by: Sören Dréano <71752785+SorenDreano@users.noreply.github.com>
    Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>
    Co-authored-by: Soren Dreano <soren@numind.ai>
    Co-authored-by: Brian Dellabetta <brian-dellabetta@users.noreply.github.com>
    Co-authored-by: Brian Dellabetta <bdellabe@redhat.com>
    Co-authored-by: HDCharles <39544797+HDCharles@users.noreply.github.com>

commit bdb65473ba21ca6aaaf726ffe66c695f5608c953
Author: dhuangnm <74931910+dhuangnm@users.noreply.github.com>
Date:   Fri Feb 27 18:06:58 2026 -0500

    Bump compressed-tensors version (#2423)

    SUMMARY:
    Compressed-tensors 0.14.0 has been released. Bump up its version in
    llmcompressor.

    TEST PLAN:
    All tests.

    Signed-off-by: Dan Huang <dahuang@redhat.com>

commit 0c0ead359a355ea443df50f3f6c91de7d1df255d
Author: Dipika Sikka <dipikasikka1@gmail.com>
Date:   Thu Feb 26 18:04:11 2026 -0500

    [ReadMe] Update whats new (#2417)

    SUMMARY:
    Sample Build:
    https://app.readthedocs.org/projects/vllm-llm-compressor/builds/31579228/

commit a9847e04a92f75d64416b133991b868ed4564bf6
Author: Dipika Sikka <dipikasikka1@gmail.com>
Date:   Thu Feb 26 17:39:55 2026 -0500

    [Docs] Updates (#2416)

    SUMMARY:
    - Fix torchrun command
    - Add reference to guides in compress.md
    - Update model loading table

commit fe512727a4584c79f62dba984f004e1b4f6f9277
Author: dhuangnm <74931910+dhuangnm@users.noreply.github.com>
Date:   Thu Feb 26 15:09:59 2026 -0500

    Improve how we identify and run e2e smoke tests (#2336)

    SUMMARY:
    Currently we use the file `tests/e2e/vLLM/rhaiis-e2e-smoke.list` to mark
    the configs for smoke tests that we use to run for the RHAIIS image.
    This is vulnerable as we need to keep the list in this file up-to-date
    to any changes in the config yaml files and this is error-prone.

    This PR removes this `tests/e2e/vLLM/rhaiis-e2e-smoke.list` file and use
    the config yaml file directly to mark the smoke tests. We added a new
    field `test_group` to the yaml file and updated the `run_tests_in_*.sh`
    scripts to parse this field and filter out tests if a test group (-g) is
    specified. This allows both python and rhaiis image testing be able to
    run smoke and full tests for the configs.

    To be more specific:

    $# to run e2e tests for all configs (default)
    `
    bash tests/e2e/vLLM/run_tests_in_python.sh -c tests/e2e/vLLM/configs -t
    tests/e2e/vLLM/test_vllm.py
    `

    $# to run e2e tests for configs with smoke only
    `
    bash tests/e2e/vLLM/run_tests_in_python.sh -c tests/e2e/vLLM/configs -t
    tests/e2e/vLLM/test_vllm.py -g rhaiis
    `

    Similar commands for the `run_tests_in_rhaiis.sh` script.

    Going forward, for any newly added configs for the e2e tests, if we want
    to include them into the smoke tests for the RHAIIS image, we need to
    remember to add the `test_group: "smoke"` into their yaml file under the
    configs/ so we can automatically pick it up for the RHAIIS image
    testing.

    TEST PLAN:
    A successful run of the smoke tests is here:

    https://github.com/neuralmagic/llm-compressor-testing/actions/runs/21727920814

    ---------

    Signed-off-by: Dan Huang <dahuang@redhat.com>
    Signed-off-by: Dipika Sikka <dipikasikka1@gmail.com>
    Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com>

commit d0228407111ad6a70fa74c933cd138ab0404a9f6
Author: Dipika Sikka <dipikasikka1@gmail.com>
Date:   Thu Feb 26 11:41:09 2026 -0500

    [Example Testing] Remove and update example test cases (#2406)

    SUMMARY:
    - Remove out-dated cases
    - Add more up-to-date cases (e.g disk offload, ddp, model free ptq),
    examples, and models
    - Ensure all cases are verified for correct compression format
    - Add an optional `qwen` install to enable qwen VL examples which
    leverage `qwen_vl_utils`
    - Will require
    https://github.com/neuralmagic/llm-compressor-testing/pull/219 for
    example testing

    With these changes, all examples pass:
    https://github.com/neuralmagic/llm-compressor-testing/actions/runs/22450404023

    ---------

    Signed-off-by: Dipika Sikka <dipikasikka1@gmail.com>

commit 12aa5639a3276bb7fe493a0a2158e9846c63f3ff
Author: dhuangnm <74931910+dhuangnm@users.noreply.github.com>
Date:   Wed Feb 25 17:17:21 2026 -0500

    [WIP] Update dependency bounds for new release (#2407)

    SUMMARY:
    Update llmcompressor dependency bounds except for compressed-tensors,
    which will be updated after the compressed-tensors 0.14.0 is released.

    TEST PLAN:
    All tests

    ---------

    Signed-off-by: Dan Huang <dahuang@redhat.com>
    Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com>

commit 81ec39c1c36c7f1d092dbd518591ca1bfb171c18
Author: Kyle Sayers <kylesayrs@gmail.com>
Date:   Wed Feb 25 14:45:45 2026 -0500

    [Offload] Convert model back to CT offloading for testing (#2403)
    * Fix testing which requires access to the model after the model has
    been saved
    * https://github.com/vllm-project/compressed-tensors/pull/601
    * Convert back to CT offloading after converting to accelerate
    offloading for saving
    * Previously we just "removed dispatch", but this is bad practice as it
    won't work for disk offloading

    Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
    Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com>

commit c6e4d38dde4471874e4a3100f928cd3fef473cd5
Author: HDCharles <39544797+HDCharles@users.noreply.github.com>
Date:   Wed Feb 25 14:20:44 2026 -0500

    [dist][moe] fix add moe_context for big models (#2405)

    Summary:

    large models like Qwen/Qwen3-VL-235B-A22B-Instruct, when they add moe
    calibration context, different threads can take different lengths of
    time, for larger models this difference can be longer than the nccl
    timeout.

    fix: add a sync point at each module since we're rate limited to the
    slowest thread as is. at some point this should be changed to add moe
    calibration context in parallel and broadcast the updated modules

    TEST PLAN:

    tested e2e

    <details>
        ```
    qwen3_vl_235b_moe_gptq_int4_ddp_example.py`
    supported for Qwen3-VL-MoE

    from compressed_tensors.offload import init_dist, load_offloaded_model
    from transformers import AutoProcessor,
    Qwen3VLMoeForConditionalGeneration

        from llmcompressor import oneshot
        from llmcompressor.modifiers.quantization import GPTQModifier

        MODEL_ID = "Qwen/Qwen3-VL-235B-A22B-Instruct"
        init_dist()
        with load_offloaded_model():
            model = Qwen3VLMoeForConditionalGeneration.from_pretrained(
                MODEL_ID, dtype="auto", device_map="auto_offload"
            )

        processor = AutoProcessor.from_pretrained(MODEL_ID)
    currently
        recipe = GPTQModifier(
            targets="Linear",
            scheme="W4A16",
            ignore=[
                "re:.*lm_head",
                "re:visual.*",
                "re:model.visual.*",
                "re:.*mlp.gate$",
            ],
        )
        oneshot(model=model, recipe=recipe)

        import torch
        SAVE_DIR = (
            MODEL_ID.rstrip("/").split("/")[-1]
            + "-GPTQ-W4A16-G128-DDP"
            + str(torch.distributed.get_world_size())
        )
        model.save_pretrained(SAVE_DIR, save_compressed=True)
        processor.save_pretrained(SAVE_DIR)

        ```
    <\details>

    Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>

commit f18d6e384fb9244c82eeb5ce715c3c54b4a91313
Author: HDCharles <39544797+HDCharles@users.noreply.github.com>
Date:   Wed Feb 25 13:09:01 2026 -0500

    fix ddp for nvfp4 on A100 (#2404)

    depends on https://github.com/vllm-project/compressed-tensors/pull/603

    Summary:

    nccl only allows broadcasting fp8 on a100 but we can work around it with
    this util

    Test Plan:

    <details>
    Test Script

    </details>

    Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>

commit ff526d72e41b3e13ae9df4f0d0524764751cd2ec
Author: Kyle Sayers <kylesayrs@gmail.com>
Date:   Wed Feb 25 11:27:12 2026 -0500

    [Docs] Add Sequential Onloading, Disk Offloading, and Distributed Oneshot Docs (#2396)
    * Add documentation for new features in v0.10.0
    * Add up-to-date documentation on sequential onloading
    * Add docs page for Sequential Onloading
    * Add docs page for Model Loading
    * Add docs page for Distributed Oneshot
    * Fix the path of observers.md
    * Slightly change wording on docs home page
    * Add redirect to model loading docs in disk offloading examples folder

    ---------

    Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
    Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

commit 1e4d3c5bca95ac75fc301005d1fe5b2adca9a955
Author: Brian Dellabetta <brian-dellabetta@users.noreply.github.com>
Date:   Wed Feb 25 11:09:57 2026 -0500

    [Examples] Remove diagnostic `model.generate` calls for models with 40B+ parameters (#2401)

    SUMMARY:
    Remove all calls to `model.generate` in examples involving models with
    ~40B+ parameters. Anything smaller should run on a single 80GB GPU.

    TEST PLAN:
    n/a

    ---------

    Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>
    Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com>

commit f0a1824bc5440597d071bcc21bd8ad01bd8b0038
Author: Dipika Sikka <dipikasikka1@gmail.com>
Date:   Wed Feb 25 11:03:02 2026 -0500

    [Tests][LM Eval] Fix test seeding for consistent results (#2395)

    SUMMARY:
    - Enables consistent test results before runs

    Test Run:
    https://github.com/neuralmagic/llm-compressor-testing/actions/runs/22371360237

    ---------

    Signed-off-by: Dipika Sikka <dipikasikka1@gmail.com>

commit b0cc7a05f7f6916d5757f452f7147e066f318451
Author: Dipika Sikka <dipikasikka1@gmail.com>
Date:   Wed Feb 25 10:54:34 2026 -0500

    [Docs] Clean-up + Example ReadMe updates (#2399)

    SUMMARY:
    - Remove marlin24 examples
    - Clean-up existing README docs
    - Add examples/README.md file explaining repo structure
    - Update MoE README.md

commit 778abe815c226669753308ea9ee76ee91186db26
Author: Dipika Sikka <dipikasikka1@gmail.com>
Date:   Tue Feb 24 13:43:53 2026 -0500

    [Docs] Remove finetune examples (#2398)

    SUMMARY:

    - Remove old finetune examples
    - Remove old maintainers file as redundant with CODEOWNERS

commit 9b7fb9f77159967f90b66b37be5ea7bc21532504
Author: Bartowski <3266127+bartowski1182@users.noreply.github.com>
Date:   Tue Feb 24 12:21:09 2026 -0500

    Add AFMOE mappings for awq and smoothquant (#2316)

    SUMMARY:
    These mappings are needed to properly apply AWQ and smoothquant to the
    Trinity series of models, AfmoeForCausalLM

    TEST PLAN:
    Quality was tested with benchmarks, without these changes the benchmark
    results were extremely low, with these changes it was close to margin of
    error compared to bf16/FP8 dynamic

    Can test on Trinity-Large-Preview

    https://huggingface.co/arcee-ai/Trinity-Large-Preview

    Test code for quantization:

    https://gist.github.com/bartowski1182/b7e05f6c96735ec5d03f234d37e11e4d

    ---------

    Signed-off-by: Colin Kealty <3266127+bartowski1182@users.noreply.github.com>
    Signed-off-by: Bartowski <3266127+bartowski1182@users.noreply.github.com>
    Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com>
    Co-authored-by: Brian Dellabetta <brian-dellabetta@users.noreply.github.com>
    Co-authored-by: HDCharles <39544797+HDCharles@users.noreply.github.com>

commit be12cc6d70f3fb3fdd6b0bbe0a8ba35f19b549d9
Author: Dipika Sikka <dipikasikka1@gmail.com>
Date:   Tue Feb 24 11:25:46 2026 -0500

    [Docs] Reorganize + Additional Guides (#2379)

    SUMMARY:
    - Add choosing a model
    - Add choosing a dataset
    - Re-organize to set-up a step-by-step compression guide
    - Additional clean-up and organization

    Sample Doc Generation:
    https://vllm--2379.org.readthedocs.build/projects/llm-compressor/en/2379/

    ---------

    Signed-off-by: Dipika Sikka <ds3822@columbia.edu>
    Co-authored-by: Brian Dellabetta <brian-dellabetta@users.noreply.github.com>

commit 986ac236f3bbdc95c8e47072fb33474511aee962
Author: Kyle Sayers <kylesayrs@gmail.com>
Date:   Mon Feb 23 13:03:00 2026 -0500

    [Misc] Remove usages of `update_parameter_data` (#2393)
    * Begin deprecation of `update_parameter_data` in favor of
    `update_offload_parameter`

    ---------

    Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

commit 5c757a6985d32ee74b7a2c30349c852624cf4100
Author: Kyle Sayers <kylesayrs@gmail.com>
Date:   Mon Feb 23 11:59:36 2026 -0500

    [Offloading] Support Disk Offloading (#2373)
    * Support disk offloading for very large models
    * [[Offload] Convert accelerate for
    loading/saving](https://github.com/vllm-project/compressed-tensors/pull/572/)
    * Add `examples/disk_offloading/qwen3_example.py`
    * Add `examples/disk_offloading/kimi_k2_example.py`
    * Remove post-processing step where `remove_dispatch` is called
    * Previously, this was used to avoid conflicts between
    `dispatch_for_sequential` and `dispatch_for_generation`.
    * Now, the two functions are directly compatible: you don't need to
    remove the dispatch of one to use the other
    * Add `to_accelerate` to `save_pretrained_wrapper`
    * This ensures that the model is converted to `accelerate` offloading
    before saving
    * This ensures the best compatibility with `save_pretrained`, and
    reduces excess memory usage which would cause gpu/cpu ooms
    * During oneshot preprocessing, convert `from_accelerate` if possible.
    This guards against users who load their model outside of the
    `load_offloaded_model` context
    * Remove `offload_device` arguemnt from `dispatch_for_sequential` to
    avoid deprecation warning
    * `dispatch_for_sequential` now always respects the device the model was
    loaded on
    * Ran `Qwen/Qwen3-0.6B` example to completion
    * [IN PROGRESS] Run `unsloth/Kimi-K2-Instruct-0905-BF16` example to
    completion

    ---------

    Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

commit 5f63d7a9a6ae0f9944e69ff87ea5cce31f923ae2
Author: HDCharles <39544797+HDCharles@users.noreply.github.com>
Date:   Thu Feb 19 14:17:53 2026 -0500

    [GPTQ][ddp] enabling DDP for GPTQ (#2333)

    After the changes in
    https://github.com/vllm-project/compressed-tensors/pull/572
    https://github.com/vllm-project/compressed-tensors/pull/534
    https://github.com/vllm-project/llm-compressor/pull/2340 we're ready to
    start rolling out DDP implementations of various modifiers
    The Api we've landed on attempts to maintain the normal flow with
    minimal changes necessary to enable DDP:

    1) the user will call torchrun --nproc_per_node==<num_threads> script.py
    to start the script
    2) the user will initialize the distributed context, (they can use the
    helper init_dist to do this)
    3) the user will load the model using the new context manager, setting
    the device map as outlined
    [here](https://github.com/vllm-project/compressed-tensors/pull/572).
    (For most users this will be "auto_offload")
    4) (optional) the user can partition the dataset at load time using
    get_rank_partition or just load as normal and oneshot will partition the
    data later (will load 1 copy of dataset into cpu memory for each rank
    which may be onerous)
    ```python
    from compressed_tensors.offload import load_offloaded_model, init_dist
    init_dist()
    with load_offloaded_model():
        model = AutoModelForCausalLM.from_pretrained(MODEL_ID, device_map="auto_offload")
    ...
    ds = load_dataset(
        DATASET_ID, split=get_rank_partition(DATASET_SPLIT, NUM_CALIBRATION_SAMPLES)
    ```
    Adding the DDP process to GPTQ has relatively straightforward though
    optimizing it for speed was a bit trickier. There are 4 steps

    1) assigning each module to a rank which it will be compressed by
    2) for each module assigned to a rank, having all hessian information
    sent by other ranks to the assigned rank
    3) each rank compresses the modules that it was assigned
    4) broadcast the final quantized values to all ranks

    Step 1 required the largest optimization, without any load balancing, we
    ran into situations where 1 rank could be doing twice as much work as
    another. Thus we implemented basic load balancing and time estimation
    that seems to be working well in practice. The other major optimization
    was using asynchronous ops for thread to thread communication. Before
    these optimizations, 2 thread GPTQ was as fast as 1 thread GPTQ for
    llama3-8B, afterward it results in a 27% speedup despite being a
    relatively small model.

    | model_id | world_size | max_time | max_memory | save_time |
    flex_extract | eval_time |

    |----------|-------------|----------|------------|-----------|--------------|-----------|
    | Meta-Llama-3-8B-Instruct | 1 | 745.03 | 5.82 | 19.57 | 0.7066 | 95.28
    |
    | Meta-Llama-3-8B-Instruct | 2 | 372.20 | 5.57 | 49.10 | 0.7089 | 95.24
    |
    | Meta-Llama-3-8B-Instruct | 4 | 264.07 | 5.82 | 52.50 | 0.7180 | 96.74
    |
    | Qwen3-30B-A3B | 1 | 14207.53 | 6.56 | 748.23 | 0.8704 | 209.93 |
    | Qwen3-30B-A3B | 2 | 7018.25 | 6.36 | 696.65 | 0.8810 | 205.89 |
    | Qwen3-30B-A3B | 4 | 3694.46 | 6.36 | 723.05 | 0.8832 | 217.62 |

    while validating numerical accuracy of the DDP technique, we noticed
    that accuracy improved significantly for each thread added. After some
    debugging we realized this was because the existing [hessian
    calculation](https://github.com/vllm-project/llm-compressor/pull/2333/changes#diff-18d1319f01629ca65cc54f955dc6177f6dd025f057013932b2ed29842854f3ecL61-L65)
    was causing an accumulation of floating point errors. By rewriting the
    hessian calculation to sum the intermediate hessians and only divide by
    num_samples at the end, we improved the GSM8K evaluation from (.67, .66)
    to (.71, .71). You can repro these results
    [here](https://github.com/vllm-project/llm-compressor/pull/2333/changes#diff-d31ce0453051853c17ba2a5225b3d1bfab548e095bab0967d6acfd1b3ce1b35d)

    ---------

    Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>

commit 881dd462975a92551685b5507dfa1272f8c40bb8
Author: Kyle Sayers <kylesayrs@gmail.com>
Date:   Thu Feb 19 12:55:13 2026 -0500

    [Bugfix] Reduce device movement while checking layer divisibility (#2385)
    * Improve runtime and memory usage by checking the shape of the
    offloaded weight, not the onloaded weight
    * Wrap all calls to `_layer_indivisible` with the `disable_onloading`
    context

    Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

commit 70b610acb234e095f664d361412f6a4e9ef2ff09
Author: Brian Dellabetta <brian-dellabetta@users.noreply.github.com>
Date:   Thu Feb 19 11:41:32 2026 -0500

    [Observers] Allow for case when weight shape and block size are not evenly divisble (#2283)

    SUMMARY:
    Update observer logic for block strategy when weight shape is not
    divisible by block size

    Prerequisite:
    - [x] https://github.com/vllm-project/compressed-tensors/pull/547

    TEST PLAN:
    - [x] Quantized checkpoint made with this branch (and above CT branch)
    runs on vllm main for flashinfer, deepgemm and default kernels --
    https://huggingface.co/bdellabe/DeepSeek-V2-Lite-FP8-BLOCK

    Run script below with
    - `VLLM_BLOCKSCALE_FP8_GEMM_FLASHINFER=1 VLLM_USE_DEEP_GEMM=1` for
    flashinfer
    - `VLLM_BLOCKSCALE_FP8_GEMM_FLASHINFER=0 VLLM_USE_DEEP_GEMM=1` for
    deepgemm
    - `VLLM_BLOCKSCALE_FP8_GEMM_FLASHINFER=0 VLLM_USE_DEEP_GEMM=0` for
    default

    ```python
    if __name__ == "__main__":
        from vllm import LLM, SamplingParams

        prompts = ["The Swiss Alps are", "Brad Marchand is", "The Toronto Maple Leafs are"]
        sampling_params = SamplingParams(
            temperature=0.80, top_p=0.95, max_tokens=40, min_tokens=10
        )
        llm = LLM(
            "bdellabe/DeepSeek-V2-Lite-FP8-BLOCK",
            max_model_len=4096,
            enforce_eager=True,
        )
        output = llm.generate(prompts, sampling_params)
        for out in output:
            print(out.outputs[0].text)

        print("COMPLETE")
    ```

    ---------

    Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>

commit d0ce1d827ce3981eaf173f47452f66793d8d1d78
Author: Itay Etelis <92247226+Etelis@users.noreply.github.com>
Date:   Thu Feb 19 11:35:03 2026 +0200

    move smoothquant to transforms (#2314)

    Moves `SmoothQuantModifier` from `modifiers/smoothquant/` to
    `modifiers/transform/smoothquant/` to correctly categorize it as a
    transform rather than a modifier.

    Closes #2306

    - Moved SmoothQuant source files to `modifiers/transform/smoothquant/`
    - Moved corresponding test files
    - Updated all imports across examples, docs, and dependent code
    - Exported `SmoothQuantModifier` from `modifiers.transform`

    ```python
    from llmcompressor.modifiers.transform.smoothquant import SmoothQuantModifier
    ```

    ---------

    Signed-off-by: Itay Etelis <itayetelis@gmail.com>
    Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
    Co-authored-by: Itay Etelis <itay.etelis@ibm.com>

commit 936e0a701e55e8d9f9b9145b64673510bfe2a79c
Author: Dipika Sikka <dipikasikka1@gmail.com>
Date:   Thu Feb 19 01:15:24 2026 -0500

    [Tests][e2e] Release memory before running vLLM (#2375)

    Co-authored-by: Kyle Sayers <kylesayrs@gmail.com>

commit 2e469aa913c41d0c832d3a0a5785751b48e065ed
Author: Kyle Sayers <kylesayrs@gmail.com>
Date:   Thu Feb 19 00:59:14 2026 -0500

    [Bugfix] Fix circular references when activation offload device is cuda (#2387)
    tensors. This is a good approach, but has an edge case where, if the
    value of entry is identical to the key of the entry, then the key will
    never be garbage collected.

    This can occur if the user specifies `sequential_offload_device="cuda"`,
    or if the AWQ offload device is "cuda" (default true in most cases).
    * Fix memory leak in AWQ which led to very high CUDA memory usage
    * Guard against entries into the `WeakKeyDictionary` where the key and
    value are identical
    * Misc
      * Move `OverrideEqMode` to the bottom of the `pipelines/cache.py`
      * Remove `_fp16_baseline_cache`, which was not being used
    | Before Changes | After Changes |
    | - | - |
    | <img width="640" height="480" alt="awq_before"
    src="https://github.com/user-attachments/assets/07714321-4b2f-49b7-aa2b-5c745a60d2f4"
    /> | <img width="640" height="480" alt="awq_after"
    src="https://github.com/user-attachments/assets/336b0e98-c24c-4e0c-a873-3166effc32b7"
    /> |

    ---------

    Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
    Co-authored-by: HDCharles <39544797+HDCharles@users.noreply.github.com>

commit 9979e9829ba034ee323b24a841e2288572c594df
Author: Kyle Sayers <kylesayrs@gmail.com>
Date:   Wed Feb 18 18:20:15 2026 -0500

    [`model_free_ptq`] Earlier Shape Validation (#2372)
    * Add earlier shape validation, at the cost of loading tensors twice
    * Add a validation step which loads tensors and validates their shapes
    * Misc
      * Add `iter_quantizable_tensors` to reduce code reuse
    * Added `tests/llmcompressor/pipelines/test_model_free_validation.py`

    ------
    [Codex
    Task](https://chatgpt.com/codex/tasks/task_e_69936b53d28c8327aa0b784040c34734):
    I had to do significant cleanup to make this multithreaded/fix
    duplicated code.

    ---------

    Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
    Co-authored-by: HDCharles <39544797+HDCharles@users.noreply.github.com>

commit 7ac94483be892f1e522079065c5b6ad4ff683c79
Author: D!NE$H <67671800+gDINESH13@users.noreply.github.com>
Date:   Thu Feb 19 03:15:07 2026 +0530

    input_id not required for Step3-VL-10B (#2370)

    SUMMARY:

    Closes #2272

    Adds TypeError exception handling to `get_embeddings()` to support
    models with non-standard `get_input_embeddings() `implementations,
    specifically Step3-VL-10B.

    As @kylesayrs mentioned, Step3-VL-10B has a non-standard implementation
    of get_input_embeddings() that requires an input_ids parameter. This fix
    gracefully handles the TypeError that occurs when calling this method
    without the required parameter, allowing quantization to proceed.

    TEST PLAN:

    I do not have system specs to run this model locally. But testing would
    be running
    `examples/quantization_w8a8_int8/llama3_example.py` just changing
    model_id to "stepfun-ai/Step3-VL-10B"

    It should gracefully handle the TypeError to require input_ids.

    Signed-off-by: gDINESH13 <dinesh13g@gmail.com>
    Co-authored-by: HDCharles <39544797+HDCharles@users.noreply.github.com>

commit f4a1490d09b65bb125ecbd855f0866f4ef80cd1a
Author: Avishek Goswami <86944690+GOavi101@users.noreply.github.com>
Date:   Thu Feb 19 00:49:10 2026 +0530

    DataLoader options, single-pass weight calibration, optional sequential prefetch (#2349)
    Performance-oriented changes for calibration and weight quantization:
    DataLoader tuning when workers are used, a single-pass weight
    calibration in `QuantizationModifier`, and an optional sequential
    prefetch to overlap onload with forward. Defaults stay safe for low
    RAM/GPU.
    - **`pin_memory`**: Set to `True` only when CUDA is available **and**
    `dataloader_num_workers > 0` (avoids extra pinned memory when
    `num_workers=0`).
    - When `num_workers > 0`: set `persistent_workers=True` and
    `prefetch_factor=2` for faster calibration.
    `args/dataset_arguments.py`, `entrypoints/oneshot.py`)
    - New argument **`sequential_prefetch`** (default **`False`**).
    - When `False`: same as before — one batch on GPU at a time (low peak
    memory).
    - When `True`: prefetch next batch in a background thread to overlap
    onload with forward (faster when GPU memory allows two batches).
    - `dataloader_num_workers` default remains **0** (low-memory safe); help
    text updated.
    - `sequential_prefetch` added to `DatasetArguments` and `oneshot()` with
    default `False`.

    ---------

    Signed-off-by: Avishek Goswami <avishek.goswami@ibm.com>
    Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>
    Co-authored-by: Avishek Goswami <avishek.goswami@ibm.com>
    Co-authored-by: HDCharles <39544797+HDCharles@users.noreply.github.com>
    Co-authored-by: HDCharles <charlesdavidhernandez@gmail.com>

commit 36c30ee5848427046d006c7fc9cb46113c7ac5ba
Author: Kyle Sayers <kylesayrs@gmail.com>
Date:   Tue Feb 17 17:12:59 2026 -0500

    [Examples] Deprecate `dispatch_for_generation` in favor of `dispatch_model` (#2376)
    * Start using `dispatch_model` as a primitive instead of
    `dispatch_for_generation`, which doesn't add anything but indirection
    * Find and replace `dispatch_for_generation` -> `dispatch_model`
    * Add deprecation warning to `dispatch_for_generation`

    ---------

    Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

commit ef70f436188e919ae572b8ece384942d00f09d4d
Author: Avishek Goswami <86944690+GOavi101@users.noreply.github.com>
Date:   Tue Feb 17 23:12:12 2026 +0530

    feat: early group-size divisibility check with layer FQNs (#2353)
    Add an early check so users hit a clear error at `initialize()` (before
    long calibration e.g. GPTQ) when using group/tensor-group quantization
    on layers whose weight columns are not divisible by `group_size`,
    instead of failing at save with an opaque message.
    - **Policy:** Only GROUP and TENSOR_GROUP require strict divisibility
    (those kernels don’t support non-divisible shapes). BLOCK is
    intentionally not checked (block kernels support non-divisible). This is
    centralized in `group_size_validation.py`.
    - **Early error:** We fail during `initialize_quantization()` and raise
    with:
    - The exact layer FQNs and `(columns, group_size)` for each problematic
    layer
      - Instructions to add those names to the modifier’s `ignore` list
    - **Tests:** Added tests for the validation helper and for the modifier
    (raises with expected message, succeeds when layers are ignored or all
    divisible).

     Fixes #1983

    ---------

    Signed-off-by: Avishek Goswami <avishek.goswami@ibm.com>
    Co-authored-by: Avishek Goswami <avishek.goswami@ibm.com>
    Co-authored-by: HDCharles <39544797+HDCharles@users.noreply.github.com>
    Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com>

commit ccc26f1c7f01ba8256efffe28549ad6775044fb7
Author: Cassie Jeon <cajeon@redhat.com>
Date:   Tue Feb 17 11:11:09 2026 -0500

    First draft for INFERENG-2666 (#2251)

    SUMMARY:
    This is a first draft for INFERENG-2666. This draft covers Llama4,
    Qwen3, Kimi K2, and Mistral models for FP8 quantization.

    TEST PLAN:
    N/A. Documentation and code examples will need to be verified and
    reviewed by developers.

    Additional questions for reviewers:
    1. Should all the examples be in one page? Or should I separate the
    examples into separate pages for each model? This is for FP8, but I know
    FP4 will also need documentation so wanted to get your thoughts if FP4
    examples should also be one document or separated by model.

    2. Are there any specific wording or content that should be called out
    before the examples for each model?

    3. I modeled the draft from [this Example
    page](https://docs.vllm.ai/projects/llm-compressor/en/latest/examples/quantization_w8a8_fp8/)
    that Dipika had initially pointed out. Let me know if you think I should
    organize the information differently.

    Signed-off-by: Cassie Jeon <cajeon@redhat.com>

commit cc3eed27da218662c629451ecdc7bac558873d30
Author: Kyle Sayers <kylesayrs@gmail.com>
Date:   Tue Feb 17 10:15:19 2026 -0500

    [Bugfix] Guard against MLA (#2337)
    * Support INT4 quantization of models with MLA attention
    * As of https://github.com/vllm-project/compressed-tensors/pull/533, MLA
    attention is considered an attention module
    * However, checking for submodule.q_proj fails for MLA, since MLA does
    not have a q_proj
    * Guard against layers without q_proj
    * Able to quantize MLA model

    Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
    Co-authored-by: HDCharles <39544797+HDCharles@users.noreply.github.com>

commit 0d556a7da6c047b583a24b5e702ba2bfa647e05a
Author: Kyle Sayers <kylesayrs@gmail.com>
Date:   Tue Feb 17 09:37:24 2026 -0500

    [Sequential Pipeline] only cache unique offloaded values (#2366)

    Updated by @brian-dellabetta

    SUMMARY:
    The SequentialPipeline offloads subgraph outputs as part of normal
    usage. Occasionally these outputs share duplicates in kwargs that point
    to the same memory location on the onloaded device. When offloading is
    enabled, there was previously no check to see if any tensors to be
    offloaded had already previously been offloaded, which can cause a huge
    increase in memory requirements in some models, as reported in #2363.
    This PR
    - [x] adds an offload map to IntermediatesCache to ensure tensors are
    not redundantly offloaded
    - [x] wraps the map in an override to ensure `torch.equal` is used
    rather than `torch.eq` (which is the one used with `==` checks).
    `torch.eq` can return multiple boolean values depending on the tensors
    being compared, resulting in an error. This override, which should only
    be used when the tensors are immutable (the case here), allows us to
    retain the original hashing function and have an `O(1)` lookup. Our
    other attempts to circumvent the issue added to runtime or required
    `O(N)` lookup.

    Resolves #2363

    TEST PLAN:
    - [x] Unit test added for `OverrideEqMode`
    - [x] Script from #2363 runs with ~81GB CPU RAM after first layer
    propagation, increased to ~88GB CPU RAM used by layer 11/49, and then
    stays consistently <89GB CPU RAM used by layer 25/49. On current main,
    this script would hit ~750GB CPU RAM usage during first layer
    propagastion

    ---------

    Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
    Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>
    Signed-off-by: Brian Dellabetta <brian-dellabetta@users.noreply.github.com>
    Co-authored-by: Brian Dellabetta <brian-dellabetta@users.noreply.github.com>
    Co-authored-by: Brian Dellabetta <bdellabe@redhat.com>

commit 556b50306657186c7ca21b99d578491edc0f0a43
Author: Kyle Sayers <kylesayrs@gmail.com>
Date:   Mon Feb 16 11:36:12 2026 -0500

    [Misc] Reword warning message to make log grepping easier (#2312)
    * Make it easier to find failures in logs by removing the word "failed"
    from this very common warning

    Signed-off-by: Kyle Sayers <kylesayrs@a100-08.nemg-001.lab.rdu2.dc.redhat.com>
    Co-authored-by: Kyle Sayers <kylesayrs@a100-08.nemg-001.lab.rdu2.dc.redhat.com>
    Co-authored-by: HDCharles <39544797+HDCharles@users.noreply.github.com>

commit b6c331e2fa8faabf851a48b5458ccd9632e6206b
Author: HDCharles <39544797+HDCharles@users.noreply.github.com>
Date:   Mon Feb 16 10:18:30 2026 -0500

    [ddp] fixing data slice bug (#2361)

    Summary:

    that's not how you slice a dataset, previously not tested with
    world_size==1

    Test Plan:

    [script](https://gist.github.com/HDCharles/282950166fd0c95a7a2594fe922bcb53)

    (world_size==1)

    ---------

    Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>
    Signed-off-by: HDCharles <39544797+HDCharles@users.noreply.github.com>
    Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

commit 6ddd0361e41f65d86a08889efd58c7dec00282e3
Author: ZewenShen-Cohere <zewen.shen@cohere.com>
Date:   Fri Feb 13 14:17:41 2026 -0500

    [AWQ] Add activation_hook_target field for custom activation cache hooking (#2346)

    - Adds an optional `activation_hook_target` field to `AWQMapping` that
    lets users specify which submodule (relative to the parent/LCA) to hook
    for activation caching, replacing the hardcoded `hasattr(parent, 'mlp')`
    workaround for MoE models with parallel transformer blocks.
    - When `activation_hook_target` is `None` (default), behavior is
    unchanged: the hook is placed on `balance_layers[0]`. When set (e.g.
    `"mlp"`), it resolves to the corresponding submodule on the parent via
    `getattr_chain`.

    In parallel transformer architectures, attention and MLP run in parallel
    from the same input. The existing code always hooks `balance_layers[0]`
    for activation caching, which captures the wrong activations when
    balance layers span both attention and MLP branches. There was a
    commented-out `hasattr(parent, 'mlp')` workaround, but it was brittle
    and not generalizable. This change makes the hook target explicitly
    configurable per mapping.

    I've tested this change with our internal models, and it aligns with
    previous results.

    ---------

    Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com>
    Co-authored-by: HDCharles <39544797+HDCharles@users.noreply.github.com>

commit b0463d101350e04b40268596c71531622b26ad20
Author: HDCharles <39544797+HDCharles@users.noreply.github.com>
Date:   Fri Feb 13 11:16:34 2026 -0500

    [bug][awq] fix inf handling (#2332)

    Must have been a bad merge or rebase at some point, scalesview was being
    set before the inf/nan check

    TEST PLAN:
    CI

    ---------

    Signed-off-by: HDCharles <39544797+HDCharles@users.noreply.github.com>

commit 6d600d4b91e8fb8991cc2c5c6e4f8cd911c36815
Author: Dipika Sikka <dipikasikka1@gmail.com>
Date:   Fri Feb 13 10:05:07 2026 -0500

    Fix CI/CD failures (#2359)

    SUMMARY:

    autoround:
    - Test was previously using int weights with float activations which
    silently fails with torch 2.9 but results in a failure for 2.10
    - Fix the args to appropriately use a valid scheme where weights are
    also float

    quant_reload:
    - Remove old unused argument
    - Set tie_word_embeddings to false to account for what the test is
    targeting - I believe we’re seeing this now from recent
    compressed-tensors changes cc @kylesayrs

commit 302c2c7a190f1b6c6151afb0fbc5bf63b75f240e
Author: ZewenShen-Cohere <zewen.shen@cohere.com>
Date:   Fri Feb 13 08:06:15 2026 -0500

    AWQ: orig_layer_weights should save all balance layer weights (#2344)

    Currently, orig_layer_weights only clones weights for layers that have a
    quantization scheme and are listed in `mapping.balance_layers`. This
    becomes a problem when we disable quantization for a layer that is still
    in `mapping.balance_layers`: all balance layers still need to be
    smoothed at the end, but orig_layer_weights does not store the original
    weights for all of them. As a result, the smoothing step fails (see
    where the error is triggered:
    https://github.com/ZewenShen-Cohere/llm-compressor-fork/blob/e9e3d3191f7598198f070c5f8269f08ec89e0b2f/src/llmcompressor/modifiers/awq/base.py#L554
    ).

    Co-authored-by: Brian Dellabetta <brian-dellabetta@users.noreply.github.com>
    Co-authored-by: HDCharles <39544797+HDCharles@users.noreply.github.com>

commit d2b67d15139f7a55699f5378cb477c945eb9ed5e
Author: Dipika Sikka <dipikasikka1@gmail.com>
Date:   Thu Feb 12 14:33:00 2026 -0500

    Update CI/CD Logs (#2358)

    SUMMARY:
    - Provide summary for why a test was skipped

commit 05a13f35711e12bed4771aea7755f27d248fdaeb
Author: Dipika Sikka <dipikasikka1@gmail.com>
Date:   Thu Feb 12 10:11:37 2026 -0500

    Support torch 2.10 (#2356)

    SUMMARY:
    - Requires: https://github.com/vllm-project/compressed-tensors/pull/583
    - Transformers tests currently already failing on main

commit c37fcfa081daa024f865a3f4798db029a0a67d43
Author: Fynn Schmitt-Ulms <fynnsu@outlook.com>
Date:   Wed Feb 11 19:52:19 2026 -0500

    Add synchronize trigger to ready label check (#2354)

    SUMMARY:
    Triggers the ready label check each time new commits are pushed to a pr.

    Looking at https://github.com/vllm-project/llm-compressor/pull/2350 it
    seems like there is still an issue with our ready check system.
    1. The first commit was added
    2. The ready label was added and second commit
    (7d7ebd2247142dfb75cbe631aa37859092654f71) was pushed, this caused the
    ready check to run and pass
    3. Further commits were added but the ready check was never retriggered
    4. "ready-label-check Expected — Waiting for status to be reported" is
    blocking merge, despite the most recent run of the ready check passing.

    It seems like required checks may need to run and pass on the most
    recent commit for github to allow the merge. This pr causes subsequent
    commits to re-trigger the ready check workflow.

    TEST PLAN:
    Merge and see if this fixes the problem. It can't make it worse since
    this just causes the check to run more often.

    Signed-off-by: Fynn Schmitt-Ulms <fschmitt@redhat.com>
    Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com>

commit fcd7fdbda73b88168095e728dfdc6d3ce7cf004f
Author: Dipika Sikka <dipikasikka1@gmail.com>
Date:   Wed Feb 11 16:03:54 2026 -0500

    Swap to use CPU runners (#2350)

    SUMMARY:
    - Swap ubuntu runners to use our cpu runner
    - Remove 2 year old docker build workflow that we never use

    ---------

    Signed-off-by: Dipika Sikka <dipikasikka1@gmail.com>

commit d316e27e6aaafbb17480519ad15fc0d2b723353f
Author: Fynn Schmitt-Ulms <fynnsu@outlook.com>
Date:   Wed Feb 11 13:46:46 2026 -0500

    Add concurrency check to all pr workflows (#2348)

    SUMMARY:
    We typically only care about the test results for the final commit in a
    pr. This pr will reduce the load on github actions runners by cancelling
    all jobs except for the one on the latest commit.

    For example, if the following commits are all uploaded one at a time
    quickly in a row:
    Commit A1 uploaded, job A1 starts
    Commit B1 (separate pr) uploaded, job B1 queued
    Commit A2 uploaded, job A1 cancelled, job A2 queued, job B1 started
    Job B1 finishes, job A1 starts

    Note: this is the same concurrency logic we already have on
    `test-check-transformers.yaml`

    Signed-off-by: Fynn Schmitt-Ulms <fschmitt@redhat.com>

commit 22fc354d25248f1ef9d990a9a20c6aeca8a94d6d
Author: Dipika Sikka <dipikasikka1@gmail.com>
Date:   Wed Feb 11 13:06:15 2026 -0500

    Revert "add qwen3 vl autoround example (#2334)" (#2351)

    This reverts commit 7b366711cba3982bbac99abdf6bf2c3572395f1a.

commit 7b366711cba3982bbac99abdf6bf2c3572395f1a
Author: Xin He <xin3.he@intel.com>
Date:   Thu Feb 12 01:34:04 2026 +0800

    add qwen3 vl autoround example (#2334)

    SUMMARY:
    AutoRound quantization example: qwen3-vl nvfp4

    TEST PLAN:
    python qwen3_vl_example.py
    Output:
    ```
    Hello my name is Mihai, I am a 30 year old male, and I am currently a software engineer working in a company that develops software for the financial sector. I am a very passionate person, and I am always eager to learn new things. I have a strong interest in AI, machine learning, and data science. I am also very interested in the intersection of these fields with finance. I am currently working on a project that involves building a machine learning model to predict stock prices. I am
    ```

    ---------

    Signed-off-by: Xin He <xin3.he@intel.com>
    Co-authored-by: HDCharles <39544797+HDCharles@users.noreply.github.com>
    Co-authored-by: Brian Dellabetta <brian-dellabetta@users.noreply.github.com>

commit b49fbfda933e168f7b58f10ff45e019b3f24baee
Author: Brian Dellabetta <brian-dellabetta@users.noreply.github.com>
Date:   Tue Feb 10 14:11:18 2026 -0500

    [cicd] move check ready action to on pull_request_target (#2342)

    SUMMARY:
    Change the check ready label ci/cd action to run on
    `pull_request_target` so that it runs more robustly for community user
    PRs. From
    [docs](https://docs.github.com/en/actions/reference/workflows-and-actions/events-that-trigger-workflows#pull_request_target):

    > This event runs in the context of the default branch of the base
    repository, rather than in the context of the merge commit, as the
    pull_request event does. This prevents execution of unsafe code from the
    head of the pull request that could alter your repository or steal any
    secrets you use in your workflow. This event allows your workflow to do
    things like label or comment on pull requests from forks. Avoid using
    this event if you need to build or run code from the pull request.

  …
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation ready When a PR is ready for review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant