[Self-Review] Enhance Autoround to support multiple cards tuning by yiliu30 · Pull Request #12 · yiliu30/llm-compressor-fork

yiliu30 · 2025-12-15T05:59:15Z

Given AutoRound uses block‑level reconstruction loss to fine‑tune quantization parameters, which requires running backward passes on each block. For large model, like Qwen35B, a single GPU often doesn’t have enough memory to hold an entire block during backward computation. To address this, we use the accelerator to dispatch the module across multiple devices.
In this PR, we enable this feature on LLMC side:

Add device_map for tuning with multiple cards
Map ignore to Autoround skipping layers
Add Qwen235B as example for multiple cards

Test plan

pytest -svv ./llmcompressor/transformers/autoround/test_autoround_oneshot.py -k test_oneshot_with_device_map

Example results

# vllm (pretrained=INC4AI/Qwen3-235B-A22B-W4A16-G128-AutoRound-ITERS1-LLMC-TEST-ONLY,tensor_parallel_size=2,max_model_len=8192,max_num_batched_tokens=32768,max_num_seqs=128,add_bos_token=True,gpu_memory_utilization=0.8,dtype=bfloat16,max_gen_toks=2048,enable_prefix_caching=False), gen_kwargs: (None), limit: 1000.0, num_fewshot: None, batch_size: 128
# |Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
# |-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
# |gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.928|±  |0.0082|
# |     |       |strict-match    |     5|exact_match|↑  |0.930|±  |0.0081|
  

# vllm (pretrained=INC4AI/Qwen3-235B-A22B-W4A16-G128-AutoRound-ITERS200-LLMC-TEST-ONLY,tensor_parallel_size=2,max_model_len=8192,max_num_batched_tokens=32768,max_num_seqs=128,add_bos_token=True,gpu_memory_utilization=0.8,dtype=bfloat16,max_gen_toks=2048,enable_prefix_caching=False), gen_kwargs: (None), limit: 1000.0, num_fewshot: None, batch_size: 128
# |Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
# |-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
# |gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.934|±  |0.0079|
# |     |       |strict-match    |     5|exact_match|↑  |0.915|±  |0.0088|

Signed-off-by: yiliu30 <yi4.liu@intel.com>

…mpressor-fork into autoround-version

Signed-off-by: yiliu30 <yi4.liu@intel.com>

github-actions · 2025-12-15T05:59:23Z

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

gemini-code-assist · 2025-12-15T05:59:30Z

Summary of Changes

Hello @yiliu30, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the AutoRoundModifier by improving its integration with models that leverage Hugging Face Accelerate for device management and offloading. The changes introduce robust handling of Accelerate hooks, provide explicit control over device mapping during quantization, and allow for the designation of layers to be kept in full precision. These updates aim to make the AutoRound quantization process more flexible and compatible with a wider range of large language models, as demonstrated by the new DeepSeek and Qwen examples.

Highlights

New AutoRound Examples: Added two new example scripts, deepseek_example.py and qwen3_example.py, demonstrating how to apply AutoRound quantization to DeepSeek and Qwen models, respectively. These examples showcase model loading, dataset preparation, and the application of the AutoRoundModifier.
Accelerate Hook Management: Introduced a suspend_accelerate_hooks context manager to temporarily detach and re-attach Hugging Face Accelerate hooks during the AutoRound quantization process. This ensures proper handling of models that use Accelerate for device management or offloading, preventing conflicts during quantization.
Device Mapping for AutoRound: The AutoRoundModifier now includes a device_map parameter, allowing users to explicitly specify the devices (e.g., '0,1') on which the quantization should be performed. This parameter is passed directly to the underlying AutoRound library.
Full Precision Layer Specification: Added a new method get_unquantized_layer_names to identify layers that should remain in full precision. These layers are then passed to the AutoRound algorithm via the fp_layers parameter, ensuring that specified modules are not quantized.
Improved Auto-Offloading: Updated the auto_offload parameter in the AutoRound initialization from False to True, enabling AutoRound to manage model offloading more effectively during the quantization process.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

yiliu30 · 2025-12-15T06:00:55Z

/gemini review

gemini-code-assist

Code Review

This pull request adds support for multi-device execution in the AutoRound modifier and includes new example scripts. The core changes in the modifier logic seem solid, particularly the introduction of suspend_accelerate_hooks to manage device offloading. My feedback primarily focuses on improving the portability and clarity of the new example scripts by removing hardcoded local file paths. I've also included a couple of minor suggestions to improve code style and comment clarity in the modifier itself.

gemini-code-assist · 2025-12-15T06:01:20Z

examples/autoround/deepseek_example.py

+model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
+model_id = "/storage/yiliu7/unsloth/DeepSeek-R1-BF16"
+# model_id = "/storage/yiliu7/deepseek-ai/DeepSeek-V2-Lite-Chat/"


The model_id is set to a hardcoded local path, which makes this example not portable. Please use a single, public model identifier from the Hugging Face Hub and remove the other commented or overwritten assignments for clarity.

Suggested change

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

model_id = "/storage/yiliu7/unsloth/DeepSeek-R1-BF16"

# model_id = "/storage/yiliu7/deepseek-ai/DeepSeek-V2-Lite-Chat/"

model_id = "deepseek-ai/DeepSeek-V2-Lite-Chat"

gemini-code-assist · 2025-12-15T06:01:20Z

examples/autoround/qwen3_example.py

+model_id = "Qwen/Qwen3-30B-A3B"
+# model_id = "/storage/yiliu7/Qwen/Qwen3-30B-A3B"
+# model_id = "/storage/yiliu7/Qwen/Qwen2.5-0.5B/"
+model_id = "/storage/yiliu7/Qwen/Qwen3-235B-A22B/"


The model_id is set to a hardcoded local path, which makes this example not portable. Please use a single, public model identifier from the Hugging Face Hub and remove the other commented or overwritten assignments for clarity. Using a smaller model might also be more suitable for an example script.

Suggested change

model_id = "Qwen/Qwen3-30B-A3B"

# model_id = "/storage/yiliu7/Qwen/Qwen3-30B-A3B"

# model_id = "/storage/yiliu7/Qwen/Qwen2.5-0.5B/"

model_id = "/storage/yiliu7/Qwen/Qwen3-235B-A22B/"

model_id = "Qwen/Qwen1.5-0.5B-Chat"

gemini-code-assist · 2025-12-15T06:01:20Z

examples/autoround/qwen3_example.py

+SAVE_DIR = (
+    "/storage/yiliu7/" + model_id.rstrip("/").split("/")[-1] + "-W4A16-G128-AutoRound"
+)


The SAVE_DIR is constructed with a hardcoded absolute path prefix, which will cause the script to fail on other machines. It should be a relative path so the model is saved in the current working directory.

SAVE_DIR = model_id.rstrip("/").split("/")[-1] + "-W4A16-G128-AutoRound"

src/llmcompressor/modifiers/autoround/base.py

gemini-code-assist · 2025-12-15T06:01:20Z

src/llmcompressor/modifiers/autoround/base.py

@@ -239,14 +277,12 @@ def apply_autoround(self, state, subgraph):
                q_input=self._q_input,
                device=str(device),
                # Leave offload for LLMC


The comment # Leave offload for LLMC is now misleading since auto_offload is set to True. With the addition of suspend_accelerate_hooks, it seems the intention is now to use auto_round's internal offloading. The comment should be updated to reflect this change in behavior.

Suggested change

# Leave offload for LLMC

# Use auto_round's internal offloading

Signed-off-by: yiliu30 <yi4.liu@intel.com>

gemini-code-assist

Code Review

This pull request adds support for device mapping in the AutoRound modifier, including two new example scripts. My review focuses on improving the portability and security of the new examples and enhancing the robustness of the AutoRoundModifier. The main suggestions are to remove hardcoded local paths from the example scripts in favor of public models from the Hugging Face Hub, warn users about the security implications of trust_remote_code=True, and use a more flexible default for device_map.

gemini-code-assist · 2025-12-15T06:02:52Z

examples/autoround/deepseek_example.py

+model_id = "/storage/yiliu7/unsloth/DeepSeek-R1-BF16"
+# model_id = "/storage/yiliu7/deepseek-ai/DeepSeek-V2-Lite-Chat/"
+model = AutoModelForCausalLM.from_pretrained(
+    model_id, torch_dtype="auto", trust_remote_code=True


Using trust_remote_code=True can introduce a security vulnerability if the model repository contains malicious code. It is crucial to warn users about this risk, especially in an example script that they might copy and run. Please add a comment explaining that users should only enable this if they trust the source of the model.

gemini-code-assist · 2025-12-15T06:02:52Z

examples/autoround/deepseek_example.py

+model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
+model_id = "/storage/yiliu7/unsloth/DeepSeek-R1-BF16"


The model_id is hardcoded to a local path, which makes this example not portable or runnable for other users. It's better to default to a model identifier from the Hugging Face Hub and provide the local path as a commented-out alternative for users who wish to use a local model.

Suggested change

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

model_id = "/storage/yiliu7/unsloth/DeepSeek-R1-BF16"

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

# model_id = "/storage/yiliu7/unsloth/DeepSeek-R1-BF16"

gemini-code-assist · 2025-12-15T06:02:52Z

examples/autoround/qwen3_example.py

+model_id = "Qwen/Qwen3-30B-A3B"
+# model_id = "/storage/yiliu7/Qwen/Qwen3-30B-A3B"
+# model_id = "/storage/yiliu7/Qwen/Qwen2.5-0.5B/"
+model_id = "/storage/yiliu7/Qwen/Qwen3-235B-A22B/"


The model_id is hardcoded to a local path, which makes this example not runnable for other users. It's better to default to a model from the Hugging Face Hub and provide the local path as a commented-out alternative.

Suggested change

model_id = "Qwen/Qwen3-30B-A3B"

# model_id = "/storage/yiliu7/Qwen/Qwen3-30B-A3B"

# model_id = "/storage/yiliu7/Qwen/Qwen2.5-0.5B/"

model_id = "/storage/yiliu7/Qwen/Qwen3-235B-A22B/"

model_id = "Qwen/Qwen3-30B-A3B"

# model_id = "/storage/yiliu7/Qwen/Qwen3-30B-A3B"

# model_id = "/storage/yiliu7/Qwen/Qwen2.5-0.5B/"

# model_id = "/storage/yiliu7/Qwen/Qwen3-235B-A22B/"

gemini-code-assist · 2025-12-15T06:02:52Z

examples/autoround/qwen3_example.py

+SAVE_DIR = (
+    "/storage/yiliu7/" + model_id.rstrip("/").split("/")[-1] + "-W4A16-G128-AutoRound"
+)


The SAVE_DIR is constructed using a hardcoded absolute path. This will cause the script to fail for any user who does not have the /storage/yiliu7/ directory. The output directory should be a relative path to make the example portable.

SAVE_DIR = model_id.rstrip("/").split("/")[-1] + "-W4A16-G128-AutoRound"

gemini-code-assist · 2025-12-15T06:02:52Z

src/llmcompressor/modifiers/autoround/base.py

    iters: int = 200
    enable_torch_compile: bool = True
    batch_size: int = 8
+    device_map: str = "0"


The default value for device_map is hardcoded to '0'. This assumes a GPU is available at index 0 and will cause a crash on CPU-only systems or systems with a different GPU configuration. Consider changing the default to 'auto' to let accelerate handle device placement automatically, which is more robust and user-friendly.

Suggested change

device_map: str = "0"

device_map: str = "auto"

Signed-off-by: yiliu30 <yi4.liu@intel.com>

…ct#2121) SUMMARY: Part of vllm-project#1927 - Updated type hints to Python 3.10+ built-in generics - Replaced List[] with list[] - No functional changes TEST PLAN: - Ran `make quality` (ruff format and lint checks) - Verified no functional code changes were introduced --------- Co-authored-by: Brian Dellabetta <brian-dellabetta@users.noreply.github.com>

Signed-off-by: yiliu30 <yi4.liu@intel.com>

yiliu30 · 2025-12-18T03:40:07Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces support for multi-GPU tuning with device_map in the AutoRoundModifier. It also adds a new example for quantizing a large model (Qwen3-235B) using multiple GPUs. Additionally, it maps the ignore parameter to AutoRound's layer skipping mechanism and includes a bug fix related to accessing quantization scale attributes. The changes are well-implemented and include a new test case for the multi-GPU functionality. My main feedback is to clean up the new example file by removing development artifacts for better clarity.

gemini-code-assist · 2025-12-18T03:41:16Z

examples/autoround/qwen3_example.py

+# FIXME: clean up model_id settings
+model_id = "Qwen/Qwen3-30B-A3B"
+model_id = "/storage/yiliu7/Qwen/Qwen3-30B-A3B"
+# model_id = "/storage/yiliu7/Qwen/Qwen2.5-0.5B/"
+model_id = "/storage/yiliu7/Qwen/Qwen3-235B-A22B/"
+model_id = "/models/Qwen3-30B-A3B"
+model_id = "Qwen/Qwen3-235B-A22B/"


This section for selecting the model ID contains multiple overwritten assignments and a FIXME comment. This looks like leftover development code and can be confusing for users of this example. Please clean this up to have a single, clear model_id assignment. If you want to show other possible models, it's better to list them in comments.

Suggested change

# FIXME: clean up model_id settings

model_id = "Qwen/Qwen3-30B-A3B"

model_id = "/storage/yiliu7/Qwen/Qwen3-30B-A3B"

# model_id = "/storage/yiliu7/Qwen/Qwen2.5-0.5B/"

model_id = "/storage/yiliu7/Qwen/Qwen3-235B-A22B/"

model_id = "/models/Qwen3-30B-A3B"

model_id = "Qwen/Qwen3-235B-A22B/"

# You can try other models here, for example:

# model_id = "Qwen/Qwen3-30B-A3B"

model_id = "Qwen/Qwen3-235B-A22B/"

yiliu30 · 2025-12-18T04:14:54Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces support for multi-GPU tuning with device_map in the AutoRound modifier. It also adds a new example script for a large model (Qwen3-235B) to demonstrate this feature, along with a corresponding test case. The changes to the AutoRoundModifier correctly handle device mapping and suspend accelerate hooks during the process. My feedback focuses on improving the new example script for better clarity and robustness.

gemini-code-assist · 2025-12-18T04:16:22Z

examples/autoround/qwen3_example.py

+# FIXME: clean up model_id settings
+model_id = "Qwen/Qwen3-30B-A3B"
+model_id = "/storage/yiliu7/Qwen/Qwen3-30B-A3B"
+# model_id = "/storage/yiliu7/Qwen/Qwen2.5-0.5B/"
+model_id = "/storage/yiliu7/Qwen/Qwen3-235B-A22B/"
+model_id = "/models/Qwen3-30B-A3B"
+model_id = "Qwen/Qwen3-235B-A22B/"


The model_id is reassigned multiple times, with some lines commented out, and there's a FIXME comment. This seems like leftover code from development. It would be cleaner to remove the unused assignments and the FIXME comment, leaving only the desired model ID. For an example script, it's best to have a single, clear configuration.

Suggested change

# FIXME: clean up model_id settings

model_id = "Qwen/Qwen3-30B-A3B"

model_id = "/storage/yiliu7/Qwen/Qwen3-30B-A3B"

# model_id = "/storage/yiliu7/Qwen/Qwen2.5-0.5B/"

model_id = "/storage/yiliu7/Qwen/Qwen3-235B-A22B/"

model_id = "/models/Qwen3-30B-A3B"

model_id = "Qwen/Qwen3-235B-A22B/"

model_id = "Qwen/Qwen3-235B-A22B/"

gemini-code-assist · 2025-12-18T04:16:22Z

examples/autoround/qwen3_example.py

+print("==========================================\n\n")
+
+# Save to disk compressed.
+SAVE_DIR = model_id.rstrip("/").split("/")[-1] + "-W4A16-G128-AutoRound"


Using split('/')[-1] to get the model name from the path is not very robust as it assumes a Unix-like path separator. A more cross-platform and readable approach is to use os.path.basename. You will need to add import os at the top of the file.

Suggested change

SAVE_DIR = model_id.rstrip("/").split("/")[-1] + "-W4A16-G128-AutoRound"

SAVE_DIR = os.path.basename(model_id.rstrip("/")) + "-W4A16-G128-AutoRound"

Signed-off-by: yiliu30 <yi4.liu@intel.com>

hshen14

LGTM
you may prepare a readme to show the instruction and results for the example

…-project#2034) SUMMARY: This is part of vllm-project#1927 Modernize type annotations using | operator and built-in generics in the transformer module as part of codebase modernization effort. TEST PLAN: ``` make style make quality make tests ``` Notes: Happy to address any comments! Thank you! --------- Co-authored-by: Brian Dellabetta <brian-dellabetta@users.noreply.github.com>

SUMMARY: Added examples for fp8 awq which now work after the AWQ generalization TEST PLAN: python $REPOS/llm-compressor/examples/awq/fp8_dynamic_llama_example.py 2>&1 | tee fp8_dynamic.log python $REPOS/llm-compressor/examples/awq/fp8_block_llama_example.py 2>&1 | tee fp8_block.log <details> <summary>fp8_dynamic.log</summary> /home/HDCharles/rhdev/lib/python3.11/site-packages/transformers/utils/hub.py:110: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead. warnings.warn( `torch_dtype` is deprecated! Use `dtype` instead! Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s] Loading checkpoint shards: 25%|██▌ | 1/4 [00:00<00:00, 7.60it/s] Loading checkpoint shards: 50%|█████ | 2/4 [00:00<00:00, 6.70it/s] Loading checkpoint shards: 75%|███████▌ | 3/4 [00:00<00:00, 6.82it/s] Loading checkpoint shards: 100%|██████████| 4/4 [00:00<00:00, 8.95it/s] 2025-12-17T20:56:18.271169+0000 | reset | INFO - Compression lifecycle reset 2025-12-17T20:56:18.271896+0000 | from_modifiers | INFO - Creating recipe from modifiers 2025-12-17T20:56:18.292591+0000 | initialize | INFO - Compression lifecycle initialized for 1 modifiers 2025-12-17T20:56:18.292874+0000 | IndependentPipeline | INFO - Inferred `DataFreePipeline` for `QuantizationModifier` Updating global scales: 0%| | 0/224 [00:00<?, ?it/s] Updating global scales: 100%|██████████| 224/224 [00:00<00:00, 648394.82it/s] Fusing global scales: 0it [00:00, ?it/s] Fusing global scales: 647it [00:00, 511346.28it/s] Calibrating weights: 0%| | 0/224 [00:00<?, ?it/s] Calibrating weights: 40%|███▉ | 89/224 [00:00<00:00, 888.99it/s] Calibrating weights: 100%|██████████| 224/224 [00:00<00:00, 1596.33it/s] 2025-12-17T20:56:53.594142+0000 | finalize | INFO - Compression lifecycle finalized for 1 modifiers 2025-12-17T20:56:57.580914+0000 | post_process | WARNING - Optimized model is not saved. To save, please provide`output_dir` as input arg.Ex. `oneshot(..., output_dir=...)` The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results. Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation. The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results. ========== SAMPLE GENERATION ============== <|begin_of_text|>Hello my name is Sarah and I am a 30-year-old woman who has been diagnosed with multiple sclerosis (MS). I am here to share my story and to help raise awareness about this chronic and often debilitating disease. I was diagnosed with MS in 2010, when I was 25 years old. At the time, I was working as a teacher and living a normal life. But suddenly, I started experiencing strange symptoms such as numbness in my hands and feet, blurred vision, and fatigue. I went ========================================== 2025-12-17T20:57:24.962901+0000 | get_model_compressor | INFO - skip_sparsity_compression_stats set to True. Skipping sparsity compression statistic calculations. No sparsity compressor will be applied. Compressing model: 0it [00:00, ?it/s] Compressing model: 1it [00:00, 3.48it/s] Compressing model: 5it [00:00, 14.97it/s] Compressing model: 8it [00:00, 16.31it/s] Compressing model: 12it [00:00, 22.42it/s] Compressing model: 15it [00:00, 16.64it/s] Compressing model: 18it [00:01, 15.70it/s] Compressing model: 20it [00:01, 12.23it/s] Compressing model: 25it [00:01, 18.27it/s] Compressing model: 28it [00:01, 17.00it/s] Compressing model: 33it [00:01, 22.56it/s] Compressing model: 36it [00:02, 21.53it/s] Compressing model: 41it [00:02, 24.09it/s] Compressing model: 45it [00:02, 27.34it/s] Compressing model: 49it [00:02, 14.52it/s] Compressing model: 54it [00:02, 18.97it/s] Compressing model: 57it [00:03, 18.91it/s] Compressing model: 61it [00:03, 22.45it/s] Compressing model: 65it [00:03, 22.69it/s] Compressing model: 68it [00:03, 18.42it/s] Compressing model: 71it [00:04, 13.96it/s] Compressing model: 76it [00:04, 17.50it/s] Compressing model: 81it [00:04, 22.10it/s] Compressing model: 84it [00:04, 19.62it/s] Compressing model: 89it [00:04, 24.35it/s] Compressing model: 92it [00:04, 22.88it/s] Compressing model: 96it [00:04, 23.23it/s] Compressing model: 99it [00:05, 14.21it/s] Compressing model: 103it [00:05, 17.90it/s] Compressing model: 106it [00:05, 17.96it/s] Compressing model: 110it [00:05, 21.75it/s] Compressing model: 113it [00:06, 15.18it/s] Compressing model: 116it [00:06, 12.39it/s] Compressing model: 118it [00:06, 12.76it/s] Compressing model: 121it [00:06, 15.29it/s] Compressing model: 125it [00:06, 17.59it/s] Compressing model: 129it [00:07, 21.70it/s] Compressing model: 132it [00:07, 20.76it/s] Compressing model: 137it [00:07, 25.70it/s] Compressing model: 140it [00:07, 21.44it/s] Compressing model: 143it [00:07, 14.45it/s] Compressing model: 146it [00:08, 15.29it/s] Compressing model: 150it [00:08, 19.35it/s] Compressing model: 153it [00:08, 19.25it/s] Compressing model: 158it [00:08, 24.56it/s] Compressing model: 161it [00:08, 16.54it/s] Compressing model: 166it [00:09, 17.15it/s] Compressing model: 169it [00:09, 17.37it/s] Compressing model: 174it [00:09, 20.55it/s] Compressing model: 179it [00:09, 25.06it/s] Compressing model: 182it [00:09, 21.49it/s] Compressing model: 187it [00:09, 26.07it/s] Compressing model: 191it [00:10, 25.59it/s] Compressing model: 194it [00:10, 18.97it/s] Compressing model: 197it [00:10, 14.35it/s] Compressing model: 202it [00:10, 17.80it/s] Compressing model: 206it [00:10, 21.30it/s] Compressing model: 209it [00:11, 20.75it/s] Compressing model: 212it [00:11, 12.23it/s] Compressing model: 215it [00:11, 14.03it/s] Compressing model: 218it [00:11, 14.99it/s] Compressing model: 222it [00:12, 19.03it/s] Compressing model: 224it [00:12, 18.36it/s] </details> <details> <summary>fp8_block.log</summary> /home/HDCharles/rhdev/lib/python3.11/site-packages/transformers/utils/hub.py:110: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead. warnings.warn( `torch_dtype` is deprecated! Use `dtype` instead! Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s] Loading checkpoint shards: 100%|██████████| 4/4 [00:00<00:00, 136.99it/s] 2025-12-17T20:57:53.946116+0000 | reset | INFO - Compression lifecycle reset 2025-12-17T20:57:53.946848+0000 | from_modifiers | INFO - Creating recipe from modifiers 2025-12-17T20:57:53.966319+0000 | initialize | INFO - Compression lifecycle initialized for 1 modifiers 2025-12-17T20:57:53.966658+0000 | IndependentPipeline | INFO - Inferred `DataFreePipeline` for `QuantizationModifier` Updating global scales: 0%| | 0/224 [00:00<?, ?it/s] Updating global scales: 100%|██████████| 224/224 [00:00<00:00, 637397.62it/s] Fusing global scales: 0it [00:00, ?it/s] Fusing global scales: 647it [00:00, 486415.97it/s] Calibrating weights: 0%| | 0/224 [00:00<?, ?it/s] Calibrating weights: 0%| | 1/224 [00:00<00:33, 6.66it/s] Calibrating weights: 100%|██████████| 224/224 [00:00<00:00, 943.96it/s] 2025-12-17T20:58:00.043737+0000 | finalize | INFO - Compression lifecycle finalized for 1 modifiers 2025-12-17T20:58:03.951940+0000 | post_process | WARNING - Optimized model is not saved. To save, please provide`output_dir` as input arg.Ex. `oneshot(..., output_dir=...)` The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results. Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation. The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results. ========== SAMPLE GENERATION ============== <|begin_of_text|>Hello my name is Kaitlyn and I am a 24-year-old freelance writer and editor. I have a passion for storytelling and a knack for crafting compelling narratives. I have a degree in English Literature and have been writing professionally for over 5 years. I have experience writing articles, blog posts, and website content for a variety of clients, including businesses, non-profits, and individuals. I am also skilled in editing and proofreading, and have worked with clients to refine their writing and ensure it is error ========================================== 2025-12-17T20:58:34.036482+0000 | get_model_compressor | INFO - skip_sparsity_compression_stats set to True. Skipping sparsity compression statistic calculations. No sparsity compressor will be applied. Compressing model: 0it [00:00, ?it/s] Compressing model: 5it [00:00, 34.55it/s] Compressing model: 9it [00:00, 12.71it/s] Compressing model: 11it [00:00, 12.76it/s] Compressing model: 13it [00:00, 13.16it/s] Compressing model: 17it [00:01, 18.77it/s] Compressing model: 20it [00:01, 18.86it/s] Compressing model: 23it [00:01, 20.26it/s] Compressing model: 27it [00:01, 15.43it/s] Compressing model: 29it [00:01, 12.88it/s] Compressing model: 34it [00:02, 17.10it/s] Compressing model: 39it [00:02, 22.20it/s] Compressing model: 42it [00:02, 19.60it/s] Compressing model: 47it [00:02, 24.68it/s] Compressing model: 50it [00:02, 23.06it/s] Compressing model: 55it [00:03, 18.85it/s] Compressing model: 58it [00:03, 16.46it/s] Compressing model: 62it [00:03, 18.39it/s] Compressing model: 67it [00:03, 23.19it/s] Compressing model: 70it [00:03, 20.28it/s] Compressing model: 75it [00:03, 25.18it/s] Compressing model: 78it [00:04, 18.17it/s] Compressing model: 81it [00:04, 19.71it/s] Compressing model: 84it [00:04, 14.67it/s] Compressing model: 89it [00:04, 19.78it/s] Compressing model: 92it [00:04, 19.63it/s] Compressing model: 97it [00:05, 22.49it/s] Compressing model: 102it [00:05, 26.98it/s] Compressing model: 106it [00:05, 17.97it/s] Compressing model: 110it [00:05, 17.31it/s] Compressing model: 113it [00:06, 17.63it/s] Compressing model: 118it [00:06, 20.70it/s] Compressing model: 122it [00:06, 24.05it/s] Compressing model: 125it [00:06, 22.60it/s] Compressing model: 128it [00:06, 13.66it/s] Compressing model: 131it [00:07, 14.68it/s] Compressing model: 133it [00:07, 14.59it/s] Compressing model: 138it [00:07, 20.29it/s] Compressing model: 141it [00:07, 19.93it/s] Compressing model: 146it [00:07, 22.96it/s] Compressing model: 150it [00:07, 26.31it/s] Compressing model: 153it [00:07, 24.04it/s] Compressing model: 156it [00:08, 17.59it/s] Compressing model: 159it [00:08, 14.86it/s] Compressing model: 161it [00:08, 14.72it/s] Compressing model: 166it [00:08, 20.20it/s] Compressing model: 169it [00:08, 19.64it/s] Compressing model: 173it [00:09, 23.47it/s] Compressing model: 176it [00:09, 17.13it/s] Compressing model: 179it [00:09, 18.76it/s] Compressing model: 182it [00:09, 14.24it/s] Compressing model: 187it [00:09, 19.21it/s] Compressing model: 190it [00:10, 19.04it/s] Compressing model: 195it [00:10, 22.02it/s] Compressing model: 200it [00:10, 26.51it/s] Compressing model: 204it [00:10, 18.33it/s] Compressing model: 207it [00:10, 19.70it/s] Compressing model: 210it [00:11, 14.98it/s] Compressing model: 215it [00:11, 19.78it/s] Compressing model: 218it [00:11, 19.47it/s] Compressing model: 222it [00:11, 23.07it/s] Compressing model: 224it [00:11, 19.04it/s] <\details> Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>

Signed-off-by: yiliu30 <yi4.liu@intel.com>

SUMMARY: - Seems like setting the collator from tuncation to default + shuffling addresses the regression we're seeing in lm-eval - Given the recovery values you see in these tests were determined using these settings, I think they should be how we evaluate our lm-eval tests for the time being --------- Signed-off-by: Dipika Sikka <ds3822@columbia.edu> Co-authored-by: Brian Dellabetta <brian-dellabetta@users.noreply.github.com>

Summary - Many tests get skipped in our base tests because the workflow was missing the HF_TOKEN and did not have access to a gpu - With this change, only the weekly and nightly tests outside of the transformers folder are skipped and everything else is running --------- Signed-off-by: Dipika Sikka <dipikasikka1@gmail.com>

SUMMARY: Previously I added _orient and _reorient functions to handle this, there was a TODO to use flatten_for_calibration, this was a little tricky since A) its difficult to use to reverse the transformation (see initial commit) B) it changes the shape to [1, num_H, num_W, block_H*block_W] and while we can reshape that directly into the shape needed for the normalization easily enough, we'd like to record the 4D shape [num_H, num_W, block_H, block_W] which requires yet more steps to get and record. Thus it seems simpler to do the unflatten and transpose manually, exactly how we need it, then record shapes, normalize and lastly use the recorded shapes to undo it all. TEST PLAN: pytest /home/HDCharles/repos/llm-compressor/src/llmcompressor/modifiers/awq/base.py --------- Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>

…llm-project#2113) SUMMARY: Add code to convert the gpt_oss model and then quantize model to w8a8 TEST PLAN: Tested with gpt_oss model and run using vllm Sample output === Prompt 0 === <|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI. Knowledge cutoff: 2024-06 Current date: 2025-12-04 Reasoning: medium Valid channels: analysis, commentary, final. Channel must be included for every message.<|end|><|start|>developer<|message|># Instructions You are a helpful assistant. <|end|><|start|>user<|message|>Give 3 reasons to use AI.<|end|><|start|>assistant --- Candidate 0 --- analysisWe need to answer with three reasons to use AI. Should be short. Provide 3 reasons. Let's do it.assistantcommentaryHere are three reasons to use AI: Personalized learning – AI can adapt to your learning style and pace, providing customized lessons that suit your learning style and progress. Skill development – AI can help you develop skills that are in demand, such as coding, data analysis, and data science. Supportive design We should provide a short answer.assistantfinalHere are three reasons to use AI: Personalized learning – AI adapts to your learning style and pace, giving you personalized lessons that fit your needs. Skill development – AI helps you learn and practice skills that are in demand, like coding, data analysis, and data‑science skills. Real‑world impact – AI helps you practice real-world skills in real life, such as learning new skills and applying them in real life. finish_reason: stop num_tokens: 226 --------- Signed-off-by: Sharif Inamdar <sharif.inamdar@arm.com> Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com> Co-authored-by: shanjiaz <43143795+shanjiaz@users.noreply.github.com>

yiliu30 added 14 commits November 20, 2025 19:04

update autoround version

bc97d48

Signed-off-by: yiliu30 <yi4.liu@intel.com>

Merge branch 'main' into autoround-version

19ab4f2

expose bs

9ba113c

Signed-off-by: yiliu30 <yi4.liu@intel.com>

Merge branch 'autoround-version' of https://github.com/yiliu30/llm-co…

646982a

…mpressor-fork into autoround-version

use 0.9.1

1050335

Signed-off-by: yiliu30 <yi4.liu@intel.com>

fix

50e6682

Signed-off-by: yiliu30 <yi4.liu@intel.com>

update

d139071

Signed-off-by: yiliu30 <yi4.liu@intel.com>

enable auto-dispatch

a0affbd

Signed-off-by: yiliu30 <yi4.liu@intel.com>

add ds example

17ba9f5

Signed-off-by: yiliu30 <yi4.liu@intel.com>

merge main

cd943cd

Signed-off-by: yiliu30 <yi4.liu@intel.com>

pass ignore to ar

8338ed5

Signed-off-by: yiliu30 <yi4.liu@intel.com>

add qwen example

56515af

Signed-off-by: yiliu30 <yi4.liu@intel.com>

update example

ad6c1c0

Signed-off-by: yiliu30 <yi4.liu@intel.com>

format

09a72c0

Signed-off-by: yiliu30 <yi4.liu@intel.com>

gemini-code-assist bot reviewed Dec 15, 2025

View reviewed changes

update

af112bd

Signed-off-by: yiliu30 <yi4.liu@intel.com>

gemini-code-assist bot reviewed Dec 15, 2025

View reviewed changes

yiliu30 and others added 8 commits December 16, 2025 22:37

refine suspend hook

ec98118

Signed-off-by: yiliu30 <yi4.liu@intel.com>

update

c5eae60

Signed-off-by: yiliu30 <yi4.liu@intel.com>

clean code

2d482fc

Signed-off-by: yiliu30 <yi4.liu@intel.com>

add ut

17b7e45

Signed-off-by: yiliu30 <yi4.liu@intel.com>

fix

7a9b3cd

Signed-off-by: yiliu30 <yi4.liu@intel.com>

fix hint

4f45b17

Signed-off-by: yiliu30 <yi4.liu@intel.com>

refine

0fac601

Signed-off-by: yiliu30 <yi4.liu@intel.com>

gemini-code-assist bot reviewed Dec 18, 2025

View reviewed changes

speedup ut

0f7a990

Signed-off-by: yiliu30 <yi4.liu@intel.com>

yiliu30 changed the title ~~[Self-Review]Auto device~~ [Self-Review] Enhance Autoround to support multiple cards tuning Dec 18, 2025

hshen14 reviewed Dec 18, 2025

View reviewed changes

sugatmahanti and others added 11 commits December 18, 2025 10:56

clean

58ef017

Signed-off-by: yiliu30 <yi4.liu@intel.com>

add docstring

c9ea99c

Signed-off-by: yiliu30 <yi4.liu@intel.com>

format

d2a7c92

Signed-off-by: yiliu30 <yi4.liu@intel.com>

Merge branch 'main' into auto-device

d48c3d6

Merge branch 'main' into auto-device

993a68e

yiliu30 closed this Dec 22, 2025

	# Leave offload for LLMC
	# Use auto_round's internal offloading

		model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
		model_id = "/storage/yiliu7/unsloth/DeepSeek-R1-BF16"

	SAVE_DIR = model_id.rstrip("/").split("/")[-1] + "-W4A16-G128-AutoRound"
	SAVE_DIR = os.path.basename(model_id.rstrip("/")) + "-W4A16-G128-AutoRound"

Conversation

yiliu30 commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test plan

Example results

Uh oh!

github-actions bot commented Dec 15, 2025

Uh oh!

gemini-code-assist bot commented Dec 15, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

yiliu30 commented Dec 15, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gemini-code-assist bot Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

yiliu30 commented Dec 18, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

yiliu30 commented Dec 18, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

hshen14 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

yiliu30 commented Dec 15, 2025 •

edited

Loading