Skip to content

[Self-Review] Enhance Autoround to support multiple cards tuning#12

Closed
yiliu30 wants to merge 35 commits intomainfrom
auto-device
Closed

[Self-Review] Enhance Autoround to support multiple cards tuning#12
yiliu30 wants to merge 35 commits intomainfrom
auto-device

Conversation

@yiliu30
Copy link
Owner

@yiliu30 yiliu30 commented Dec 15, 2025

Given AutoRound uses block‑level reconstruction loss to fine‑tune quantization parameters, which requires running backward passes on each block. For large model, like Qwen35B, a single GPU often doesn’t have enough memory to hold an entire block during backward computation. To address this, we use the accelerator to dispatch the module across multiple devices.
In this PR, we enable this feature on LLMC side:

  • Add device_map for tuning with multiple cards
  • Map ignore to Autoround skipping layers
  • Add Qwen235B as example for multiple cards

Test plan

pytest -svv ./llmcompressor/transformers/autoround/test_autoround_oneshot.py -k test_oneshot_with_device_map

Example results

# vllm (pretrained=INC4AI/Qwen3-235B-A22B-W4A16-G128-AutoRound-ITERS1-LLMC-TEST-ONLY,tensor_parallel_size=2,max_model_len=8192,max_num_batched_tokens=32768,max_num_seqs=128,add_bos_token=True,gpu_memory_utilization=0.8,dtype=bfloat16,max_gen_toks=2048,enable_prefix_caching=False), gen_kwargs: (None), limit: 1000.0, num_fewshot: None, batch_size: 128
# |Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
# |-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
# |gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.928|±  |0.0082|
# |     |       |strict-match    |     5|exact_match|↑  |0.930|±  |0.0081|
  

# vllm (pretrained=INC4AI/Qwen3-235B-A22B-W4A16-G128-AutoRound-ITERS200-LLMC-TEST-ONLY,tensor_parallel_size=2,max_model_len=8192,max_num_batched_tokens=32768,max_num_seqs=128,add_bos_token=True,gpu_memory_utilization=0.8,dtype=bfloat16,max_gen_toks=2048,enable_prefix_caching=False), gen_kwargs: (None), limit: 1000.0, num_fewshot: None, batch_size: 128
# |Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
# |-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
# |gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.934|±  |0.0079|
# |     |       |strict-match    |     5|exact_match|↑  |0.915|±  |0.0088|

Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
@github-actions
Copy link

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

@gemini-code-assist
Copy link

Summary of Changes

Hello @yiliu30, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the AutoRoundModifier by improving its integration with models that leverage Hugging Face Accelerate for device management and offloading. The changes introduce robust handling of Accelerate hooks, provide explicit control over device mapping during quantization, and allow for the designation of layers to be kept in full precision. These updates aim to make the AutoRound quantization process more flexible and compatible with a wider range of large language models, as demonstrated by the new DeepSeek and Qwen examples.

Highlights

  • New AutoRound Examples: Added two new example scripts, deepseek_example.py and qwen3_example.py, demonstrating how to apply AutoRound quantization to DeepSeek and Qwen models, respectively. These examples showcase model loading, dataset preparation, and the application of the AutoRoundModifier.
  • Accelerate Hook Management: Introduced a suspend_accelerate_hooks context manager to temporarily detach and re-attach Hugging Face Accelerate hooks during the AutoRound quantization process. This ensures proper handling of models that use Accelerate for device management or offloading, preventing conflicts during quantization.
  • Device Mapping for AutoRound: The AutoRoundModifier now includes a device_map parameter, allowing users to explicitly specify the devices (e.g., '0,1') on which the quantization should be performed. This parameter is passed directly to the underlying AutoRound library.
  • Full Precision Layer Specification: Added a new method get_unquantized_layer_names to identify layers that should remain in full precision. These layers are then passed to the AutoRound algorithm via the fp_layers parameter, ensuring that specified modules are not quantized.
  • Improved Auto-Offloading: Updated the auto_offload parameter in the AutoRound initialization from False to True, enabling AutoRound to manage model offloading more effectively during the quantization process.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@yiliu30
Copy link
Owner Author

yiliu30 commented Dec 15, 2025

/gemini review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for multi-device execution in the AutoRound modifier and includes new example scripts. The core changes in the modifier logic seem solid, particularly the introduction of suspend_accelerate_hooks to manage device offloading. My feedback primarily focuses on improving the portability and clarity of the new example scripts by removing hardcoded local file paths. I've also included a couple of minor suggestions to improve code style and comment clarity in the modifier itself.

Comment on lines +9 to +11
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
model_id = "/storage/yiliu7/unsloth/DeepSeek-R1-BF16"
# model_id = "/storage/yiliu7/deepseek-ai/DeepSeek-V2-Lite-Chat/"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The model_id is set to a hardcoded local path, which makes this example not portable. Please use a single, public model identifier from the Hugging Face Hub and remove the other commented or overwritten assignments for clarity.

Suggested change
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
model_id = "/storage/yiliu7/unsloth/DeepSeek-R1-BF16"
# model_id = "/storage/yiliu7/deepseek-ai/DeepSeek-V2-Lite-Chat/"
model_id = "deepseek-ai/DeepSeek-V2-Lite-Chat"

Comment on lines +9 to +12
model_id = "Qwen/Qwen3-30B-A3B"
# model_id = "/storage/yiliu7/Qwen/Qwen3-30B-A3B"
# model_id = "/storage/yiliu7/Qwen/Qwen2.5-0.5B/"
model_id = "/storage/yiliu7/Qwen/Qwen3-235B-A22B/"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The model_id is set to a hardcoded local path, which makes this example not portable. Please use a single, public model identifier from the Hugging Face Hub and remove the other commented or overwritten assignments for clarity. Using a smaller model might also be more suitable for an example script.

Suggested change
model_id = "Qwen/Qwen3-30B-A3B"
# model_id = "/storage/yiliu7/Qwen/Qwen3-30B-A3B"
# model_id = "/storage/yiliu7/Qwen/Qwen2.5-0.5B/"
model_id = "/storage/yiliu7/Qwen/Qwen3-235B-A22B/"
model_id = "Qwen/Qwen1.5-0.5B-Chat"

Comment on lines +56 to +58
SAVE_DIR = (
"/storage/yiliu7/" + model_id.rstrip("/").split("/")[-1] + "-W4A16-G128-AutoRound"
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The SAVE_DIR is constructed with a hardcoded absolute path prefix, which will cause the script to fail on other machines. It should be a relative path so the model is saved in the current working directory.

SAVE_DIR = model_id.rstrip("/").split("/")[-1] + "-W4A16-G128-AutoRound"

@@ -239,14 +277,12 @@ def apply_autoround(self, state, subgraph):
q_input=self._q_input,
device=str(device),
# Leave offload for LLMC

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The comment # Leave offload for LLMC is now misleading since auto_offload is set to True. With the addition of suspend_accelerate_hooks, it seems the intention is now to use auto_round's internal offloading. The comment should be updated to reflect this change in behavior.

Suggested change
# Leave offload for LLMC
# Use auto_round's internal offloading

Signed-off-by: yiliu30 <yi4.liu@intel.com>
Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for device mapping in the AutoRound modifier, including two new example scripts. My review focuses on improving the portability and security of the new examples and enhancing the robustness of the AutoRoundModifier. The main suggestions are to remove hardcoded local paths from the example scripts in favor of public models from the Hugging Face Hub, warn users about the security implications of trust_remote_code=True, and use a more flexible default for device_map.

model_id = "/storage/yiliu7/unsloth/DeepSeek-R1-BF16"
# model_id = "/storage/yiliu7/deepseek-ai/DeepSeek-V2-Lite-Chat/"
model = AutoModelForCausalLM.from_pretrained(
model_id, torch_dtype="auto", trust_remote_code=True

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using trust_remote_code=True can introduce a security vulnerability if the model repository contains malicious code. It is crucial to warn users about this risk, especially in an example script that they might copy and run. Please add a comment explaining that users should only enable this if they trust the source of the model.

Comment on lines +9 to +10
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
model_id = "/storage/yiliu7/unsloth/DeepSeek-R1-BF16"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The model_id is hardcoded to a local path, which makes this example not portable or runnable for other users. It's better to default to a model identifier from the Hugging Face Hub and provide the local path as a commented-out alternative for users who wish to use a local model.

Suggested change
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
model_id = "/storage/yiliu7/unsloth/DeepSeek-R1-BF16"
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
# model_id = "/storage/yiliu7/unsloth/DeepSeek-R1-BF16"

Comment on lines +9 to +12
model_id = "Qwen/Qwen3-30B-A3B"
# model_id = "/storage/yiliu7/Qwen/Qwen3-30B-A3B"
# model_id = "/storage/yiliu7/Qwen/Qwen2.5-0.5B/"
model_id = "/storage/yiliu7/Qwen/Qwen3-235B-A22B/"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The model_id is hardcoded to a local path, which makes this example not runnable for other users. It's better to default to a model from the Hugging Face Hub and provide the local path as a commented-out alternative.

Suggested change
model_id = "Qwen/Qwen3-30B-A3B"
# model_id = "/storage/yiliu7/Qwen/Qwen3-30B-A3B"
# model_id = "/storage/yiliu7/Qwen/Qwen2.5-0.5B/"
model_id = "/storage/yiliu7/Qwen/Qwen3-235B-A22B/"
model_id = "Qwen/Qwen3-30B-A3B"
# model_id = "/storage/yiliu7/Qwen/Qwen3-30B-A3B"
# model_id = "/storage/yiliu7/Qwen/Qwen2.5-0.5B/"
# model_id = "/storage/yiliu7/Qwen/Qwen3-235B-A22B/"

Comment on lines +56 to +58
SAVE_DIR = (
"/storage/yiliu7/" + model_id.rstrip("/").split("/")[-1] + "-W4A16-G128-AutoRound"
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The SAVE_DIR is constructed using a hardcoded absolute path. This will cause the script to fail for any user who does not have the /storage/yiliu7/ directory. The output directory should be a relative path to make the example portable.

SAVE_DIR = model_id.rstrip("/").split("/")[-1] + "-W4A16-G128-AutoRound"

iters: int = 200
enable_torch_compile: bool = True
batch_size: int = 8
device_map: str = "0"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The default value for device_map is hardcoded to '0'. This assumes a GPU is available at index 0 and will cause a crash on CPU-only systems or systems with a different GPU configuration. Consider changing the default to 'auto' to let accelerate handle device placement automatically, which is more robust and user-friendly.

Suggested change
device_map: str = "0"
device_map: str = "auto"

yiliu30 and others added 8 commits December 16, 2025 22:37
Signed-off-by: yiliu30 <yi4.liu@intel.com>
…ct#2121)

SUMMARY:
Part of vllm-project#1927

- Updated type hints to Python 3.10+ built-in generics
- Replaced List[] with list[]
- No functional changes



TEST PLAN:
- Ran `make quality` (ruff format and lint checks)
- Verified no functional code changes were introduced

---------

Co-authored-by: Brian Dellabetta <brian-dellabetta@users.noreply.github.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
@yiliu30
Copy link
Owner Author

yiliu30 commented Dec 18, 2025

/gemini review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for multi-GPU tuning with device_map in the AutoRoundModifier. It also adds a new example for quantizing a large model (Qwen3-235B) using multiple GPUs. Additionally, it maps the ignore parameter to AutoRound's layer skipping mechanism and includes a bug fix related to accessing quantization scale attributes. The changes are well-implemented and include a new test case for the multi-GPU functionality. My main feedback is to clean up the new example file by removing development artifacts for better clarity.

Comment on lines +9 to +15
# FIXME: clean up model_id settings
model_id = "Qwen/Qwen3-30B-A3B"
model_id = "/storage/yiliu7/Qwen/Qwen3-30B-A3B"
# model_id = "/storage/yiliu7/Qwen/Qwen2.5-0.5B/"
model_id = "/storage/yiliu7/Qwen/Qwen3-235B-A22B/"
model_id = "/models/Qwen3-30B-A3B"
model_id = "Qwen/Qwen3-235B-A22B/"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This section for selecting the model ID contains multiple overwritten assignments and a FIXME comment. This looks like leftover development code and can be confusing for users of this example. Please clean this up to have a single, clear model_id assignment. If you want to show other possible models, it's better to list them in comments.

Suggested change
# FIXME: clean up model_id settings
model_id = "Qwen/Qwen3-30B-A3B"
model_id = "/storage/yiliu7/Qwen/Qwen3-30B-A3B"
# model_id = "/storage/yiliu7/Qwen/Qwen2.5-0.5B/"
model_id = "/storage/yiliu7/Qwen/Qwen3-235B-A22B/"
model_id = "/models/Qwen3-30B-A3B"
model_id = "Qwen/Qwen3-235B-A22B/"
# You can try other models here, for example:
# model_id = "Qwen/Qwen3-30B-A3B"
model_id = "Qwen/Qwen3-235B-A22B/"

@yiliu30
Copy link
Owner Author

yiliu30 commented Dec 18, 2025

/gemini review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for multi-GPU tuning with device_map in the AutoRound modifier. It also adds a new example script for a large model (Qwen3-235B) to demonstrate this feature, along with a corresponding test case. The changes to the AutoRoundModifier correctly handle device mapping and suspend accelerate hooks during the process. My feedback focuses on improving the new example script for better clarity and robustness.

Comment on lines +9 to +15
# FIXME: clean up model_id settings
model_id = "Qwen/Qwen3-30B-A3B"
model_id = "/storage/yiliu7/Qwen/Qwen3-30B-A3B"
# model_id = "/storage/yiliu7/Qwen/Qwen2.5-0.5B/"
model_id = "/storage/yiliu7/Qwen/Qwen3-235B-A22B/"
model_id = "/models/Qwen3-30B-A3B"
model_id = "Qwen/Qwen3-235B-A22B/"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The model_id is reassigned multiple times, with some lines commented out, and there's a FIXME comment. This seems like leftover code from development. It would be cleaner to remove the unused assignments and the FIXME comment, leaving only the desired model ID. For an example script, it's best to have a single, clear configuration.

Suggested change
# FIXME: clean up model_id settings
model_id = "Qwen/Qwen3-30B-A3B"
model_id = "/storage/yiliu7/Qwen/Qwen3-30B-A3B"
# model_id = "/storage/yiliu7/Qwen/Qwen2.5-0.5B/"
model_id = "/storage/yiliu7/Qwen/Qwen3-235B-A22B/"
model_id = "/models/Qwen3-30B-A3B"
model_id = "Qwen/Qwen3-235B-A22B/"
model_id = "Qwen/Qwen3-235B-A22B/"

print("==========================================\n\n")

# Save to disk compressed.
SAVE_DIR = model_id.rstrip("/").split("/")[-1] + "-W4A16-G128-AutoRound"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using split('/')[-1] to get the model name from the path is not very robust as it assumes a Unix-like path separator. A more cross-platform and readable approach is to use os.path.basename. You will need to add import os at the top of the file.

Suggested change
SAVE_DIR = model_id.rstrip("/").split("/")[-1] + "-W4A16-G128-AutoRound"
SAVE_DIR = os.path.basename(model_id.rstrip("/")) + "-W4A16-G128-AutoRound"

Signed-off-by: yiliu30 <yi4.liu@intel.com>
@yiliu30 yiliu30 changed the title [Self-Review]Auto device [Self-Review] Enhance Autoround to support multiple cards tuning Dec 18, 2025
Copy link

@hshen14 hshen14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
you may prepare a readme to show the instruction and results for the example

sugatmahanti and others added 11 commits December 18, 2025 10:56
…-project#2034)

SUMMARY:
This is part of
vllm-project#1927

Modernize type annotations using | operator and built-in generics in the
transformer module as part of codebase modernization effort.

TEST PLAN:
```
make style
make quality
make tests
```
Notes:
Happy to address any comments! Thank you!

---------

Co-authored-by: Brian Dellabetta <brian-dellabetta@users.noreply.github.com>
SUMMARY:
Added examples for fp8 awq which now work after the AWQ generalization


TEST PLAN:
python $REPOS/llm-compressor/examples/awq/fp8_dynamic_llama_example.py
2>&1 | tee fp8_dynamic.log
python $REPOS/llm-compressor/examples/awq/fp8_block_llama_example.py
2>&1 | tee fp8_block.log

<details>
  <summary>fp8_dynamic.log</summary>
  

/home/HDCharles/rhdev/lib/python3.11/site-packages/transformers/utils/hub.py:110:
FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be
removed in v5 of Transformers. Use `HF_HOME` instead.
  warnings.warn(
`torch_dtype` is deprecated! Use `dtype` instead!

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]
Loading checkpoint shards:  25%|██▌       | 1/4 [00:00<00:00,  7.60it/s]
Loading checkpoint shards:  50%|█████     | 2/4 [00:00<00:00,  6.70it/s]
Loading checkpoint shards:  75%|███████▌  | 3/4 [00:00<00:00,  6.82it/s]
Loading checkpoint shards: 100%|██████████| 4/4 [00:00<00:00,  8.95it/s]
2025-12-17T20:56:18.271169+0000 | reset | INFO - Compression lifecycle
reset
2025-12-17T20:56:18.271896+0000 | from_modifiers | INFO - Creating
recipe from modifiers
2025-12-17T20:56:18.292591+0000 | initialize | INFO - Compression
lifecycle initialized for 1 modifiers
2025-12-17T20:56:18.292874+0000 | IndependentPipeline | INFO - Inferred
`DataFreePipeline` for `QuantizationModifier`

Updating global scales:   0%|          | 0/224 [00:00<?, ?it/s]
Updating global scales: 100%|██████████| 224/224 [00:00<00:00,
648394.82it/s]

Fusing global scales: 0it [00:00, ?it/s]
Fusing global scales: 647it [00:00, 511346.28it/s]

Calibrating weights:   0%|          | 0/224 [00:00<?, ?it/s]
Calibrating weights:  40%|███▉      | 89/224 [00:00<00:00, 888.99it/s]
Calibrating weights: 100%|██████████| 224/224 [00:00<00:00, 1596.33it/s]
2025-12-17T20:56:53.594142+0000 | finalize | INFO - Compression
lifecycle finalized for 1 modifiers
2025-12-17T20:56:57.580914+0000 | post_process | WARNING - Optimized
model is not saved. To save, please provide`output_dir` as input arg.Ex.
`oneshot(..., output_dir=...)`
The attention mask and the pad token id were not set. As a consequence,
you may observe unexpected behavior. Please pass your input's
`attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
The attention mask is not set and cannot be inferred from input because
pad token is same as eos token. As a consequence, you may observe
unexpected behavior. Please pass your input's `attention_mask` to obtain
reliable results.



========== SAMPLE GENERATION ==============
<|begin_of_text|>Hello my name is Sarah and I am a 30-year-old woman who
has been diagnosed with multiple sclerosis (MS). I am here to share my
story and to help raise awareness about this chronic and often
debilitating disease.
I was diagnosed with MS in 2010, when I was 25 years old. At the time, I
was working as a teacher and living a normal life. But suddenly, I
started experiencing strange symptoms such as numbness in my hands and
feet, blurred vision, and fatigue. I went
==========================================


2025-12-17T20:57:24.962901+0000 | get_model_compressor | INFO -
skip_sparsity_compression_stats set to True. Skipping sparsity
compression statistic calculations. No sparsity compressor will be
applied.

Compressing model: 0it [00:00, ?it/s]
Compressing model: 1it [00:00,  3.48it/s]
Compressing model: 5it [00:00, 14.97it/s]
Compressing model: 8it [00:00, 16.31it/s]
Compressing model: 12it [00:00, 22.42it/s]
Compressing model: 15it [00:00, 16.64it/s]
Compressing model: 18it [00:01, 15.70it/s]
Compressing model: 20it [00:01, 12.23it/s]
Compressing model: 25it [00:01, 18.27it/s]
Compressing model: 28it [00:01, 17.00it/s]
Compressing model: 33it [00:01, 22.56it/s]
Compressing model: 36it [00:02, 21.53it/s]
Compressing model: 41it [00:02, 24.09it/s]
Compressing model: 45it [00:02, 27.34it/s]
Compressing model: 49it [00:02, 14.52it/s]
Compressing model: 54it [00:02, 18.97it/s]
Compressing model: 57it [00:03, 18.91it/s]
Compressing model: 61it [00:03, 22.45it/s]
Compressing model: 65it [00:03, 22.69it/s]
Compressing model: 68it [00:03, 18.42it/s]
Compressing model: 71it [00:04, 13.96it/s]
Compressing model: 76it [00:04, 17.50it/s]
Compressing model: 81it [00:04, 22.10it/s]
Compressing model: 84it [00:04, 19.62it/s]
Compressing model: 89it [00:04, 24.35it/s]
Compressing model: 92it [00:04, 22.88it/s]
Compressing model: 96it [00:04, 23.23it/s]
Compressing model: 99it [00:05, 14.21it/s]
Compressing model: 103it [00:05, 17.90it/s]
Compressing model: 106it [00:05, 17.96it/s]
Compressing model: 110it [00:05, 21.75it/s]
Compressing model: 113it [00:06, 15.18it/s]
Compressing model: 116it [00:06, 12.39it/s]
Compressing model: 118it [00:06, 12.76it/s]
Compressing model: 121it [00:06, 15.29it/s]
Compressing model: 125it [00:06, 17.59it/s]
Compressing model: 129it [00:07, 21.70it/s]
Compressing model: 132it [00:07, 20.76it/s]
Compressing model: 137it [00:07, 25.70it/s]
Compressing model: 140it [00:07, 21.44it/s]
Compressing model: 143it [00:07, 14.45it/s]
Compressing model: 146it [00:08, 15.29it/s]
Compressing model: 150it [00:08, 19.35it/s]
Compressing model: 153it [00:08, 19.25it/s]
Compressing model: 158it [00:08, 24.56it/s]
Compressing model: 161it [00:08, 16.54it/s]
Compressing model: 166it [00:09, 17.15it/s]
Compressing model: 169it [00:09, 17.37it/s]
Compressing model: 174it [00:09, 20.55it/s]
Compressing model: 179it [00:09, 25.06it/s]
Compressing model: 182it [00:09, 21.49it/s]
Compressing model: 187it [00:09, 26.07it/s]
Compressing model: 191it [00:10, 25.59it/s]
Compressing model: 194it [00:10, 18.97it/s]
Compressing model: 197it [00:10, 14.35it/s]
Compressing model: 202it [00:10, 17.80it/s]
Compressing model: 206it [00:10, 21.30it/s]
Compressing model: 209it [00:11, 20.75it/s]
Compressing model: 212it [00:11, 12.23it/s]
Compressing model: 215it [00:11, 14.03it/s]
Compressing model: 218it [00:11, 14.99it/s]
Compressing model: 222it [00:12, 19.03it/s]
Compressing model: 224it [00:12, 18.36it/s]
</details>
<details>
  <summary>fp8_block.log</summary>


/home/HDCharles/rhdev/lib/python3.11/site-packages/transformers/utils/hub.py:110:
FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be
removed in v5 of Transformers. Use `HF_HOME` instead.
  warnings.warn(
`torch_dtype` is deprecated! Use `dtype` instead!

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]
Loading checkpoint shards: 100%|██████████| 4/4 [00:00<00:00,
136.99it/s]
2025-12-17T20:57:53.946116+0000 | reset | INFO - Compression lifecycle
reset
2025-12-17T20:57:53.946848+0000 | from_modifiers | INFO - Creating
recipe from modifiers
2025-12-17T20:57:53.966319+0000 | initialize | INFO - Compression
lifecycle initialized for 1 modifiers
2025-12-17T20:57:53.966658+0000 | IndependentPipeline | INFO - Inferred
`DataFreePipeline` for `QuantizationModifier`

Updating global scales:   0%|          | 0/224 [00:00<?, ?it/s]
Updating global scales: 100%|██████████| 224/224 [00:00<00:00,
637397.62it/s]

Fusing global scales: 0it [00:00, ?it/s]
Fusing global scales: 647it [00:00, 486415.97it/s]

Calibrating weights:   0%|          | 0/224 [00:00<?, ?it/s]
Calibrating weights:   0%|          | 1/224 [00:00<00:33,  6.66it/s]
Calibrating weights: 100%|██████████| 224/224 [00:00<00:00, 943.96it/s]
2025-12-17T20:58:00.043737+0000 | finalize | INFO - Compression
lifecycle finalized for 1 modifiers
2025-12-17T20:58:03.951940+0000 | post_process | WARNING - Optimized
model is not saved. To save, please provide`output_dir` as input arg.Ex.
`oneshot(..., output_dir=...)`
The attention mask and the pad token id were not set. As a consequence,
you may observe unexpected behavior. Please pass your input's
`attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
The attention mask is not set and cannot be inferred from input because
pad token is same as eos token. As a consequence, you may observe
unexpected behavior. Please pass your input's `attention_mask` to obtain
reliable results.



========== SAMPLE GENERATION ==============
<|begin_of_text|>Hello my name is Kaitlyn and I am a 24-year-old
freelance writer and editor. I have a passion for storytelling and a
knack for crafting compelling narratives. I have a degree in English
Literature and have been writing professionally for over 5 years. I have
experience writing articles, blog posts, and website content for a
variety of clients, including businesses, non-profits, and individuals.
I am also skilled in editing and proofreading, and have worked with
clients to refine their writing and ensure it is error
==========================================


2025-12-17T20:58:34.036482+0000 | get_model_compressor | INFO -
skip_sparsity_compression_stats set to True. Skipping sparsity
compression statistic calculations. No sparsity compressor will be
applied.

Compressing model: 0it [00:00, ?it/s]
Compressing model: 5it [00:00, 34.55it/s]
Compressing model: 9it [00:00, 12.71it/s]
Compressing model: 11it [00:00, 12.76it/s]
Compressing model: 13it [00:00, 13.16it/s]
Compressing model: 17it [00:01, 18.77it/s]
Compressing model: 20it [00:01, 18.86it/s]
Compressing model: 23it [00:01, 20.26it/s]
Compressing model: 27it [00:01, 15.43it/s]
Compressing model: 29it [00:01, 12.88it/s]
Compressing model: 34it [00:02, 17.10it/s]
Compressing model: 39it [00:02, 22.20it/s]
Compressing model: 42it [00:02, 19.60it/s]
Compressing model: 47it [00:02, 24.68it/s]
Compressing model: 50it [00:02, 23.06it/s]
Compressing model: 55it [00:03, 18.85it/s]
Compressing model: 58it [00:03, 16.46it/s]
Compressing model: 62it [00:03, 18.39it/s]
Compressing model: 67it [00:03, 23.19it/s]
Compressing model: 70it [00:03, 20.28it/s]
Compressing model: 75it [00:03, 25.18it/s]
Compressing model: 78it [00:04, 18.17it/s]
Compressing model: 81it [00:04, 19.71it/s]
Compressing model: 84it [00:04, 14.67it/s]
Compressing model: 89it [00:04, 19.78it/s]
Compressing model: 92it [00:04, 19.63it/s]
Compressing model: 97it [00:05, 22.49it/s]
Compressing model: 102it [00:05, 26.98it/s]
Compressing model: 106it [00:05, 17.97it/s]
Compressing model: 110it [00:05, 17.31it/s]
Compressing model: 113it [00:06, 17.63it/s]
Compressing model: 118it [00:06, 20.70it/s]
Compressing model: 122it [00:06, 24.05it/s]
Compressing model: 125it [00:06, 22.60it/s]
Compressing model: 128it [00:06, 13.66it/s]
Compressing model: 131it [00:07, 14.68it/s]
Compressing model: 133it [00:07, 14.59it/s]
Compressing model: 138it [00:07, 20.29it/s]
Compressing model: 141it [00:07, 19.93it/s]
Compressing model: 146it [00:07, 22.96it/s]
Compressing model: 150it [00:07, 26.31it/s]
Compressing model: 153it [00:07, 24.04it/s]
Compressing model: 156it [00:08, 17.59it/s]
Compressing model: 159it [00:08, 14.86it/s]
Compressing model: 161it [00:08, 14.72it/s]
Compressing model: 166it [00:08, 20.20it/s]
Compressing model: 169it [00:08, 19.64it/s]
Compressing model: 173it [00:09, 23.47it/s]
Compressing model: 176it [00:09, 17.13it/s]
Compressing model: 179it [00:09, 18.76it/s]
Compressing model: 182it [00:09, 14.24it/s]
Compressing model: 187it [00:09, 19.21it/s]
Compressing model: 190it [00:10, 19.04it/s]
Compressing model: 195it [00:10, 22.02it/s]
Compressing model: 200it [00:10, 26.51it/s]
Compressing model: 204it [00:10, 18.33it/s]
Compressing model: 207it [00:10, 19.70it/s]
Compressing model: 210it [00:11, 14.98it/s]
Compressing model: 215it [00:11, 19.78it/s]
Compressing model: 218it [00:11, 19.47it/s]
Compressing model: 222it [00:11, 23.07it/s]
Compressing model: 224it [00:11, 19.04it/s]

<\details>

Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
SUMMARY:
- Seems like setting the collator from tuncation to default + shuffling
addresses the regression we're seeing in lm-eval
- Given the recovery values you see in these tests were determined using
these settings, I think they should be how we evaluate our lm-eval tests
for the time being

---------

Signed-off-by: Dipika Sikka <ds3822@columbia.edu>
Co-authored-by: Brian Dellabetta <brian-dellabetta@users.noreply.github.com>
Summary
- Many tests get skipped in our base tests because the workflow was
missing the HF_TOKEN and did not have access to a gpu
- With this change, only the weekly and nightly tests outside of the
transformers folder are skipped and everything else is running

---------

Signed-off-by: Dipika Sikka <dipikasikka1@gmail.com>
SUMMARY:
Previously I added _orient and _reorient functions to handle this, there
was a TODO to use flatten_for_calibration, this was a little tricky
since
A) its difficult to use to reverse the transformation (see initial
commit)
B) it changes the shape to [1, num_H, num_W, block_H*block_W] and while
we can reshape that directly into the shape needed for the normalization
easily enough, we'd like to record the 4D shape [num_H, num_W, block_H,
block_W] which requires yet more steps to get and record.

Thus it seems simpler to do the unflatten and transpose manually,
exactly how we need it, then record shapes, normalize and lastly use the
recorded shapes to undo it all.

TEST PLAN:
pytest
/home/HDCharles/repos/llm-compressor/src/llmcompressor/modifiers/awq/base.py

---------

Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>
…llm-project#2113)

SUMMARY:
Add code to convert the gpt_oss model and then quantize model to w8a8

TEST PLAN:
Tested with gpt_oss model and run using vllm

Sample output

=== Prompt 0 ===
<|start|>system<|message|>You are ChatGPT, a large language model
trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-12-04

Reasoning: medium

Valid channels: analysis, commentary, final. Channel must be included
for every message.<|end|><|start|>developer<|message|># Instructions

You are a helpful assistant.

<|end|><|start|>user<|message|>Give 3 reasons to use
AI.<|end|><|start|>assistant

--- Candidate 0 ---
analysisWe need to answer with three reasons to use AI. Should be short.
Provide 3 reasons. Let's do it.assistantcommentaryHere are three reasons
to use AI:

Personalized learning – AI can adapt to your learning style and pace,
providing customized lessons that suit your learning style and progress.
Skill development – AI can help you develop skills that are in demand,
such as coding, data analysis, and data science.
Supportive design
We should provide a short answer.assistantfinalHere are three reasons to
use AI:

Personalized learning – AI adapts to your learning style and pace,
giving you personalized lessons that fit your needs.
Skill development – AI helps you learn and practice skills that are in
demand, like coding, data analysis, and data‑science skills.
Real‑world impact – AI helps you practice real-world skills in real
life, such as learning new skills and applying them in real life.
finish_reason: stop
num_tokens: 226

---------

Signed-off-by: Sharif Inamdar <sharif.inamdar@arm.com>
Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com>
Co-authored-by: shanjiaz <43143795+shanjiaz@users.noreply.github.com>
@yiliu30 yiliu30 closed this Dec 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants