[Self-Review] Enhance Autoround to support multiple cards tuning#12
[Self-Review] Enhance Autoround to support multiple cards tuning#12
Conversation
Signed-off-by: yiliu30 <yi4.liu@intel.com>
…mpressor-fork into autoround-version
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
|
👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review. Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed. |
Summary of ChangesHello @yiliu30, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request adds support for multi-device execution in the AutoRound modifier and includes new example scripts. The core changes in the modifier logic seem solid, particularly the introduction of suspend_accelerate_hooks to manage device offloading. My feedback primarily focuses on improving the portability and clarity of the new example scripts by removing hardcoded local file paths. I've also included a couple of minor suggestions to improve code style and comment clarity in the modifier itself.
| model_id = "meta-llama/Meta-Llama-3-8B-Instruct" | ||
| model_id = "/storage/yiliu7/unsloth/DeepSeek-R1-BF16" | ||
| # model_id = "/storage/yiliu7/deepseek-ai/DeepSeek-V2-Lite-Chat/" |
There was a problem hiding this comment.
The model_id is set to a hardcoded local path, which makes this example not portable. Please use a single, public model identifier from the Hugging Face Hub and remove the other commented or overwritten assignments for clarity.
| model_id = "meta-llama/Meta-Llama-3-8B-Instruct" | |
| model_id = "/storage/yiliu7/unsloth/DeepSeek-R1-BF16" | |
| # model_id = "/storage/yiliu7/deepseek-ai/DeepSeek-V2-Lite-Chat/" | |
| model_id = "deepseek-ai/DeepSeek-V2-Lite-Chat" |
examples/autoround/qwen3_example.py
Outdated
| model_id = "Qwen/Qwen3-30B-A3B" | ||
| # model_id = "/storage/yiliu7/Qwen/Qwen3-30B-A3B" | ||
| # model_id = "/storage/yiliu7/Qwen/Qwen2.5-0.5B/" | ||
| model_id = "/storage/yiliu7/Qwen/Qwen3-235B-A22B/" |
There was a problem hiding this comment.
The model_id is set to a hardcoded local path, which makes this example not portable. Please use a single, public model identifier from the Hugging Face Hub and remove the other commented or overwritten assignments for clarity. Using a smaller model might also be more suitable for an example script.
| model_id = "Qwen/Qwen3-30B-A3B" | |
| # model_id = "/storage/yiliu7/Qwen/Qwen3-30B-A3B" | |
| # model_id = "/storage/yiliu7/Qwen/Qwen2.5-0.5B/" | |
| model_id = "/storage/yiliu7/Qwen/Qwen3-235B-A22B/" | |
| model_id = "Qwen/Qwen1.5-0.5B-Chat" |
examples/autoround/qwen3_example.py
Outdated
| SAVE_DIR = ( | ||
| "/storage/yiliu7/" + model_id.rstrip("/").split("/")[-1] + "-W4A16-G128-AutoRound" | ||
| ) |
There was a problem hiding this comment.
| @@ -239,14 +277,12 @@ def apply_autoround(self, state, subgraph): | |||
| q_input=self._q_input, | |||
| device=str(device), | |||
| # Leave offload for LLMC | |||
There was a problem hiding this comment.
The comment # Leave offload for LLMC is now misleading since auto_offload is set to True. With the addition of suspend_accelerate_hooks, it seems the intention is now to use auto_round's internal offloading. The comment should be updated to reflect this change in behavior.
| # Leave offload for LLMC | |
| # Use auto_round's internal offloading |
There was a problem hiding this comment.
Code Review
This pull request adds support for device mapping in the AutoRound modifier, including two new example scripts. My review focuses on improving the portability and security of the new examples and enhancing the robustness of the AutoRoundModifier. The main suggestions are to remove hardcoded local paths from the example scripts in favor of public models from the Hugging Face Hub, warn users about the security implications of trust_remote_code=True, and use a more flexible default for device_map.
| model_id = "/storage/yiliu7/unsloth/DeepSeek-R1-BF16" | ||
| # model_id = "/storage/yiliu7/deepseek-ai/DeepSeek-V2-Lite-Chat/" | ||
| model = AutoModelForCausalLM.from_pretrained( | ||
| model_id, torch_dtype="auto", trust_remote_code=True |
There was a problem hiding this comment.
Using trust_remote_code=True can introduce a security vulnerability if the model repository contains malicious code. It is crucial to warn users about this risk, especially in an example script that they might copy and run. Please add a comment explaining that users should only enable this if they trust the source of the model.
| model_id = "meta-llama/Meta-Llama-3-8B-Instruct" | ||
| model_id = "/storage/yiliu7/unsloth/DeepSeek-R1-BF16" |
There was a problem hiding this comment.
The model_id is hardcoded to a local path, which makes this example not portable or runnable for other users. It's better to default to a model identifier from the Hugging Face Hub and provide the local path as a commented-out alternative for users who wish to use a local model.
| model_id = "meta-llama/Meta-Llama-3-8B-Instruct" | |
| model_id = "/storage/yiliu7/unsloth/DeepSeek-R1-BF16" | |
| model_id = "meta-llama/Meta-Llama-3-8B-Instruct" | |
| # model_id = "/storage/yiliu7/unsloth/DeepSeek-R1-BF16" |
examples/autoround/qwen3_example.py
Outdated
| model_id = "Qwen/Qwen3-30B-A3B" | ||
| # model_id = "/storage/yiliu7/Qwen/Qwen3-30B-A3B" | ||
| # model_id = "/storage/yiliu7/Qwen/Qwen2.5-0.5B/" | ||
| model_id = "/storage/yiliu7/Qwen/Qwen3-235B-A22B/" |
There was a problem hiding this comment.
The model_id is hardcoded to a local path, which makes this example not runnable for other users. It's better to default to a model from the Hugging Face Hub and provide the local path as a commented-out alternative.
| model_id = "Qwen/Qwen3-30B-A3B" | |
| # model_id = "/storage/yiliu7/Qwen/Qwen3-30B-A3B" | |
| # model_id = "/storage/yiliu7/Qwen/Qwen2.5-0.5B/" | |
| model_id = "/storage/yiliu7/Qwen/Qwen3-235B-A22B/" | |
| model_id = "Qwen/Qwen3-30B-A3B" | |
| # model_id = "/storage/yiliu7/Qwen/Qwen3-30B-A3B" | |
| # model_id = "/storage/yiliu7/Qwen/Qwen2.5-0.5B/" | |
| # model_id = "/storage/yiliu7/Qwen/Qwen3-235B-A22B/" |
examples/autoround/qwen3_example.py
Outdated
| SAVE_DIR = ( | ||
| "/storage/yiliu7/" + model_id.rstrip("/").split("/")[-1] + "-W4A16-G128-AutoRound" | ||
| ) |
There was a problem hiding this comment.
The SAVE_DIR is constructed using a hardcoded absolute path. This will cause the script to fail for any user who does not have the /storage/yiliu7/ directory. The output directory should be a relative path to make the example portable.
SAVE_DIR = model_id.rstrip("/").split("/")[-1] + "-W4A16-G128-AutoRound"| iters: int = 200 | ||
| enable_torch_compile: bool = True | ||
| batch_size: int = 8 | ||
| device_map: str = "0" |
There was a problem hiding this comment.
The default value for device_map is hardcoded to '0'. This assumes a GPU is available at index 0 and will cause a crash on CPU-only systems or systems with a different GPU configuration. Consider changing the default to 'auto' to let accelerate handle device placement automatically, which is more robust and user-friendly.
| device_map: str = "0" | |
| device_map: str = "auto" |
Signed-off-by: yiliu30 <yi4.liu@intel.com>
…ct#2121) SUMMARY: Part of vllm-project#1927 - Updated type hints to Python 3.10+ built-in generics - Replaced List[] with list[] - No functional changes TEST PLAN: - Ran `make quality` (ruff format and lint checks) - Verified no functional code changes were introduced --------- Co-authored-by: Brian Dellabetta <brian-dellabetta@users.noreply.github.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces support for multi-GPU tuning with device_map in the AutoRoundModifier. It also adds a new example for quantizing a large model (Qwen3-235B) using multiple GPUs. Additionally, it maps the ignore parameter to AutoRound's layer skipping mechanism and includes a bug fix related to accessing quantization scale attributes. The changes are well-implemented and include a new test case for the multi-GPU functionality. My main feedback is to clean up the new example file by removing development artifacts for better clarity.
examples/autoround/qwen3_example.py
Outdated
| # FIXME: clean up model_id settings | ||
| model_id = "Qwen/Qwen3-30B-A3B" | ||
| model_id = "/storage/yiliu7/Qwen/Qwen3-30B-A3B" | ||
| # model_id = "/storage/yiliu7/Qwen/Qwen2.5-0.5B/" | ||
| model_id = "/storage/yiliu7/Qwen/Qwen3-235B-A22B/" | ||
| model_id = "/models/Qwen3-30B-A3B" | ||
| model_id = "Qwen/Qwen3-235B-A22B/" |
There was a problem hiding this comment.
This section for selecting the model ID contains multiple overwritten assignments and a FIXME comment. This looks like leftover development code and can be confusing for users of this example. Please clean this up to have a single, clear model_id assignment. If you want to show other possible models, it's better to list them in comments.
| # FIXME: clean up model_id settings | |
| model_id = "Qwen/Qwen3-30B-A3B" | |
| model_id = "/storage/yiliu7/Qwen/Qwen3-30B-A3B" | |
| # model_id = "/storage/yiliu7/Qwen/Qwen2.5-0.5B/" | |
| model_id = "/storage/yiliu7/Qwen/Qwen3-235B-A22B/" | |
| model_id = "/models/Qwen3-30B-A3B" | |
| model_id = "Qwen/Qwen3-235B-A22B/" | |
| # You can try other models here, for example: | |
| # model_id = "Qwen/Qwen3-30B-A3B" | |
| model_id = "Qwen/Qwen3-235B-A22B/" |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces support for multi-GPU tuning with device_map in the AutoRound modifier. It also adds a new example script for a large model (Qwen3-235B) to demonstrate this feature, along with a corresponding test case. The changes to the AutoRoundModifier correctly handle device mapping and suspend accelerate hooks during the process. My feedback focuses on improving the new example script for better clarity and robustness.
examples/autoround/qwen3_example.py
Outdated
| # FIXME: clean up model_id settings | ||
| model_id = "Qwen/Qwen3-30B-A3B" | ||
| model_id = "/storage/yiliu7/Qwen/Qwen3-30B-A3B" | ||
| # model_id = "/storage/yiliu7/Qwen/Qwen2.5-0.5B/" | ||
| model_id = "/storage/yiliu7/Qwen/Qwen3-235B-A22B/" | ||
| model_id = "/models/Qwen3-30B-A3B" | ||
| model_id = "Qwen/Qwen3-235B-A22B/" |
There was a problem hiding this comment.
The model_id is reassigned multiple times, with some lines commented out, and there's a FIXME comment. This seems like leftover code from development. It would be cleaner to remove the unused assignments and the FIXME comment, leaving only the desired model ID. For an example script, it's best to have a single, clear configuration.
| # FIXME: clean up model_id settings | |
| model_id = "Qwen/Qwen3-30B-A3B" | |
| model_id = "/storage/yiliu7/Qwen/Qwen3-30B-A3B" | |
| # model_id = "/storage/yiliu7/Qwen/Qwen2.5-0.5B/" | |
| model_id = "/storage/yiliu7/Qwen/Qwen3-235B-A22B/" | |
| model_id = "/models/Qwen3-30B-A3B" | |
| model_id = "Qwen/Qwen3-235B-A22B/" | |
| model_id = "Qwen/Qwen3-235B-A22B/" |
examples/autoround/qwen3_example.py
Outdated
| print("==========================================\n\n") | ||
|
|
||
| # Save to disk compressed. | ||
| SAVE_DIR = model_id.rstrip("/").split("/")[-1] + "-W4A16-G128-AutoRound" |
There was a problem hiding this comment.
Using split('/')[-1] to get the model name from the path is not very robust as it assumes a Unix-like path separator. A more cross-platform and readable approach is to use os.path.basename. You will need to add import os at the top of the file.
| SAVE_DIR = model_id.rstrip("/").split("/")[-1] + "-W4A16-G128-AutoRound" | |
| SAVE_DIR = os.path.basename(model_id.rstrip("/")) + "-W4A16-G128-AutoRound" |
Signed-off-by: yiliu30 <yi4.liu@intel.com>
hshen14
left a comment
There was a problem hiding this comment.
LGTM
you may prepare a readme to show the instruction and results for the example
…-project#2034) SUMMARY: This is part of vllm-project#1927 Modernize type annotations using | operator and built-in generics in the transformer module as part of codebase modernization effort. TEST PLAN: ``` make style make quality make tests ``` Notes: Happy to address any comments! Thank you! --------- Co-authored-by: Brian Dellabetta <brian-dellabetta@users.noreply.github.com>
SUMMARY: Added examples for fp8 awq which now work after the AWQ generalization TEST PLAN: python $REPOS/llm-compressor/examples/awq/fp8_dynamic_llama_example.py 2>&1 | tee fp8_dynamic.log python $REPOS/llm-compressor/examples/awq/fp8_block_llama_example.py 2>&1 | tee fp8_block.log <details> <summary>fp8_dynamic.log</summary> /home/HDCharles/rhdev/lib/python3.11/site-packages/transformers/utils/hub.py:110: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead. warnings.warn( `torch_dtype` is deprecated! Use `dtype` instead! Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s] Loading checkpoint shards: 25%|██▌ | 1/4 [00:00<00:00, 7.60it/s] Loading checkpoint shards: 50%|█████ | 2/4 [00:00<00:00, 6.70it/s] Loading checkpoint shards: 75%|███████▌ | 3/4 [00:00<00:00, 6.82it/s] Loading checkpoint shards: 100%|██████████| 4/4 [00:00<00:00, 8.95it/s] 2025-12-17T20:56:18.271169+0000 | reset | INFO - Compression lifecycle reset 2025-12-17T20:56:18.271896+0000 | from_modifiers | INFO - Creating recipe from modifiers 2025-12-17T20:56:18.292591+0000 | initialize | INFO - Compression lifecycle initialized for 1 modifiers 2025-12-17T20:56:18.292874+0000 | IndependentPipeline | INFO - Inferred `DataFreePipeline` for `QuantizationModifier` Updating global scales: 0%| | 0/224 [00:00<?, ?it/s] Updating global scales: 100%|██████████| 224/224 [00:00<00:00, 648394.82it/s] Fusing global scales: 0it [00:00, ?it/s] Fusing global scales: 647it [00:00, 511346.28it/s] Calibrating weights: 0%| | 0/224 [00:00<?, ?it/s] Calibrating weights: 40%|███▉ | 89/224 [00:00<00:00, 888.99it/s] Calibrating weights: 100%|██████████| 224/224 [00:00<00:00, 1596.33it/s] 2025-12-17T20:56:53.594142+0000 | finalize | INFO - Compression lifecycle finalized for 1 modifiers 2025-12-17T20:56:57.580914+0000 | post_process | WARNING - Optimized model is not saved. To save, please provide`output_dir` as input arg.Ex. `oneshot(..., output_dir=...)` The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results. Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation. The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results. ========== SAMPLE GENERATION ============== <|begin_of_text|>Hello my name is Sarah and I am a 30-year-old woman who has been diagnosed with multiple sclerosis (MS). I am here to share my story and to help raise awareness about this chronic and often debilitating disease. I was diagnosed with MS in 2010, when I was 25 years old. At the time, I was working as a teacher and living a normal life. But suddenly, I started experiencing strange symptoms such as numbness in my hands and feet, blurred vision, and fatigue. I went ========================================== 2025-12-17T20:57:24.962901+0000 | get_model_compressor | INFO - skip_sparsity_compression_stats set to True. Skipping sparsity compression statistic calculations. No sparsity compressor will be applied. Compressing model: 0it [00:00, ?it/s] Compressing model: 1it [00:00, 3.48it/s] Compressing model: 5it [00:00, 14.97it/s] Compressing model: 8it [00:00, 16.31it/s] Compressing model: 12it [00:00, 22.42it/s] Compressing model: 15it [00:00, 16.64it/s] Compressing model: 18it [00:01, 15.70it/s] Compressing model: 20it [00:01, 12.23it/s] Compressing model: 25it [00:01, 18.27it/s] Compressing model: 28it [00:01, 17.00it/s] Compressing model: 33it [00:01, 22.56it/s] Compressing model: 36it [00:02, 21.53it/s] Compressing model: 41it [00:02, 24.09it/s] Compressing model: 45it [00:02, 27.34it/s] Compressing model: 49it [00:02, 14.52it/s] Compressing model: 54it [00:02, 18.97it/s] Compressing model: 57it [00:03, 18.91it/s] Compressing model: 61it [00:03, 22.45it/s] Compressing model: 65it [00:03, 22.69it/s] Compressing model: 68it [00:03, 18.42it/s] Compressing model: 71it [00:04, 13.96it/s] Compressing model: 76it [00:04, 17.50it/s] Compressing model: 81it [00:04, 22.10it/s] Compressing model: 84it [00:04, 19.62it/s] Compressing model: 89it [00:04, 24.35it/s] Compressing model: 92it [00:04, 22.88it/s] Compressing model: 96it [00:04, 23.23it/s] Compressing model: 99it [00:05, 14.21it/s] Compressing model: 103it [00:05, 17.90it/s] Compressing model: 106it [00:05, 17.96it/s] Compressing model: 110it [00:05, 21.75it/s] Compressing model: 113it [00:06, 15.18it/s] Compressing model: 116it [00:06, 12.39it/s] Compressing model: 118it [00:06, 12.76it/s] Compressing model: 121it [00:06, 15.29it/s] Compressing model: 125it [00:06, 17.59it/s] Compressing model: 129it [00:07, 21.70it/s] Compressing model: 132it [00:07, 20.76it/s] Compressing model: 137it [00:07, 25.70it/s] Compressing model: 140it [00:07, 21.44it/s] Compressing model: 143it [00:07, 14.45it/s] Compressing model: 146it [00:08, 15.29it/s] Compressing model: 150it [00:08, 19.35it/s] Compressing model: 153it [00:08, 19.25it/s] Compressing model: 158it [00:08, 24.56it/s] Compressing model: 161it [00:08, 16.54it/s] Compressing model: 166it [00:09, 17.15it/s] Compressing model: 169it [00:09, 17.37it/s] Compressing model: 174it [00:09, 20.55it/s] Compressing model: 179it [00:09, 25.06it/s] Compressing model: 182it [00:09, 21.49it/s] Compressing model: 187it [00:09, 26.07it/s] Compressing model: 191it [00:10, 25.59it/s] Compressing model: 194it [00:10, 18.97it/s] Compressing model: 197it [00:10, 14.35it/s] Compressing model: 202it [00:10, 17.80it/s] Compressing model: 206it [00:10, 21.30it/s] Compressing model: 209it [00:11, 20.75it/s] Compressing model: 212it [00:11, 12.23it/s] Compressing model: 215it [00:11, 14.03it/s] Compressing model: 218it [00:11, 14.99it/s] Compressing model: 222it [00:12, 19.03it/s] Compressing model: 224it [00:12, 18.36it/s] </details> <details> <summary>fp8_block.log</summary> /home/HDCharles/rhdev/lib/python3.11/site-packages/transformers/utils/hub.py:110: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead. warnings.warn( `torch_dtype` is deprecated! Use `dtype` instead! Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s] Loading checkpoint shards: 100%|██████████| 4/4 [00:00<00:00, 136.99it/s] 2025-12-17T20:57:53.946116+0000 | reset | INFO - Compression lifecycle reset 2025-12-17T20:57:53.946848+0000 | from_modifiers | INFO - Creating recipe from modifiers 2025-12-17T20:57:53.966319+0000 | initialize | INFO - Compression lifecycle initialized for 1 modifiers 2025-12-17T20:57:53.966658+0000 | IndependentPipeline | INFO - Inferred `DataFreePipeline` for `QuantizationModifier` Updating global scales: 0%| | 0/224 [00:00<?, ?it/s] Updating global scales: 100%|██████████| 224/224 [00:00<00:00, 637397.62it/s] Fusing global scales: 0it [00:00, ?it/s] Fusing global scales: 647it [00:00, 486415.97it/s] Calibrating weights: 0%| | 0/224 [00:00<?, ?it/s] Calibrating weights: 0%| | 1/224 [00:00<00:33, 6.66it/s] Calibrating weights: 100%|██████████| 224/224 [00:00<00:00, 943.96it/s] 2025-12-17T20:58:00.043737+0000 | finalize | INFO - Compression lifecycle finalized for 1 modifiers 2025-12-17T20:58:03.951940+0000 | post_process | WARNING - Optimized model is not saved. To save, please provide`output_dir` as input arg.Ex. `oneshot(..., output_dir=...)` The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results. Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation. The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results. ========== SAMPLE GENERATION ============== <|begin_of_text|>Hello my name is Kaitlyn and I am a 24-year-old freelance writer and editor. I have a passion for storytelling and a knack for crafting compelling narratives. I have a degree in English Literature and have been writing professionally for over 5 years. I have experience writing articles, blog posts, and website content for a variety of clients, including businesses, non-profits, and individuals. I am also skilled in editing and proofreading, and have worked with clients to refine their writing and ensure it is error ========================================== 2025-12-17T20:58:34.036482+0000 | get_model_compressor | INFO - skip_sparsity_compression_stats set to True. Skipping sparsity compression statistic calculations. No sparsity compressor will be applied. Compressing model: 0it [00:00, ?it/s] Compressing model: 5it [00:00, 34.55it/s] Compressing model: 9it [00:00, 12.71it/s] Compressing model: 11it [00:00, 12.76it/s] Compressing model: 13it [00:00, 13.16it/s] Compressing model: 17it [00:01, 18.77it/s] Compressing model: 20it [00:01, 18.86it/s] Compressing model: 23it [00:01, 20.26it/s] Compressing model: 27it [00:01, 15.43it/s] Compressing model: 29it [00:01, 12.88it/s] Compressing model: 34it [00:02, 17.10it/s] Compressing model: 39it [00:02, 22.20it/s] Compressing model: 42it [00:02, 19.60it/s] Compressing model: 47it [00:02, 24.68it/s] Compressing model: 50it [00:02, 23.06it/s] Compressing model: 55it [00:03, 18.85it/s] Compressing model: 58it [00:03, 16.46it/s] Compressing model: 62it [00:03, 18.39it/s] Compressing model: 67it [00:03, 23.19it/s] Compressing model: 70it [00:03, 20.28it/s] Compressing model: 75it [00:03, 25.18it/s] Compressing model: 78it [00:04, 18.17it/s] Compressing model: 81it [00:04, 19.71it/s] Compressing model: 84it [00:04, 14.67it/s] Compressing model: 89it [00:04, 19.78it/s] Compressing model: 92it [00:04, 19.63it/s] Compressing model: 97it [00:05, 22.49it/s] Compressing model: 102it [00:05, 26.98it/s] Compressing model: 106it [00:05, 17.97it/s] Compressing model: 110it [00:05, 17.31it/s] Compressing model: 113it [00:06, 17.63it/s] Compressing model: 118it [00:06, 20.70it/s] Compressing model: 122it [00:06, 24.05it/s] Compressing model: 125it [00:06, 22.60it/s] Compressing model: 128it [00:06, 13.66it/s] Compressing model: 131it [00:07, 14.68it/s] Compressing model: 133it [00:07, 14.59it/s] Compressing model: 138it [00:07, 20.29it/s] Compressing model: 141it [00:07, 19.93it/s] Compressing model: 146it [00:07, 22.96it/s] Compressing model: 150it [00:07, 26.31it/s] Compressing model: 153it [00:07, 24.04it/s] Compressing model: 156it [00:08, 17.59it/s] Compressing model: 159it [00:08, 14.86it/s] Compressing model: 161it [00:08, 14.72it/s] Compressing model: 166it [00:08, 20.20it/s] Compressing model: 169it [00:08, 19.64it/s] Compressing model: 173it [00:09, 23.47it/s] Compressing model: 176it [00:09, 17.13it/s] Compressing model: 179it [00:09, 18.76it/s] Compressing model: 182it [00:09, 14.24it/s] Compressing model: 187it [00:09, 19.21it/s] Compressing model: 190it [00:10, 19.04it/s] Compressing model: 195it [00:10, 22.02it/s] Compressing model: 200it [00:10, 26.51it/s] Compressing model: 204it [00:10, 18.33it/s] Compressing model: 207it [00:10, 19.70it/s] Compressing model: 210it [00:11, 14.98it/s] Compressing model: 215it [00:11, 19.78it/s] Compressing model: 218it [00:11, 19.47it/s] Compressing model: 222it [00:11, 23.07it/s] Compressing model: 224it [00:11, 19.04it/s] <\details> Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
SUMMARY: - Seems like setting the collator from tuncation to default + shuffling addresses the regression we're seeing in lm-eval - Given the recovery values you see in these tests were determined using these settings, I think they should be how we evaluate our lm-eval tests for the time being --------- Signed-off-by: Dipika Sikka <ds3822@columbia.edu> Co-authored-by: Brian Dellabetta <brian-dellabetta@users.noreply.github.com>
Summary - Many tests get skipped in our base tests because the workflow was missing the HF_TOKEN and did not have access to a gpu - With this change, only the weekly and nightly tests outside of the transformers folder are skipped and everything else is running --------- Signed-off-by: Dipika Sikka <dipikasikka1@gmail.com>
SUMMARY: Previously I added _orient and _reorient functions to handle this, there was a TODO to use flatten_for_calibration, this was a little tricky since A) its difficult to use to reverse the transformation (see initial commit) B) it changes the shape to [1, num_H, num_W, block_H*block_W] and while we can reshape that directly into the shape needed for the normalization easily enough, we'd like to record the 4D shape [num_H, num_W, block_H, block_W] which requires yet more steps to get and record. Thus it seems simpler to do the unflatten and transpose manually, exactly how we need it, then record shapes, normalize and lastly use the recorded shapes to undo it all. TEST PLAN: pytest /home/HDCharles/repos/llm-compressor/src/llmcompressor/modifiers/awq/base.py --------- Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>
…llm-project#2113) SUMMARY: Add code to convert the gpt_oss model and then quantize model to w8a8 TEST PLAN: Tested with gpt_oss model and run using vllm Sample output === Prompt 0 === <|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI. Knowledge cutoff: 2024-06 Current date: 2025-12-04 Reasoning: medium Valid channels: analysis, commentary, final. Channel must be included for every message.<|end|><|start|>developer<|message|># Instructions You are a helpful assistant. <|end|><|start|>user<|message|>Give 3 reasons to use AI.<|end|><|start|>assistant --- Candidate 0 --- analysisWe need to answer with three reasons to use AI. Should be short. Provide 3 reasons. Let's do it.assistantcommentaryHere are three reasons to use AI: Personalized learning – AI can adapt to your learning style and pace, providing customized lessons that suit your learning style and progress. Skill development – AI can help you develop skills that are in demand, such as coding, data analysis, and data science. Supportive design We should provide a short answer.assistantfinalHere are three reasons to use AI: Personalized learning – AI adapts to your learning style and pace, giving you personalized lessons that fit your needs. Skill development – AI helps you learn and practice skills that are in demand, like coding, data analysis, and data‑science skills. Real‑world impact – AI helps you practice real-world skills in real life, such as learning new skills and applying them in real life. finish_reason: stop num_tokens: 226 --------- Signed-off-by: Sharif Inamdar <sharif.inamdar@arm.com> Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com> Co-authored-by: shanjiaz <43143795+shanjiaz@users.noreply.github.com>
Given AutoRound uses block‑level reconstruction loss to fine‑tune quantization parameters, which requires running backward passes on each block. For large model, like Qwen35B, a single GPU often doesn’t have enough memory to hold an entire block during backward computation. To address this, we use the accelerator to dispatch the module across multiple devices.
In this PR, we enable this feature on LLMC side:
device_mapfor tuning with multiple cardsignoreto Autoround skipping layersTest plan
Example results