Conversation
Signed-off-by: yiliu30 <yi4.liu@intel.com>
|
👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review. Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed. |
Summary of ChangesHello @yiliu30, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly improves the flexibility of the AutoRound quantization modifier by enabling it to work effectively with custom calibration datasets. The core enhancement involves modifying the AutoRoundModifier to correctly process and incorporate attention masks from the input data during the quantization procedure. This ensures that the algorithm can accurately handle padded sequences, which is crucial for diverse datasets. To illustrate this new capability, a comprehensive example script has been added, showcasing the use of AutoRound with a Llama 3 model and a Hugging Face dataset. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request enables the use of custom datasets with attention masks for AutoRound quantization by collecting attention masks from the calibration data. It also adds a new example demonstrating this functionality. The core logic change in AutoRoundModifier has a critical flaw where it consumes the calibration dataloader prematurely, which will cause the calibration process to fail. I've provided a detailed explanation of the issue and a recommended fix. Additionally, I've included some suggestions to improve the clarity and robustness of the new example file.
| for batch in state.data.calib: | ||
| if "attention_mask" in batch: | ||
| self._attention_mask_list.append(batch["attention_mask"]) |
There was a problem hiding this comment.
This loop over state.data.calib will consume the DataLoader, leaving it empty for the actual calibration process that follows. This will likely cause calibration to fail or produce incorrect results as it will run on no data.
A more robust approach would be to collect the attention_mask during the calibration forward passes using a hook. This would avoid consuming the dataloader prematurely. You could add a forward pre-hook to state.model in the on_start method to capture the attention_mask from the keyword arguments of each batch as it's processed. This is the recommended fix.
Example of a hook-based implementation in on_start:
def attention_mask_capture_hook(module, *args, **kwargs):
if "attention_mask" in kwargs and kwargs["attention_mask"] is not None:
self._attention_mask_list.append(kwargs["attention_mask"])
self.register_hook(
state.model, attention_mask_capture_hook, "forward_pre", with_kwargs=True
)This would require removing the added loop from on_initialize.
| ds = ds.map(tokenize, remove_columns=ds.column_names) | ||
|
|
||
| # Configure the quantization algorithm to run. | ||
| # * quantize the weights to 4 bit with GPTQ with a group size 128 |
There was a problem hiding this comment.
The comment is misleading as it refers to GPTQ, but the AutoRoundModifier uses the AutoRound algorithm. This could cause confusion for users of this example.
| # * quantize the weights to 4 bit with GPTQ with a group size 128 | |
| # * quantize the weights to 4 bit with AutoRound with a group size 128 |
| print("==========================================\n\n") | ||
|
|
||
| # Save to disk compressed. | ||
| SAVE_DIR = model_id.rstrip("/").split("/")[-1] + "-W4A16-G128" |
There was a problem hiding this comment.
The save directory name is constructed with a hardcoded "-W4A16-G128". The G128 part, representing a group size of 128, is an implicit default of the W4A16 scheme and is not explicitly defined in the recipe. This makes the code brittle. If the default for the scheme changes in the future, this directory name will become misleading. For an example this might be acceptable, but it's worth noting.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request enables the use of custom datasets for AutoRound quantization, adding a new example script and modifying the core AutoRoundModifier. The changes correctly gather attention masks from calibration data. My review includes suggestions to improve the new example's clarity and robustness, such as removing redundant code and making file paths programmatic. I've also pointed out development artifacts like FIXME comments and breakpoint() calls in the core logic that should be cleaned up before merging.
|
|
||
|
|
||
| # Use batched=True for efficiency | ||
| ds = ds.map(fix_batch_if_needed) |
There was a problem hiding this comment.
The comment on line 57 suggests using batched=True for efficiency, but it's missing in this .map() call. The fix_batch_if_needed function is designed to work with batches. Please add batched=True for correctness and performance.
| ds = ds.map(fix_batch_if_needed) | |
| ds = ds.map(fix_batch_if_needed, batched=True) |
| model_id = "meta-llama/Meta-Llama-3-8B-Instruct" | ||
| model_id = "/models/Qwen3-0.6B-Base" |
There was a problem hiding this comment.
The variable model_id is assigned on line 10 and then immediately reassigned on line 11. The first assignment is redundant and can be removed to avoid confusion.
| model_id = "meta-llama/Meta-Llama-3-8B-Instruct" | |
| model_id = "/models/Qwen3-0.6B-Base" | |
| model_id = "/models/Qwen3-0.6B-Base" |
| print("==========================================\n\n") | ||
|
|
||
| # Save to disk compressed. | ||
| SAVE_DIR = model_id.rstrip("/").split("/")[-1] + "-W4A16-G128-AutoRound-Untrachat200k" |
There was a problem hiding this comment.
The SAVE_DIR is constructed with several hardcoded strings (-W4A16-G128-AutoRound-Untrachat200k). This makes the example brittle if parameters like the quantization scheme or dataset change. Consider constructing this string programmatically from the relevant variables (recipe, DATASET_ID, etc.) to improve maintainability and make the example more robust.
| return ar_quant_scheme | ||
|
|
||
|
|
||
| # FIXME: refine code |
There was a problem hiding this comment.
| if m.ndim == 2: | ||
| all_ones_rows = torch.all(m == 1, dim=1) # [batch] | ||
| if torch.any(all_ones_rows): | ||
| # breakpoint() |
| batch['attention_mask'] is List[List[int]] when batched=True. | ||
| Convert to tensor, fix, then convert back to list to stay compatible with datasets. | ||
| """ | ||
| # breakpoint() |
attention_maskfailed to parse