Conversation
Signed-off-by: yiliu30 <yi4.liu@intel.com>
|
👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review. Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed. |
Signed-off-by: yiliu30 <yi4.liu@intel.com>
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces an experimental integration with the auto_round library for quantization. While this is a significant feature, the current implementation appears to be in an early, experimental stage and contains several issues that should be addressed before merging. The example scripts are cluttered with hardcoded paths and commented-out code, making them difficult to use. The core logic in gptq/base.py relies on internal methods of auto_round, which is a fragile approach. There are also FIXME comments and incomplete logic, indicating that the feature is not fully implemented. Additionally, debugging code has been added to production paths. I recommend cleaning up the example scripts, completing the auto_round integration using its public API, and removing the debugging code.
| --dtype bfloat16 \ | ||
| --port 8099 \ | ||
| --no-enable-prefix-caching \ | ||
| --trust-remote-code 2>&1 | tee $log_file |
| for name, module in decoding_layer.named_modules(): | ||
| if hasattr(module, "weight_scale") and hasattr(module, "weight_zero_point"): | ||
| logger.info(f"Updating offload parameters for module {getattr(module, '_tmp_name', '')} || {name}") | ||
| # weight = module.weight | ||
| scale = module.scale | ||
| del module.scale | ||
| # zero_point = module.weight_zero_point | ||
| # offload_device = torch.device("cpu") | ||
| # module.weight.data.copy_(weight.data) | ||
| # module.weight_scale.data.copy_(scale.data) | ||
| # module.weight_zero_point.data.copy_(zero_point.data) | ||
| # update_offload_parameter(module, "weight", weight) | ||
| # FIXME: yi quantize the weight based on the scale and zero_point from AutoRound | ||
| update_offload_parameter(module, "weight_scale", scale) | ||
| # update_offload_parameter(module, "weight_zero_point", zero_point) |
There was a problem hiding this comment.
The logic for applying the quantization results seems incomplete. The FIXME comment, the commented-out update for weight_zero_point, and the deletion of module.scale suggest that the quantization is not being correctly applied to the model's weights. This is a critical issue that needs to be resolved for the feature to work correctly.
| def autoround(self, state): | ||
| cur_layer_idx = self._cur_layer_idx | ||
| self._cur_layer_idx += 1 | ||
| logger.info(f">>||>> AutoRound for decoding layer index {cur_layer_idx}") | ||
| if cur_layer_idx >= len(state.model.model.layers): | ||
| logger.info(f">>||>> All decoding layers have been processed for AutoRound.") | ||
| self.compress_modules(return_directly=False) | ||
| return | ||
| decoding_layer = state.model.model.layers[cur_layer_idx] | ||
| new_model = _PretrainModelWrapper() | ||
| new_model.model.layers.append(decoding_layer) | ||
| logger.warning(f">>||>> Strating AutoRound for decoding layer {getattr(decoding_layer, '_tmp_name', '')}") | ||
| param = next(decoding_layer.parameters()) | ||
| new_model.dtype = param.dtype | ||
| with align_module_device(decoding_layer): | ||
| if _DEBUG: | ||
| iters = 4 | ||
| else: | ||
| iters = 200 | ||
| import auto_round | ||
| rounder = auto_round.AutoRound( | ||
| model=new_model, | ||
| tokenizer="", | ||
| scheme="W4A16", | ||
| # iters = 128, | ||
| iters = iters, | ||
| enable_quanted_input=False, | ||
| # FIXME: batch size 1 causes error, looks like related to the input_others prepare | ||
| # batch_size=1 | ||
| enable_torch_compile=True, | ||
| # enable_deterministic_algorithms=True, | ||
| ) | ||
| # block: torch.nn.Module, | ||
| # input_ids: list[torch.Tensor], | ||
| # input_others: dict, | ||
| # q_input: Union[None, torch.Tensor] = None, | ||
| # device: Union[str, torch.device] = "cpu", | ||
|
|
||
|
|
||
| from auto_round.compressors.utils import set_layer_config | ||
| def _preprocess(self): | ||
| self.layer_config, self.has_qlayer_outside_block, self.regex_config = set_layer_config( | ||
| self.model, | ||
| self.layer_config, | ||
| self.scheme, | ||
| self.scale_dtype, | ||
| self.supported_types, | ||
| self.inner_supported_types, | ||
| self.quant_block_list, | ||
| self.fp_layers, | ||
| self.quant_lm_head, | ||
| # enable_gguf_official_mixed=enable_gguf_official_mixed, | ||
| is_mllm=self.mllm, | ||
| ) | ||
| self.batch_dim = 0 | ||
| _preprocess(rounder) | ||
|
|
||
| input_name = f"model.layers.{cur_layer_idx}" | ||
| cur_inputs = all_module_input[input_name] | ||
| input_ids = [] | ||
| input_others = {} | ||
| positional_inputs = [] | ||
| attention_mask = None | ||
| position_ids = None | ||
| cache_position = None | ||
| position_embeddings = (None, None) | ||
| for cur_inp in cur_inputs: | ||
| _input_ids = cur_inp[0][0][0] | ||
| input_ids.append(cur_inp[0][0][0]) | ||
| for key, val in cur_inp[0][1].items(): | ||
| if key == "position_ids": | ||
| position_ids = val | ||
| elif key == "position_embeddings": | ||
| position_embeddings = val | ||
| elif key == "cache_position": | ||
| cache_position = val | ||
| input_others["position_ids"] = position_ids | ||
| input_others["positional_inputs"] = positional_inputs | ||
| input_others["attention_mask"] = attention_mask | ||
| input_others["position_embeddings"] = position_embeddings | ||
| input_others["cache_position"] = cache_position | ||
| decoding_layer.tuning_device = torch.device("cuda") | ||
| with torch.enable_grad(): | ||
| rounder._quantize_block( | ||
| block=decoding_layer, | ||
| input_ids=input_ids, | ||
| input_others=input_others, | ||
| q_input=None, | ||
| device="cuda", | ||
| ) | ||
| logger.info(f"AutoRound completed for decoding layer {getattr(decoding_layer, '_tmp_name', '')}") | ||
| # decoding_layer.mlp.up_proj.weight_scale.shape | ||
| # decoding_layer.mlp.up_proj.weight_zero_point.shape | ||
| logger.info(f"Weight scale shape: {decoding_layer.mlp.up_proj.weight_scale.shape}, zero point shape: {decoding_layer.mlp.up_proj.weight_zero_point.shape}") | ||
| logger.info(f"Weight scale shape: {decoding_layer.mlp.down_proj.weight_scale.shape}, zero point shape: {decoding_layer.mlp.down_proj.weight_zero_point.shape}") | ||
| # decoding_layer.to("cuda") | ||
| print(f" decoding_layer.mlp.down_proj{ decoding_layer.mlp.down_proj}") | ||
|
|
||
| for name, module in decoding_layer.named_modules(): | ||
| if hasattr(module, "weight_scale") and hasattr(module, "weight_zero_point"): | ||
| logger.info(f"Updating offload parameters for module {getattr(module, '_tmp_name', '')} || {name}") | ||
| # weight = module.weight | ||
| scale = module.scale | ||
| del module.scale | ||
| # zero_point = module.weight_zero_point | ||
| # offload_device = torch.device("cpu") | ||
| # module.weight.data.copy_(weight.data) | ||
| # module.weight_scale.data.copy_(scale.data) | ||
| # module.weight_zero_point.data.copy_(zero_point.data) | ||
| # update_offload_parameter(module, "weight", weight) | ||
| # FIXME: yi quantize the weight based on the scale and zero_point from AutoRound | ||
| update_offload_parameter(module, "weight_scale", scale) | ||
| # update_offload_parameter(module, "weight_zero_point", zero_point) | ||
| # state.model.model.layers[0].mlp.gate_proj.weightbt | ||
| for module in list(self._num_samples.keys()): | ||
| name = self._module_names[module] | ||
| del self._num_samples[module] | ||
| decoding_layer.eval() | ||
| all_module_input.clear() | ||
| all_module_output.clear() | ||
| # logger.info(f"state.model.model.layers[0].mlp.gate_proj.weight: {state.model.model.layers[0].mlp.gate_proj.weight.device}") |
There was a problem hiding this comment.
The autoround method's integration with the auto_round library appears to be fragile as it relies on its internal methods like _preprocess and _quantize_block. This can easily break with new versions of auto_round. It is recommended to use the public API of the library for a more robust integration.
| model_id = "meta-llama/Meta-Llama-3-8B-Instruct" | ||
| model_id = "/data5/yliu7/HF_HOME/meta-llama/Llama-3.2-1B-Instruct" | ||
| model_id = "Qwen/Qwen2.5-0.5B" | ||
| model_id = "/data5/yliu7/HF_HOME/Qwen/Qwen2.5-0.5B" | ||
| model_id = "/data5/yliu7/meta-llama/meta-llama/Meta-Llama-3.1-8B-Instruct" | ||
| # model_id = "facebook/opt-125m" |
| # ds = load_dataset(DATASET_ID, split=f"{DATASET_SPLIT}[:{NUM_CALIBRATION_SAMPLES}]") | ||
| # ds = ds.shuffle(seed=42) | ||
|
|
||
|
|
||
| def preprocess(example): | ||
| return { | ||
| "text": tokenizer.apply_chat_template( | ||
| example["messages"], | ||
| tokenize=False, | ||
| ) | ||
| } | ||
| # def preprocess(example): | ||
| # return { | ||
| # "text": tokenizer.apply_chat_template( | ||
| # example["messages"], | ||
| # tokenize=False, | ||
| # ) | ||
| # } | ||
|
|
||
|
|
||
| ds = ds.map(preprocess) | ||
| # ds = ds.map(preprocess) | ||
|
|
||
|
|
||
| # Tokenize inputs. | ||
| def tokenize(sample): | ||
| return tokenizer( | ||
| sample["text"], | ||
| padding=False, | ||
| max_length=MAX_SEQUENCE_LENGTH, | ||
| truncation=True, | ||
| add_special_tokens=False, | ||
| ) | ||
| # # Tokenize inputs. | ||
| # def tokenize(sample): | ||
| # return tokenizer( | ||
| # sample["text"], | ||
| # padding=False, | ||
| # max_length=MAX_SEQUENCE_LENGTH, | ||
| # truncation=True, | ||
| # add_special_tokens=False, | ||
| # ) | ||
|
|
||
|
|
||
| ds = ds.map(tokenize, remove_columns=ds.column_names) | ||
| # ds = ds.map(tokenize, remove_columns=ds.column_names) |
| model_path="/data5/yliu7/tmp/Qwen2.5-0.5B-W4A16-G128" | ||
| model_path="/data5/yliu7/tmp/Meta-Llama-3.1-8B-Instruct-W4A16-G128" | ||
| model_path="/data5/yliu7/tmp/Meta-Llama-3.1-8B-Instruct-W4A16-G128-with-shuffule" | ||
| model_path="/data5/yliu7/tmp/Meta-Llama-3.1-8B-Instruct-W4A16-G128-disbale-shuffule/" |
| model_path=/home/yliu7/workspace/inc/3rd-party/llm-compressor/examples/quantization_non_uniform/Llama-3.2-1B-Instruct-NVFP4-FP8-Dynamic | ||
| model_path="/data5/yliu7/HF_HOME/qwen_moe_skip_lm_head" | ||
| model_path=/data5/yliu7/HF_HOME/ByteDance-Seed/Seed-OSS-36B-Instruct | ||
| model_path=/data5/yliu7/HF_HOME/GLM-4.5-Air-w8afp8-llmc/GLM-4.5-Air-w8afp8 | ||
| # model_path=/data5/yliu7/HF_HOME/Llama-3.2-1B-Instruct-NVFPP_B16/ | ||
| # model_path=/data5/yliu7/HF_HOME/meta-llama/Llama-3.2-1B-Instruct/ | ||
| model_path=/data5/yliu7/HF_HOME/Yi30/gpt-oss-20b-BF16-MXFP8/ | ||
| model_path=/data5/yliu7/HF_HOME/Yi30/gpt-oss-120b-BF16-unsloth-MXFP8 | ||
| model_path=/data6/yiliu4/unsloth-gpt-oss-120b-BF16-ar-MXFP4/ | ||
| model_path="/data5/yliu7/HF_HOME/unsloth-gpt-oss-20b-BF16-ar-MXFP4/" | ||
| model_path="/data5/yliu7/HF_HOME/unsloth-gpt-oss-20b-BF16-ar-MXFP4/" | ||
| model_path="/data5/yliu7/tmp/Qwen2.5-0.5B-W4A16-G128" | ||
| model_path="/data5/yliu7/tmp/Meta-Llama-3.1-8B-Instruct-W4A16-G128" | ||
| model_path="/data5/yliu7/tmp/Meta-Llama-3.1-8B-Instruct-W4A16-G128-with-shuffule" | ||
| model_path="/data5/yliu7/tmp/Meta-Llama-3.1-8B-Instruct-W4A16-G128-disbale-shuffule/" | ||
| # model_path="/data5/yliu7/meta-llama/meta-llama/Meta-Llama-3.1-8B-Instruct-AR-W4G128" | ||
| # model_path="/data5/yliu7/meta-llama/meta-llama/Meta-Llama-3.1-8B-Instruct-AR-W4G128" | ||
| # model_path=/data5/yliu7/HF_HOME/unsloth-gpt-oss-20b-BF16-ar-MXFP4/ | ||
| # model_path=/data6/yiliu4/unsloth-gpt-oss-120b-BF16-ar-MXFP4/ | ||
| # model_path=/data5/yliu7/HF_HOME/Yi30/unsloth-gpt-oss-20b-BF16-MXFP4 | ||
| # model_path=/models/DeepSeek-V2-Lite-Chat/ |
|
|
||
| MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct" | ||
|
|
||
| MODEL_ID = "/data5/yliu7/HF_HOME/meta-llama/Llama-3.2-1B-Instruct" |
| all_module_input = defaultdict(list) | ||
| all_module_output = defaultdict(list) |
There was a problem hiding this comment.
The use of module-level global dictionaries all_module_input and all_module_output to pass data between hooks and the autoround method introduces global state, which can be hard to manage and debug. Consider using instance attributes on the modifier to store this state, making the data flow more explicit and encapsulated.
| class _LLModelWrapper(torch.nn.Module): | ||
| def __init__(self): | ||
| super().__init__() | ||
| self.layers = torch.nn.ModuleList() | ||
|
|
||
| def forward(self, *args, **kwargs): | ||
| for layer in self.layers: | ||
| res = layer(*args, **kwargs) | ||
| return res | ||
|
|
||
|
|
||
| class _PretrainModelWrapper(torch.nn.Module): | ||
| def __init__(self): | ||
| super().__init__() | ||
| self.model = _LLModelWrapper() | ||
|
|
||
| def forward(self, *args, **kwargs): | ||
| return self.model(*args, **kwargs) | ||
|
|
There was a problem hiding this comment.
The wrapper classes _LLModelWrapper and _PretrainModelWrapper have very generic names. To improve code clarity, consider renaming them to be more descriptive of their specific purpose, such as SingleLayerWrapper. Additionally, _PretrainModelWrapper appears to be a redundant wrapper around _LLModelWrapper and could potentially be removed.
SUMMARY:
"please provide a brief summary"
TEST PLAN:
"please outline how the changes were tested"