Ar by yiliu30 · Pull Request #7 · yiliu30/llm-compressor-fork

yiliu30 · 2025-10-31T12:41:17Z

SUMMARY:
"please provide a brief summary"

TEST PLAN:
"please outline how the changes were tested"

Signed-off-by: yiliu30 <yi4.liu@intel.com>

github-actions · 2025-10-31T12:41:33Z

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

Signed-off-by: yiliu30 <yi4.liu@intel.com>

yiliu30 · 2025-11-03T11:22:27Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces an experimental integration with the auto_round library for quantization. While this is a significant feature, the current implementation appears to be in an early, experimental stage and contains several issues that should be addressed before merging. The example scripts are cluttered with hardcoded paths and commented-out code, making them difficult to use. The core logic in gptq/base.py relies on internal methods of auto_round, which is a fragile approach. There are also FIXME comments and incomplete logic, indicating that the feature is not fully implemented. Additionally, debugging code has been added to production paths. I recommend cleaning up the example scripts, completing the auto_round integration using its public API, and removing the debugging code.

gemini-code-assist · 2025-11-03T11:23:50Z

examples/quantization_w4a16/run_vllm.sh

+    --dtype bfloat16 \
+    --port 8099 \
+    --no-enable-prefix-caching \
+    --trust-remote-code  2>&1 | tee $log_file


The script uses the variable $log_file but it is never defined. This will cause the script to fail.

gemini-code-assist · 2025-11-03T11:23:50Z

src/llmcompressor/modifiers/quantization/gptq/base.py

+            for name, module in decoding_layer.named_modules():
+                if hasattr(module, "weight_scale") and hasattr(module, "weight_zero_point"):
+                    logger.info(f"Updating offload parameters for module {getattr(module, '_tmp_name', '')} || {name}")
+                    # weight = module.weight
+                    scale = module.scale
+                    del module.scale
+                    # zero_point = module.weight_zero_point
+                    # offload_device = torch.device("cpu")
+                    # module.weight.data.copy_(weight.data)
+                    # module.weight_scale.data.copy_(scale.data)
+                    # module.weight_zero_point.data.copy_(zero_point.data)
+                    # update_offload_parameter(module, "weight", weight)
+                    # FIXME: yi quantize the weight based on the scale and zero_point from AutoRound
+                    update_offload_parameter(module, "weight_scale", scale)
+                    # update_offload_parameter(module, "weight_zero_point", zero_point)


The logic for applying the quantization results seems incomplete. The FIXME comment, the commented-out update for weight_zero_point, and the deletion of module.scale suggest that the quantization is not being correctly applied to the model's weights. This is a critical issue that needs to be resolved for the feature to work correctly.

gemini-code-assist · 2025-11-03T11:23:50Z

src/llmcompressor/modifiers/quantization/gptq/base.py

+    def autoround(self, state):
+        cur_layer_idx = self._cur_layer_idx
+        self._cur_layer_idx += 1
+        logger.info(f">>||>> AutoRound for decoding layer index {cur_layer_idx}")
+        if cur_layer_idx >= len(state.model.model.layers):
+            logger.info(f">>||>> All decoding layers have been processed for AutoRound.")
+            self.compress_modules(return_directly=False)
+            return
+        decoding_layer = state.model.model.layers[cur_layer_idx]
+        new_model = _PretrainModelWrapper()
+        new_model.model.layers.append(decoding_layer)
+        logger.warning(f">>||>> Strating AutoRound for decoding layer {getattr(decoding_layer, '_tmp_name', '')}")
+        param = next(decoding_layer.parameters())
+        new_model.dtype = param.dtype
+        with align_module_device(decoding_layer):
+            if _DEBUG:
+                iters = 4
+            else:
+                iters = 200
+            import auto_round
+            rounder = auto_round.AutoRound(
+                model=new_model,
+                tokenizer="",
+                scheme="W4A16",
+                # iters = 128,
+                iters = iters,
+                enable_quanted_input=False,
+                # FIXME: batch size 1 causes error, looks like related to the input_others prepare
+                # batch_size=1 
+                enable_torch_compile=True,
+                # enable_deterministic_algorithms=True,
+            )
+            # block: torch.nn.Module,
+            # input_ids: list[torch.Tensor],
+            # input_others: dict,
+            # q_input: Union[None, torch.Tensor] = None,
+            # device: Union[str, torch.device] = "cpu",
+
+
+            from auto_round.compressors.utils import set_layer_config
+            def _preprocess(self):
+                self.layer_config, self.has_qlayer_outside_block, self.regex_config = set_layer_config(
+                    self.model,
+                    self.layer_config,
+                    self.scheme,
+                    self.scale_dtype,
+                    self.supported_types,
+                    self.inner_supported_types,
+                    self.quant_block_list,
+                    self.fp_layers,
+                    self.quant_lm_head,
+                    # enable_gguf_official_mixed=enable_gguf_official_mixed,
+                    is_mllm=self.mllm,
+                )
+                self.batch_dim = 0
+            _preprocess(rounder)
+
+            input_name = f"model.layers.{cur_layer_idx}"
+            cur_inputs = all_module_input[input_name]
+            input_ids = []
+            input_others = {}
+            positional_inputs = []
+            attention_mask = None
+            position_ids = None
+            cache_position = None
+            position_embeddings = (None, None)
+            for cur_inp in cur_inputs:
+                _input_ids = cur_inp[0][0][0]
+                input_ids.append(cur_inp[0][0][0])
+                for key, val in cur_inp[0][1].items():
+                    if key == "position_ids":
+                        position_ids = val
+                    elif key == "position_embeddings":
+                        position_embeddings = val
+                    elif key == "cache_position":
+                        cache_position = val
+            input_others["position_ids"] = position_ids
+            input_others["positional_inputs"] = positional_inputs
+            input_others["attention_mask"] = attention_mask
+            input_others["position_embeddings"] = position_embeddings
+            input_others["cache_position"] = cache_position
+            decoding_layer.tuning_device = torch.device("cuda")
+            with torch.enable_grad():
+                rounder._quantize_block(
+                    block=decoding_layer,
+                    input_ids=input_ids,
+                    input_others=input_others,
+                    q_input=None,
+                    device="cuda",
+                )
+            logger.info(f"AutoRound completed for decoding layer {getattr(decoding_layer, '_tmp_name', '')}")
+            # decoding_layer.mlp.up_proj.weight_scale.shape
+            # decoding_layer.mlp.up_proj.weight_zero_point.shape
+            logger.info(f"Weight scale shape: {decoding_layer.mlp.up_proj.weight_scale.shape}, zero point shape: {decoding_layer.mlp.up_proj.weight_zero_point.shape}")
+            logger.info(f"Weight scale shape: {decoding_layer.mlp.down_proj.weight_scale.shape}, zero point shape: {decoding_layer.mlp.down_proj.weight_zero_point.shape}")
+            # decoding_layer.to("cuda")
+            print(f" decoding_layer.mlp.down_proj{ decoding_layer.mlp.down_proj}")
+
+            for name, module in decoding_layer.named_modules():
+                if hasattr(module, "weight_scale") and hasattr(module, "weight_zero_point"):
+                    logger.info(f"Updating offload parameters for module {getattr(module, '_tmp_name', '')} || {name}")
+                    # weight = module.weight
+                    scale = module.scale
+                    del module.scale
+                    # zero_point = module.weight_zero_point
+                    # offload_device = torch.device("cpu")
+                    # module.weight.data.copy_(weight.data)
+                    # module.weight_scale.data.copy_(scale.data)
+                    # module.weight_zero_point.data.copy_(zero_point.data)
+                    # update_offload_parameter(module, "weight", weight)
+                    # FIXME: yi quantize the weight based on the scale and zero_point from AutoRound
+                    update_offload_parameter(module, "weight_scale", scale)
+                    # update_offload_parameter(module, "weight_zero_point", zero_point)
+            # state.model.model.layers[0].mlp.gate_proj.weightbt
+            for module in list(self._num_samples.keys()):
+                name = self._module_names[module]
+                del self._num_samples[module]
+            decoding_layer.eval()
+            all_module_input.clear()
+            all_module_output.clear()
+        # logger.info(f"state.model.model.layers[0].mlp.gate_proj.weight: {state.model.model.layers[0].mlp.gate_proj.weight.device}")


The autoround method's integration with the auto_round library appears to be fragile as it relies on its internal methods like _preprocess and _quantize_block. This can easily break with new versions of auto_round. It is recommended to use the public API of the library for a more robust integration.

gemini-code-assist · 2025-11-03T11:23:50Z

examples/quantization_w4a16/llama3_example.py

 model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
+model_id = "/data5/yliu7/HF_HOME/meta-llama/Llama-3.2-1B-Instruct"
+model_id = "Qwen/Qwen2.5-0.5B"
+model_id = "/data5/yliu7/HF_HOME/Qwen/Qwen2.5-0.5B"
+model_id = "/data5/yliu7/meta-llama/meta-llama/Meta-Llama-3.1-8B-Instruct"
+# model_id = "facebook/opt-125m"


These multiple assignments to model_id, with most being commented out, make the configuration confusing. Please retain only the intended one and consider making it a script argument for better reusability.

gemini-code-assist · 2025-11-03T11:23:50Z

examples/quantization_w4a16/llama3_example.py

+# ds = load_dataset(DATASET_ID, split=f"{DATASET_SPLIT}[:{NUM_CALIBRATION_SAMPLES}]")
+# ds = ds.shuffle(seed=42)


-def preprocess(example):
-    return {
-        "text": tokenizer.apply_chat_template(
-            example["messages"],
-            tokenize=False,
-        )
-    }
+# def preprocess(example):
+#     return {
+#         "text": tokenizer.apply_chat_template(
+#             example["messages"],
+#             tokenize=False,
+#         )
+#     }


-ds = ds.map(preprocess)
+# ds = ds.map(preprocess)


-# Tokenize inputs.
-def tokenize(sample):
-    return tokenizer(
-        sample["text"],
-        padding=False,
-        max_length=MAX_SEQUENCE_LENGTH,
-        truncation=True,
-        add_special_tokens=False,
-    )
+# # Tokenize inputs.
+# def tokenize(sample):
+#     return tokenizer(
+#         sample["text"],
+#         padding=False,
+#         max_length=MAX_SEQUENCE_LENGTH,
+#         truncation=True,
+#         add_special_tokens=False,
+#     )


-ds = ds.map(tokenize, remove_columns=ds.column_names)
+# ds = ds.map(tokenize, remove_columns=ds.column_names)


This large block of commented-out code should be removed to improve readability if it's no longer needed.

gemini-code-assist · 2025-11-03T11:23:51Z

examples/quantization_w4a16/run_vllm.sh

+model_path="/data5/yliu7/tmp/Qwen2.5-0.5B-W4A16-G128"
+model_path="/data5/yliu7/tmp/Meta-Llama-3.1-8B-Instruct-W4A16-G128"
+model_path="/data5/yliu7/tmp/Meta-Llama-3.1-8B-Instruct-W4A16-G128-with-shuffule"
+model_path="/data5/yliu7/tmp/Meta-Llama-3.1-8B-Instruct-W4A16-G128-disbale-shuffule/"


The model_path is configured through multiple, mostly commented-out, hardcoded paths. This is confusing and error-prone. It would be better to accept the model path as a command-line argument. Also, note the typo 'disbale-shuffule' in the active path.

gemini-code-assist · 2025-11-03T11:23:51Z

examples/quantization_w4a16/start_client.sh

+model_path=/home/yliu7/workspace/inc/3rd-party/llm-compressor/examples/quantization_non_uniform/Llama-3.2-1B-Instruct-NVFP4-FP8-Dynamic
+model_path="/data5/yliu7/HF_HOME/qwen_moe_skip_lm_head"
+model_path=/data5/yliu7/HF_HOME/ByteDance-Seed/Seed-OSS-36B-Instruct
+model_path=/data5/yliu7/HF_HOME/GLM-4.5-Air-w8afp8-llmc/GLM-4.5-Air-w8afp8
+# model_path=/data5/yliu7/HF_HOME/Llama-3.2-1B-Instruct-NVFPP_B16/
+# model_path=/data5/yliu7/HF_HOME/meta-llama/Llama-3.2-1B-Instruct/
+model_path=/data5/yliu7/HF_HOME/Yi30/gpt-oss-20b-BF16-MXFP8/
+model_path=/data5/yliu7/HF_HOME/Yi30/gpt-oss-120b-BF16-unsloth-MXFP8
+model_path=/data6/yiliu4/unsloth-gpt-oss-120b-BF16-ar-MXFP4/
+model_path="/data5/yliu7/HF_HOME/unsloth-gpt-oss-20b-BF16-ar-MXFP4/"
+model_path="/data5/yliu7/HF_HOME/unsloth-gpt-oss-20b-BF16-ar-MXFP4/"
+model_path="/data5/yliu7/tmp/Qwen2.5-0.5B-W4A16-G128"
+model_path="/data5/yliu7/tmp/Meta-Llama-3.1-8B-Instruct-W4A16-G128"
+model_path="/data5/yliu7/tmp/Meta-Llama-3.1-8B-Instruct-W4A16-G128-with-shuffule"
+model_path="/data5/yliu7/tmp/Meta-Llama-3.1-8B-Instruct-W4A16-G128-disbale-shuffule/"
+# model_path="/data5/yliu7/meta-llama/meta-llama/Meta-Llama-3.1-8B-Instruct-AR-W4G128"
+# model_path="/data5/yliu7/meta-llama/meta-llama/Meta-Llama-3.1-8B-Instruct-AR-W4G128"
+# model_path=/data5/yliu7/HF_HOME/unsloth-gpt-oss-20b-BF16-ar-MXFP4/
+# model_path=/data6/yiliu4/unsloth-gpt-oss-120b-BF16-ar-MXFP4/
+# model_path=/data5/yliu7/HF_HOME/Yi30/unsloth-gpt-oss-20b-BF16-MXFP4
+# model_path=/models/DeepSeek-V2-Lite-Chat/


This script contains a large number of hardcoded and commented-out model_path assignments. This makes the script difficult to understand and maintain. Please refactor this to take model_path as a command-line argument and remove the unused assignments.

gemini-code-assist · 2025-11-03T11:23:51Z

examples/quantization_w4a4_fp4/llama3_example.py


 MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct"
-
+MODEL_ID = "/data5/yliu7/HF_HOME/meta-llama/Llama-3.2-1B-Instruct"


The MODEL_ID is hardcoded to a local path. For better portability of this example, please consider making this a command-line argument or using a model from Hugging Face Hub.

gemini-code-assist · 2025-11-03T11:23:51Z

src/llmcompressor/modifiers/quantization/gptq/base.py

+all_module_input = defaultdict(list)
+all_module_output = defaultdict(list)


The use of module-level global dictionaries all_module_input and all_module_output to pass data between hooks and the autoround method introduces global state, which can be hard to manage and debug. Consider using instance attributes on the modifier to store this state, making the data flow more explicit and encapsulated.

gemini-code-assist · 2025-11-03T11:23:51Z

src/llmcompressor/modifiers/quantization/gptq/base.py

+class _LLModelWrapper(torch.nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.layers = torch.nn.ModuleList()
+
+    def forward(self, *args, **kwargs):
+        for layer in self.layers:
+            res = layer(*args, **kwargs)
+        return res
+
+
+class _PretrainModelWrapper(torch.nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.model = _LLModelWrapper()
+
+    def forward(self, *args, **kwargs):
+        return self.model(*args, **kwargs)
+


The wrapper classes _LLModelWrapper and _PretrainModelWrapper have very generic names. To improve code clarity, consider renaming them to be more descriptive of their specific purpose, such as SingleLayerWrapper. Additionally, _PretrainModelWrapper appears to be a redundant wrapper around _LLModelWrapper and could potentially be removed.

yiliu30 added 2 commits October 31, 2025 08:38

tmp ar

46b38b0

Signed-off-by: yiliu30 <yi4.liu@intel.com>

add eval code

3120321

Signed-off-by: yiliu30 <yi4.liu@intel.com>

yiliu30 added 2 commits November 2, 2025 01:54

fix acc issue

1c5f0eb

Signed-off-by: yiliu30 <yi4.liu@intel.com>

fix more

b302d19

Signed-off-by: yiliu30 <yi4.liu@intel.com>

gemini-code-assist bot reviewed Nov 3, 2025

View reviewed changes


		MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct"

		MODEL_ID = "/data5/yliu7/HF_HOME/meta-llama/Llama-3.2-1B-Instruct"

		all_module_input = defaultdict(list)
		all_module_output = defaultdict(list)

Conversation

yiliu30 commented Oct 31, 2025

Uh oh!

github-actions bot commented Oct 31, 2025

Uh oh!

yiliu30 commented Nov 3, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant