Skip to content

Ar#7

Draft
yiliu30 wants to merge 4 commits intomainfrom
ar
Draft

Ar#7
yiliu30 wants to merge 4 commits intomainfrom
ar

Conversation

@yiliu30
Copy link
Copy Markdown
Owner

@yiliu30 yiliu30 commented Oct 31, 2025

SUMMARY:
"please provide a brief summary"

TEST PLAN:
"please outline how the changes were tested"

Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
@github-actions
Copy link
Copy Markdown

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
@yiliu30
Copy link
Copy Markdown
Owner Author

yiliu30 commented Nov 3, 2025

/gemini review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an experimental integration with the auto_round library for quantization. While this is a significant feature, the current implementation appears to be in an early, experimental stage and contains several issues that should be addressed before merging. The example scripts are cluttered with hardcoded paths and commented-out code, making them difficult to use. The core logic in gptq/base.py relies on internal methods of auto_round, which is a fragile approach. There are also FIXME comments and incomplete logic, indicating that the feature is not fully implemented. Additionally, debugging code has been added to production paths. I recommend cleaning up the example scripts, completing the auto_round integration using its public API, and removing the debugging code.

--dtype bfloat16 \
--port 8099 \
--no-enable-prefix-caching \
--trust-remote-code 2>&1 | tee $log_file
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The script uses the variable $log_file but it is never defined. This will cause the script to fail.

Comment on lines +436 to +450
for name, module in decoding_layer.named_modules():
if hasattr(module, "weight_scale") and hasattr(module, "weight_zero_point"):
logger.info(f"Updating offload parameters for module {getattr(module, '_tmp_name', '')} || {name}")
# weight = module.weight
scale = module.scale
del module.scale
# zero_point = module.weight_zero_point
# offload_device = torch.device("cpu")
# module.weight.data.copy_(weight.data)
# module.weight_scale.data.copy_(scale.data)
# module.weight_zero_point.data.copy_(zero_point.data)
# update_offload_parameter(module, "weight", weight)
# FIXME: yi quantize the weight based on the scale and zero_point from AutoRound
update_offload_parameter(module, "weight_scale", scale)
# update_offload_parameter(module, "weight_zero_point", zero_point)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The logic for applying the quantization results seems incomplete. The FIXME comment, the commented-out update for weight_zero_point, and the deletion of module.scale suggest that the quantization is not being correctly applied to the model's weights. This is a critical issue that needs to be resolved for the feature to work correctly.

Comment on lines +338 to +458
def autoround(self, state):
cur_layer_idx = self._cur_layer_idx
self._cur_layer_idx += 1
logger.info(f">>||>> AutoRound for decoding layer index {cur_layer_idx}")
if cur_layer_idx >= len(state.model.model.layers):
logger.info(f">>||>> All decoding layers have been processed for AutoRound.")
self.compress_modules(return_directly=False)
return
decoding_layer = state.model.model.layers[cur_layer_idx]
new_model = _PretrainModelWrapper()
new_model.model.layers.append(decoding_layer)
logger.warning(f">>||>> Strating AutoRound for decoding layer {getattr(decoding_layer, '_tmp_name', '')}")
param = next(decoding_layer.parameters())
new_model.dtype = param.dtype
with align_module_device(decoding_layer):
if _DEBUG:
iters = 4
else:
iters = 200
import auto_round
rounder = auto_round.AutoRound(
model=new_model,
tokenizer="",
scheme="W4A16",
# iters = 128,
iters = iters,
enable_quanted_input=False,
# FIXME: batch size 1 causes error, looks like related to the input_others prepare
# batch_size=1
enable_torch_compile=True,
# enable_deterministic_algorithms=True,
)
# block: torch.nn.Module,
# input_ids: list[torch.Tensor],
# input_others: dict,
# q_input: Union[None, torch.Tensor] = None,
# device: Union[str, torch.device] = "cpu",


from auto_round.compressors.utils import set_layer_config
def _preprocess(self):
self.layer_config, self.has_qlayer_outside_block, self.regex_config = set_layer_config(
self.model,
self.layer_config,
self.scheme,
self.scale_dtype,
self.supported_types,
self.inner_supported_types,
self.quant_block_list,
self.fp_layers,
self.quant_lm_head,
# enable_gguf_official_mixed=enable_gguf_official_mixed,
is_mllm=self.mllm,
)
self.batch_dim = 0
_preprocess(rounder)

input_name = f"model.layers.{cur_layer_idx}"
cur_inputs = all_module_input[input_name]
input_ids = []
input_others = {}
positional_inputs = []
attention_mask = None
position_ids = None
cache_position = None
position_embeddings = (None, None)
for cur_inp in cur_inputs:
_input_ids = cur_inp[0][0][0]
input_ids.append(cur_inp[0][0][0])
for key, val in cur_inp[0][1].items():
if key == "position_ids":
position_ids = val
elif key == "position_embeddings":
position_embeddings = val
elif key == "cache_position":
cache_position = val
input_others["position_ids"] = position_ids
input_others["positional_inputs"] = positional_inputs
input_others["attention_mask"] = attention_mask
input_others["position_embeddings"] = position_embeddings
input_others["cache_position"] = cache_position
decoding_layer.tuning_device = torch.device("cuda")
with torch.enable_grad():
rounder._quantize_block(
block=decoding_layer,
input_ids=input_ids,
input_others=input_others,
q_input=None,
device="cuda",
)
logger.info(f"AutoRound completed for decoding layer {getattr(decoding_layer, '_tmp_name', '')}")
# decoding_layer.mlp.up_proj.weight_scale.shape
# decoding_layer.mlp.up_proj.weight_zero_point.shape
logger.info(f"Weight scale shape: {decoding_layer.mlp.up_proj.weight_scale.shape}, zero point shape: {decoding_layer.mlp.up_proj.weight_zero_point.shape}")
logger.info(f"Weight scale shape: {decoding_layer.mlp.down_proj.weight_scale.shape}, zero point shape: {decoding_layer.mlp.down_proj.weight_zero_point.shape}")
# decoding_layer.to("cuda")
print(f" decoding_layer.mlp.down_proj{ decoding_layer.mlp.down_proj}")

for name, module in decoding_layer.named_modules():
if hasattr(module, "weight_scale") and hasattr(module, "weight_zero_point"):
logger.info(f"Updating offload parameters for module {getattr(module, '_tmp_name', '')} || {name}")
# weight = module.weight
scale = module.scale
del module.scale
# zero_point = module.weight_zero_point
# offload_device = torch.device("cpu")
# module.weight.data.copy_(weight.data)
# module.weight_scale.data.copy_(scale.data)
# module.weight_zero_point.data.copy_(zero_point.data)
# update_offload_parameter(module, "weight", weight)
# FIXME: yi quantize the weight based on the scale and zero_point from AutoRound
update_offload_parameter(module, "weight_scale", scale)
# update_offload_parameter(module, "weight_zero_point", zero_point)
# state.model.model.layers[0].mlp.gate_proj.weightbt
for module in list(self._num_samples.keys()):
name = self._module_names[module]
del self._num_samples[module]
decoding_layer.eval()
all_module_input.clear()
all_module_output.clear()
# logger.info(f"state.model.model.layers[0].mlp.gate_proj.weight: {state.model.model.layers[0].mlp.gate_proj.weight.device}")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The autoround method's integration with the auto_round library appears to be fragile as it relies on its internal methods like _preprocess and _quantize_block. This can easily break with new versions of auto_round. It is recommended to use the public API of the library for a more robust integration.

Comment on lines 12 to +17
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
model_id = "/data5/yliu7/HF_HOME/meta-llama/Llama-3.2-1B-Instruct"
model_id = "Qwen/Qwen2.5-0.5B"
model_id = "/data5/yliu7/HF_HOME/Qwen/Qwen2.5-0.5B"
model_id = "/data5/yliu7/meta-llama/meta-llama/Meta-Llama-3.1-8B-Instruct"
# model_id = "facebook/opt-125m"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

These multiple assignments to model_id, with most being commented out, make the configuration confusing. Please retain only the intended one and consider making it a script argument for better reusability.

Comment on lines +67 to +94
# ds = load_dataset(DATASET_ID, split=f"{DATASET_SPLIT}[:{NUM_CALIBRATION_SAMPLES}]")
# ds = ds.shuffle(seed=42)


def preprocess(example):
return {
"text": tokenizer.apply_chat_template(
example["messages"],
tokenize=False,
)
}
# def preprocess(example):
# return {
# "text": tokenizer.apply_chat_template(
# example["messages"],
# tokenize=False,
# )
# }


ds = ds.map(preprocess)
# ds = ds.map(preprocess)


# Tokenize inputs.
def tokenize(sample):
return tokenizer(
sample["text"],
padding=False,
max_length=MAX_SEQUENCE_LENGTH,
truncation=True,
add_special_tokens=False,
)
# # Tokenize inputs.
# def tokenize(sample):
# return tokenizer(
# sample["text"],
# padding=False,
# max_length=MAX_SEQUENCE_LENGTH,
# truncation=True,
# add_special_tokens=False,
# )


ds = ds.map(tokenize, remove_columns=ds.column_names)
# ds = ds.map(tokenize, remove_columns=ds.column_names)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This large block of commented-out code should be removed to improve readability if it's no longer needed.

Comment on lines +1 to +4
model_path="/data5/yliu7/tmp/Qwen2.5-0.5B-W4A16-G128"
model_path="/data5/yliu7/tmp/Meta-Llama-3.1-8B-Instruct-W4A16-G128"
model_path="/data5/yliu7/tmp/Meta-Llama-3.1-8B-Instruct-W4A16-G128-with-shuffule"
model_path="/data5/yliu7/tmp/Meta-Llama-3.1-8B-Instruct-W4A16-G128-disbale-shuffule/"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The model_path is configured through multiple, mostly commented-out, hardcoded paths. This is confusing and error-prone. It would be better to accept the model path as a command-line argument. Also, note the typo 'disbale-shuffule' in the active path.

Comment on lines +20 to +40
model_path=/home/yliu7/workspace/inc/3rd-party/llm-compressor/examples/quantization_non_uniform/Llama-3.2-1B-Instruct-NVFP4-FP8-Dynamic
model_path="/data5/yliu7/HF_HOME/qwen_moe_skip_lm_head"
model_path=/data5/yliu7/HF_HOME/ByteDance-Seed/Seed-OSS-36B-Instruct
model_path=/data5/yliu7/HF_HOME/GLM-4.5-Air-w8afp8-llmc/GLM-4.5-Air-w8afp8
# model_path=/data5/yliu7/HF_HOME/Llama-3.2-1B-Instruct-NVFPP_B16/
# model_path=/data5/yliu7/HF_HOME/meta-llama/Llama-3.2-1B-Instruct/
model_path=/data5/yliu7/HF_HOME/Yi30/gpt-oss-20b-BF16-MXFP8/
model_path=/data5/yliu7/HF_HOME/Yi30/gpt-oss-120b-BF16-unsloth-MXFP8
model_path=/data6/yiliu4/unsloth-gpt-oss-120b-BF16-ar-MXFP4/
model_path="/data5/yliu7/HF_HOME/unsloth-gpt-oss-20b-BF16-ar-MXFP4/"
model_path="/data5/yliu7/HF_HOME/unsloth-gpt-oss-20b-BF16-ar-MXFP4/"
model_path="/data5/yliu7/tmp/Qwen2.5-0.5B-W4A16-G128"
model_path="/data5/yliu7/tmp/Meta-Llama-3.1-8B-Instruct-W4A16-G128"
model_path="/data5/yliu7/tmp/Meta-Llama-3.1-8B-Instruct-W4A16-G128-with-shuffule"
model_path="/data5/yliu7/tmp/Meta-Llama-3.1-8B-Instruct-W4A16-G128-disbale-shuffule/"
# model_path="/data5/yliu7/meta-llama/meta-llama/Meta-Llama-3.1-8B-Instruct-AR-W4G128"
# model_path="/data5/yliu7/meta-llama/meta-llama/Meta-Llama-3.1-8B-Instruct-AR-W4G128"
# model_path=/data5/yliu7/HF_HOME/unsloth-gpt-oss-20b-BF16-ar-MXFP4/
# model_path=/data6/yiliu4/unsloth-gpt-oss-120b-BF16-ar-MXFP4/
# model_path=/data5/yliu7/HF_HOME/Yi30/unsloth-gpt-oss-20b-BF16-MXFP4
# model_path=/models/DeepSeek-V2-Lite-Chat/
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This script contains a large number of hardcoded and commented-out model_path assignments. This makes the script difficult to understand and maintain. Please refactor this to take model_path as a command-line argument and remove the unused assignments.


MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct"

MODEL_ID = "/data5/yliu7/HF_HOME/meta-llama/Llama-3.2-1B-Instruct"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The MODEL_ID is hardcoded to a local path. For better portability of this example, please consider making this a command-line argument or using a model from Hugging Face Hub.

Comment on lines +41 to +42
all_module_input = defaultdict(list)
all_module_output = defaultdict(list)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The use of module-level global dictionaries all_module_input and all_module_output to pass data between hooks and the autoround method introduces global state, which can be hard to manage and debug. Consider using instance attributes on the modifier to store this state, making the data flow more explicit and encapsulated.

Comment on lines +59 to +77
class _LLModelWrapper(torch.nn.Module):
def __init__(self):
super().__init__()
self.layers = torch.nn.ModuleList()

def forward(self, *args, **kwargs):
for layer in self.layers:
res = layer(*args, **kwargs)
return res


class _PretrainModelWrapper(torch.nn.Module):
def __init__(self):
super().__init__()
self.model = _LLModelWrapper()

def forward(self, *args, **kwargs):
return self.model(*args, **kwargs)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The wrapper classes _LLModelWrapper and _PretrainModelWrapper have very generic names. To improve code clarity, consider renaming them to be more descriptive of their specific purpose, such as SingleLayerWrapper. Additionally, _PretrainModelWrapper appears to be a redundant wrapper around _LLModelWrapper and could potentially be removed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant