-
Notifications
You must be signed in to change notification settings - Fork 300
Description
⚙️ current environment
NVIDIA-SMI 580.95.05
Driver Version: 580.95.05
CUDA Version: 13.0
NVIDIA GeForce RTX 4070 Super 12GB VRAM
environment CUDA Version -> 12.8
16GB DDR4 RAM
CPU AMD Ryzen 7 5700G with Radeon Graphics
The output of python3 src/quantization/awq_test.py
The following generation flags are not valid and may be ignored: ['pad_token_id']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 8.55it/s]
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
Tokenizing: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 512/512 [00:00<00:00, 1405.89 examples/s]
2025-12-02T02:56:17.665846+0530 | reset | INFO - Compression lifecycle reset
2025-12-02T02:56:17.671479+0530 | _create_default_logger | INFO - Logging all LLM Compressor modifier-level logs to sparse_logs/02-12-2025_02.56.17.log
2025-12-02T02:56:17.671860+0530 | from_modifiers | INFO - Creating recipe from modifiers
2025-12-02T02:56:17.728215+0530 | on_initialize | INFO - No AWQModifier.mappings provided, inferring from model...
2025-12-02T02:56:17.728384+0530 | get_layer_mappings_from_architecture | INFO - Architecture HunYuanVLForConditionalGeneration not found in mappings. Using default mappings: [AWQMapping(smooth_layer='re:.*input_layernorm$', balance_layers=['re:.*q_proj$', 're:.*k_proj$', 're:.*v_proj$']), AWQMapping(smooth_layer='re:.*v_proj$', balance_layers=['re:.*o_proj$']), AWQMapping(smooth_layer='re:.*post_attention_layernorm$', balance_layers=['re:.*gate_proj$', 're:.*up_proj$']), AWQMapping(smooth_layer='re:.*up_proj$', balance_layers=['re:.*down_proj$'])]
0it [00:00, ?it/s]
Resolving mapping 2/4 (24 skipped): : 51it [00:00, 8421.97it/s]
0it [00:00, ?it/s]
Resolving mapping 4/4 (0 skipped): : 24it [00:00, 6069.54it/s]
2025-12-02T02:56:17.744290+0530 | initialize | INFO - Compression lifecycle initialized for 1 modifiers
2025-12-02T02:56:17.744422+0530 | IndependentPipeline | INFO - Inferred `SequentialPipeline` for `AWQModifier`
Preparing cache: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 512/512 [00:01<00:00, 404.09it/s]
(1/25): Calibrating: 0%| | 0/512 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/home/al0nkr/visual-search-rag/visragvenv/lib/python3.12/site-packages/llmcompressor/pipelines/sequential/helpers.py", line 73, in forward
outputs = forward_fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "<string>", line 18, in forward
File "/home/al0nkr/visual-search-rag/visragvenv/lib/python3.12/site-packages/transformers/modeling_layers.py", line 94, in __call__
return super().__call__(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/al0nkr/visual-search-rag/visragvenv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/al0nkr/visual-search-rag/visragvenv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/al0nkr/visual-search-rag/visragvenv/lib/python3.12/site-packages/transformers/utils/deprecation.py", line 172, in wrapped_func
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/al0nkr/visual-search-rag/visragvenv/lib/python3.12/site-packages/transformers/models/hunyuan_vl/modeling_hunyuan_vl.py", line 675, in forward
hidden_states, _ = self.self_attn(
^^^^^^^^^^^^^^^
File "/home/al0nkr/visual-search-rag/visragvenv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/al0nkr/visual-search-rag/visragvenv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/al0nkr/visual-search-rag/visragvenv/lib/python3.12/site-packages/transformers/utils/deprecation.py", line 172, in wrapped_func
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/al0nkr/visual-search-rag/visragvenv/lib/python3.12/site-packages/transformers/models/hunyuan_vl/modeling_hunyuan_vl.py", line 595, in forward
query_states, key_states = apply_rotary_pos_emb_xdrope(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/al0nkr/visual-search-rag/visragvenv/lib/python3.12/site-packages/transformers/models/hunyuan_vl/modeling_hunyuan_vl.py", line 474, in apply_rotary_pos_emb_xdrope
cos = cos[position_ids, ...].permute(0, 2, 1, 3).reshape(output_size[0], output_size[2], x_dim, -1).contiguous()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: permute(sparse_coo): number of dimensions in the tensor input does not match the length of the desired ordering of dimensions i.e. input.dim() = 3 is not equal to len(dims) = 4
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/al0nkr/visual-search-rag/src/quantization/awq_test.py", line 77, in <module>
oneshot(
File "/home/al0nkr/visual-search-rag/visragvenv/lib/python3.12/site-packages/llmcompressor/entrypoints/oneshot.py", line 330, in oneshot
one_shot()
File "/home/al0nkr/visual-search-rag/visragvenv/lib/python3.12/site-packages/llmcompressor/entrypoints/oneshot.py", line 158, in __call__
self.apply_recipe_modifiers(
File "/home/al0nkr/visual-search-rag/visragvenv/lib/python3.12/site-packages/llmcompressor/entrypoints/oneshot.py", line 201, in apply_recipe_modifiers
pipeline(
File "/home/al0nkr/visual-search-rag/visragvenv/lib/python3.12/site-packages/llmcompressor/pipelines/independent/pipeline.py", line 45, in __call__
pipeline(model, dataloader, dataset_args)
File "/home/al0nkr/visual-search-rag/visragvenv/lib/python3.12/site-packages/llmcompressor/pipelines/sequential/pipeline.py", line 104, in __call__
subgraph.forward(model, **inputs)
File "/home/al0nkr/visual-search-rag/visragvenv/lib/python3.12/site-packages/llmcompressor/pipelines/sequential/helpers.py", line 75, in forward
raise RuntimeError(
RuntimeError: Raised an exception during execution of the following code:
1
2 torch.fx._symbolic_trace.wrap("transformers_models_hunyuan_vl_modeling_hunyuan_vl_wrapped_0")
3 torch.fx._symbolic_trace.wrap("transformers_models_hunyuan_vl_modeling_hunyuan_vl_wrapped_1")
4 torch.fx._symbolic_trace.wrap("transformers_models_hunyuan_vl_modeling_hunyuan_vl_wrapped_3")
5 torch.fx._symbolic_trace.wrap("transformers_models_hunyuan_vl_modeling_hunyuan_vl_wrapped_4")
6 torch.fx._symbolic_trace.wrap("transformers_models_hunyuan_vl_modeling_hunyuan_vl_wrapped_5")
7
8 def forward(self, input_ids : torch.Tensor, attention_mask : torch.Tensor):
9 wrapped_0 = transformers_models_hunyuan_vl_modeling_hunyuan_vl_wrapped_0(input_ids, None); wrapped_0 = None
10 wrapped_1 = transformers_models_hunyuan_vl_modeling_hunyuan_vl_wrapped_1(input_ids, None); input_ids = None
11 getitem = wrapped_1[0]; wrapped_1 = None
12 wrapped_3 = transformers_models_hunyuan_vl_modeling_hunyuan_vl_wrapped_3(None, getitem, None)
13 getitem_1 = wrapped_3[0]
14 getitem_2 = wrapped_3[1]; wrapped_3 = getitem_2 = None
15 wrapped_4 = transformers_models_hunyuan_vl_modeling_hunyuan_vl_wrapped_4(getitem_1, None)
16 getitem_3 = wrapped_4[0]; wrapped_4 = None
17 wrapped_5 = transformers_models_hunyuan_vl_modeling_hunyuan_vl_wrapped_5(attention_mask, getitem_1, getitem, None, getitem_3); attention_mask = None
18 model_layers_0 = getattr(self.model.layers, "0")(getitem, attention_mask = wrapped_5, position_ids = getitem_3, past_key_values = None, cache_position = getitem_1); getitem = None
19 return {'getitem_1': getitem_1, 'getitem_3': getitem_3, 'wrapped_5': wrapped_5, 'model_layers_0': model_layers_0}
20
🐛 Describe the bug
- Tried to quantize HunyuanOCR model with a small-coco dataset fomatted into a dataloader with
AWQModifierandoneshot - New to using llm-compressor and vllm, and given HunyuanOCR has launched recently, I assume support for it hasn't been completely looked into?
🛠️ Steps to reproduce
awq_test.py
import torch, json, os
from datasets import Dataset
from PIL import Image
from transformers import AutoModelForVision2Seq, AutoProcessor
from llmcompressor.modifiers.awq import AWQModifier
from llmcompressor import oneshot
MODEL_ID = "tencent/HunyuanOCR"
SAVE_DIR = "./HunyuanOCR-AWQ-W4A16"
CALIB_FILE = "calib_512.json"
# 1. load
model = AutoModelForVision2Seq.from_pretrained(
MODEL_ID,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
offload_buffers=True,
attn_implementation="sdpa")
processor = AutoProcessor.from_pretrained(MODEL_ID, trust_remote_code=True)
# 2. raw HF dataset (text=str, image=PIL)
def build_ds():
data = json.load(open(CALIB_FILE))
return Dataset.from_list([
{"image": Image.open(d["image"]).convert("RGB"),
"text": f"<image>{d['text']}"} # keep <image> for safety
for d in data
])
calib_ds = build_ds()
# 3. tell llm-compressor how to turn samples → tensors
def collate(sample):
# 1. build the prompt exactly like the model card
text = f"<image>\n{sample['text']}" # MUST contain <image>
# 2. single processor call – do NOT tokenise beforehand
batch = processor(text=text,
images=sample["image"],
return_tensors="pt",
truncation=True,
max_length=2048)
return batch
# 3. recipe (ignore tiny vision & head layers)
recipe = AWQModifier(
ignore=[
"re:.*embed_tokens",
"re:.*input_layernorm$",
"re:.*mlp[.]gate$",
"re:.*post_attention_layernorm$",
"re:.*norm$",
"re:model[.]visual.*",
"re:visual.*",
"lm_head",
],
duo_scaling=True,
config_groups={
"group_0": {
"targets": ["Linear"],
"weights": {
"num_bits": 4,
"type": "int",
"symmetric": True,
"group_size": 32,
"strategy": "group",
"dynamic": False,
"actorder": None,
"observer": "mse",
},
}
},
)
# 4. one-shot compression
oneshot(
model=model,
tokenizer=processor.tokenizer, # llm-compressor needs a tokenizer
dataset=calib_ds,
num_calibration_samples=512,
recipe=recipe,
output_dir=SAVE_DIR,
max_seq_length=2048)
# 5. export in compressed-tensors format (vLLM native)
model.save_pretrained(SAVE_DIR, save_compressed=True)
processor.save_pretrained(SAVE_DIR)
print("AWQ W4A16 finished →", SAVE_DIR)