Skip to content

[Bug] : Oneshot error while trying to quantize HunyuanOCR with coco json format #2082

@al0nkr

Description

@al0nkr

⚙️ current environment

NVIDIA-SMI 580.95.05
Driver Version: 580.95.05
CUDA Version: 13.0
NVIDIA GeForce RTX 4070 Super 12GB VRAM
environment CUDA Version -> 12.8
16GB DDR4 RAM
CPU AMD Ryzen 7 5700G with Radeon Graphics

The output of python3 src/quantization/awq_test.py
The following generation flags are not valid and may be ignored: ['pad_token_id']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00,  8.55it/s]
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
Tokenizing: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 512/512 [00:00<00:00, 1405.89 examples/s]
2025-12-02T02:56:17.665846+0530 | reset | INFO - Compression lifecycle reset
2025-12-02T02:56:17.671479+0530 | _create_default_logger | INFO - Logging all LLM Compressor modifier-level logs to sparse_logs/02-12-2025_02.56.17.log
2025-12-02T02:56:17.671860+0530 | from_modifiers | INFO - Creating recipe from modifiers
2025-12-02T02:56:17.728215+0530 | on_initialize | INFO - No AWQModifier.mappings provided, inferring from model...
2025-12-02T02:56:17.728384+0530 | get_layer_mappings_from_architecture | INFO - Architecture HunYuanVLForConditionalGeneration not found in mappings. Using default mappings: [AWQMapping(smooth_layer='re:.*input_layernorm$', balance_layers=['re:.*q_proj$', 're:.*k_proj$', 're:.*v_proj$']), AWQMapping(smooth_layer='re:.*v_proj$', balance_layers=['re:.*o_proj$']), AWQMapping(smooth_layer='re:.*post_attention_layernorm$', balance_layers=['re:.*gate_proj$', 're:.*up_proj$']), AWQMapping(smooth_layer='re:.*up_proj$', balance_layers=['re:.*down_proj$'])]
0it [00:00, ?it/s]
Resolving mapping 2/4 (24 skipped): : 51it [00:00, 8421.97it/s]
0it [00:00, ?it/s]
Resolving mapping 4/4 (0 skipped): : 24it [00:00, 6069.54it/s]
2025-12-02T02:56:17.744290+0530 | initialize | INFO - Compression lifecycle initialized for 1 modifiers
2025-12-02T02:56:17.744422+0530 | IndependentPipeline | INFO - Inferred `SequentialPipeline` for `AWQModifier`
Preparing cache: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 512/512 [00:01<00:00, 404.09it/s]
(1/25): Calibrating:   0%|                                                                                                                                                                                                                                          | 0/512 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/home/al0nkr/visual-search-rag/visragvenv/lib/python3.12/site-packages/llmcompressor/pipelines/sequential/helpers.py", line 73, in forward
    outputs = forward_fn(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<string>", line 18, in forward
  File "/home/al0nkr/visual-search-rag/visragvenv/lib/python3.12/site-packages/transformers/modeling_layers.py", line 94, in __call__
    return super().__call__(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/al0nkr/visual-search-rag/visragvenv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/al0nkr/visual-search-rag/visragvenv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/al0nkr/visual-search-rag/visragvenv/lib/python3.12/site-packages/transformers/utils/deprecation.py", line 172, in wrapped_func
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/al0nkr/visual-search-rag/visragvenv/lib/python3.12/site-packages/transformers/models/hunyuan_vl/modeling_hunyuan_vl.py", line 675, in forward
    hidden_states, _ = self.self_attn(
                       ^^^^^^^^^^^^^^^
  File "/home/al0nkr/visual-search-rag/visragvenv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/al0nkr/visual-search-rag/visragvenv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/al0nkr/visual-search-rag/visragvenv/lib/python3.12/site-packages/transformers/utils/deprecation.py", line 172, in wrapped_func
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/al0nkr/visual-search-rag/visragvenv/lib/python3.12/site-packages/transformers/models/hunyuan_vl/modeling_hunyuan_vl.py", line 595, in forward
    query_states, key_states = apply_rotary_pos_emb_xdrope(
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/al0nkr/visual-search-rag/visragvenv/lib/python3.12/site-packages/transformers/models/hunyuan_vl/modeling_hunyuan_vl.py", line 474, in apply_rotary_pos_emb_xdrope
    cos = cos[position_ids, ...].permute(0, 2, 1, 3).reshape(output_size[0], output_size[2], x_dim, -1).contiguous()
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: permute(sparse_coo): number of dimensions in the tensor input does not match the length of the desired ordering of dimensions i.e. input.dim() = 3 is not equal to len(dims) = 4

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/al0nkr/visual-search-rag/src/quantization/awq_test.py", line 77, in <module>
    oneshot(
  File "/home/al0nkr/visual-search-rag/visragvenv/lib/python3.12/site-packages/llmcompressor/entrypoints/oneshot.py", line 330, in oneshot
    one_shot()
  File "/home/al0nkr/visual-search-rag/visragvenv/lib/python3.12/site-packages/llmcompressor/entrypoints/oneshot.py", line 158, in __call__
    self.apply_recipe_modifiers(
  File "/home/al0nkr/visual-search-rag/visragvenv/lib/python3.12/site-packages/llmcompressor/entrypoints/oneshot.py", line 201, in apply_recipe_modifiers
    pipeline(
  File "/home/al0nkr/visual-search-rag/visragvenv/lib/python3.12/site-packages/llmcompressor/pipelines/independent/pipeline.py", line 45, in __call__
    pipeline(model, dataloader, dataset_args)
  File "/home/al0nkr/visual-search-rag/visragvenv/lib/python3.12/site-packages/llmcompressor/pipelines/sequential/pipeline.py", line 104, in __call__
    subgraph.forward(model, **inputs)
  File "/home/al0nkr/visual-search-rag/visragvenv/lib/python3.12/site-packages/llmcompressor/pipelines/sequential/helpers.py", line 75, in forward
    raise RuntimeError(
RuntimeError: Raised an exception during execution of the following code:

1
2 torch.fx._symbolic_trace.wrap("transformers_models_hunyuan_vl_modeling_hunyuan_vl_wrapped_0")
3 torch.fx._symbolic_trace.wrap("transformers_models_hunyuan_vl_modeling_hunyuan_vl_wrapped_1")
4 torch.fx._symbolic_trace.wrap("transformers_models_hunyuan_vl_modeling_hunyuan_vl_wrapped_3")
5 torch.fx._symbolic_trace.wrap("transformers_models_hunyuan_vl_modeling_hunyuan_vl_wrapped_4")
6 torch.fx._symbolic_trace.wrap("transformers_models_hunyuan_vl_modeling_hunyuan_vl_wrapped_5")
7
8 def forward(self, input_ids : torch.Tensor, attention_mask : torch.Tensor):
9 wrapped_0 = transformers_models_hunyuan_vl_modeling_hunyuan_vl_wrapped_0(input_ids, None); wrapped_0 = None
10 wrapped_1 = transformers_models_hunyuan_vl_modeling_hunyuan_vl_wrapped_1(input_ids, None); input_ids = None
11 getitem = wrapped_1[0]; wrapped_1 = None
12 wrapped_3 = transformers_models_hunyuan_vl_modeling_hunyuan_vl_wrapped_3(None, getitem, None)
13 getitem_1 = wrapped_3[0]
14 getitem_2 = wrapped_3[1]; wrapped_3 = getitem_2 = None
15 wrapped_4 = transformers_models_hunyuan_vl_modeling_hunyuan_vl_wrapped_4(getitem_1, None)
16 getitem_3 = wrapped_4[0]; wrapped_4 = None
17 wrapped_5 = transformers_models_hunyuan_vl_modeling_hunyuan_vl_wrapped_5(attention_mask, getitem_1, getitem, None, getitem_3); attention_mask = None
18 model_layers_0 = getattr(self.model.layers, "0")(getitem, attention_mask = wrapped_5, position_ids = getitem_3, past_key_values = None, cache_position = getitem_1); getitem = None
19 return {'getitem_1': getitem_1, 'getitem_3': getitem_3, 'wrapped_5': wrapped_5, 'model_layers_0': model_layers_0}
20

🐛 Describe the bug

  • Tried to quantize HunyuanOCR model with a small-coco dataset fomatted into a dataloader with AWQModifier and oneshot
  • New to using llm-compressor and vllm, and given HunyuanOCR has launched recently, I assume support for it hasn't been completely looked into?

🛠️ Steps to reproduce

awq_test.py

import torch, json, os
from datasets import Dataset
from PIL import Image
from transformers import AutoModelForVision2Seq, AutoProcessor
from llmcompressor.modifiers.awq import AWQModifier
from llmcompressor import oneshot

MODEL_ID   = "tencent/HunyuanOCR"
SAVE_DIR   = "./HunyuanOCR-AWQ-W4A16"
CALIB_FILE = "calib_512.json"

# 1.  load
model = AutoModelForVision2Seq.from_pretrained(
            MODEL_ID,
            torch_dtype=torch.bfloat16,
            device_map="auto",
            trust_remote_code=True,
            offload_buffers=True,
            attn_implementation="sdpa")
processor = AutoProcessor.from_pretrained(MODEL_ID, trust_remote_code=True)
# 2.  raw HF dataset  (text=str, image=PIL)
def build_ds():
    data = json.load(open(CALIB_FILE))
    return Dataset.from_list([
        {"image": Image.open(d["image"]).convert("RGB"),
         "text":  f"<image>{d['text']}"}      #  keep <image> for safety
        for d in data
    ])

calib_ds = build_ds()

# 3.  tell llm-compressor how to turn samples → tensors
def collate(sample):
    # 1.  build the prompt exactly like the model card
    text = f"<image>\n{sample['text']}"          #  MUST contain <image>
    # 2.  single processor call – do NOT tokenise beforehand
    batch = processor(text=text,
                      images=sample["image"],
                      return_tensors="pt",
                      truncation=True,
                      max_length=2048)
    return batch


# 3.  recipe  (ignore tiny vision & head layers)
recipe = AWQModifier(
    ignore=[
        "re:.*embed_tokens",
        "re:.*input_layernorm$",
        "re:.*mlp[.]gate$",
        "re:.*post_attention_layernorm$",
        "re:.*norm$",
        "re:model[.]visual.*",
        "re:visual.*",
        "lm_head",
    ],
    duo_scaling=True,
    config_groups={
        "group_0": {
            "targets": ["Linear"],
            "weights": {
                "num_bits": 4,
                "type": "int",
                "symmetric": True,
                "group_size": 32,
                "strategy": "group",
                "dynamic": False,
                "actorder": None,
                "observer": "mse",
            },
        }
    },
)


# 4.  one-shot compression
oneshot(
    model=model,
    tokenizer=processor.tokenizer,   # llm-compressor needs a tokenizer
    dataset=calib_ds,
    num_calibration_samples=512,
    recipe=recipe,
    output_dir=SAVE_DIR,
    max_seq_length=2048)

# 5.  export in compressed-tensors format (vLLM native)
model.save_pretrained(SAVE_DIR,  save_compressed=True)
processor.save_pretrained(SAVE_DIR)
print("AWQ W4A16 finished →", SAVE_DIR)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions