Skip to content

Conversation

akacmazz
Copy link

@akacmazz akacmazz commented Aug 11, 2025

Skip weight initialization for int8/uint8 quantized weights in init_weights method.
The normal
() function only works with floating-point tensors, but quantized
models contain int8/uint8 weights which should preserve their loaded values.

Fixes #39366

  • Add dtype check before calling normal_() on weights
  • Skip initialization for int8/uint8 weights and biases
  • Add debug logging when skipping quantized weights
  • Add comprehensive tests for quantized weight handling
  • Maintain backward compatibility with existing models

What does this PR do?

Fixes a RuntimeError that occurs when loading llmcompressor W8A8 quantized models. The issue happens because the _init_weights method attempts to apply normal_() distribution to int8 tensors, which PyTorch doesn't support.

Before & After

Before (❌)

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "RedHatAI/Qwen2.5-VL-7B-Instruct-quantized.w8a8"
)
# RuntimeError: expected a floating-point or complex dtype, but got dtype=torch.int8

## After (✅)

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "RedHatAI/Qwen2.5-VL-7B-Instruct-quantized.w8a8")

# Model loads successfully

## Root Cause

The PreTrainedModel._init_weights() method in modeling_utils.py calls module.weight.data.normal_() on all linear and embedding layers. However, quantized models have int8/uint8 weights that:
1. Cannot use normal_() (PyTorch limitation)
2. Should preserve their quantized values anyway
3. Don't require re-initialization

## Fixes # (issue)

-Add dtype checking before calling normal_()
-Skip initialization for int8/uint8 weights and biases
-Preserve quantized values as loaded from model files
-Add debug logging when skipping quantized layers
-Maintain full backward compatibility

## Testing

-Reproduced original error with unmodified code
-Verified fix works with real quantized model
-Confirmed 196 quantized layers load correctly
-Added comprehensive tests for both int8 and float32 scenarios
-Validated backward compatibility with existing models

## Impact

This fix enables loading of:
- llmcompressor W8A8 quantized models
- Other int8/uint8 quantization formats
- Future compressed-tensors quantized models

Affects: All models inheriting from PreTrainedModel with int8/uint8 quantization
Benefits: Thousands of users can now load quantized models without errors

## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
- [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#create-a-pull-request),
    Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link
    to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes? Here are the
    [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and
    [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?


## Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @

If you know how to use git blame, that is the easiest way, otherwise, here is a rough guide of **who to tag**.
Please tag fewer than 3 people.

Models:

- text models: @ArthurZucker
- vision models: @amyeroberts, @qubvel
- speech models: @eustlb
- graph models: @clefourrier

Library:

- flax: @gante and @Rocketknight1
- generate: @zucchini-nlp (visual-language models) or @gante (all others)
- pipelines: @Rocketknight1
- tensorflow: @gante and @Rocketknight1
- tokenizers: @ArthurZucker
- trainer: @zach-huggingface, @SunMarc and @qgallouedec
- chat templates: @Rocketknight1

Integrations:

- deepspeed: HF Trainer/Accelerate: @SunMarc @zach-huggingface
- ray/raytune: @richardliaw, @amogkam
- Big Model Inference: @SunMarc
- quantization (bitsandbytes, autogpt): @SunMarc @MekkCyber

Documentation: @stevhliu

HF projects:

- accelerate: [different repo](https://github.com/huggingface/accelerate)
- datasets: [different repo](https://github.com/huggingface/datasets)
- diffusers: [different repo](https://github.com/huggingface/diffusers)
- rust tokenizers: [different repo](https://github.com/huggingface/tokenizers)

Maintained examples (not research project or legacy):

- Flax: @Rocketknight1
- PyTorch: See Models above and tag the person corresponding to the modality of the example.
- TensorFlow: @Rocketknight1

-->

  Skip weight initialization for int8/uint8 quantized weights in _init_weights method.
  The normal_() function only works with floating-point tensors, but quantized
  models contain int8/uint8 weights which should preserve their loaded values.

  Fixes huggingface#39366

  - Add dtype check before calling normal_() on weights
  - Skip initialization for int8/uint8 weights and biases
  - Add debug logging when skipping quantized weights
  - Add comprehensive tests for quantized weight handling
  - Maintain backward compatibility with existing models
@Rocketknight1
Copy link
Member

cc @MekkCyber for quantization

@jubueche
Copy link

I am also encountering this issue :)

Copy link
Contributor

@MekkCyber MekkCyber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this 🤗 ! left some comments below

logger.debug(f"Skipping weight initialization for quantized module {module.__class__.__name__} with dtype {module.weight.dtype}")
else:
module.weight.data.normal_(mean=0.0, std=std)
if module.bias is not None and module.bias.dtype not in (torch.int8, torch.uint8):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this for .zero_() too or only .normal_()

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx for the review. @MekkCyber Yes, we should apply the same dtype check for .zero_() as well. Good point. Looking at the code, there are several places where .zero_() is called on weights and biases:

    1. Line 2938: module.bias.data.zero_()
    1. Line 2946: module.weight.data[module.padding_idx].zero_() (for embeddings)
    1. Line 2961: module.bias.data.zero_() (for normalization layers)

The current fix already handles bias at line 2938 with the dtype check, and line 2961 is for normalization layers which typically don't have quantized biases. However, line 2946 for embedding padding_idx could potentially fail with quantized embeddings.

I can update the fix to also check the dtype before calling .zero_() on the padding index for consistency ?

Comment on lines +1 to +16
import unittest
import torch
import torch.nn as nn
from transformers import PreTrainedModel, PretrainedConfig


class TestQuantizedWeightInitialization(unittest.TestCase):
"""Test that quantized weights are not re-initialized during model loading."""

def test_int8_weights_skipped(self):
"""Test that int8 weights are skipped during initialization."""

class TestConfig(PretrainedConfig):
def __init__(self, **kwargs):
super().__init__(**kwargs)
self.initializer_range = 0.02
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can just add a simple test in tests/quantization/compressed_tensors_integration with the failling model instead

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, that makes much more sense. I can move the test to tests/quantization/compressed_tensors_integration/ since that's where compressed-tensors related tests belong. I can add a simple test there that reproduces the original failing scenario with the actual quantized model "nm-testing/tinyllama-w8a8-compressed-hf-quantizer" that's already being used in the existing tests. Thank you.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @akacmazz ! still i'm a bit confused why we are trying to initialize weights in the case of quantized models, i mean the weights need to be there already because we are just loading. Will try to take a deeper look because I think the issue is deeper than this

@jubueche
Copy link

BTW, I also encountered a scenario where I was loading an llm-compressor compressed model and it did not have the weights in int8, but rather it just did not have any weight attribute at all. This is because the weight was still packed and was called weight_packed (or smth. similar). So I had to adjust this one block to

if isinstance(module, (nn.Linear, nn.Conv1d, nn.Conv2d, nn.Conv3d, nn.ConvTranspose1d, nn.ConvTranspose2d)):
    # Skip initialization for quantized weights (int8, uint8)
    if hasattr(module, "weight") and module.weight.dtype in (torch.int8, torch.uint8):
        logger.debug(f"Skipping weight initialization for quantized module {module.__class__.__name__} with dtype {module.weight.dtype}")
    elif not hasattr(module, "weight"):
        logger.debug(f"Skipping weight initialization because {module} does not have attribute `weight`.")
    else:
        module.weight.data.normal_(mean=0.0, std=std)
    if module.bias is not None and module.bias.dtype not in (torch.int8, torch.uint8):
        module.bias.data.zero_()

With this, my model skipped initialization for these layers and ran fine. I can give more details if you want.

Comment on lines +46 to +56
class TestConfig(PretrainedConfig):
def __init__(self, **kwargs):
super().__init__(**kwargs)
self.initializer_range = 0.02

class TestModel(PreTrainedModel):
config_class = TestConfig

def __init__(self, config):
super().__init__(config)
self.linear = nn.Linear(10, 10)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, awesome work on this @akacmazz! Small suggestion, perhaps we could move TestConfig and TestModel to the module level as helper classes. This would make the test suite cleaner and prevent the same code from being defined twice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

RuntimeError when loading llmcompressor W8A8 quantized model: int8 dtype in weight initialization
5 participants