Fix RuntimeError when loading quantized models with int8 weights (#39366) #40090

akacmazz · 2025-08-11T22:52:42Z

Skip weight initialization for int8/uint8 quantized weights in init_weights method.
The normal() function only works with floating-point tensors, but quantized
models contain int8/uint8 weights which should preserve their loaded values.

Fixes #39366

Add dtype check before calling normal_() on weights
Skip initialization for int8/uint8 weights and biases
Add debug logging when skipping quantized weights
Add comprehensive tests for quantized weight handling
Maintain backward compatibility with existing models

What does this PR do?

Fixes a RuntimeError that occurs when loading llmcompressor W8A8 quantized models. The issue happens because the _init_weights method attempts to apply normal_() distribution to int8 tensors, which PyTorch doesn't support.

Before & After

Before (❌)

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "RedHatAI/Qwen2.5-VL-7B-Instruct-quantized.w8a8"
)
# RuntimeError: expected a floating-point or complex dtype, but got dtype=torch.int8

## After (✅)

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "RedHatAI/Qwen2.5-VL-7B-Instruct-quantized.w8a8")

# Model loads successfully

## Root Cause

The PreTrainedModel._init_weights() method in modeling_utils.py calls module.weight.data.normal_() on all linear and embedding layers. However, quantized models have int8/uint8 weights that:
1. Cannot use normal_() (PyTorch limitation)
2. Should preserve their quantized values anyway
3. Don't require re-initialization

## Fixes # (issue)

- ✅ Add dtype checking before calling normal_()
- ✅ Skip initialization for int8/uint8 weights and biases
- ✅ Preserve quantized values as loaded from model files
- ✅ Add debug logging when skipping quantized layers
- ✅ Maintain full backward compatibility

## Testing

- ✅ Reproduced original error with unmodified code
- ✅ Verified fix works with real quantized model
- ✅ Confirmed 196 quantized layers load correctly
- ✅ Added comprehensive tests for both int8 and float32 scenarios
- ✅ Validated backward compatibility with existing models

## Impact

This fix enables loading of:
- llmcompressor W8A8 quantized models
- Other int8/uint8 quantization formats
- Future compressed-tensors quantized models

Affects: All models inheriting from PreTrainedModel with int8/uint8 quantization
Benefits: Thousands of users can now load quantized models without errors

## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
- [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#create-a-pull-request),
    Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link
    to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes? Here are the
    [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and
    [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?


## Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @

If you know how to use git blame, that is the easiest way, otherwise, here is a rough guide of **who to tag**.
Please tag fewer than 3 people.

Models:

- text models: @ArthurZucker
- vision models: @amyeroberts, @qubvel
- speech models: @eustlb
- graph models: @clefourrier

Library:

- flax: @gante and @Rocketknight1
- generate: @zucchini-nlp (visual-language models) or @gante (all others)
- pipelines: @Rocketknight1
- tensorflow: @gante and @Rocketknight1
- tokenizers: @ArthurZucker
- trainer: @zach-huggingface, @SunMarc and @qgallouedec
- chat templates: @Rocketknight1

Integrations:

- deepspeed: HF Trainer/Accelerate: @SunMarc @zach-huggingface
- ray/raytune: @richardliaw, @amogkam
- Big Model Inference: @SunMarc
- quantization (bitsandbytes, autogpt): @SunMarc @MekkCyber

Documentation: @stevhliu

HF projects:

- accelerate: [different repo](https://github.com/huggingface/accelerate)
- datasets: [different repo](https://github.com/huggingface/datasets)
- diffusers: [different repo](https://github.com/huggingface/diffusers)
- rust tokenizers: [different repo](https://github.com/huggingface/tokenizers)

Maintained examples (not research project or legacy):

- Flax: @Rocketknight1
- PyTorch: See Models above and tag the person corresponding to the modality of the example.
- TensorFlow: @Rocketknight1

-->

Skip weight initialization for int8/uint8 quantized weights in _init_weights method. The normal_() function only works with floating-point tensors, but quantized models contain int8/uint8 weights which should preserve their loaded values. Fixes huggingface#39366 - Add dtype check before calling normal_() on weights - Skip initialization for int8/uint8 weights and biases - Add debug logging when skipping quantized weights - Add comprehensive tests for quantized weight handling - Maintain backward compatibility with existing models

Rocketknight1 · 2025-08-13T12:49:04Z

cc @MekkCyber for quantization

jubueche · 2025-08-25T15:24:30Z

I am also encountering this issue :)

MekkCyber

Thanks for fixing this 🤗 ! left some comments below

MekkCyber · 2025-08-26T08:31:43Z

src/transformers/modeling_utils.py

+                logger.debug(f"Skipping weight initialization for quantized module {module.__class__.__name__} with dtype {module.weight.dtype}")
+            else:
+                module.weight.data.normal_(mean=0.0, std=std)
+            if module.bias is not None and module.bias.dtype not in (torch.int8, torch.uint8):


do we need this for .zero_() too or only .normal_()

Thx for the review. @MekkCyber Yes, we should apply the same dtype check for .zero_() as well. Good point. Looking at the code, there are several places where .zero_() is called on weights and biases:

Line 2938: module.bias.data.zero_()

Line 2946: module.weight.data[module.padding_idx].zero_() (for embeddings)

Line 2961: module.bias.data.zero_() (for normalization layers)

The current fix already handles bias at line 2938 with the dtype check, and line 2961 is for normalization layers which typically don't have quantized biases. However, line 2946 for embedding padding_idx could potentially fail with quantized embeddings.

I can update the fix to also check the dtype before calling .zero_() on the padding index for consistency ?

MekkCyber · 2025-08-26T09:35:09Z

tests/test_quantized_weight_initialization.py

+import unittest
+import torch
+import torch.nn as nn
+from transformers import PreTrainedModel, PretrainedConfig
+
+
+class TestQuantizedWeightInitialization(unittest.TestCase):
+    """Test that quantized weights are not re-initialized during model loading."""
+
+    def test_int8_weights_skipped(self):
+        """Test that int8 weights are skipped during initialization."""
+
+        class TestConfig(PretrainedConfig):
+            def __init__(self, **kwargs):
+                super().__init__(**kwargs)
+                self.initializer_range = 0.02


we can just add a simple test in tests/quantization/compressed_tensors_integration with the failling model instead

Sure, that makes much more sense. I can move the test to tests/quantization/compressed_tensors_integration/ since that's where compressed-tensors related tests belong. I can add a simple test there that reproduces the original failing scenario with the actual quantized model "nm-testing/tinyllama-w8a8-compressed-hf-quantizer" that's already being used in the existing tests. Thank you.

Thanks @akacmazz ! still i'm a bit confused why we are trying to initialize weights in the case of quantized models, i mean the weights need to be there already because we are just loading. Will try to take a deeper look because I think the issue is deeper than this

jubueche · 2025-08-26T11:15:00Z

BTW, I also encountered a scenario where I was loading an llm-compressor compressed model and it did not have the weights in int8, but rather it just did not have any weight attribute at all. This is because the weight was still packed and was called weight_packed (or smth. similar). So I had to adjust this one block to

if isinstance(module, (nn.Linear, nn.Conv1d, nn.Conv2d, nn.Conv3d, nn.ConvTranspose1d, nn.ConvTranspose2d)):
    # Skip initialization for quantized weights (int8, uint8)
    if hasattr(module, "weight") and module.weight.dtype in (torch.int8, torch.uint8):
        logger.debug(f"Skipping weight initialization for quantized module {module.__class__.__name__} with dtype {module.weight.dtype}")
    elif not hasattr(module, "weight"):
        logger.debug(f"Skipping weight initialization because {module} does not have attribute `weight`.")
    else:
        module.weight.data.normal_(mean=0.0, std=std)
    if module.bias is not None and module.bias.dtype not in (torch.int8, torch.uint8):
        module.bias.data.zero_()

With this, my model skipped initialization for these layers and ran fine. I can give more details if you want.

Ed-uardo · 2025-08-29T03:08:02Z

tests/test_quantized_weight_initialization.py

+        class TestConfig(PretrainedConfig):
+            def __init__(self, **kwargs):
+                super().__init__(**kwargs)
+                self.initializer_range = 0.02
+
+        class TestModel(PreTrainedModel):
+            config_class = TestConfig
+
+            def __init__(self, config):
+                super().__init__(config)
+                self.linear = nn.Linear(10, 10)


Hey, awesome work on this @akacmazz! Small suggestion, perhaps we could move TestConfig and TestModel to the module level as helper classes. This would make the test suite cleaner and prevent the same code from being defined twice.

MekkCyber reviewed Aug 26, 2025

View reviewed changes

Ed-uardo reviewed Aug 29, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix RuntimeError when loading quantized models with int8 weights (#39366) #40090

Fix RuntimeError when loading quantized models with int8 weights (#39366) #40090

akacmazz commented Aug 11, 2025 •

edited

Loading

Uh oh!

Rocketknight1 commented Aug 13, 2025

Uh oh!

jubueche commented Aug 25, 2025

Uh oh!

MekkCyber left a comment

Uh oh!

MekkCyber Aug 26, 2025

Uh oh!

akacmazz Aug 26, 2025

Uh oh!

MekkCyber Aug 26, 2025

Uh oh!

akacmazz Aug 26, 2025

Uh oh!

MekkCyber Aug 26, 2025

Uh oh!

jubueche commented Aug 26, 2025

Uh oh!

Ed-uardo Aug 29, 2025

Uh oh!

Uh oh!

Fix RuntimeError when loading quantized models with int8 weights (#39366) #40090

Are you sure you want to change the base?

Fix RuntimeError when loading quantized models with int8 weights (#39366) #40090

Conversation

akacmazz commented Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before & After

Before (❌)

Uh oh!

Rocketknight1 commented Aug 13, 2025

Uh oh!

jubueche commented Aug 25, 2025

Uh oh!

MekkCyber left a comment

Choose a reason for hiding this comment

Uh oh!

MekkCyber Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

akacmazz Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

MekkCyber Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

akacmazz Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

MekkCyber Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

jubueche commented Aug 26, 2025

Uh oh!

Ed-uardo Aug 29, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

akacmazz commented Aug 11, 2025 •

edited

Loading