-
Notifications
You must be signed in to change notification settings - Fork 30.7k
Description
System Info
This is an issue since v4.55.1
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
Please run the following reproducer script: uv run reproducer.py
#!/usr/bin/env -S uv run --script
# /// script
# dependencies = [
# "num2words",
# "pillow",
# "torch",
# "torchvision",
# "transformers==4.54.1",
# ]
# ///
import transformers
print(transformers.__version__)
from transformers import AutoProcessor, AutoModelForImageTextToText
import torch
model_path = "HuggingFaceTB/SmolVLM2-2.2B-Instruct"
processor = AutoProcessor.from_pretrained(model_path)
model = AutoModelForImageTextToText.from_pretrained(
model_path,
torch_dtype=torch.bfloat16,
).to("cuda")
messages = [
{
"role": "user",
"content": [
{"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"},
{"type": "text", "text": "Can you describe this image?"},
]
},
]
inputs = processor.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
).to(model.device, dtype=torch.bfloat16)
patch_size = model.model.vision_model.patch_size
pixel_values = inputs['pixel_values']
batch_size, num_images, num_channels, height, width = pixel_values.shape
pixel_values = pixel_values.view(batch_size * num_images, *pixel_values.shape[2:])
patch_attention_mask = torch.ones(
(
batch_size,
pixel_values.size(2) // patch_size,
pixel_values.size(3) // patch_size,
)
)
patch_attention_mask = patch_attention_mask.to(dtype=torch.bool, device=pixel_values.device)
embeddings_model = model.model.vision_model.embeddings
embeddings = embeddings_model(pixel_values=pixel_values, patch_attention_mask=patch_attention_mask)
print(embeddings[0][-1][:10])
For transformers==4.54.1, this will produce:
tensor([-0.5938, -0.5117, -0.7305, -1.1797, -0.5977, -0.7305, -0.7070, -0.6484,
-0.5547, -0.6758], device='cuda:0', dtype=torch.bfloat16,
grad_fn=<SliceBackward0>)
Changing the dependencies
to "transformers==4.55.1"
and rerunning the script will produce:
tensor([-0.5977, -0.5742, -0.6875, -1.0625, -0.5977, -0.8398, -0.7031, -0.6406,
-0.5078, -0.6445], device='cuda:0', dtype=torch.bfloat16,
grad_fn=<SliceBackward0>)
The issue is the logic to calculate position_ids changed slightly from 4.54.1 (https://github.com/huggingface/transformers/blob/4.54.1/src/transformers/models/smolvlm/modeling_smolvlm.py#L131-L156) to 4.55.1 (https://github.com/huggingface/transformers/blob/v4.55.1/src/transformers/models/smolvlm/modeling_smolvlm.py#L131-L162). This results in different position_embeddings
which affect the final embeddings
.
Expected behavior
We expect the embeddings
to be the same between versions. Furthermore, the current version differs from vllm's implementation (which is aligned with 4.54.1). This causes issues in RL when we expect the inference and training implementations to be aligned.