Skip to content

Regression in SmolVLM results in different vision embeddings #41190

@yfw

Description

@yfw

System Info

This is an issue since v4.55.1

Who can help?

@zucchini-nlp

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Please run the following reproducer script: uv run reproducer.py

#!/usr/bin/env -S uv run --script
# /// script
# dependencies = [
#   "num2words",
#   "pillow",
#   "torch",
#   "torchvision",
#   "transformers==4.54.1",
# ]
# ///

import transformers
print(transformers.__version__)

from transformers import AutoProcessor, AutoModelForImageTextToText
import torch

model_path = "HuggingFaceTB/SmolVLM2-2.2B-Instruct"
processor = AutoProcessor.from_pretrained(model_path)
model = AutoModelForImageTextToText.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
).to("cuda")

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"},
            {"type": "text", "text": "Can you describe this image?"},
        ]
    },
]

inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device, dtype=torch.bfloat16)

patch_size = model.model.vision_model.patch_size
pixel_values = inputs['pixel_values']
batch_size, num_images, num_channels, height, width = pixel_values.shape
pixel_values = pixel_values.view(batch_size * num_images, *pixel_values.shape[2:])

patch_attention_mask = torch.ones(
    (
        batch_size,
        pixel_values.size(2) // patch_size,
        pixel_values.size(3) // patch_size,
    )
)
patch_attention_mask = patch_attention_mask.to(dtype=torch.bool, device=pixel_values.device)

embeddings_model = model.model.vision_model.embeddings
embeddings = embeddings_model(pixel_values=pixel_values, patch_attention_mask=patch_attention_mask)
print(embeddings[0][-1][:10])

For transformers==4.54.1, this will produce:

tensor([-0.5938, -0.5117, -0.7305, -1.1797, -0.5977, -0.7305, -0.7070, -0.6484,
        -0.5547, -0.6758], device='cuda:0', dtype=torch.bfloat16,
       grad_fn=<SliceBackward0>)

Changing the dependencies to "transformers==4.55.1" and rerunning the script will produce:

tensor([-0.5977, -0.5742, -0.6875, -1.0625, -0.5977, -0.8398, -0.7031, -0.6406,
        -0.5078, -0.6445], device='cuda:0', dtype=torch.bfloat16,
       grad_fn=<SliceBackward0>)

The issue is the logic to calculate position_ids changed slightly from 4.54.1 (https://github.com/huggingface/transformers/blob/4.54.1/src/transformers/models/smolvlm/modeling_smolvlm.py#L131-L156) to 4.55.1 (https://github.com/huggingface/transformers/blob/v4.55.1/src/transformers/models/smolvlm/modeling_smolvlm.py#L131-L162). This results in different position_embeddings which affect the final embeddings.

Expected behavior

We expect the embeddings to be the same between versions. Furthermore, the current version differs from vllm's implementation (which is aligned with 4.54.1). This causes issues in RL when we expect the inference and training implementations to be aligned.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions