Regression in SmolVLM results in different vision embeddings

### System Info

This is an issue since v4.55.1

### Who can help?

@zucchini-nlp 

### Information

- [ ] The official example scripts
- [ ] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

Please run the following reproducer script: `uv run reproducer.py`

```python
#!/usr/bin/env -S uv run --script
# /// script
# dependencies = [
#   "num2words",
#   "pillow",
#   "torch",
#   "torchvision",
#   "transformers==4.54.1",
# ]
# ///

import transformers
print(transformers.__version__)

from transformers import AutoProcessor, AutoModelForImageTextToText
import torch

model_path = "HuggingFaceTB/SmolVLM2-2.2B-Instruct"
processor = AutoProcessor.from_pretrained(model_path)
model = AutoModelForImageTextToText.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
).to("cuda")

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"},
            {"type": "text", "text": "Can you describe this image?"},
        ]
    },
]

inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device, dtype=torch.bfloat16)

patch_size = model.model.vision_model.patch_size
pixel_values = inputs['pixel_values']
batch_size, num_images, num_channels, height, width = pixel_values.shape
pixel_values = pixel_values.view(batch_size * num_images, *pixel_values.shape[2:])

patch_attention_mask = torch.ones(
    (
        batch_size,
        pixel_values.size(2) // patch_size,
        pixel_values.size(3) // patch_size,
    )
)
patch_attention_mask = patch_attention_mask.to(dtype=torch.bool, device=pixel_values.device)

embeddings_model = model.model.vision_model.embeddings
embeddings = embeddings_model(pixel_values=pixel_values, patch_attention_mask=patch_attention_mask)
print(embeddings[0][-1][:10])
```

For transformers==4.54.1, this will produce:

```
tensor([-0.5938, -0.5117, -0.7305, -1.1797, -0.5977, -0.7305, -0.7070, -0.6484,
        -0.5547, -0.6758], device='cuda:0', dtype=torch.bfloat16,
       grad_fn=<SliceBackward0>)
```

Changing the `dependencies` to `"transformers==4.55.1"` and rerunning the script will produce:

```
tensor([-0.5977, -0.5742, -0.6875, -1.0625, -0.5977, -0.8398, -0.7031, -0.6406,
        -0.5078, -0.6445], device='cuda:0', dtype=torch.bfloat16,
       grad_fn=<SliceBackward0>)
```

The issue is the logic to calculate position_ids changed slightly from 4.54.1 (https://github.com/huggingface/transformers/blob/4.54.1/src/transformers/models/smolvlm/modeling_smolvlm.py#L131-L156) to 4.55.1 (https://github.com/huggingface/transformers/blob/v4.55.1/src/transformers/models/smolvlm/modeling_smolvlm.py#L131-L162). This results in different `position_embeddings` which affect the final `embeddings`.

### Expected behavior

We expect the `embeddings` to be the same between versions. Furthermore, the current version differs from vllm's implementation (which is aligned with 4.54.1). This causes issues in RL when we expect the inference and training implementations to be aligned.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Regression in SmolVLM results in different vision embeddings #41190

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Regression in SmolVLM results in different vision embeddings #41190

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions