Skip to content

[Bugfix] Standardize getting number of image patches/tokens#34358

Open
DarkLight1337 wants to merge 9 commits intovllm-project:mainfrom
DarkLight1337:fix-image-patches
Open

[Bugfix] Standardize getting number of image patches/tokens#34358
DarkLight1337 wants to merge 9 commits intovllm-project:mainfrom
DarkLight1337:fix-image-patches

Conversation

@DarkLight1337
Copy link
Member

@DarkLight1337 DarkLight1337 commented Feb 11, 2026

Purpose

  • Consider mm_kwargs when determining number of image tokens.
  • Disallow passing processor=None to simplify the code
  • Fix Idefics3 and SmolVLM tests not passing mm_kwargs to the reference processor call.

FIX Idefics3 test in #34334

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@DarkLight1337 DarkLight1337 added ready ONLY add when PR is ready to merge/full CI is needed multi-modality Related to multi-modality (#4194) labels Feb 11, 2026
@mergify mergify bot added the bug Something isn't working label Feb 11, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request standardizes the methods for calculating the number of image tokens across various multimodal models. The changes correctly enforce that a processor must be passed and that mm_kwargs are considered when applicable. This simplifies the code, improves consistency, and fixes bugs where these arguments were previously ignored. The refactoring is well-executed across multiple files. I have found one critical issue that could lead to a runtime error.

Signed-off-by: DarkLight1337 <[email protected]>
@DarkLight1337
Copy link
Member Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant and valuable refactoring across multiple multimodal models to standardize how the number of image patches and tokens are calculated. By making the processor argument non-optional and consistently passing mm_kwargs, the changes eliminate boilerplate code, improve clarity, and enhance correctness. The bug fixes in the Idefics3 and SmolVLM tests, as well as the fix for SmolVLMProcessingInfo._get_image_token, are also important improvements. The code is now more robust and easier to maintain. Overall, this is a well-executed and beneficial change.

Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
@DarkLight1337 DarkLight1337 marked this pull request as draft February 12, 2026 04:51
@DarkLight1337
Copy link
Member Author

DarkLight1337 commented Feb 12, 2026

I think there is some bug with get_number_of_image_patches in the idefics3 image processor (possibly same with smolvlm as well).

images
>>> [[<PIL.Image.Image image mode=RGB size=1456x1456 at 0x7FF451210A90>]]
>>> output_kwargs["images_kwargs"]
{'return_row_col_info': True, 'size': {'longest_edge': 364}, 'return_tensors': 'pt', 'input_data_format': 'channels_last'}
image_inputs = self.image_processor(images, **output_kwargs["images_kwargs"])
{k: image_inputs[k] for k in ("rows", "cols")
>>> {'rows': [[0]], 'cols': [[0]]}
self.image_processor.get_number_of_image_patches(1456, 1456, output_kwargs["images_kwargs"])
>>> (1, 1, 1)

get_number_of_image_patches should be returning (0, 0, 0).

cc @hmellor @ArthurZucker @zucchini-nlp

@zucchini-nlp
Copy link
Contributor

By default it was meant to be "one" meaning no cropping to patches iirc. But it indeed is confusing if the number doesn't match the rows/cols we get from calling the processor. That seems to have introduced the bug
Have you already tested with various numbers of cols and rows, if the final number of placeholders is different? Or I can test myself and fix

@DarkLight1337
Copy link
Member Author

I will be afk for much of the day, would be much appreciated if you could help test this!

@zucchini-nlp
Copy link
Contributor

Will do, no prob!

@zucchini-nlp
Copy link
Contributor

zucchini-nlp commented Feb 12, 2026

@DarkLight1337 can you check if huggingface/transformers#43948 solves your problem? I found issues with a few other models and fixed them as well

@DarkLight1337
Copy link
Member Author

The processor can at least run without errors but the test still fails due to incorrect output.

@zucchini-nlp
Copy link
Contributor

Is that the test I've pointed out yesterday and can you make sure that the height/width passed to the utility is correct?

Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
@DarkLight1337
Copy link
Member Author

DarkLight1337 commented Feb 12, 2026

Ok the test has been fixed by f3705cd, I was just passing the kwargs incorrectly.

@DarkLight1337
Copy link
Member Author

For now let's disable the tests until your patch has been merged.

@DarkLight1337
Copy link
Member Author

Can we assume that your patch will land in v5.2?

@zucchini-nlp
Copy link
Contributor

Yes, it will

@DarkLight1337
Copy link
Member Author

Alright, then this PR should be good to go, thanks!

@DarkLight1337 DarkLight1337 enabled auto-merge (squash) February 12, 2026 15:37
Signed-off-by: DarkLight1337 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working multi-modality Related to multi-modality (#4194) ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants