[Bugfix] Standardize getting number of image patches/tokens#34358
[Bugfix] Standardize getting number of image patches/tokens#34358DarkLight1337 wants to merge 9 commits intovllm-project:mainfrom
Conversation
Signed-off-by: DarkLight1337 <[email protected]>
There was a problem hiding this comment.
Code Review
This pull request standardizes the methods for calculating the number of image tokens across various multimodal models. The changes correctly enforce that a processor must be passed and that mm_kwargs are considered when applicable. This simplifies the code, improves consistency, and fixes bugs where these arguments were previously ignored. The refactoring is well-executed across multiple files. I have found one critical issue that could lead to a runtime error.
Signed-off-by: DarkLight1337 <[email protected]>
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a significant and valuable refactoring across multiple multimodal models to standardize how the number of image patches and tokens are calculated. By making the processor argument non-optional and consistently passing mm_kwargs, the changes eliminate boilerplate code, improve clarity, and enhance correctness. The bug fixes in the Idefics3 and SmolVLM tests, as well as the fix for SmolVLMProcessingInfo._get_image_token, are also important improvements. The code is now more robust and easier to maintain. Overall, this is a well-executed and beneficial change.
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
|
I think there is some bug with get_number_of_image_patches in the idefics3 image processor (possibly same with smolvlm as well). images
>>> [[<PIL.Image.Image image mode=RGB size=1456x1456 at 0x7FF451210A90>]]
>>> output_kwargs["images_kwargs"]
{'return_row_col_info': True, 'size': {'longest_edge': 364}, 'return_tensors': 'pt', 'input_data_format': 'channels_last'}
image_inputs = self.image_processor(images, **output_kwargs["images_kwargs"])
{k: image_inputs[k] for k in ("rows", "cols")
>>> {'rows': [[0]], 'cols': [[0]]}self.image_processor.get_number_of_image_patches(1456, 1456, output_kwargs["images_kwargs"])
>>> (1, 1, 1)
|
|
By default it was meant to be "one" meaning no cropping to patches iirc. But it indeed is confusing if the number doesn't match the |
|
I will be afk for much of the day, would be much appreciated if you could help test this! |
|
Will do, no prob! |
|
@DarkLight1337 can you check if huggingface/transformers#43948 solves your problem? I found issues with a few other models and fixed them as well |
|
The processor can at least run without errors but the test still fails due to incorrect output. |
|
Is that the test I've pointed out yesterday and can you make sure that the |
Signed-off-by: DarkLight1337 <[email protected]>
bb09dd8 to
f3705cd
Compare
Signed-off-by: DarkLight1337 <[email protected]>
|
Ok the test has been fixed by f3705cd, I was just passing the kwargs incorrectly. |
|
For now let's disable the tests until your patch has been merged. |
|
Can we assume that your patch will land in v5.2? |
|
Yes, it will |
|
Alright, then this PR should be good to go, thanks! |
Signed-off-by: DarkLight1337 <[email protected]>
Purpose
mm_kwargswhen determining number of image tokens.processor=Noneto simplify the codemm_kwargsto the reference processor call.FIX Idefics3 test in #34334
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.