-
Notifications
You must be signed in to change notification settings - Fork 751
[multimodal] Allow float32 image input #14359
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/14359
Note: Links to docs will display an error until the docs builds have been completed. ⏳ No Failures, 4 PendingAs of commit 4d6d3be with merge base d25c35a ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
jackzhxng
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice. this would be a good opportunity to get started on a multimodal_prefiller test file too
|
|
||
| executorch::runtime::Result<executorch::extension::TensorPtr> toTensor( | ||
| bool with_batch = false) const { | ||
| // Note: This creates a 3D tensor (CHW). The model might expect a 4D |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems like you already batch using with_batch so can rm this comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wish it's easy to test. My plan is to setup some python test in the pybind PR
Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:
4f68739 to
4d6d3be
Compare
|
Cherry-pick candidate? |
could be. I have a few other PRs so let me land everything and see |
|
@pytorchbot cherry-pick --onto release/1.0 -c critical |
Letting `Image` class support both `uint8_t` and `float` data types, changing `MultimodalPrefiller` class to support text, image, and audio modalities with error checking and modularity. **Image Data Handling and Type Safety:** * Refactored the `Image` class in `image.h` from a simple struct to a class that uses a `std::variant` to support both `uint8_t` and `float` image data, providing type-safe accessors and a `toTensor` method for conversion to tensors. * Updated `load_image` in Llava `main.cpp` to construct `Image` objects using the new class interface and move semantics, ensuring correct data layout and encapsulation. * Added a runtime check in `LlavaImagePrefiller` to ensure only `uint8_t` images are processed, using the new type-checking methods. **Multimodal Prefill Logic and Flexibility:** * Updated the `MultimodalPrefiller` class in `multimodal_prefiller.h` to dynamically check input types, validate tensor types against model expectations, and handles encoder/decoder execution with improved error handling and modularity. (cherry picked from commit bc18834)
Cherry picking #14359The cherry pick PR is at #14490 and it is recommended to link a critical cherry pick PR with an issue. The following tracker issues are updated: Details for Dev Infra teamRaised by workflow job |
Letting `Image` class support both `uint8_t` and `float` data types, changing `MultimodalPrefiller` class to support text, image, and audio modalities with error checking and modularity. **Image Data Handling and Type Safety:** * Refactored the `Image` class in `image.h` from a simple struct to a class that uses a `std::variant` to support both `uint8_t` and `float` image data, providing type-safe accessors and a `toTensor` method for conversion to tensors. * Updated `load_image` in Llava `main.cpp` to construct `Image` objects using the new class interface and move semantics, ensuring correct data layout and encapsulation. * Added a runtime check in `LlavaImagePrefiller` to ensure only `uint8_t` images are processed, using the new type-checking methods. **Multimodal Prefill Logic and Flexibility:** * Updated the `MultimodalPrefiller` class in `multimodal_prefiller.h` to dynamically check input types, validate tensor types against model expectations, and handles encoder/decoder execution with improved error handling and modularity.
Letting
Imageclass support bothuint8_tandfloatdata types, changingMultimodalPrefillerclass to support text, image, and audio modalities with error checking and modularity.Image Data Handling and Type Safety:
Imageclass inimage.hfrom a simple struct to a class that uses astd::variantto support bothuint8_tandfloatimage data, providing type-safe accessors and atoTensormethod for conversion to tensors.load_imagein Llavamain.cppto constructImageobjects using the new class interface and move semantics, ensuring correct data layout and encapsulation.LlavaImagePrefillerto ensure onlyuint8_timages are processed, using the new type-checking methods.Multimodal Prefill Logic and Flexibility:
MultimodalPrefillerclass inmultimodal_prefiller.hto dynamically check input types, validate tensor types against model expectations, and handles encoder/decoder execution with improved error handling and modularity.