Skip to content

Conversation

@larryliu0820
Copy link
Contributor

Letting Image class support both uint8_t and float data types, changing MultimodalPrefiller class to support text, image, and audio modalities with error checking and modularity.

Image Data Handling and Type Safety:

  • Refactored the Image class in image.h from a simple struct to a class that uses a std::variant to support both uint8_t and float image data, providing type-safe accessors and a toTensor method for conversion to tensors.
  • Updated load_image in Llava main.cpp to construct Image objects using the new class interface and move semantics, ensuring correct data layout and encapsulation.
  • Added a runtime check in LlavaImagePrefiller to ensure only uint8_t images are processed, using the new type-checking methods.

Multimodal Prefill Logic and Flexibility:

  • Updated the MultimodalPrefiller class in multimodal_prefiller.h to dynamically check input types, validate tensor types against model expectations, and handles encoder/decoder execution with improved error handling and modularity.

@pytorch-bot
Copy link

pytorch-bot bot commented Sep 16, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/14359

Note: Links to docs will display an error until the docs builds have been completed.

⏳ No Failures, 4 Pending

As of commit 4d6d3be with merge base d25c35a (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 16, 2025
@larryliu0820 larryliu0820 added the release notes: llm Changes to llm utilities label Sep 16, 2025
Copy link
Contributor

@jackzhxng jackzhxng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice. this would be a good opportunity to get started on a multimodal_prefiller test file too


executorch::runtime::Result<executorch::extension::TensorPtr> toTensor(
bool with_batch = false) const {
// Note: This creates a 3D tensor (CHW). The model might expect a 4D
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like you already batch using with_batch so can rm this comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish it's easy to test. My plan is to setup some python test in the pybind PR

@larryliu0820 larryliu0820 merged commit bc18834 into main Sep 17, 2025
126 checks passed
@larryliu0820 larryliu0820 deleted the image_takes_float branch September 17, 2025 08:10
@mergennachin
Copy link
Contributor

Cherry-pick candidate?

@larryliu0820
Copy link
Contributor Author

Cherry-pick candidate?

could be. I have a few other PRs so let me land everything and see

@larryliu0820
Copy link
Contributor Author

@pytorchbot cherry-pick --onto release/1.0 -c critical

pytorchbot pushed a commit that referenced this pull request Sep 22, 2025
Letting `Image` class support both `uint8_t` and `float` data types,
changing `MultimodalPrefiller` class to support text, image, and audio
modalities with error checking and modularity.

**Image Data Handling and Type Safety:**

* Refactored the `Image` class in `image.h` from a simple struct to a
class that uses a `std::variant` to support both `uint8_t` and `float`
image data, providing type-safe accessors and a `toTensor` method for
conversion to tensors.
* Updated `load_image` in Llava `main.cpp` to construct `Image` objects
using the new class interface and move semantics, ensuring correct data
layout and encapsulation.
* Added a runtime check in `LlavaImagePrefiller` to ensure only
`uint8_t` images are processed, using the new type-checking methods.

**Multimodal Prefill Logic and Flexibility:**

* Updated the `MultimodalPrefiller` class in `multimodal_prefiller.h` to
dynamically check input types, validate tensor types against model
expectations, and handles encoder/decoder execution with improved error
handling and modularity.

(cherry picked from commit bc18834)
@pytorchbot
Copy link
Collaborator

Cherry picking #14359

The cherry pick PR is at #14490 and it is recommended to link a critical cherry pick PR with an issue. The following tracker issues are updated:

Details for Dev Infra team Raised by workflow job

StrycekSimon pushed a commit to nxp-upstream/executorch that referenced this pull request Sep 23, 2025
Letting `Image` class support both `uint8_t` and `float` data types,
changing `MultimodalPrefiller` class to support text, image, and audio
modalities with error checking and modularity.

**Image Data Handling and Type Safety:**

* Refactored the `Image` class in `image.h` from a simple struct to a
class that uses a `std::variant` to support both `uint8_t` and `float`
image data, providing type-safe accessors and a `toTensor` method for
conversion to tensors.
* Updated `load_image` in Llava `main.cpp` to construct `Image` objects
using the new class interface and move semantics, ensuring correct data
layout and encapsulation.
* Added a runtime check in `LlavaImagePrefiller` to ensure only
`uint8_t` images are processed, using the new type-checking methods.

**Multimodal Prefill Logic and Flexibility:**

* Updated the `MultimodalPrefiller` class in `multimodal_prefiller.h` to
dynamically check input types, validate tensor types against model
expectations, and handles encoder/decoder execution with improved error
handling and modularity.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. release notes: llm Changes to llm utilities

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants