[multimodal] Allow float32 image input #14359

larryliu0820 · 2025-09-16T23:44:58Z

Letting Image class support both uint8_t and float data types, changing MultimodalPrefiller class to support text, image, and audio modalities with error checking and modularity.

Image Data Handling and Type Safety:

Refactored the Image class in image.h from a simple struct to a class that uses a std::variant to support both uint8_t and float image data, providing type-safe accessors and a toTensor method for conversion to tensors.
Updated load_image in Llava main.cpp to construct Image objects using the new class interface and move semantics, ensuring correct data layout and encapsulation.
Added a runtime check in LlavaImagePrefiller to ensure only uint8_t images are processed, using the new type-checking methods.

Multimodal Prefill Logic and Flexibility:

Updated the MultimodalPrefiller class in multimodal_prefiller.h to dynamically check input types, validate tensor types against model expectations, and handles encoder/decoder execution with improved error handling and modularity.

pytorch-bot · 2025-09-16T23:45:02Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/14359

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

⏳ No Failures, 4 Pending

As of commit 4d6d3be with merge base d25c35a ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

jackzhxng

nice. this would be a good opportunity to get started on a multimodal_prefiller test file too

jackzhxng · 2025-09-17T00:39:10Z

extension/llm/runner/image.h

+
+  executorch::runtime::Result<executorch::extension::TensorPtr> toTensor(
+      bool with_batch = false) const {
+    // Note: This creates a 3D tensor (CHW). The model might expect a 4D


seems like you already batch using with_batch so can rm this comment?

I wish it's easy to test. My plan is to setup some python test in the pybind PR

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

mergennachin · 2025-09-17T14:09:55Z

Cherry-pick candidate?

larryliu0820 · 2025-09-17T16:35:36Z

Cherry-pick candidate?

could be. I have a few other PRs so let me land everything and see

larryliu0820 · 2025-09-22T21:44:51Z

@pytorchbot cherry-pick --onto release/1.0 -c critical

Letting `Image` class support both `uint8_t` and `float` data types, changing `MultimodalPrefiller` class to support text, image, and audio modalities with error checking and modularity. **Image Data Handling and Type Safety:** * Refactored the `Image` class in `image.h` from a simple struct to a class that uses a `std::variant` to support both `uint8_t` and `float` image data, providing type-safe accessors and a `toTensor` method for conversion to tensors. * Updated `load_image` in Llava `main.cpp` to construct `Image` objects using the new class interface and move semantics, ensuring correct data layout and encapsulation. * Added a runtime check in `LlavaImagePrefiller` to ensure only `uint8_t` images are processed, using the new type-checking methods. **Multimodal Prefill Logic and Flexibility:** * Updated the `MultimodalPrefiller` class in `multimodal_prefiller.h` to dynamically check input types, validate tensor types against model expectations, and handles encoder/decoder execution with improved error handling and modularity. (cherry picked from commit bc18834)

pytorchbot · 2025-09-22T21:47:14Z

Cherry picking #14359

The cherry pick PR is at #14490 and it is recommended to link a critical cherry pick PR with an issue. The following tracker issues are updated:

[v1.0.0] Release Tracker #14288 (comment)

Details for Dev Infra team

Raised by workflow job

Letting `Image` class support both `uint8_t` and `float` data types, changing `MultimodalPrefiller` class to support text, image, and audio modalities with error checking and modularity. **Image Data Handling and Type Safety:** * Refactored the `Image` class in `image.h` from a simple struct to a class that uses a `std::variant` to support both `uint8_t` and `float` image data, providing type-safe accessors and a `toTensor` method for conversion to tensors. * Updated `load_image` in Llava `main.cpp` to construct `Image` objects using the new class interface and move semantics, ensuring correct data layout and encapsulation. * Added a runtime check in `LlavaImagePrefiller` to ensure only `uint8_t` images are processed, using the new type-checking methods. **Multimodal Prefill Logic and Flexibility:** * Updated the `MultimodalPrefiller` class in `multimodal_prefiller.h` to dynamically check input types, validate tensor types against model expectations, and handles encoder/decoder execution with improved error handling and modularity.

larryliu0820 requested review from jackzhxng, lucylq, mergennachin and swolchok as code owners September 16, 2025 23:44

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 16, 2025

larryliu0820 added the release notes: llm Changes to llm utilities label Sep 16, 2025

kirklandsign approved these changes Sep 16, 2025

View reviewed changes

jackzhxng reviewed Sep 17, 2025

View reviewed changes

larryliu0820 added 3 commits September 16, 2025 23:51

[multimodal] Allow float32 image input

154850e

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

Fix

78fa105

Fix android and ios usages of image

4d6d3be

larryliu0820 force-pushed the image_takes_float branch from 4f68739 to 4d6d3be Compare September 17, 2025 06:52

larryliu0820 merged commit bc18834 into main Sep 17, 2025
126 checks passed

larryliu0820 deleted the image_takes_float branch September 17, 2025 08:10

pytorchbot mentioned this pull request Sep 22, 2025

[v1.0.0] Release Tracker #14288

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[multimodal] Allow float32 image input #14359

[multimodal] Allow float32 image input #14359

Uh oh!

larryliu0820 commented Sep 16, 2025

Uh oh!

pytorch-bot bot commented Sep 16, 2025 •

edited

Loading

Uh oh!

jackzhxng left a comment

Uh oh!

jackzhxng Sep 17, 2025

Uh oh!

larryliu0820 Sep 17, 2025

Uh oh!

Uh oh!

mergennachin commented Sep 17, 2025

Uh oh!

larryliu0820 commented Sep 17, 2025

Uh oh!

larryliu0820 commented Sep 22, 2025

Uh oh!

pytorchbot commented Sep 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

[multimodal] Allow float32 image input #14359

[multimodal] Allow float32 image input #14359

Uh oh!

Conversation

larryliu0820 commented Sep 16, 2025

Uh oh!

pytorch-bot bot commented Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/14359

⏳ No Failures, 4 Pending

Uh oh!

jackzhxng left a comment

Choose a reason for hiding this comment

Uh oh!

jackzhxng Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

larryliu0820 Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mergennachin commented Sep 17, 2025

Uh oh!

larryliu0820 commented Sep 17, 2025

Uh oh!

larryliu0820 commented Sep 22, 2025

Uh oh!

pytorchbot commented Sep 22, 2025

Cherry picking #14359

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

pytorch-bot bot commented Sep 16, 2025 •

edited

Loading