Support image-text-to-text task #111

larryliu0820 · 2025-07-23T00:44:03Z

This PR has some code adopted from transformers. We put it in optimum-executorch so that we can fast iterate on the stack. Eventually we want to upstream changes to transformers. See details below.

Exportable Modules

`TorchExportableModuleWithHybridCache`

A wrapper module that makes decoder-only language models exportable with torch.export using HybridCache. This is a forked version of TorchExportableModuleForDecoderOnlyLM with some modifications to support inputs_embeds.

Note: This class should be upstreamed to transformers. We keep it here so that we can iterate quickly.

`TorchExportableModuleForImageTextLM`

A wrapper for text decoder model in a vision-language model. It is very similar to TorchExportableModuleForDecoderOnlyLM but instead of taking input_ids this module takes inputs_embeds. This is because we want to be able to take both token embeddings and image embeddings as inputs.

Note: This class should be upstreamed to transformers. Please find this PR for more details: huggingface/transformers#39836 once that lands we can cleanup the class here.

`ImageEncoderExportableModule`

A wrapper for vision encoder models that projects vision features to language model space. Commonly implemented as get_image_features() in HuggingFace transformers. For example: Gemma3Model.get_image_features().

`ImageTextToTextExportableModule`

A wrapper of torch.nn.Module for image-text-to-text task. Provides export() API that generates an ExportedProgram. It will be consumed by xnnpack.py recipe to generate ExecuTorch program.

Usage

from optimum.executorch import ExecuTorchModelForMultimodalCausalLM

model_id = "google/gemma-3-4b-it"

model = ExecuTorchModelForMultimodalCausalLM.from_pretrained(
    model_id,
    recipe="xnnpack",
    task="image-text-to-text",
    export=True,
    use_custom_sdpa=True,
    use_custom_kv_cache=True,
    qlinear=True,
    qembedding_config=True,
)

Testing

Run tests with:

RUN_SLOW=1 pytest tests/models/test_modeling_gemma3.py::ExecuTorchModelIntegrationTest::test_gemma3_image_text_to_text_generation_with_custom_sdpa_kv_cache_8da4w_8we

HuggingFaceDocBuilderDev · 2025-07-23T19:10:42Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

jackzhxng · 2025-08-01T18:41:07Z

optimum/exporters/executorch/tasks/image_text_to_text.py

+            # TODO: Should switch to `AOPerModuleConfig` once fix for tied weights is available.
+            embedding_config = IntxWeightOnlyConfig(
+                weight_dtype=torch.int8,
+                granularity=PerAxis(0),


why not groupwise?

Let's iterate on this later, currently this quantization config works fine.

jackzhxng · 2025-08-01T18:41:42Z

optimum/exporters/executorch/tasks/image_text_to_text.py

+            )
+
+        if qlinear_config:
+            logging.info("Quantizing linear layers.")


could try peraxis here for encoder

See comment above

optimum/exporters/executorch/tasks/image_text_to_text.py

jackzhxng · 2025-08-01T18:44:47Z

optimum/executorch/modeling.py

-            logger.warning(f"task was provided and set to {task} but not used, will be ignored")
-        inferred_task = TasksManager.infer_task_from_model(cls.auto_model_class)
-        logging.info(f"Inferred task from model class: {inferred_task}")
+            logger.warning(f"task was provided and set to {task}")


seems ok, I had to do this too but @guangy10 thoughts on this?

optimum/executorch/modeling.py

jackzhxng · 2025-08-01T18:56:42Z

optimum/exporters/executorch/integrations.py

+        return exported_program
+
+
+class ImageEncoderExportableModule(torch.nn.Module):


Should look into if the vision embeddings -> multlimodal projector is gemma specific or generally applicable across the board for encoders? It's possible that other vision models have a few extra steps in here. In that case maybe it makes sense to just call it GemmaImageEncoderExportableModule, maybe create a new dir and put it into exporters/executorch/models/gemma for per-model exportable modules

what is hf transformers pattern here?

The similar pattern is that they just write model-specific code for new models in modular_.py, so I think it's fine that we have some model-specific code

optimum/exporters/executorch/integrations.py

jackzhxng · 2025-08-01T19:06:15Z

optimum/exporters/executorch/integrations.py

+        return image_features
+
+
+class ImageTextToTextExportableModule(torch.nn.Module):


I remember you had code for verifying the ExportedProgram E2E in the original draft PR, can add that to def generate() here and add the test for it too

Is it common to have generate() implemented here?

jackzhxng · 2025-08-01T19:06:50Z

setup.py

    "optimum~=1.24",
    "executorch>=0.6.0",
-    "transformers==4.51.3",
+    "transformers==4.53.2",


Should upgrade transformers in separate PR since it could be problematic

kimishpatel · 2025-08-05T21:28:02Z

optimum/executorch/attentions/custom_sdpa.py

        # Branch out for float vs bool mask
        # assert attention_mask.dim() == 2, f"attention_mask must be a 2D matrix."
-        attention_mask = attention_mask.reshape(-1, max_seq_len)
+        attention_mask = attention_mask.reshape(-1, attention_mask.shape[-1])


why this change?

This is giving a weird issue in verifying the e2e workflow using ExportedProgram. I forgot what exactly though

if not needed. please undo

No we definitely need this, otherwise e2e won’t work.

kimishpatel · 2025-08-05T21:41:56Z

optimum/executorch/modeling.py

+                special_image_mask = (input_ids == self.config.image_token_id).unsqueeze(-1)
+                special_image_mask = special_image_mask.expand_as(inputs_embeds).to(inputs_embeds.device)
+            image_features = image_features.to(inputs_embeds.device, inputs_embeds.dtype)
+            inputs_embeds = inputs_embeds.masked_scatter(special_image_mask, image_features)


@larryliu0820 so not doing this in runtime means we make assumptions on where the image tokens go in the prompt, right?

Yeah the runner will have to take in a vector of inputs, then prefill sequentially.

I find the sequential prefilling a bit strange in the runner mainly because it is assuming format on chat template. that image tokens are coming last. You really need to do masked scatter, no?

The runner knows nothing about the chat template. It only sees [image, text, image..]

I understand that but where image tokens goes or in future speech tokens go in property of the model's chat template, isnt it? So whether it is managed in the runner or the layer above it doesnt matter, but it would have be accounted for somewhere

kimishpatel · 2025-08-05T21:52:15Z

optimum/exporters/executorch/integrations.py

+
+        if (
+            hasattr(model.config.text_config, "layer_types")
+            and getattr(model.config.text_config, "sliding_window", None) is not None


so this only works for gemma3?

not 100% sure haha. Will use it to enable a few more models.

kimishpatel · 2025-08-05T21:56:13Z

optimum/exporters/executorch/integrations.py

+        Returns:
+            image_features (`torch.Tensor`): Image feature tensor of shape `(num_images, image_length, embed_dim)`).
+        """
+        vision_outputs = self.model.vision_tower(


so this relies on the fact that there is vision_tower attr on the model that is for vision encoder

Yeah this should work for llava as well.

yeah but is this something we can rely on? Like upstreaming this change might be difficult? Mainly the question is, how much of the model structure information you are exploiting

kimishpatel · 2025-08-05T21:56:37Z

optimum/exporters/executorch/integrations.py

+        """
+        vision_outputs = self.model.vision_tower(
+            pixel_values=pixel_values
+        ).last_hidden_state


same here and the next line

kimishpatel · 2025-08-05T21:59:24Z

optimum/exporters/executorch/integrations.py

+        sliding_window = self.metadata.get("sliding_window", float("inf"))
+        max_dim = min(max_seq_len, sliding_window) - 1


does similar constraint exist for sliding window in decoder only lm?

kimishpatel · 2025-08-05T22:27:16Z

optimum/exporters/executorch/integrations.py

+                RemoveRedundantTransposes,
+            )
+
+            mutated_gm = RemoveRedundantTransposes()(exported_program.module())[0]


we should run this pass for other exported models as well

kimishpatel · 2025-08-05T22:34:23Z

optimum/exporters/executorch/integrations.py

+            )
+
+            token_embeddings_exported_program = torch.export.export(
+                exportable_module.model.model.language_model.get_input_embeddings(),


I dont follow this. we already export exported_module.model. So I would have expected that get_input_embedding is traced as part of that? I guess thats not the case when input_embeds != None

get_input_embeddings() has not been traced because in the language model we specialized on input_embeds != None and skipped the token embedding layer.

yeah, I dont quite like the fact that we are exploiting the information from model code though.

I guess that’s inevitable and I hope transformers folks can give some guarantees lol. Like model.vision_model and model.get_image_features()

kimishpatel · 2025-08-05T22:42:36Z

optimum/exporters/executorch/tasks/image_text_to_text.py

+                weight_dtype=torch.int4,
+                weight_granularity=PerGroup(32),
+            )
+            quantize_(


Not quantizing vision model?

larryliu0820 added 3 commits July 22, 2025 17:43

DRAFT - Support image-text-to-text task

c187bf7

Make it work

c904c44

Recipe and modeling

69d47d6

Enable generate()

60b00dd

larryliu0820 mentioned this pull request Jul 28, 2025

RFC: Multimodal Support on ExecuTorch pytorch/executorch#12913

Open

Clean up

7f46076

larryliu0820 changed the title ~~DRAFT - Support image-text-to-text task~~ Support image-text-to-text task Aug 1, 2025

jackzhxng reviewed Aug 1, 2025

View reviewed changes

guangy10 mentioned this pull request Aug 1, 2025

Add multimodal support to optimum-executorch #116

Closed

jackzhxng mentioned this pull request Aug 4, 2025

Support input_embeds in torch exportable decoders huggingface/transformers#39836

Merged

5 tasks

kimishpatel reviewed Aug 5, 2025

View reviewed changes

larryliu0820 added 2 commits August 7, 2025 16:15

Address comments

655d241

Lint

6d201d2

		return exported_program


		class ImageEncoderExportableModule(torch.nn.Module):

		return image_features


		class ImageTextToTextExportableModule(torch.nn.Module):

		sliding_window = self.metadata.get("sliding_window", float("inf"))
		max_dim = min(max_seq_len, sliding_window) - 1

Support image-text-to-text task #111

Are you sure you want to change the base?

Support image-text-to-text task #111

Uh oh!

Conversation

larryliu0820 commented Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Exportable Modules

TorchExportableModuleWithHybridCache

TorchExportableModuleForImageTextLM

ImageEncoderExportableModule

ImageTextToTextExportableModule

Usage

Testing

Uh oh!

HuggingFaceDocBuilderDev commented Jul 23, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

larryliu0820 Aug 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

larryliu0820 commented Jul 23, 2025 •

edited

Loading

`TorchExportableModuleWithHybridCache`

`TorchExportableModuleForImageTextLM`

`ImageEncoderExportableModule`

`ImageTextToTextExportableModule`

larryliu0820 Aug 8, 2025 •

edited

Loading