[Gemma3] Add VLM support (need help) #50

simondanielsson · 2025-09-03T14:42:10Z

What does this PR do?

This PR adds support for exporting Gemma3 models.

Fixes #49 (issue)

Note: possibly might need this PR in the main repo as well, but there might be a better way of dealing with this :)

Note to self before merging:

add AutoModelForImageTextToText in https://github.com/huggingface/optimum/blob/576942c47c2095820b51f70c2ae888f1ad8e1e9e/optimum/exporters/tasks.py#L166

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

Who can review?

@echarlaix, @JingyaHuang, @michaelbenayoun, @IlyasMoutawwakil

IlyasMoutawwakil · 2025-09-03T15:13:35Z

Hi !
Very cool starting point, but I think it's gonna be a bit more complicated depending on what you wanna add the support for exactly. Gemma3 works as both a VLM (conditional generation) and a normal LM (causal LM).
One takes both text and vision and inputs, while the second only takes text.
For text only generation, it's as simple as inheriting from the right decoder onnx config (llama onnx config) and customizing a couple things.
However for Gemma3 with vision, you will need to add an api similar to what we do for seq2seq models (a bit more complicated than that), so that we can export the model parts (vision encoder, multimodal projector, language model).

IlyasMoutawwakil · 2025-09-03T15:14:56Z

this API for example was added in the optimum-intel project and you can take inspiration from there https://github.com/huggingface/optimum-intel/blob/cd0028599d65e841d6fe16d4d9ccd5528f89f3cb/optimum/exporters/openvino/model_configs.py#L3917

IlyasMoutawwakil · 2025-09-03T15:15:39Z

you don't have to follow it exactly, let me know if you have better suggestions.

IlyasMoutawwakil · 2025-09-03T15:24:59Z

optimum/exporters/onnx/model_configs.py

+@register_tasks_manager_onnx("gemma3", *[*COMMON_TEXT_GENERATION_TASKS, "text-classification"])
+class Gemma3OnnxConfig(LlamaOnnxConfig):
+    DUMMY_INPUT_GENERATOR_CLASSES = (
+        DummyTextInputGenerator,
+        DummyVisionInputGenerator,
+    )
+    DUMMY_PKV_GENERATOR_CLASS = GemmaDummyPastKeyValuesGenerator
+    NORMALIZED_CONFIG_CLASS = NormalizedConfigManager.get_normalized_config_class("gemma3")
+    MIN_TRANSFORMERS_VERSION = version.parse("4.52.0.dev0")


for example, here the model will only support text inputs, as it inherits its inputs property form llama. it also only supports classic decoder model (LLM) tasks, like text-generation.
if your intention is to support text only generation with gemma3, then yes this would be acceptable, in that case there's no need for DummyVisionInputGenerator. Also for NORMALIZED_CONFIG_CLASS, you can define it directly instead of using the config manager.

simondanielsson · 2025-09-03T16:16:33Z

Makes total sense, thanks for the detailed comments! Will have another go at this :)

simondanielsson · 2025-09-03T16:50:43Z

Some questions to clarify before I continue implementing:

Would we like something like the following:

a gemma3_text type - allowing exporting of the causal LM for e.g. text-generation tasks.
a gemma3 type - allowing for image-text-to-text tasks.

Potentially it could make sense to merge these into a single gemma3 and distinguish between them only on the task ( conditionally create the right behaviour (VLM or LM)) - let's see if that's possible.

If exporting a gemma3 image-text-to-text, we by default export the full model. Do we want to add support for exporting only one subcomponent ("behavior") of the model as well?

Thanks for bearing with me :) I have lots of time to spend on this so happy to learn and contribute lots in the coming month if all goes well.

IlyasMoutawwakil · 2025-09-04T07:52:48Z

I don't think there's a need for another "model type", it will be complicated to add a model type because the exporter relies on the model type in the model config to infer which onnx config it should use.

Potentially it could make sense to merge these into a single gemma3 and distinguish between them only on the task ( conditionally create the right behaviour (VLM or LM)) - let's see if that's possible.

yes this makes more sense, btw you can see how for seq2seq models like bart we do support text-generation (Causal LM) in the same onnx config.

Do we want to add support for exporting only one subcomponent ("behavior") of the model as well?

If you mean in the case of text only generation where only the language model should be exported, then yes.

tests/onnxruntime/test_decoder.py

IlyasMoutawwakil · 2025-09-11T08:10:13Z

tests/exporters/onnx/test_export.py

+        if monolith:
+            expected_models = {"model.onnx"}
+        elif task in {"text-generation", "text-generation-with-past"}:
+            expected_models = {"language_model.onnx", "language_model_head.onnx"}


I think if the model is exporter with the task text-generation, we should have something similar to what we get with other decoder models (i.e. it should be one model.onnx exported with past key values if with-past)

Now I added

expected_models = {"text_encoder.onnx", "language_model_with_head.onnx"}

would you prefer to call the second one model.onnx?

IlyasMoutawwakil · 2025-09-11T08:12:43Z

tests/exporters/onnx/test_export.py

+                "language_model_head.onnx",
+            }
+        elif task in {"feature-extraction", "feature-extraction-with-past"}:
+            expected_models = {"vision_encoder.onnx", "language_model.onnx"}


in feature extraction, the model should be decomposed to the same elements as image-text-to-text
feature-extraction is the task that maps to Gemma3Model (the model without any task head) https://github.com/huggingface/transformers/blob/de01a22aff21d16532d8dd68806589ca6c73dd5c/src/transformers/models/gemma3/modeling_gemma3.py#L771

Implemented it as follows - LMK if not correct:

image-text-to-text:

vision_encoder

multimodal_projector

text_encoder

language_model_with_head <--- note head

feature-extraction:

vision_encoder

multimodal_projector

text_encoder

language_model

IlyasMoutawwakil · 2025-09-11T08:18:03Z

tests/exporters/onnx/test_export.py

+                "vision_encoder.onnx",
+                "multimodal_projector.onnx",
+                "language_model.onnx",
+                "language_model_head.onnx",


is it necessary to decompose the model into a language model and an lm_head ? I think they can be exported as language_model_with_lm_head, you can see in gemma3 modeling that in the class for conditional generation, the lm_head is always applied after the language modeling (not the case for feature extraction)

I think there's also a missing component (we can call it the "text_encoder") https://github.com/huggingface/transformers/blob/de01a22aff21d16532d8dd68806589ca6c73dd5c/src/transformers/models/gemma3/modeling_gemma3.py#L902 which is necessary to embed the text input ids before fusing them with the embedded image features https://github.com/huggingface/transformers/blob/de01a22aff21d16532d8dd68806589ca6c73dd5c/src/transformers/models/gemma3/modeling_gemma3.py#L917

Added text encoder as well - great point!

Agree probably not necessary to decompose the LM and LM head. However, to export just the LM and LM head as one unit (ignoring the vision encoder etc.) I would have to implement a separate nn.Module because Gemma3ForConditionalGeneration does not expose any way of getting just the LM and the head. Do you think it's fine to export it as a monolith in this case? This is how it's done in -intel here

See implementation above

simondanielsson · 2025-09-11T19:29:44Z

optimum/onnxruntime/modeling_vlm.py

+
+
+# TODO: to be implemented
+class ORTVisionEncoder(ORTSessionMixin):


Note: This is mostly stubbed or a bit WIP

simondanielsson · 2025-09-11T19:33:03Z

@IlyasMoutawwakil I updated the the test structure to align more with your refactoring efforts. Note parts of it are still heavily stubbed.

Unfortunately I will not have much more capacity to work on this (required a bit more time than I anticipated) - just an hour or so tomorrow. How can I best assist/clean this up for someone else to pick this up easily, or alternatively put it into a minimal state where it can be merged and followed up later by someone else in another PR?

Thanks for the amazing support through this, I've learnt a lot.

IlyasMoutawwakil · 2025-09-18T06:22:58Z

Thanks for the effort @simondanielsson . tbh I knew it required more work than the classic model addition workflow, because the entire recipe for vlm export and inference is required for a correct implementation. I will try to build upon your work in another PR 🤗

thewh1teagle · 2025-10-04T05:53:22Z

Hey, thanks for your effort!
Looks like the following command is failing in this branch:

git clone https://github.com/huggingface/optimum-onnx
gh pr checkout https://github.com/huggingface/optimum-onnx/pull/50
uv sync
uv run optimum-cli export -h

Error:

 uv run optimum-cli export -h
Traceback (most recent call last):
  File "/Users/yqbqwlny/Documents/audio/optimum-onnx/.venv/bin/optimum-cli", line 10, in <module>
    sys.exit(main())
             ^^^^^^
  File "/Users/yqbqwlny/Documents/audio/optimum-onnx/.venv/lib/python3.12/site-packages/optimum/commands/optimum_cli.py", line 196, in main
    commands_to_register = _OPTIMUM_CLI_SUBCOMMANDS + load_optimum_namespace_cli_commands()
                                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/yqbqwlny/Documents/audio/optimum-onnx/.venv/lib/python3.12/site-packages/optimum/commands/optimum_cli.py", line 140, in load_optimum_namespace_cli_commands
    register_module = importlib.import_module(f"{commands_register_namespace}.{register_file.stem}")
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/yqbqwlny/.local/share/uv/python/cpython-3.12.11-macos-aarch64-none/lib/python3.12/importlib/__init__.py", line 90, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen importlib._bootstrap>", line 1387, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1360, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1331, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 935, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 999, in exec_module
  File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed
  File "/Users/yqbqwlny/Documents/audio/optimum-onnx/optimum/commands/register/register_export.py", line 15, in <module>
    from optimum.commands.export import ExportCommand
ImportError: cannot import name 'ExportCommand' from 'optimum.commands.export' (unknown location)

I had to modify things in onnx.py and register_export.py

Curious to try it. how can I convert gemma3-270m from local checkpoint directory? trying to convert this gemma3-g2p model.

thewh1teagle · 2025-10-04T06:06:51Z

Why not add initial text-only support first? It’s better to have something working early than wait for the perfect PR

IlyasMoutawwakil · 2025-10-04T08:33:39Z

@thewh1teagle yes you need to rebase the branch with latest changes in main. we just merged 2 PRs, one in optimum and another in optipum-onnx to use native python namespaces, to have a better developer experience when working on the project in editable mode.

Also agree that having initial text only support is what i had in mind at first. i guess @simondanielsson was interested in the full development of a new export api, which is very generous of them ! dont hesitate to open a PR with only text gen support !

surfiniaburger · 2025-10-04T23:39:06Z

gemma3n multimodal
Gemma3n for text only

I wrote this a while back, hope it helps.

Added support for gemma3-text following the code in: - #50 also added a working example with `gemma3-270m-instruct` will update and improve as needed. Related - #69 - #49 - #45 - huggingface/optimum#1724 - #56 --------- Co-authored-by: Ilyas Moutawwakil <[email protected]> Co-authored-by: IlyasMoutawwakil <[email protected]>

geraldstanje1 · 2025-10-22T14:44:24Z

hi @IlyasMoutawwakil, why does this get closed?

IlyasMoutawwakil · 2025-10-22T14:48:53Z

@geraldstanje1 support for text-only gemma3 was added in #70 (and extended to loading from a vlm checkpoint in #90).
This PR is out of date and the author has stated their unavailability to finish 😕
I can keep it open but I don't want it to stop others from attempting this feature.

danielssonsimonbcg and others added 14 commits September 3, 2025 16:43

Initial commit

9afac3c

Add OnnxConfig

4bc0fab

Add gemma3 to list of models requiring position ids

ae1bfce

Add vocab size to text config

1775f54

Add normaliezd text and vision config to Gemma3

35a647f

Update normalized config to grab from manager

454d47b

Remove unused logigng

cd831b6

Update dummy input generator

97c09b9

Add gemma3 tests

03d4de3

Improve formatting of Gemma3OnnxConfig

8345410

Add gemma3 onnxruntime tests

55b5280

Add Gemma and Gemma3 to list of supported models in docs

510c2b7

Add Gemma3 to test_decoder.py

41f0c75

Remove commented code

709fff4

simondanielsson force-pushed the feature/add-gemma3-export branch from dcdbae1 to 709fff4 Compare September 3, 2025 14:46

simondanielsson added 3 commits September 3, 2025 16:50

Reset formatting in onnx.py

329062c

Remove .DS_Store

401f5f7

Fix formatting

38407e9

simondanielsson mentioned this pull request Sep 3, 2025

[Gemma3] Add supported tasks and normalized config huggingface/optimum#2345

Closed

3 tasks

simondanielsson marked this pull request as ready for review September 3, 2025 14:55

IlyasMoutawwakil reviewed Sep 3, 2025

View reviewed changes

simondanielsson added 3 commits September 4, 2025 11:35

Stub base VLM onnx config

bd4f48c

First version of multimodal OnnxConfig

33e2ec6

Allow registering custom classes and tasks from optimum-onnx

d9bd16c

IlyasMoutawwakil reviewed Sep 11, 2025

View reviewed changes

tests/onnxruntime/test_decoder.py Show resolved Hide resolved

IlyasMoutawwakil reviewed Sep 11, 2025

View reviewed changes

simondanielsson added 9 commits September 11, 2025 19:55

Remove LM head and add text_encoder

436b59e

Remove LM head from tests

6398669

Add LANGUAGE_MODEL_WITH_HEAD behvaior

37ea777

Update tests to use _with_head

67a4302

Update exported submodels

60b250f

Stub modeling_vlm module and proper onnxruntime tests

c7bb9f9

Remove unused injection of processors in OnnxConfigWithPast

e76ab45

Remove dsstore

9b839bc

Change superclass of ORTModelForVIualCausalLM ot ORTModel

99be0fb

simondanielsson commented Sep 11, 2025

View reviewed changes

IlyasMoutawwakil mentioned this pull request Oct 3, 2025

Cannot convert Gemma-3-4B QAT to ONNX #67

Closed

4 tasks

thewh1teagle mentioned this pull request Oct 4, 2025

Add onnx inference thewh1teagle/gemma3-g2p#1

Open

thewh1teagle mentioned this pull request Oct 4, 2025

Add support for gemma3-text #70

Merged

IlyasMoutawwakil closed this Oct 22, 2025

IlyasMoutawwakil reopened this Oct 22, 2025

IlyasMoutawwakil changed the title ~~[Gemma3] Add export support~~ [Gemma3] Add VLM support (need help) Oct 22, 2025

IlyasMoutawwakil mentioned this pull request Oct 22, 2025

feat: Add native Gemma3 support for ONNX export #87

Closed



		# TODO: to be implemented
		class ORTVisionEncoder(ORTSessionMixin):

[Gemma3] Add VLM support (need help) #50

Are you sure you want to change the base?

[Gemma3] Add VLM support (need help) #50

Conversation

simondanielsson commented Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

IlyasMoutawwakil commented Sep 3, 2025

Uh oh!

IlyasMoutawwakil commented Sep 3, 2025

Uh oh!

IlyasMoutawwakil commented Sep 3, 2025

Uh oh!

IlyasMoutawwakil Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

simondanielsson commented Sep 3, 2025

Uh oh!

simondanielsson commented Sep 3, 2025

Uh oh!

IlyasMoutawwakil commented Sep 4, 2025

Uh oh!

Uh oh!

IlyasMoutawwakil Sep 11, 2025

Choose a reason for hiding this comment

Uh oh!

simondanielsson Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

IlyasMoutawwakil Sep 11, 2025

Choose a reason for hiding this comment

Uh oh!

simondanielsson Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

IlyasMoutawwakil Sep 11, 2025

Choose a reason for hiding this comment

Uh oh!

IlyasMoutawwakil Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

simondanielsson Sep 11, 2025

Choose a reason for hiding this comment

Uh oh!

simondanielsson Sep 11, 2025

Choose a reason for hiding this comment

Uh oh!

simondanielsson Sep 11, 2025

Choose a reason for hiding this comment

Uh oh!

simondanielsson commented Sep 11, 2025

Uh oh!

IlyasMoutawwakil commented Sep 18, 2025

Uh oh!

thewh1teagle commented Oct 4, 2025

Uh oh!

thewh1teagle commented Oct 4, 2025

Uh oh!

IlyasMoutawwakil commented Oct 4, 2025

Uh oh!

surfiniaburger commented Oct 4, 2025

Uh oh!

geraldstanje1 commented Oct 22, 2025

Uh oh!

IlyasMoutawwakil commented Oct 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

simondanielsson commented Sep 3, 2025 •

edited

Loading

IlyasMoutawwakil Sep 3, 2025 •

edited

Loading

simondanielsson Sep 11, 2025 •

edited

Loading

simondanielsson Sep 11, 2025 •

edited

Loading

IlyasMoutawwakil Sep 11, 2025 •

edited

Loading