Skip to content

Conversation

@simondanielsson
Copy link

@simondanielsson simondanielsson commented Sep 3, 2025

What does this PR do?

This PR adds support for exporting Gemma3 models.

Fixes #49 (issue)

Note: possibly might need this PR in the main repo as well, but there might be a better way of dealing with this :)

Note to self before merging:

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

Who can review?

@echarlaix, @JingyaHuang, @michaelbenayoun, @IlyasMoutawwakil

@simondanielsson simondanielsson force-pushed the feature/add-gemma3-export branch from dcdbae1 to 709fff4 Compare September 3, 2025 14:46
@IlyasMoutawwakil
Copy link
Member

Hi !
Very cool starting point, but I think it's gonna be a bit more complicated depending on what you wanna add the support for exactly. Gemma3 works as both a VLM (conditional generation) and a normal LM (causal LM).
One takes both text and vision and inputs, while the second only takes text.
For text only generation, it's as simple as inheriting from the right decoder onnx config (llama onnx config) and customizing a couple things.
However for Gemma3 with vision, you will need to add an api similar to what we do for seq2seq models (a bit more complicated than that), so that we can export the model parts (vision encoder, multimodal projector, language model).

@IlyasMoutawwakil
Copy link
Member

this API for example was added in the optimum-intel project and you can take inspiration from there https://github.com/huggingface/optimum-intel/blob/cd0028599d65e841d6fe16d4d9ccd5528f89f3cb/optimum/exporters/openvino/model_configs.py#L3917

@IlyasMoutawwakil
Copy link
Member

you don't have to follow it exactly, let me know if you have better suggestions.

Comment on lines 510 to 518
@register_tasks_manager_onnx("gemma3", *[*COMMON_TEXT_GENERATION_TASKS, "text-classification"])
class Gemma3OnnxConfig(LlamaOnnxConfig):
DUMMY_INPUT_GENERATOR_CLASSES = (
DummyTextInputGenerator,
DummyVisionInputGenerator,
)
DUMMY_PKV_GENERATOR_CLASS = GemmaDummyPastKeyValuesGenerator
NORMALIZED_CONFIG_CLASS = NormalizedConfigManager.get_normalized_config_class("gemma3")
MIN_TRANSFORMERS_VERSION = version.parse("4.52.0.dev0")
Copy link
Member

@IlyasMoutawwakil IlyasMoutawwakil Sep 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for example, here the model will only support text inputs, as it inherits its inputs property form llama. it also only supports classic decoder model (LLM) tasks, like text-generation.
if your intention is to support text only generation with gemma3, then yes this would be acceptable, in that case there's no need for DummyVisionInputGenerator. Also for NORMALIZED_CONFIG_CLASS, you can define it directly instead of using the config manager.

@simondanielsson
Copy link
Author

Makes total sense, thanks for the detailed comments! Will have another go at this :)

@simondanielsson
Copy link
Author

Some questions to clarify before I continue implementing:

  1. Would we like something like the following:
  • a gemma3_text type - allowing exporting of the causal LM for e.g. text-generation tasks.
  • a gemma3 type - allowing for image-text-to-text tasks.

Potentially it could make sense to merge these into a single gemma3 and distinguish between them only on the task ( conditionally create the right behaviour (VLM or LM)) - let's see if that's possible.

  1. If exporting a gemma3 image-text-to-text, we by default export the full model. Do we want to add support for exporting only one subcomponent ("behavior") of the model as well?

Thanks for bearing with me :) I have lots of time to spend on this so happy to learn and contribute lots in the coming month if all goes well.

@IlyasMoutawwakil
Copy link
Member

I don't think there's a need for another "model type", it will be complicated to add a model type because the exporter relies on the model type in the model config to infer which onnx config it should use.

Potentially it could make sense to merge these into a single gemma3 and distinguish between them only on the task ( conditionally create the right behaviour (VLM or LM)) - let's see if that's possible.

yes this makes more sense, btw you can see how for seq2seq models like bart we do support text-generation (Causal LM) in the same onnx config.

Do we want to add support for exporting only one subcomponent ("behavior") of the model as well?

If you mean in the case of text only generation where only the language model should be exported, then yes.

if monolith:
expected_models = {"model.onnx"}
elif task in {"text-generation", "text-generation-with-past"}:
expected_models = {"language_model.onnx", "language_model_head.onnx"}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if the model is exporter with the task text-generation, we should have something similar to what we get with other decoder models (i.e. it should be one model.onnx exported with past key values if with-past)

Copy link
Author

@simondanielsson simondanielsson Sep 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I added

            expected_models = {"text_encoder.onnx", "language_model_with_head.onnx"}
  • would you prefer to call the second one model.onnx?

"language_model_head.onnx",
}
elif task in {"feature-extraction", "feature-extraction-with-past"}:
expected_models = {"vision_encoder.onnx", "language_model.onnx"}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in feature extraction, the model should be decomposed to the same elements as image-text-to-text
feature-extraction is the task that maps to Gemma3Model (the model without any task head) https://github.com/huggingface/transformers/blob/de01a22aff21d16532d8dd68806589ca6c73dd5c/src/transformers/models/gemma3/modeling_gemma3.py#L771

Copy link
Author

@simondanielsson simondanielsson Sep 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented it as follows - LMK if not correct:

image-text-to-text:

  • vision_encoder
  • multimodal_projector
  • text_encoder
  • language_model_with_head <--- note head

feature-extraction:

  • vision_encoder
  • multimodal_projector
  • text_encoder
  • language_model

"vision_encoder.onnx",
"multimodal_projector.onnx",
"language_model.onnx",
"language_model_head.onnx",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it necessary to decompose the model into a language model and an lm_head ? I think they can be exported as language_model_with_lm_head, you can see in gemma3 modeling that in the class for conditional generation, the lm_head is always applied after the language modeling (not the case for feature extraction)

Copy link
Member

@IlyasMoutawwakil IlyasMoutawwakil Sep 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Added text encoder as well - great point!
  • Agree probably not necessary to decompose the LM and LM head. However, to export just the LM and LM head as one unit (ignoring the vision encoder etc.) I would have to implement a separate nn.Module because Gemma3ForConditionalGeneration does not expose any way of getting just the LM and the head. Do you think it's fine to export it as a monolith in this case? This is how it's done in -intel here

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See implementation above



# TODO: to be implemented
class ORTVisionEncoder(ORTSessionMixin):
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: This is mostly stubbed or a bit WIP

@simondanielsson
Copy link
Author

@IlyasMoutawwakil I updated the the test structure to align more with your refactoring efforts. Note parts of it are still heavily stubbed.

Unfortunately I will not have much more capacity to work on this (required a bit more time than I anticipated) - just an hour or so tomorrow. How can I best assist/clean this up for someone else to pick this up easily, or alternatively put it into a minimal state where it can be merged and followed up later by someone else in another PR?

Thanks for the amazing support through this, I've learnt a lot.

@IlyasMoutawwakil
Copy link
Member

Thanks for the effort @simondanielsson . tbh I knew it required more work than the classic model addition workflow, because the entire recipe for vlm export and inference is required for a correct implementation. I will try to build upon your work in another PR 🤗

@thewh1teagle
Copy link
Contributor

Hey, thanks for your effort!
Looks like the following command is failing in this branch:

git clone https://github.com/huggingface/optimum-onnx
gh pr checkout https://github.com/huggingface/optimum-onnx/pull/50
uv sync
uv run optimum-cli export -h

Error:

 uv run optimum-cli export -h
Traceback (most recent call last):
  File "/Users/yqbqwlny/Documents/audio/optimum-onnx/.venv/bin/optimum-cli", line 10, in <module>
    sys.exit(main())
             ^^^^^^
  File "/Users/yqbqwlny/Documents/audio/optimum-onnx/.venv/lib/python3.12/site-packages/optimum/commands/optimum_cli.py", line 196, in main
    commands_to_register = _OPTIMUM_CLI_SUBCOMMANDS + load_optimum_namespace_cli_commands()
                                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/yqbqwlny/Documents/audio/optimum-onnx/.venv/lib/python3.12/site-packages/optimum/commands/optimum_cli.py", line 140, in load_optimum_namespace_cli_commands
    register_module = importlib.import_module(f"{commands_register_namespace}.{register_file.stem}")
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/yqbqwlny/.local/share/uv/python/cpython-3.12.11-macos-aarch64-none/lib/python3.12/importlib/__init__.py", line 90, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen importlib._bootstrap>", line 1387, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1360, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1331, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 935, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 999, in exec_module
  File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed
  File "/Users/yqbqwlny/Documents/audio/optimum-onnx/optimum/commands/register/register_export.py", line 15, in <module>
    from optimum.commands.export import ExportCommand
ImportError: cannot import name 'ExportCommand' from 'optimum.commands.export' (unknown location)

I had to modify things in onnx.py and register_export.py

Curious to try it. how can I convert gemma3-270m from local checkpoint directory? trying to convert this gemma3-g2p model.

@thewh1teagle
Copy link
Contributor

Why not add initial text-only support first? It’s better to have something working early than wait for the perfect PR

@IlyasMoutawwakil
Copy link
Member

@thewh1teagle yes you need to rebase the branch with latest changes in main. we just merged 2 PRs, one in optimum and another in optipum-onnx to use native python namespaces, to have a better developer experience when working on the project in editable mode.

Also agree that having initial text only support is what i had in mind at first. i guess @simondanielsson was interested in the full development of a new export api, which is very generous of them ! dont hesitate to open a PR with only text gen support !

@surfiniaburger
Copy link

gemma3n multimodal
Gemma3n for text only

I wrote this a while back, hope it helps.

IlyasMoutawwakil added a commit that referenced this pull request Oct 22, 2025
Added support for gemma3-text following the code in:

- #50

also added a working example with `gemma3-270m-instruct`  

will update and improve as needed.


Related

- #69
- #49
- #45
- huggingface/optimum#1724
- #56

---------

Co-authored-by: Ilyas Moutawwakil <[email protected]>
Co-authored-by: IlyasMoutawwakil <[email protected]>
@geraldstanje1
Copy link

hi @IlyasMoutawwakil, why does this get closed?

@IlyasMoutawwakil
Copy link
Member

@geraldstanje1 support for text-only gemma3 was added in #70 (and extended to loading from a vlm checkpoint in #90).
This PR is out of date and the author has stated their unavailability to finish 😕
I can keep it open but I don't want it to stop others from attempting this feature.

@IlyasMoutawwakil IlyasMoutawwakil changed the title [Gemma3] Add export support [Gemma3] Add VLM support (need help) Oct 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Gemma 3 support

6 participants