- 
                Notifications
    You must be signed in to change notification settings 
- Fork 25
[Gemma3] Add VLM support (need help) #50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[Gemma3] Add VLM support (need help) #50
Conversation
dcdbae1    to
    709fff4      
    Compare
  
    | Hi ! | 
| this API for example was added in the optimum-intel project and you can take inspiration from there https://github.com/huggingface/optimum-intel/blob/cd0028599d65e841d6fe16d4d9ccd5528f89f3cb/optimum/exporters/openvino/model_configs.py#L3917 | 
| you don't have to follow it exactly, let me know if you have better suggestions. | 
| @register_tasks_manager_onnx("gemma3", *[*COMMON_TEXT_GENERATION_TASKS, "text-classification"]) | ||
| class Gemma3OnnxConfig(LlamaOnnxConfig): | ||
| DUMMY_INPUT_GENERATOR_CLASSES = ( | ||
| DummyTextInputGenerator, | ||
| DummyVisionInputGenerator, | ||
| ) | ||
| DUMMY_PKV_GENERATOR_CLASS = GemmaDummyPastKeyValuesGenerator | ||
| NORMALIZED_CONFIG_CLASS = NormalizedConfigManager.get_normalized_config_class("gemma3") | ||
| MIN_TRANSFORMERS_VERSION = version.parse("4.52.0.dev0") | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for example, here the model will only support text inputs, as it inherits its inputs property form llama. it also only supports classic decoder model (LLM) tasks, like text-generation.
if your intention is to support text only generation with gemma3, then yes this would be acceptable, in that case there's no need for DummyVisionInputGenerator. Also for NORMALIZED_CONFIG_CLASS, you can define it directly instead of using the config manager.
| Makes total sense, thanks for the detailed comments! Will have another go at this :) | 
| Some questions to clarify before I continue implementing: 
 
 Potentially it could make sense to merge these into a single  
 Thanks for bearing with me :) I have lots of time to spend on this so happy to learn and contribute lots in the coming month if all goes well. | 
| I don't think there's a need for another "model type", it will be complicated to add a model type because the exporter relies on the model type in the model config to infer which onnx config it should use. 
 yes this makes more sense, btw you can see how for seq2seq models like bart we do support text-generation (Causal LM) in the same onnx config. 
 If you mean in the case of text only generation where only the language model should be exported, then yes. | 
        
          
                tests/exporters/onnx/test_export.py
              
                Outdated
          
        
      | if monolith: | ||
| expected_models = {"model.onnx"} | ||
| elif task in {"text-generation", "text-generation-with-past"}: | ||
| expected_models = {"language_model.onnx", "language_model_head.onnx"} | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if the model is exporter with the task text-generation, we should have something similar to what we get with other decoder models (i.e. it should be one model.onnx exported with past key values if with-past)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now I added
            expected_models = {"text_encoder.onnx", "language_model_with_head.onnx"}- would you prefer to call the second one model.onnx?
        
          
                tests/exporters/onnx/test_export.py
              
                Outdated
          
        
      | "language_model_head.onnx", | ||
| } | ||
| elif task in {"feature-extraction", "feature-extraction-with-past"}: | ||
| expected_models = {"vision_encoder.onnx", "language_model.onnx"} | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in feature extraction, the model should be decomposed to the same elements as image-text-to-text
feature-extraction is the task that maps to Gemma3Model (the model without any task head) https://github.com/huggingface/transformers/blob/de01a22aff21d16532d8dd68806589ca6c73dd5c/src/transformers/models/gemma3/modeling_gemma3.py#L771
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implemented it as follows - LMK if not correct:
image-text-to-text:
- vision_encoder
- multimodal_projector
- text_encoder
- language_model_with_head <--- note head
feature-extraction:
- vision_encoder
- multimodal_projector
- text_encoder
- language_model
        
          
                tests/exporters/onnx/test_export.py
              
                Outdated
          
        
      | "vision_encoder.onnx", | ||
| "multimodal_projector.onnx", | ||
| "language_model.onnx", | ||
| "language_model_head.onnx", | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it necessary to decompose the model into a language model and an lm_head ? I think they can be exported as language_model_with_lm_head, you can see in gemma3 modeling that in the class for conditional generation, the lm_head is always applied after the language modeling (not the case for feature extraction)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's also a missing component (we can call it the "text_encoder") https://github.com/huggingface/transformers/blob/de01a22aff21d16532d8dd68806589ca6c73dd5c/src/transformers/models/gemma3/modeling_gemma3.py#L902 which is necessary to embed the text input ids before fusing them with the embedded image features https://github.com/huggingface/transformers/blob/de01a22aff21d16532d8dd68806589ca6c73dd5c/src/transformers/models/gemma3/modeling_gemma3.py#L917
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Added text encoder as well - great point!
- Agree probably not necessary to decompose the LM and LM head. However, to export just the LM and LM head as one unit (ignoring the vision encoder etc.) I would have to implement a separate nn.Module because Gemma3ForConditionalGeneration does not expose any way of getting just the LM and the head. Do you think it's fine to export it as a monolith in this case? This is how it's done in -intelhere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See implementation above
|  | ||
|  | ||
| # TODO: to be implemented | ||
| class ORTVisionEncoder(ORTSessionMixin): | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: This is mostly stubbed or a bit WIP
| @IlyasMoutawwakil I updated the the test structure to align more with your refactoring efforts. Note parts of it are still heavily stubbed. Unfortunately I will not have much more capacity to work on this (required a bit more time than I anticipated) - just an hour or so tomorrow. How can I best assist/clean this up for someone else to pick this up easily, or alternatively put it into a minimal state where it can be merged and followed up later by someone else in another PR? Thanks for the amazing support through this, I've learnt a lot. | 
| Thanks for the effort @simondanielsson . tbh I knew it required more work than the classic model addition workflow, because the entire recipe for vlm export and inference is required for a correct implementation. I will try to build upon your work in another PR 🤗 | 
| Hey, thanks for your effort! git clone https://github.com/huggingface/optimum-onnx
gh pr checkout https://github.com/huggingface/optimum-onnx/pull/50
uv sync
uv run optimum-cli export -hError:  uv run optimum-cli export -h
Traceback (most recent call last):
  File "/Users/yqbqwlny/Documents/audio/optimum-onnx/.venv/bin/optimum-cli", line 10, in <module>
    sys.exit(main())
             ^^^^^^
  File "/Users/yqbqwlny/Documents/audio/optimum-onnx/.venv/lib/python3.12/site-packages/optimum/commands/optimum_cli.py", line 196, in main
    commands_to_register = _OPTIMUM_CLI_SUBCOMMANDS + load_optimum_namespace_cli_commands()
                                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/yqbqwlny/Documents/audio/optimum-onnx/.venv/lib/python3.12/site-packages/optimum/commands/optimum_cli.py", line 140, in load_optimum_namespace_cli_commands
    register_module = importlib.import_module(f"{commands_register_namespace}.{register_file.stem}")
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/yqbqwlny/.local/share/uv/python/cpython-3.12.11-macos-aarch64-none/lib/python3.12/importlib/__init__.py", line 90, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen importlib._bootstrap>", line 1387, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1360, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1331, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 935, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 999, in exec_module
  File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed
  File "/Users/yqbqwlny/Documents/audio/optimum-onnx/optimum/commands/register/register_export.py", line 15, in <module>
    from optimum.commands.export import ExportCommand
ImportError: cannot import name 'ExportCommand' from 'optimum.commands.export' (unknown location)I had to modify things in  Curious to try it. how can I convert  | 
| Why not add initial text-only support first? It’s better to have something working early than wait for the perfect PR | 
| @thewh1teagle yes you need to rebase the branch with latest changes in main. we just merged 2 PRs, one in optimum and another in optipum-onnx to use native python namespaces, to have a better developer experience when working on the project in editable mode. Also agree that having initial text only support is what i had in mind at first. i guess @simondanielsson was interested in the full development of a new export api, which is very generous of them ! dont hesitate to open a PR with only text gen support ! | 
| gemma3n multimodal I wrote this a while back, hope it helps. | 
Added support for gemma3-text following the code in: - #50 also added a working example with `gemma3-270m-instruct` will update and improve as needed. Related - #69 - #49 - #45 - huggingface/optimum#1724 - #56 --------- Co-authored-by: Ilyas Moutawwakil <[email protected]> Co-authored-by: IlyasMoutawwakil <[email protected]>
| hi @IlyasMoutawwakil, why does this get closed? | 
| @geraldstanje1 support for text-only gemma3 was added in #70 (and extended to loading from a vlm checkpoint in #90). | 
What does this PR do?
This PR adds support for exporting Gemma3 models.
Fixes #49 (issue)
Note: possibly might need this PR in the main repo as well, but there might be a better way of dealing with this :)
Note to self before merging:
Before submitting
Who can review?
@echarlaix, @JingyaHuang, @michaelbenayoun, @IlyasMoutawwakil