Skip to content

Support for VLM Conversion with Encoder-Decoder Architecture #946

@sagnik-charlie

Description

@sagnik-charlie

Description of the bug:

Currently, the litert_torch library supports decoder-only model conversion, even for VLMs like qwen_vl, paligemma or smolvlm2 those mentioned in litert_torch.generative.examples. However, the export feature in litert_torch.genenerative.export_hf.export file mentions something about 'task'. By mentioning the task as 'image_text_to_text' the VLM will be loaded as pytorch model. But again from the exportable_module in litert_torch.genenerative.export_hf.core it uses the decoder only model conversion script. Using both generative examples and export_hf files, the litert_torch library still supports a decoder only model conversion, which is task like 'llm_chat' and 'llm_prompt_lab' for litert-lm engine runtime. The litert-lm engine expects 'TF_LITE_VISION_ENCODER' and 'TF_LITE_AUDIO_ENCODER_HW' for image and audio support along with the decoder, the code for which is still missing. Can you please look into this problem, and provide a solution for this?

Actual vs expected behavior:

VLM Conversion, expects both the encoder as well as the decoder to be exported. But however for all the mentioned VLMs, the VLM model returns a decoder with valid signatures. Due to this, only 'llm_chat' and 'llm_prompt_lab' tasks are supported in the Google AI Edge Gallery App, whereas the tasks like 'llm_ask_image' and 'llm_ask_audio' expects a 'TF_LITE_VISION_ENCODER' and 'TF_LITE_AUDIO_ENCODER_HW' respectively inside the .litertlm bundle. The conversion for which is still not supported using the litert_torch library.

Any other information you'd like to share?

No response

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions