-
Notifications
You must be signed in to change notification settings - Fork 148
Description
Description of the bug:
Currently, the litert_torch library supports decoder-only model conversion, even for VLMs like qwen_vl, paligemma or smolvlm2 those mentioned in litert_torch.generative.examples. However, the export feature in litert_torch.genenerative.export_hf.export file mentions something about 'task'. By mentioning the task as 'image_text_to_text' the VLM will be loaded as pytorch model. But again from the exportable_module in litert_torch.genenerative.export_hf.core it uses the decoder only model conversion script. Using both generative examples and export_hf files, the litert_torch library still supports a decoder only model conversion, which is task like 'llm_chat' and 'llm_prompt_lab' for litert-lm engine runtime. The litert-lm engine expects 'TF_LITE_VISION_ENCODER' and 'TF_LITE_AUDIO_ENCODER_HW' for image and audio support along with the decoder, the code for which is still missing. Can you please look into this problem, and provide a solution for this?
Actual vs expected behavior:
VLM Conversion, expects both the encoder as well as the decoder to be exported. But however for all the mentioned VLMs, the VLM model returns a decoder with valid signatures. Due to this, only 'llm_chat' and 'llm_prompt_lab' tasks are supported in the Google AI Edge Gallery App, whereas the tasks like 'llm_ask_image' and 'llm_ask_audio' expects a 'TF_LITE_VISION_ENCODER' and 'TF_LITE_AUDIO_ENCODER_HW' respectively inside the .litertlm bundle. The conversion for which is still not supported using the litert_torch library.
Any other information you'd like to share?
No response