Question on MobiLlama-V

Thanks for your great work! In **Multimodal MobiLlama** of the **Results** section, you briefly introduce how you developed MobiLlama-V. The model seems to have a LLaVA-like architecture, but is only trained on the visual instruction tuning data, which is the potential reason that  MobiLlama-V exhibits mediocre performance. Hence, my questions are the following:

1. Can you release more details about the architecture and training process of MobiLlama-V?
2. Did/Will you perform two-stage training instead of only the second stage? 
3. Do you consider using [ALLaVA-4V](https://github.com/FreedomIntelligence/ALLaVA), a high-quality multimodal dataset for vision-language training? This dataset is proposed to improve the performance of small VLMs.

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question on MobiLlama-V #13

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Question on MobiLlama-V #13

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions