Skip to content

Question on MobiLlama-V #13

@g-h-chen

Description

@g-h-chen

Thanks for your great work! In Multimodal MobiLlama of the Results section, you briefly introduce how you developed MobiLlama-V. The model seems to have a LLaVA-like architecture, but is only trained on the visual instruction tuning data, which is the potential reason that MobiLlama-V exhibits mediocre performance. Hence, my questions are the following:

  1. Can you release more details about the architecture and training process of MobiLlama-V?
  2. Did/Will you perform two-stage training instead of only the second stage?
  3. Do you consider using ALLaVA-4V, a high-quality multimodal dataset for vision-language training? This dataset is proposed to improve the performance of small VLMs.

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions