Using Vlm to generate overall information has some problems

According to the paper, this VLM is trained in two stages. However, I noticed that only one checkpoint is available on Hugging Face. In 1_vlm_demo.py, this single checkpoint appears to be used for both Stage 1 and Stage 2.
I attempted to run 1_vlm_demo.py and checked the Stage 1 output in the /demo folder. The performance seems suboptimal, with many outputs failing to align with the prompt templates.
Is this result expected? Additionally, is it possible to obtain the fine-tuned model for Stage 1 separately?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using Vlm to generate overall information has some problems #19

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Using Vlm to generate overall information has some problems #19

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions