According to the paper, this VLM is trained in two stages. However, I noticed that only one checkpoint is available on Hugging Face. In 1_vlm_demo.py, this single checkpoint appears to be used for both Stage 1 and Stage 2.
I attempted to run 1_vlm_demo.py and checked the Stage 1 output in the /demo folder. The performance seems suboptimal, with many outputs failing to align with the prompt templates.
Is this result expected? Additionally, is it possible to obtain the fine-tuned model for Stage 1 separately?