Hello, thanks for the great work on VILA-U.
I want to put the first frame of the image into the process of image autoregression to achieve consistency and continuity in video generation, but I don't know where the specific implementation code of the following method is? Could you please tell me? Or do you have a better suggestion?
outputs = self.llm.generate(
input_ids=input_ids,
attention_mask=attention_mask,
vision_tower=self.vision_tower,
mm_projector=self.mm_projector,
image_ids=image_ids,
cfg=cfg,
**generation_kwargs
)
vila_u_arch.py, line 580-588