Concetenating input images twice while training does not seem intuitive.

I was wondering if you could tell me why you are tokenizing the image twice and then appending it in case of multimodal understanding. 

For instance, in this script https://github.com/Gen-Verse/MMaDA/blob/main/training/train_mmada_stage4.py, in L721 and L722, you are concatenating the same image input ids twice. It seems that the same image input IDs are concatenated twice. One version appears to be masked with some tokens replaced by random tokens, while the other uses the standard image tokens.

Could you clarify the reasoning behind tokenizing and appending the image twice for multimodal understanding? In particular, during inference we seem to pass the image only once, so I was wondering what role the duplicated image tokens play during training.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Concetenating input images twice while training does not seem intuitive. #66

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Concetenating input images twice while training does not seem intuitive. #66

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions