I was wondering if you could tell me why you are tokenizing the image twice and then appending it in case of multimodal understanding.
For instance, in this script https://github.com/Gen-Verse/MMaDA/blob/main/training/train_mmada_stage4.py, in L721 and L722, you are concatenating the same image input ids twice. It seems that the same image input IDs are concatenated twice. One version appears to be masked with some tokens replaced by random tokens, while the other uses the standard image tokens.
Could you clarify the reasoning behind tokenizing and appending the image twice for multimodal understanding? In particular, during inference we seem to pass the image only once, so I was wondering what role the duplicated image tokens play during training.
I was wondering if you could tell me why you are tokenizing the image twice and then appending it in case of multimodal understanding.
For instance, in this script https://github.com/Gen-Verse/MMaDA/blob/main/training/train_mmada_stage4.py, in L721 and L722, you are concatenating the same image input ids twice. It seems that the same image input IDs are concatenated twice. One version appears to be masked with some tokens replaced by random tokens, while the other uses the standard image tokens.
Could you clarify the reasoning behind tokenizing and appending the image twice for multimodal understanding? In particular, during inference we seem to pass the image only once, so I was wondering what role the duplicated image tokens play during training.