Skip to content

Concetenating input images twice while training does not seem intuitive. #66

@kingston-aditya

Description

@kingston-aditya

I was wondering if you could tell me why you are tokenizing the image twice and then appending it in case of multimodal understanding.

For instance, in this script https://github.com/Gen-Verse/MMaDA/blob/main/training/train_mmada_stage4.py, in L721 and L722, you are concatenating the same image input ids twice. It seems that the same image input IDs are concatenated twice. One version appears to be masked with some tokens replaced by random tokens, while the other uses the standard image tokens.

Could you clarify the reasoning behind tokenizing and appending the image twice for multimodal understanding? In particular, during inference we seem to pass the image only once, so I was wondering what role the duplicated image tokens play during training.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions