Thank you very much for your exploration of semantic segmentation of the MM-DIT generative model.
When reading the code, I encountered a question. I saw that you only took the first token in the concept embedding as the final token to be used.
Is this because the first token after passing through the t5 encoder contains all the semantic information?
Will this lead to the loss of information?