Question on CLIP finetuning in VLM2Vec-v1

Could you provide more detailed information about CLIP fine-tuning for multimodal retrieval?

Specifically, I'm interested in understanding how to handle composite inputs during training, where both queries and database entries may contain combinations of image and text modalities.