Could you provide more detailed information about CLIP fine-tuning for multimodal retrieval?
Specifically, I'm interested in understanding how to handle composite inputs during training, where both queries and database entries may contain combinations of image and text modalities.