Hi! Thank you for your work! I have a question regarding cross-modal retrieval on COCO, I am struggling to reproduce the results reported in the paper. Could you provide more details on your training protocol? Which coefficients are you using for VICReg/what kind of expander architecture/are you doing any further downstream training or do you directly use the encoder embeddings obtained via ssl?
Thank you!