It's quite easy to finetune one of the Open AI CLIP checkpoints with this codebase: https://github.com/Zasder3/train-CLIP-FT Uses pytorch-lightning. May be worth pursuing