Hi!
Thank you for the really cool research and available code. I was wondering, would it be possible / feasable / interesting to train the LLM2CLIP's vision encoder from scratch using the CC-LLM as text encoder?
I noticed in the paper you only finetuned vision encoders with the CC-LLM, but I don't see why we couldn't just immediately train a blank vision encoder. Is it because generating so many embeddings with the CC-LLM would cost too much?