Hi, Did anyone try to adopt clip style contrastive pretraining model to a signal-text multimodal model? Thank you!