Thanks for your excellent work. I am wondering how can I use this model to deal with multi-modality data?