ODISe uses a CLIP image embedding-based prompt embedding technique. How do I implement that in this code as the conditions are of a specific shape and type in "encoder_hidden_states". It would be nice if I could get some help as i'm not entirely sure on how to implement it.