-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Hi! Thanks for your work.
I’ve read your paper and I'm a bit confused about the claimed contribution of relying solely on the self-attention mechanism. From my understanding, this architecture appears just the backbone used in CogVideoX. Additionally, using a VAE and concatenating image latents seems to be common practice.
I might be missing something, but could you clarify what the main improvements or novelties are compared to existing approaches like CogVideoX?
Thanks in advance!
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels