What's the improvement in model

Hi! Thanks for your work.
I’ve read your paper and I'm a bit confused about the claimed contribution of relying solely on the self-attention mechanism. From my understanding, this architecture appears just  the backbone used in CogVideoX. Additionally, using a VAE and concatenating image latents seems to be common practice.

I might be missing something, but could you clarify what the main improvements or novelties are compared to existing approaches like CogVideoX?

Thanks in advance!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What's the improvement in model #9

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

What's the improvement in model #9

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions