Skip to content

What's the improvement in model #9

@Hickey8

Description

@Hickey8

Hi! Thanks for your work.
I’ve read your paper and I'm a bit confused about the claimed contribution of relying solely on the self-attention mechanism. From my understanding, this architecture appears just the backbone used in CogVideoX. Additionally, using a VAE and concatenating image latents seems to be common practice.

I might be missing something, but could you clarify what the main improvements or novelties are compared to existing approaches like CogVideoX?

Thanks in advance!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions