Skip to content

Convolutional attention layers #35

@shahbuland

Description

@shahbuland

DCAE and SANA use attention attention layers/mix ffn architecture as the latent becomes smaller than 16x16. This might be a good idea for us to try and cram more information into the latent. If you are implementing this, feel free to follow along with the SANA paper, or the efficient vit codebase, but make sure you don't actually use linear attention. As far as I can tell, flash attention is still faster in practice without custom kernels.

This would entail a new architecture/model to be added, maybe called SANAVAE, as well as custom layers that might fit in under owl_vaes/nn/sana.py

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions