-
Notifications
You must be signed in to change notification settings - Fork 12
Open
Description
DCAE and SANA use attention attention layers/mix ffn architecture as the latent becomes smaller than 16x16. This might be a good idea for us to try and cram more information into the latent. If you are implementing this, feel free to follow along with the SANA paper, or the efficient vit codebase, but make sure you don't actually use linear attention. As far as I can tell, flash attention is still faster in practice without custom kernels.
This would entail a new architecture/model to be added, maybe called SANAVAE, as well as custom layers that might fit in under owl_vaes/nn/sana.py
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels