-
Notifications
You must be signed in to change notification settings - Fork 12
Open
Description
https://arxiv.org/abs/2502.03444
This paper is very promising. They find that semantic rich auto-encoders are actually better tokenizers for diffusion. They get near lossless reconstructions on 512x512 images using 128 latent tokens. We should be able to get somewhat-lossy-but-better-than-what-we-currently-have results using 64 tokens or maybe even 32. The goal would be to have a lossy VAE with higher compression and 1D latents, then to finetune a diffusion decoder to bring reconstruction quality back to basline levels. For the purpose of this issue:
- Implement MAEToK on separate branch (under
owl_vaes/models/maetok.py, be sure to register inowl_vaes/models/__init__.py) - Test it with basic dataset like mnist
- Scale with s3 video game data on [360,640] frames (using landscape to square projections as in DCAE)
For 1, you might need to implement a new trainer. For reference, there is a GAN enabled trainer under owl_vaes/trainers/distill_dec.py
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels