Skip to content

Re-Implement MAEToK #40

@shahbuland

Description

@shahbuland

https://arxiv.org/abs/2502.03444

This paper is very promising. They find that semantic rich auto-encoders are actually better tokenizers for diffusion. They get near lossless reconstructions on 512x512 images using 128 latent tokens. We should be able to get somewhat-lossy-but-better-than-what-we-currently-have results using 64 tokens or maybe even 32. The goal would be to have a lossy VAE with higher compression and 1D latents, then to finetune a diffusion decoder to bring reconstruction quality back to basline levels. For the purpose of this issue:

  1. Implement MAEToK on separate branch (under owl_vaes/models/maetok.py, be sure to register in owl_vaes/models/__init__.py)
  2. Test it with basic dataset like mnist
  3. Scale with s3 video game data on [360,640] frames (using landscape to square projections as in DCAE)

For 1, you might need to implement a new trainer. For reference, there is a GAN enabled trainer under owl_vaes/trainers/distill_dec.py

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions