Re-Implement MAEToK

https://arxiv.org/abs/2502.03444

This paper is very promising. They find that semantic rich auto-encoders are actually better tokenizers for diffusion. They get near lossless reconstructions on 512x512 images using 128 latent tokens. We should be able to get somewhat-lossy-but-better-than-what-we-currently-have results using 64 tokens or maybe even 32. The goal would be to have a lossy VAE with higher compression and 1D latents, then to finetune a diffusion decoder to bring reconstruction quality back to basline levels. For the purpose of this issue:

1. Implement MAEToK on separate branch (under `owl_vaes/models/maetok.py`, be sure to register in `owl_vaes/models/__init__.py`)
3. Test it with basic dataset like mnist
4. Scale with s3 video game data on [360,640] frames (using landscape to square projections as in DCAE)

For 1, you might need to implement a new trainer. For reference, there is a GAN enabled trainer under `owl_vaes/trainers/distill_dec.py`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re-Implement MAEToK #40

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Re-Implement MAEToK #40

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions