Great work!
Since the model learns to predict visual tokens autoregressively during training, I was wondering: have you tried using it for actual vision generation tasks? For example, generating or reconstructing images conditioned on text or partial visual context?
Thanks for sharing this interesting work.