Question about vision generation under the "vision-as-target" paradigm

Great work!

Since the model learns to predict visual tokens autoregressively during training, I was wondering: have you tried using it for actual vision generation tasks? For example, generating or reconstructing images conditioned on text or partial visual context?

Thanks for sharing this interesting work.