Skip to content

BeingBeyond/Being-VL-0

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 

Repository files navigation

From Pixels to Tokens: Byte-Pair Encoding on Quantized Visual Modalities

Project Page arXiv License

We propose a novel training paradigm for multimodal learning through our BPE Image Tokenizer, which applies the principles of text tokenization to visual data. Unlike conventional approaches, our method directly incorporates structural prior information into image tokens, enabling Transformer models to better learn across modalities. Our Being-VL-0 model demonstrates how this paradigm effectively bridges the gap between visual and textual representation learning.

For more details, please refer to our paper: From Pixels to Tokens: Byte-Pair Encoding on Quantized Visual Modalities (ICLR'25).

News

  • [2025-06-26]: 🔥🔥 We publish Being-VL-0.5, which is accepted by ICCV 2025! Check our paper here. The code is migrated to the new repository: BeingBeyond/Being-VL-0.5.
  • [2025-01-23]: 🎉🎉 Being-VL-0 is accepted by ICLR 2025! Check our paper here.
  • [2024-10-03]: We publish Being-VL-0, the first version of Being-VL series.

Code

We have refactored the codebase to be more modular and user-friendly. The new repository is available at BeingBeyond/Being-VL-0.5.

Citation

If you find our work useful, please consider citing us and give a star to our repository! 🌟🌟🌟

Being-VL-0.5

@inproceedings{zhang2025beingvl05,
  title={Unified Multimodal Understanding via Byte-Pair Visual Encoding},
  author={Zhang, Wanpeng and Feng, Yicheng and Luo, Hao and Li, Yijiang and Yue, Zihao and Zheng, Sipeng and Lu, Zongqing},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  year={2025}
}

Being-VL-0

@inproceedings{zhang2025beingvl0,
  title={From Pixels to Tokens: Byte-Pair Encoding on Quantized Visual Modalities},
  author={Zhang, Wanpeng and Xie, Zilong and Feng, Yicheng and Li, Yijiang and Xing, Xingrun and Zheng, Sipeng and Lu, Zongqing},
  booktitle={The Thirteenth International Conference on Learning Representations},
  year={2025},
  url={https://openreview.net/forum?id=3TnLGGHhNx}
}

About

From Pixels to Tokens: Byte-Pair Encoding on Quantized Visual Modalities (ICLR 2025)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors