MiniCPM-o is the latest series of end-side multimodal LLMs (MLLMs) ungraded from MiniCPM-V. The models can now take images, video, text, and audio as inputs and provide high-quality text and speech outputs in an end-to-end fashion. MiniCPM-o 4.0 is the latest and most capable model in the MiniCPM-o series. With a total of 4B parameters, this end-to-end model achieves comparable performance to GPT-4o-202405 in vision, speech, and multimodal live streaming, making it one of the most versatile and performant models in the open-source community. For the new voice mode, MiniCPM-o 4.0 supports bilingual real-time speech conversation with customizable voices, and also allows for end-to-end voice cloning, role play, etc. Compared to MiniCPM-o-2.6, we enhancd the stability and naturalness of speech conversation by introducing architecture improvements and improved data pipelines. It also advances MiniCPM-V-2.6's visual capabilities such strong OCR capability, trustworthy behavior, multilingual support, and video understanding.
0 commit comments