Skip to content

cpp-qn/MiniCPM-V-CookBook

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🍳 MiniCPM-V & o Cookbook

🏠 Main Repository | 📚 Full Documentation

Cook up amazing multimodal AI applications effortlessly with MiniCPM-o, bringing vision, speech, and live-streaming capabilities right to your fingertips!

✨ What Makes Our Recipes Special?

Easy Usage Documentation

Our comprehensive documentation website presents every recipe in a clear, well-organized manner. All features are displayed at a glance, making it easy for you to quickly find exactly what you need.

Broad User Spectrum

We support a wide range of users, from individuals to enterprises and researchers.

  • Individuals: Enjoy effortless inference using Ollama and Llama.cpp with minimal setup.
  • Enterprises: Achieve high-throughput, scalable performance with vLLM and SGLang.
  • Researchers: Leverage advanced frameworks including Transformers , LLaMA-Factory, SWIFT, and Align-anything to enable flexible model development and cutting-edge experimentation.

Versatile Deployment Scenarios

Our ecosystem delivers optimal solution for a variety of hardware environments and deployment demands.

  • Web demo: Launch interactive multimodal AI web demo with FastAPI.
  • Quantized deployment: Maximize efficiency and minimize resource consumption using GGUF, BNB, and AWQ.
  • Edge devices: Bring powerful AI experiences to iPhone and iPad, supporting offline and privacy-sensitive applications.

⭐️ Live Demonstrations

Explore real-world examples of MiniCPM-V deployed on edge devices using our curated recipes. These demos highlight the model’s high efficiency and robust performance in practical scenarios.

    

  • Run locally on iPad with iOS demo, observing the process of drawing a rabbit.

ipad_case.mp4

🔥 Inference Recipes

Ready-to-run examples

Recipe Description
Vision Capabilities
🖼️ Single-image QA Question answering on a single image
🧩 Multi-image QA Question answering with multiple images
🎬 Video QA Video-based question answering
📄 Document Parser Parse and extract content from PDFs and webpages
📝 Text Recognition Reliable OCR for photos and screenshots
Audio Capabilities
🎤 Speech-to-Text Multilingual speech recognition
🗣️ Text-to-Speech Instruction-following speech synthesis
🎭 Voice Cloning Realistic voice cloning and role-play

🏋️ Fine-tuning Recipes

Customize your model with your own ingredients

Data preparation

Follow the guidance to set up your training datasets.

Training

We provide training methods serving different needs as following:

Framework Description
Transformers Most flexible for customization
LLaMA-Factory Modular fine-tuning toolkit
SWIFT Lightweight and fast parameter-efficient tuning
Align-anything Visual instruction alignment for multimodal models

📦 Serving Recipes

Deploy your model efficiently

Method Description
vLLM High-throughput GPU inference
SGLang High-throughput GPU inference
Llama.cpp Fast CPU inference on PC, iPhone and iPad
Ollama User-friendly setup
OpenWebUI Interactive Web demo with Open WebUI
FastAPI Interactive Omni Streaming demo with FastAPI
iOS Interactive iOS demo with llama.cpp

🥄 Quantization Recipes

Compress your model to improve efficiency

Format Key Feature
GGUF Simplest and most portable format
BNB Simple and easy-to-use quantization method
AWQ High-performance quantization for efficient inference

Awesome Works using MiniCPM-V & o

  • text-extract-api: Document extraction API using OCRs and Ollama supported models GitHub Repo stars
  • comfyui_LLM_party: Build LLM workflows and integrate into existing image workflows GitHub Repo stars
  • Ollama-OCR: OCR package uses vlms through Ollama to extract text from images and PDF GitHub Repo stars
  • comfyui-mixlab-nodes: ComfyUI node suite supports Workflow-to-APP、GPT&3D and more GitHub Repo stars
  • OpenAvatarChat: Interactive digital human conversation implementation on single PC GitHub Repo stars
  • pensieve: A privacy-focused passive recording project by recording screen content GitHub Repo stars
  • paperless-gpt: Use LLMs to handle paperless-ngx, AI-powered titles, tags and OCR GitHub Repo stars
  • Neuro: A recreation of Neuro-Sama, but running on local models on consumer hardware GitHub Repo stars

👥 Community

Contributing

We love new recipes! Please share your creative dishes:

  1. Fork the repository
  2. Create your recipe
  3. Submit a pull request

Issues & Support

Institutions

This cookbook is developed by OpenBMB and OpenSQZ.

📜 License

This cookbook is served under the Apache-2.0 License - cook freely, share generously! 🍳

About

Cook up amazing multimodal AI applications effortlessly with MiniCPM-o

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 91.7%
  • Shell 8.3%