|
| 1 | +# Multimodal Support in llama.cpp |
| 2 | + |
| 3 | +This directory provides multimodal capabilities for `llama.cpp`. Initially intended as a showcase for running LLaVA models, its scope has expanded significantly over time to include various other vision-capable models. As a result, LLaVA is no longer the only multimodal architecture supported. |
| 4 | + |
| 5 | +> [!IMPORTANT] |
| 6 | +> |
| 7 | +> Multimodal support can be viewed as a sub-project within `llama.cpp`. It is under **very heavy development**, and **breaking changes are expected**. |
| 8 | +
|
| 9 | +The naming and structure related to multimodal support have evolved, which might cause some confusion. Here's a brief timeline to clarify: |
| 10 | + |
| 11 | +- [#3436](https://github.com/ggml-org/llama.cpp/pull/3436): Initial support for LLaVA 1.5 was added, introducing `llava.cpp` and `clip.cpp`. The `llava-cli` binary was created for model interaction. |
| 12 | +- [#4954](https://github.com/ggml-org/llama.cpp/pull/4954): Support for MobileVLM was added, becoming the second vision model supported. This built upon the existing `llava.cpp`, `clip.cpp`, and `llava-cli` infrastructure. |
| 13 | +- **Expansion & Fragmentation:** Many new models were subsequently added (e.g., [#7599](https://github.com/ggml-org/llama.cpp/pull/7599), [#10361](https://github.com/ggml-org/llama.cpp/pull/10361), [#12344](https://github.com/ggml-org/llama.cpp/pull/12344), and others). However, `llava-cli` lacked support for the increasingly complex chat templates required by these models. This led to the creation of model-specific binaries like `qwen2vl-cli`, `minicpmv-cli`, and `gemma3-cli`. While functional, this proliferation of command-line tools became confusing for users. |
| 14 | +- [#12849](https://github.com/ggml-org/llama.cpp/pull/12849): `libmtmd` was introduced as a replacement for `llava.cpp`. Its goals include providing a single, unified command-line interface, improving the user/developer experience (UX/DX), and supporting both audio and image inputs. |
| 15 | +- [#13012](https://github.com/ggml-org/llama.cpp/pull/13012): `mtmd-cli` was added, consolidating the various model-specific CLIs into a single tool powered by `libmtmd`. |
| 16 | + |
| 17 | +## Pre-quantized models |
| 18 | + |
| 19 | +See the list of pre-quantized model [here](../../docs/multimodal.md) |
| 20 | + |
| 21 | +## How it works and what is `mmproj`? |
| 22 | + |
| 23 | +Multimodal support in `llama.cpp` works by encoding images into embeddings using a separate model component, and then feeding these embeddings into the language model. |
| 24 | + |
| 25 | +This approach keeps the multimodal components distinct from the core `libllama` library. Separating these allows for faster, independent development cycles. While many modern vision models are based on Vision Transformers (ViTs), their specific pre-processing and projection steps can vary significantly. Integrating this diverse complexity directly into `libllama` is currently challenging. |
| 26 | + |
| 27 | +Consequently, running a multimodal model typically requires two GGUF files: |
| 28 | +1. The standard language model file. |
| 29 | +2. A corresponding **multimodal projector (`mmproj`)** file, which handles the image encoding and projection. |
| 30 | + |
| 31 | +## What is `libmtmd`? |
| 32 | + |
| 33 | +As outlined in the history, `libmtmd` is the modern library designed to replace the original `llava.cpp` implementation for handling multimodal inputs. |
| 34 | + |
| 35 | +Built upon `clip.cpp` (similar to `llava.cpp`), `libmtmd` offers several advantages: |
| 36 | +- **Unified Interface:** Aims to consolidate interaction for various multimodal models. |
| 37 | +- **Improved UX/DX:** Features a more intuitive API, inspired by the `Processor` class in the Hugging Face `transformers` library. |
| 38 | +- **Flexibility:** Designed to support multiple input types (text, audio, images) while respecting the wide variety of chat templates used by different models. |
| 39 | + |
| 40 | +## How to obtain `mmproj` |
| 41 | + |
| 42 | +Multimodal projector (`mmproj`) files are specific to each model architecture. |
| 43 | + |
| 44 | +For the following models, you can use `convert_hf_to_gguf.py` with `--mmproj` flag to get the `mmproj` file: |
| 45 | +- [Gemma 3](https://huggingface.co/collections/google/gemma-3-release-67c6c6f89c4f76621268bb6d) ; See the guide [here](../../docs/multimodal/gemma3.md) - Note: 1B variant does not have vision support |
| 46 | +- SmolVLM (from [HuggingFaceTB](https://huggingface.co/HuggingFaceTB)) |
| 47 | +- SmolVLM2 (from [HuggingFaceTB](https://huggingface.co/HuggingFaceTB)) |
| 48 | +- [Pixtral 12B](https://huggingface.co/mistral-community/pixtral-12b) - only works with `transformers`-compatible checkpoint |
| 49 | +- Qwen 2 VL and Qwen 2.5 VL (from [Qwen](https://huggingface.co/Qwen)) |
| 50 | +- [Mistral Small 3.1 24B](https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503) |
| 51 | +- InternVL 2.5 and InternVL 3 from [OpenGVLab](https://huggingface.co/OpenGVLab) (note: we don't support conversion of `InternVL3-*-hf` model, only non-HF version is supported ; `InternLM2Model` **text** model is not supported) |
| 52 | + |
| 53 | +For older models, please refer to the relevant guide for instructions on how to obtain or create them: |
| 54 | + |
| 55 | +NOTE: conversion scripts are located under `tools/mtmd/legacy-models` |
| 56 | + |
| 57 | +- [LLaVA](../../docs/multimodal/llava.md) |
| 58 | +- [MobileVLM](../../docs/multimodal/MobileVLM.md) |
| 59 | +- [GLM-Edge](../../docs/multimodal/glmedge.md) |
| 60 | +- [MiniCPM-V 2.5](../../docs/multimodal/minicpmv2.5.md) |
| 61 | +- [MiniCPM-V 2.6](../../docs/multimodal/minicpmv2.6.md) |
| 62 | +- [MiniCPM-o 2.6](../../docs/multimodal/minicpmo2.6.md) |
| 63 | +- [IBM Granite Vision](../../docs/multimodal/granitevision.md) |
0 commit comments