-
Notifications
You must be signed in to change notification settings - Fork 153
Description
Prerequisites
- I am running the latest code. Mention the version if possible as well.
- I carefully followed the README.md.
- I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- I reviewed the Discussions, and have a new and useful enhancement to share.
Feature Description
Add support in ik_llama.cpp
for vision and multi-modality, allowing models that process both images and text (vision-language models, VLMs) to run natively. This would involve:
- Loading and handling image inputs (e.g. image paths, image tensors, base64 / URL encoded) alongside textual prompts.
- Converting image inputs into embeddings (via a vision encoder or projector) to feed into the language model.
- Supporting
.mmproj
(or equivalent) multimodal projector metadata / files, where necessary, to enable models trained with separate projector components. - Enabling multimodal inference in both CLI mode and server mode (or whatever interface
ik_llama.cpp
offers), so users can mix text + image in prompts.
Motivation
Several reasons why this would be valuable:
-
Parity with recent advances: The
llama.cpp
library has recently added vision / multi-modality support via @ngxson’s work (especially thelibmtmd
library),.mmproj
support, unified multimodal CLI tool (llama-mtmd-cli
), and experimental integration intollama-server
. -
New model families demand support:
- The Qwen series has released VLMs / Visual-Language variants (e.g. Qwen2-VL, Qwen2.5-VL, Qwen3-VL) that are increasingly used / requested.
- Other released models like Gemma3 Vision or InternVL3_5 are prominent; users are asking for
ik_llama.cpp
to support such architectures. Bug: Gemma3 Vision not working #615, Feature Request: add support for vision model InternVL3_5 #730
-
Ecosystem growth:
- Without image capability,
ik_llama.cpp
misses out on many use-cases (image captioning, visual question answering, OCR, etc.), which people increasingly expect from VLM frameworks.
- Without image capability,
-
Strategic value: Supporting vision will broaden the user base, help keep
ik_llama.cpp
competitive, and reduce fragmentation (i.e. people might otherwise need to switch tollama.cpp
or other tools for vision tasks).
Possible Implementation
Here are possible ways this could be done, drawing on existing work (especially @ngxson / llama.cpp) PR#12898 as reference, and suggestions specific to ik_llama.cpp
.
Step | Description |
---|---|
Investigate existing llama.cpp implementation |
Review how libmtmd works: how image-projector / mmproj files are structured, how image input is parsed, how vision embedding is done, how multimodal tokens are merged with text tokens. Study the llama-mtmd-cli and server changes in llama.cpp. |
Define image input API for ik_llama.cpp | Decide how users will provide images (file paths, base64 URLs, etc.), and integrate into existing prompt format. Possibly extend CLI and server interfaces. |
Implement vision encoder / projector support | Either embed a compatible projector (e.g. via a shared library or porting libmtmd or equivalent) or provide support for reading .mmproj files / equivalent so that VLMs trained using separate projectors can run. Must handle image resizing, patch embedding, positional encoding (e.g. techniques like M-RoPE used in some models), etc. |
Modify model loading / metadata | Extend model loader to recognize when a model is multimodal (maybe via a flag or presence of projector metadata), so that the model is configured appropriately. Possibly include conversion or checking tools. |
Inference integration | Once image embeddings are ready, integrate them into the transformer stack alongside text. Handle batching, merging modalities, maintaining context, etc. Add support to ik_llama.cpp ’s inference engine. |
Interface / server support | Extend ik_llama.cpp server mode to support HTTP / API requests that include image payloads, as llama.cpp’s server does with --mmproj + model + appropriate request schemas. |
Backward compatibility & fallbacks | Ensure that text-only models still work, with minimal overhead. Possibly make vision dependency optional. Handle missing projector files gracefully. |
Testing & examples | Provide example VLMs (e.g. Qwen2-VL, Qwen2.5-VL, Qwen3-VL, Gemma3 Vision) and example prompts + images. Write tests (unit/integration) to ensure correctness on vision tasks (e.g. captioning). |