Skip to content

Commit 5fe8d72

Browse files
committed
update docs and hot topics
1 parent f5fbc03 commit 5fe8d72

File tree

4 files changed

+84
-33
lines changed

4 files changed

+84
-33
lines changed

README.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,8 +16,9 @@ Inference of Meta's [LLaMA](https://arxiv.org/abs/2302.13971) model (and others)
1616

1717
## Hot topics
1818

19+
- 🔥 Multimodal support arrived in `llama-server`: [#12898](https://github.com/ggml-org/llama.cpp/pull/12898) | [documentation](./docs/multimodal.md)
1920
- **GGML developer experience survey (organized and reviewed by NVIDIA):** [link](https://forms.gle/Gasw3cRgyhNEnrwK9)
20-
- A new binary `llama-mtmd-cli` is introduced to replace `llava-cli`, `minicpmv-cli`, `gemma3-cli` ([#13012](https://github.com/ggml-org/llama.cpp/pull/13012)) and `qwen2vl-cli` ([#13141]((https://github.com/ggml-org/llama.cpp/pull/13141))), `libllava` will be deprecated
21+
- A new binary `llama-mtmd-cli` is introduced to replace `llava-cli`, `minicpmv-cli`, `gemma3-cli` ([#13012](https://github.com/ggml-org/llama.cpp/pull/13012)) and `qwen2vl-cli` ([#13141](https://github.com/ggml-org/llama.cpp/pull/13141)), `libllava` will be deprecated
2122
- VS Code extension for FIM completions: https://github.com/ggml-org/llama.vscode
2223
- Universal [tool call support](./docs/function-calling.md) in `llama-server` https://github.com/ggml-org/llama.cpp/pull/9639
2324
- Vim/Neovim plugin for FIM completions: https://github.com/ggml-org/llama.vim

docs/multimodal.md

Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,69 @@
1+
# Multimodal
2+
3+
llama.cpp supports multimodal input via `libmtmd`. Currently, there are 2 tools support this feature:
4+
- [llama-mtmd-cli](../tools/mtmd/README.md)
5+
- [llama-server](../tools/server/README.md) via OpenAI-compatible `/chat/completions` API
6+
7+
To enable it, can use use one of the 2 methods below:
8+
9+
- Use `-hf` option with a [supported model](../../docs/multimodal.md)
10+
- To load a model using `-hf` while disabling multimodal, use `--no-mmproj`
11+
- To load a model using `-hf` while using a custom mmproj file, use `--mmproj local_file.gguf`
12+
- Use `-m model.gguf` option with `--mmproj file.gguf` to specify text and multimodal projector respectively
13+
14+
By default, multimodal projector will be offloaded to GPU. To disable this, add `--no-mmproj-offload`
15+
16+
For example:
17+
18+
```sh
19+
# simple usage with CLI
20+
llama-mtmd-cli -hf ggml-org/gemma-3-4b-it-GGUF
21+
22+
# simple usage with server
23+
llama-server -hf ggml-org/gemma-3-4b-it-GGUF
24+
25+
# using local file
26+
llama-server -m gemma-3-4b-it-Q4_K_M.gguf --mmproj mmproj-gemma-3-4b-it-Q4_K_M.gguf
27+
28+
# no GPU offload
29+
llama-server -hf ggml-org/gemma-3-4b-it-GGUF --no-mmproj-offload
30+
```
31+
32+
## Pre-quantized models
33+
34+
These are ready-to-use models, most of them come with `Q4_K_M` quantization by default.
35+
36+
Replaces the `(tool_name)` with the name of binary you want to use. For example, `llama-mtmd-cli` or `llama-server`
37+
38+
NOTE: some models may require large context window, for example: `-c 8192`
39+
40+
```sh
41+
# Gemma 3
42+
(tool_name) -hf ggml-org/gemma-3-4b-it-GGUF
43+
(tool_name) -hf ggml-org/gemma-3-12b-it-GGUF
44+
(tool_name) -hf ggml-org/gemma-3-27b-it-GGUF
45+
46+
# SmolVLM
47+
(tool_name) -hf ggml-org/SmolVLM-Instruct-GGUF
48+
(tool_name) -hf ggml-org/SmolVLM-256M-Instruct-GGUF
49+
(tool_name) -hf ggml-org/SmolVLM-500M-Instruct-GGUF
50+
(tool_name) -hf ggml-org/SmolVLM2-2.2B-Instruct-GGUF
51+
(tool_name) -hf ggml-org/SmolVLM2-256M-Video-Instruct-GGUF
52+
(tool_name) -hf ggml-org/SmolVLM2-500M-Video-Instruct-GGUF
53+
54+
# Pixtral 12B
55+
(tool_name) -hf ggml-org/pixtral-12b-GGUF
56+
57+
# Qwen 2 VL
58+
(tool_name) -hf ggml-org/Qwen2-VL-2B-Instruct-GGUF
59+
(tool_name) -hf ggml-org/Qwen2-VL-7B-Instruct-GGUF
60+
61+
# Qwen 2.5 VL
62+
(tool_name) -hf ggml-org/Qwen2.5-VL-3B-Instruct-GGUF
63+
(tool_name) -hf ggml-org/Qwen2.5-VL-7B-Instruct-GGUF
64+
(tool_name) -hf ggml-org/Qwen2.5-VL-32B-Instruct-GGUF
65+
(tool_name) -hf ggml-org/Qwen2.5-VL-72B-Instruct-GGUF
66+
67+
# Mistral Small 3.1 24B (IQ2_M quantization)
68+
(tool_name) -hf ggml-org/Mistral-Small-3.1-24B-Instruct-2503-GGUF
69+
```

tools/mtmd/README.md

Lines changed: 1 addition & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -16,38 +16,7 @@ The naming and structure related to multimodal support have evolved, which might
1616

1717
## Pre-quantized models
1818

19-
These are ready-to-use models, most of them come with `Q4_K_M` quantization by default:
20-
21-
```sh
22-
# Gemma 3
23-
llama-mtmd-cli -hf ggml-org/gemma-3-4b-it-GGUF
24-
llama-mtmd-cli -hf ggml-org/gemma-3-12b-it-GGUF
25-
llama-mtmd-cli -hf ggml-org/gemma-3-27b-it-GGUF
26-
27-
# SmolVLM
28-
llama-mtmd-cli -hf ggml-org/SmolVLM-Instruct-GGUF
29-
llama-mtmd-cli -hf ggml-org/SmolVLM-256M-Instruct-GGUF
30-
llama-mtmd-cli -hf ggml-org/SmolVLM-500M-Instruct-GGUF
31-
llama-mtmd-cli -hf ggml-org/SmolVLM2-2.2B-Instruct-GGUF
32-
llama-mtmd-cli -hf ggml-org/SmolVLM2-256M-Video-Instruct-GGUF
33-
llama-mtmd-cli -hf ggml-org/SmolVLM2-500M-Video-Instruct-GGUF
34-
35-
# Pixtral 12B
36-
llama-mtmd-cli -hf ggml-org/pixtral-12b-GGUF
37-
38-
# Qwen 2 VL
39-
llama-mtmd-cli -hf ggml-org/Qwen2-VL-2B-Instruct-GGUF
40-
llama-mtmd-cli -hf ggml-org/Qwen2-VL-7B-Instruct-GGUF
41-
42-
# Qwen 2.5 VL
43-
llama-mtmd-cli -hf ggml-org/Qwen2.5-VL-3B-Instruct-GGUF
44-
llama-mtmd-cli -hf ggml-org/Qwen2.5-VL-7B-Instruct-GGUF
45-
llama-mtmd-cli -hf ggml-org/Qwen2.5-VL-32B-Instruct-GGUF
46-
llama-mtmd-cli -hf ggml-org/Qwen2.5-VL-72B-Instruct-GGUF
47-
48-
# Mistral Small 3.1 24B (IQ2_M quantization)
49-
llama-mtmd-cli -hf ggml-org/Mistral-Small-3.1-24B-Instruct-2503-GGUF
50-
```
19+
See the list of pre-quantized model [here](../../docs/multimodal.md)
5120

5221
## How it works and what is `mmproj`?
5322

tools/server/README.md

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -193,6 +193,12 @@ services:
193193
LLAMA_ARG_PORT: 8080
194194
```
195195
196+
### Multimodal support
197+
198+
Multimodal support was added in [#12898](https://github.com/ggml-org/llama.cpp/pull/12898) and is currently an experimental feature.
199+
200+
For more details, please refer to [multimodal documentation](../../docs/multimodal.md)
201+
196202
## Build
197203
198204
`llama-server` is built alongside everything else from the root of the project
@@ -749,6 +755,9 @@ This endpoint is public (no API key check). By default, it is read-only. To make
749755
"total_slots": 1,
750756
"model_path": "../models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf",
751757
"chat_template": "...",
758+
"modalities": {
759+
"vision": false
760+
},
752761
"build_info": "b(build number)-(build commit hash)"
753762
}
754763
```
@@ -757,6 +766,7 @@ This endpoint is public (no API key check). By default, it is read-only. To make
757766
- `total_slots` - the total number of slots for process requests (defined by `--parallel` option)
758767
- `model_path` - the path to model file (same with `-m` argument)
759768
- `chat_template` - the model's original Jinja2 prompt template
769+
- `modalities` - the list of supported modalities
760770

761771
### POST `/props`: Change server global properties.
762772

@@ -1069,6 +1079,8 @@ print(completion.choices[0].text)
10691079

10701080
Given a ChatML-formatted json description in `messages`, it returns the predicted completion. Both synchronous and streaming mode are supported, so scripted and interactive applications work fine. While no strong claims of compatibility with OpenAI API spec is being made, in our experience it suffices to support many apps. Only models with a [supported chat template](https://github.com/ggml-org/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_template) can be used optimally with this endpoint. By default, the ChatML template will be used.
10711081

1082+
If model supports multimodal, you can input the media file via `image_url` content part. We support both base64 and remote URL as input. See OAI documentation for more.
1083+
10721084
*Options:*
10731085

10741086
See [OpenAI Chat Completions API documentation](https://platform.openai.com/docs/api-reference/chat). llama.cpp `/completion`-specific features such as `mirostat` are also supported.

0 commit comments

Comments
 (0)