Skip to content

Commit 8f1edcb

Browse files
committed
Merge commit 'dc39a5e7a84815a90fa0c515ed8927870cf858c9' into concedo_experimental
# Conflicts: # README.md # SECURITY.md # docs/multimodal/MobileVLM.md # examples/llava/CMakeLists.txt # examples/llava/README.md # examples/llava/android/adb_run.sh # ggml/CMakeLists.txt # ggml/src/CMakeLists.txt # ggml/src/ggml-cpu/CMakeLists.txt # ggml/src/ggml-sycl/ggml-sycl.cpp # ggml/src/ggml-sycl/rope.cpp # ggml/src/ggml-sycl/rope.hpp
2 parents 3e8b84b + dc39a5e commit 8f1edcb

26 files changed

+1787
-1450
lines changed

common/arg.cpp

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -977,14 +977,13 @@ static void common_params_print_completion(common_params_context & ctx_arg) {
977977
"llama-gritlm",
978978
"llama-imatrix",
979979
"llama-infill",
980-
"llama-llava-cli",
980+
"llama-mtmd-cli",
981981
"llama-llava-clip-quantize-cli",
982982
"llama-lookahead",
983983
"llama-lookup",
984984
"llama-lookup-create",
985985
"llama-lookup-merge",
986986
"llama-lookup-stats",
987-
"llama-minicpmv-cli",
988987
"llama-parallel",
989988
"llama-passkey",
990989
"llama-perplexity",
@@ -2727,7 +2726,7 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
27272726
[](common_params & params, const std::string & value) {
27282727
params.chat_template = value;
27292728
}
2730-
).set_examples({LLAMA_EXAMPLE_MAIN, LLAMA_EXAMPLE_SERVER}).set_env("LLAMA_ARG_CHAT_TEMPLATE"));
2729+
).set_examples({LLAMA_EXAMPLE_MAIN, LLAMA_EXAMPLE_SERVER, LLAMA_EXAMPLE_LLAVA}).set_env("LLAMA_ARG_CHAT_TEMPLATE"));
27312730
add_opt(common_arg(
27322731
{"--chat-template-file"}, "JINJA_TEMPLATE_FILE",
27332732
string_format(

convert_hf_to_gguf.py

Lines changed: 425 additions & 271 deletions
Large diffs are not rendered by default.

convert_lora_to_gguf.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@
2424
import gguf
2525

2626
# reuse model definitions from convert_hf_to_gguf.py
27-
from convert_hf_to_gguf import LazyTorchTensor, Model
27+
from convert_hf_to_gguf import LazyTorchTensor, ModelBase
2828

2929
logger = logging.getLogger("lora-to-gguf")
3030

@@ -340,11 +340,11 @@ def load_hparams_from_hf(hf_model_id: str) -> dict[str, Any]:
340340
sys.exit(1)
341341
else:
342342
logger.info(f"Loading base model: {dir_base_model.name}")
343-
hparams = Model.load_hparams(dir_base_model)
343+
hparams = ModelBase.load_hparams(dir_base_model)
344344

345345
with torch.inference_mode():
346346
try:
347-
model_class = Model.from_model_architecture(hparams["architectures"][0])
347+
model_class = ModelBase.from_model_architecture(hparams["architectures"][0])
348348
except NotImplementedError:
349349
logger.error(f"Model {hparams['architectures'][0]} is not supported")
350350
sys.exit(1)

examples/llava/README-gemma3.md renamed to docs/multimodal/gemma3.md

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -26,11 +26,12 @@ llama-gemma3-cli -hf ggml-org/gemma-3-27b-it-GGUF
2626

2727
## How to get mmproj.gguf?
2828

29+
Simply to add `--mmproj` in when converting model via `convert_hf_to_gguf.py`:
30+
2931
```bash
3032
cd gemma-3-4b-it
31-
python ../llama.cpp/examples/llava/gemma3_convert_encoder_to_gguf.py .
32-
33-
# output file is mmproj.gguf
33+
python ../llama.cpp/convert_hf_to_gguf.py --outfile model.gguf --outtype f16 --mmproj .
34+
# output file: mmproj-model.gguf
3435
```
3536

3637
## How to run it?

examples/llava/README-glmedge.md renamed to docs/multimodal/glmedge.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -3,12 +3,12 @@
33
Currently this implementation supports [glm-edge-v-2b](https://huggingface.co/THUDM/glm-edge-v-2b) and [glm-edge-v-5b](https://huggingface.co/THUDM/glm-edge-v-5b).
44

55
## Usage
6-
Build with cmake or run `make llama-llava-cli` to build it.
6+
Build the `llama-mtmd-cli` binary.
77

8-
After building, run: `./llama-llava-cli` to see the usage. For example:
8+
After building, run: `./llama-mtmd-cli` to see the usage. For example:
99

1010
```sh
11-
./llama-llava-cli -m model_path/ggml-model-f16.gguf --mmproj model_path/mmproj-model-f16.gguf --image img_path/image.jpg -p "<|system|>\n system prompt <image><|user|>\n prompt <|assistant|>\n"
11+
./llama-mtmd-cli -m model_path/ggml-model-f16.gguf --mmproj model_path/mmproj-model-f16.gguf
1212
```
1313

1414
**note**: A lower temperature like 0.1 is recommended for better quality. add `--temp 0.1` to the command to do so.

examples/llava/README-granitevision.md renamed to docs/multimodal/granitevision.md

Lines changed: 2 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -176,15 +176,11 @@ Note that currently you cannot quantize the visual encoder because granite visio
176176

177177

178178
### 5. Running the Model in Llama cpp
179-
Build llama cpp normally; you should have a target binary named `llama-llava-cli`, which you can pass two binaries to. As an example, we pass the the llama.cpp banner.
179+
Build llama cpp normally; you should have a target binary named `llama-mtmd-cli`, which you can pass two binaries to. As an example, we pass the the llama.cpp banner.
180180

181181
```bash
182-
$ ./build/bin/llama-llava-cli -m $LLM_GGUF_PATH \
182+
$ ./build/bin/llama-mtmd-cli -m $LLM_GGUF_PATH \
183183
--mmproj $VISUAL_GGUF_PATH \
184-
--image ./media/llama0-banner.png \
185184
-c 16384 \
186-
-p "<|system|>\nA chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.\n<|user|>\n\<image>\nWhat does the text in this image say?\n<|assistant|>\n" \
187185
--temp 0
188186
```
189-
190-
Sample output: `The text in the image reads "LLAMA C++ Can it run DOOM Llama?"`

docs/multimodal/llava.md

Lines changed: 143 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,143 @@
1+
# LLaVA
2+
3+
Currently this implementation supports [llava-v1.5](https://huggingface.co/liuhaotian/llava-v1.5-7b) variants,
4+
as well as llava-1.6 [llava-v1.6](https://huggingface.co/collections/liuhaotian/llava-16-65b9e40155f60fd046a5ccf2) variants.
5+
6+
The pre-converted [7b](https://huggingface.co/mys/ggml_llava-v1.5-7b)
7+
and [13b](https://huggingface.co/mys/ggml_llava-v1.5-13b)
8+
models are available.
9+
For llava-1.6 a variety of prepared gguf models are available as well [7b-34b](https://huggingface.co/cmp-nct/llava-1.6-gguf)
10+
11+
After API is confirmed, more models will be supported / uploaded.
12+
13+
## Usage
14+
Build the `llama-mtmd-cli` binary.
15+
16+
After building, run: `./llama-mtmd-cli` to see the usage. For example:
17+
18+
```sh
19+
./llama-mtmd-cli -m ../llava-v1.5-7b/ggml-model-f16.gguf \
20+
--mmproj ../llava-v1.5-7b/mmproj-model-f16.gguf \
21+
--chat-template vicuna
22+
```
23+
24+
**note**: A lower temperature like 0.1 is recommended for better quality. add `--temp 0.1` to the command to do so.
25+
**note**: For GPU offloading ensure to use the `-ngl` flag just like usual
26+
27+
## LLaVA 1.5
28+
29+
1. Clone a LLaVA and a CLIP model ([available options](https://github.com/haotian-liu/LLaVA/blob/main/docs/MODEL_ZOO.md)). For example:
30+
31+
```sh
32+
git clone https://huggingface.co/liuhaotian/llava-v1.5-7b
33+
34+
git clone https://huggingface.co/openai/clip-vit-large-patch14-336
35+
```
36+
37+
2. Install the required Python packages:
38+
39+
```sh
40+
pip install -r examples/llava/requirements.txt
41+
```
42+
43+
3. Use `llava_surgery.py` to split the LLaVA model to LLaMA and multimodel projector constituents:
44+
45+
```sh
46+
python ./examples/llava/llava_surgery.py -m ../llava-v1.5-7b
47+
```
48+
49+
4. Use `convert_image_encoder_to_gguf.py` to convert the LLaVA image encoder to GGUF:
50+
51+
```sh
52+
python ./examples/llava/convert_image_encoder_to_gguf.py -m ../clip-vit-large-patch14-336 --llava-projector ../llava-v1.5-7b/llava.projector --output-dir ../llava-v1.5-7b
53+
```
54+
55+
5. Use `examples/convert_legacy_llama.py` to convert the LLaMA part of LLaVA to GGUF:
56+
57+
```sh
58+
python ./examples/convert_legacy_llama.py ../llava-v1.5-7b --skip-unknown
59+
```
60+
61+
Now both the LLaMA part and the image encoder are in the `llava-v1.5-7b` directory.
62+
63+
## LLaVA 1.6 gguf conversion
64+
1) First clone a LLaVA 1.6 model:
65+
```console
66+
git clone https://huggingface.co/liuhaotian/llava-v1.6-vicuna-7b
67+
```
68+
69+
2) Install the required Python packages:
70+
71+
```sh
72+
pip install -r examples/llava/requirements.txt
73+
```
74+
75+
3) Use `llava_surgery_v2.py` which also supports llava-1.5 variants pytorch as well as safetensor models:
76+
```console
77+
python examples/llava/llava_surgery_v2.py -C -m ../llava-v1.6-vicuna-7b/
78+
```
79+
- you will find a llava.projector and a llava.clip file in your model directory
80+
81+
4) Copy the llava.clip file into a subdirectory (like vit), rename it to pytorch_model.bin and add a fitting vit configuration to the directory:
82+
```console
83+
mkdir vit
84+
cp ../llava-v1.6-vicuna-7b/llava.clip vit/pytorch_model.bin
85+
cp ../llava-v1.6-vicuna-7b/llava.projector vit/
86+
curl -s -q https://huggingface.co/cmp-nct/llava-1.6-gguf/raw/main/config_vit.json -o vit/config.json
87+
```
88+
89+
5) Create the visual gguf model:
90+
```console
91+
python ./examples/llava/convert_image_encoder_to_gguf.py -m vit --llava-projector vit/llava.projector --output-dir vit --clip-model-is-vision
92+
```
93+
- This is similar to llava-1.5, the difference is that we tell the encoder that we are working with the pure vision model part of CLIP
94+
95+
6) Then convert the model to gguf format:
96+
```console
97+
python ./examples/convert_legacy_llama.py ../llava-v1.6-vicuna-7b/ --skip-unknown
98+
```
99+
100+
7) And finally we can run the llava cli using the 1.6 model version:
101+
```console
102+
./llama-mtmd-cli -m ../llava-v1.6-vicuna-7b/ggml-model-f16.gguf --mmproj vit/mmproj-model-f16.gguf
103+
```
104+
105+
**note** llava-1.6 needs more context than llava-1.5, at least 3000 is needed (just run it at -c 4096)
106+
107+
**note** llava-1.6 greatly benefits from batched prompt processing (defaults work)
108+
109+
**note** if the language model in step `6)` is incompatible with the legacy conversion script, the easiest way handle the LLM model conversion is to load the model in transformers, and export only the LLM from the llava next model.
110+
111+
```python
112+
import os
113+
import transformers
114+
115+
model_path = ...
116+
llm_export_path = ...
117+
118+
tokenizer = transformers.AutoTokenizer.from_pretrained(model_path)
119+
model = transformers.AutoModelForImageTextToText.from_pretrained(model_path)
120+
121+
tokenizer.save_pretrained(llm_export_path)
122+
model.language_model.save_pretrained(llm_export_path)
123+
```
124+
125+
Then, you can convert the LLM using the `convert_hf_to_gguf.py` script, which handles more LLM architectures.
126+
127+
## Chat template
128+
129+
For llava-1.5 and llava-1.6, you need to use `vicuna` chat template. Simply add `--chat-template vicuna` to activate this template.
130+
131+
132+
## How to know if you are running in llava-1.5 or llava-1.6 mode
133+
134+
When running llava-cli you will see a visual information right before the prompt is being processed:
135+
136+
**Llava-1.5:**
137+
`encode_image_with_clip: image embedding created: 576 tokens`
138+
139+
**Llava-1.6 (anything above 576):**
140+
`encode_image_with_clip: image embedding created: 2880 tokens`
141+
142+
143+
Alternatively just pay notice to how many "tokens" have been used for your prompt, it will also show 1000+ tokens for llava-1.6

examples/llava/README-minicpmo2.6.md renamed to docs/multimodal/minicpmo2.6.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -40,9 +40,9 @@ python ./convert_hf_to_gguf.py ../MiniCPM-o-2_6/model
4040

4141
Inference on Linux or Mac
4242
```bash
43-
# run f16 version
44-
./build/bin/llama-minicpmv-cli -m ../MiniCPM-o-2_6/model/ggml-model-f16.gguf --mmproj ../MiniCPM-o-2_6/mmproj-model-f16.gguf -c 4096 --temp 0.7 --top-p 0.8 --top-k 100 --repeat-penalty 1.05 --image xx.jpg -p "What is in the image?"
43+
# run in single-turn mode
44+
./build/bin/llama-mtmd-cli -m ../MiniCPM-o-2_6/model/ggml-model-f16.gguf --mmproj ../MiniCPM-o-2_6/mmproj-model-f16.gguf -c 4096 --temp 0.7 --top-p 0.8 --top-k 100 --repeat-penalty 1.05 --image xx.jpg -p "What is in the image?"
4545

46-
# run quantized int4 version
47-
./build/bin/llama-minicpmv-cli -m ../MiniCPM-o-2_6/model/ggml-model-Q4_K_M.gguf --mmproj ../MiniCPM-o-2_6/mmproj-model-f16.gguf -c 4096 --temp 0.7 --top-p 0.8 --top-k 100 --repeat-penalty 1.05 --image xx.jpg -p "What is in the image?"
46+
# run in conversation mode
47+
./build/bin/llama-mtmd-cli -m ../MiniCPM-o-2_6/model/ggml-model-Q4_K_M.gguf --mmproj ../MiniCPM-o-2_6/mmproj-model-f16.gguf
4848
```

examples/llava/README-minicpmv2.5.md renamed to docs/multimodal/minicpmv2.5.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -39,9 +39,9 @@ python ./convert_hf_to_gguf.py ../MiniCPM-Llama3-V-2_5/model
3939

4040
Inference on Linux or Mac
4141
```bash
42-
# run f16 version
43-
./build/bin/llama-minicpmv-cli -m ../MiniCPM-Llama3-V-2_5/model/model-8B-F16.gguf --mmproj ../MiniCPM-Llama3-V-2_5/mmproj-model-f16.gguf -c 4096 --temp 0.7 --top-p 0.8 --top-k 100 --repeat-penalty 1.05 --image xx.jpg -p "What is in the image?"
42+
# run in single-turn mode
43+
./build/bin/llama-mtmd-cli -m ../MiniCPM-Llama3-V-2_5/model/model-8B-F16.gguf --mmproj ../MiniCPM-Llama3-V-2_5/mmproj-model-f16.gguf -c 4096 --temp 0.7 --top-p 0.8 --top-k 100 --repeat-penalty 1.05 --image xx.jpg -p "What is in the image?"
4444

45-
# run quantized int4 version
46-
./build/bin/llama-minicpmv-cli -m ../MiniCPM-Llama3-V-2_5/model/ggml-model-Q4_K_M.gguf --mmproj ../MiniCPM-Llama3-V-2_5/mmproj-model-f16.gguf -c 4096 --temp 0.7 --top-p 0.8 --top-k 100 --repeat-penalty 1.05 --image xx.jpg -p "What is in the image?"
45+
# run in conversation mode
46+
./build/bin/llama-mtmd-cli -m ../MiniCPM-Llama3-V-2_5/model/ggml-model-Q4_K_M.gguf --mmproj ../MiniCPM-Llama3-V-2_5/mmproj-model-f16.gguf
4747
```

examples/llava/README-minicpmv2.6.md renamed to docs/multimodal/minicpmv2.6.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -39,9 +39,9 @@ python ./convert_hf_to_gguf.py ../MiniCPM-V-2_6/model
3939

4040
Inference on Linux or Mac
4141
```bash
42-
# run f16 version
43-
./build/bin/llama-minicpmv-cli -m ../MiniCPM-V-2_6/model/ggml-model-f16.gguf --mmproj ../MiniCPM-V-2_6/mmproj-model-f16.gguf -c 4096 --temp 0.7 --top-p 0.8 --top-k 100 --repeat-penalty 1.05 --image xx.jpg -p "What is in the image?"
42+
# run in single-turn mode
43+
./build/bin/llama-mtmd-cli -m ../MiniCPM-V-2_6/model/ggml-model-f16.gguf --mmproj ../MiniCPM-V-2_6/mmproj-model-f16.gguf -c 4096 --temp 0.7 --top-p 0.8 --top-k 100 --repeat-penalty 1.05 --image xx.jpg -p "What is in the image?"
4444

45-
# run quantized int4 version
46-
./build/bin/llama-minicpmv-cli -m ../MiniCPM-V-2_6/model/ggml-model-Q4_K_M.gguf --mmproj ../MiniCPM-V-2_6/mmproj-model-f16.gguf -c 4096 --temp 0.7 --top-p 0.8 --top-k 100 --repeat-penalty 1.05 --image xx.jpg -p "What is in the image?"
45+
# run in conversation mode
46+
./build/bin/llama-mtmd-cli -m ../MiniCPM-V-2_6/model/ggml-model-Q4_K_M.gguf --mmproj ../MiniCPM-V-2_6/mmproj-model-f16.gguf
4747
```

0 commit comments

Comments
 (0)