You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If you actually inspect the `.keys()` of the loaded tensors, you should see a lot of `vision_model` tensors in the `encoder_tensors`, and 5 tensors (`'multi_modal_projector.linear_1.bias'`, `'multi_modal_projector.linear_1.weight'`, `'multi_modal_projector.linear_2.bias'`, `'multi_modal_projector.linear_2.weight'`, `'image_newline'`) in the multimodal `projector_tensors`.
41
+
42
+
43
+
### 2. Creating the Visual Component GGUF
44
+
To create the GGUF for the visual components, we need to write a config for the visual encoder; make sure the config contains the correct `image_grid_pinpoints`
45
+
46
+
47
+
Note: we refer to this file as `$VISION_CONFIG` later on.
48
+
```json
49
+
{
50
+
"_name_or_path": "siglip-model",
51
+
"architectures": [
52
+
"SiglipVisionModel"
53
+
],
54
+
"image_grid_pinpoints": [
55
+
[384,768],
56
+
[384,1152],
57
+
[384,1536],
58
+
[384,1920],
59
+
[384,2304],
60
+
[384,2688],
61
+
[384,3072],
62
+
[384,3456],
63
+
[384,3840],
64
+
[768,384],
65
+
[768,768],
66
+
[768,1152],
67
+
[768,1536],
68
+
[768,1920],
69
+
[1152,384],
70
+
[1152,768],
71
+
[1152,1152],
72
+
[1536,384],
73
+
[1536,768],
74
+
[1920,384],
75
+
[1920,768],
76
+
[2304,384],
77
+
[2688,384],
78
+
[3072,384],
79
+
[3456,384],
80
+
[3840,384]
81
+
],
82
+
"mm_patch_merge_type": "spatial_unpad",
83
+
"hidden_size": 1152,
84
+
"image_size": 384,
85
+
"intermediate_size": 4304,
86
+
"model_type": "siglip_vision_model",
87
+
"num_attention_heads": 16,
88
+
"num_hidden_layers": 27,
89
+
"patch_size": 14,
90
+
"layer_norm_eps": 1e-6,
91
+
"hidden_act": "gelu_pytorch_tanh",
92
+
"projection_dim": 0,
93
+
"vision_feature_layer": [-24, -20, -12, -1]
94
+
}
95
+
```
96
+
97
+
Create a new directory to hold the visual components, and copy the llava.clip/projector files, as well as the vision config into it.
At which point you should have something like this:
109
+
```bash
110
+
$ ls $ENCODER_PATH
111
+
config.json llava.projector pytorch_model.bin
112
+
```
113
+
114
+
Now convert the components to GGUF; Note that we also override the image mean/std dev to `[.5,.5,.5]` since we use the siglip visual encoder - in the transformers model, you can find these numbers in the [preprocessor_config.json](https://huggingface.co/ibm-granite/granite-vision-3.1-2b-preview/blob/main/preprocessor_config.json).
115
+
```bash
116
+
$ python convert_image_encoder_to_gguf.py \
117
+
-m $ENCODER_PATH \
118
+
--llava-projector $ENCODER_PATH/llava.projector \
119
+
--output-dir $ENCODER_PATH \
120
+
--clip-model-is-vision \
121
+
--clip-model-is-siglip \
122
+
--image-mean 0.5 0.5 0.5 --image-std 0.5 0.5 0.5
123
+
```
124
+
125
+
this will create the first GGUF file at `$ENCODER_PATH/mmproj-model-f16.gguf`; we will refer to the abs path of this file as the `$VISUAL_GGUF_PATH.`
126
+
127
+
128
+
### 3. Creating the LLM GGUF.
129
+
The granite vision model contains a granite LLM as its language model. For now, the easiest way to get the GGUF for LLM is by loading the composite model in `transformers` and exporting the LLM so that it can be directly converted with the normal conversion path.
130
+
131
+
First, set the `LLM_EXPORT_PATH` to the path to export the `transformers` LLM to.
132
+
```
133
+
$ export LLM_EXPORT_PATH=$PWD/granite_vision_llm
134
+
```
135
+
136
+
```python
137
+
import os
138
+
import transformers
139
+
140
+
MODEL_PATH= os.getenv("GRANITE_MODEL")
141
+
ifnotMODEL_PATH:
142
+
raiseValueError("env var GRANITE_MODEL is unset!")
143
+
144
+
LLM_EXPORT_PATH= os.getenv("LLM_EXPORT_PATH")
145
+
ifnotMODEL_PATH:
146
+
raiseValueError("env var LLM_EXPORT_PATH is unset!")
Build llama cpp normally; you should have a target binary named `llama-llava-cli`, which you can pass two binaries to. Sample usage:
171
+
172
+
Note - the test image shown below can be found [here](https://github-production-user-asset-6210df.s3.amazonaws.com/10740300/415512792-d90d5562-8844-4f34-a0a5-77f62d5a58b5.jpg?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAVCODYLSA53PQK4ZA%2F20250221%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20250221T054145Z&X-Amz-Expires=300&X-Amz-Signature=86c60be490aa49ef7d53f25d6c973580a8273904fed11ed2453d0a38240ee40a&X-Amz-SignedHeaders=host).
173
+
174
+
```bash
175
+
$ ./build/bin/llama-llava-cli -m $LLM_GGUF_PATH \
176
+
--mmproj $VISUAL_GGUF_PATH \
177
+
--image cherry_blossom.jpg \
178
+
-c 16384 \
179
+
-p "<|system|>\nA chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.\n<|user|>\n\<image>\nWhat type of flowers are in this picture?\n<|assistant|>\n" \
180
+
--temp 0
181
+
```
182
+
183
+
Sample response: `The flowers in the picture are cherry blossoms, which are known for their delicate pink petals and are often associated with the beauty of spring.`
0 commit comments