Skip to content

Commit 823ad4c

Browse files
author
poulphunter
committed
Merge remote-tracking branch 'upstream/master'
2 parents 5e40b3e + 3567ee3 commit 823ad4c

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

66 files changed

+3071
-844
lines changed

Makefile

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -680,6 +680,10 @@ ifdef GGML_CUDA_CCBIN
680680
MK_NVCCFLAGS += -ccbin $(GGML_CUDA_CCBIN)
681681
endif # GGML_CUDA_CCBIN
682682

683+
ifdef GGML_CUDA_NO_FA
684+
MK_NVCCFLAGS += -DGGML_CUDA_NO_FA
685+
endif # GGML_CUDA_NO_FA
686+
683687
ifdef GGML_CUDA_FA_ALL_QUANTS
684688
MK_NVCCFLAGS += -DGGML_CUDA_FA_ALL_QUANTS
685689
endif # GGML_CUDA_FA_ALL_QUANTS
@@ -800,6 +804,10 @@ ifdef GGML_CUDA_NO_PEER_COPY
800804
HIPFLAGS += -DGGML_CUDA_NO_PEER_COPY
801805
endif # GGML_CUDA_NO_PEER_COPY
802806

807+
ifdef GGML_CUDA_NO_FA
808+
HIPFLAGS += -DGGML_CUDA_NO_FA
809+
endif # GGML_CUDA_NO_FA
810+
803811
OBJ_GGML_EXT += ggml/src/ggml-cuda/ggml-cuda.o
804812
OBJ_GGML_EXT += $(patsubst %.cu,%.o,$(wildcard ggml/src/ggml-cuda/*.cu))
805813
OBJ_GGML_EXT += $(OBJ_CUDA_TMPL)
@@ -876,6 +884,10 @@ ifdef GGML_CUDA_NO_PEER_COPY
876884
MUSAFLAGS += -DGGML_CUDA_NO_PEER_COPY
877885
endif # GGML_CUDA_NO_PEER_COPY
878886

887+
ifdef GGML_CUDA_NO_FA
888+
MUSAFLAGS += -DGGML_CUDA_NO_FA
889+
endif # GGML_CUDA_NO_FA
890+
879891
ifdef GGML_CUDA_FA_ALL_QUANTS
880892
MUSAFLAGS += -DGGML_CUDA_FA_ALL_QUANTS
881893
endif # GGML_CUDA_FA_ALL_QUANTS

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -219,7 +219,7 @@ Instructions for adding support for new models: [HOWTO-add-model.md](docs/develo
219219
- [llama_cpp_canister](https://github.com/onicai/llama_cpp_canister) - llama.cpp as a smart contract on the Internet Computer, using WebAssembly
220220
- [llama-swap](https://github.com/mostlygeek/llama-swap) - transparent proxy that adds automatic model switching with llama-server
221221
- [Kalavai](https://github.com/kalavai-net/kalavai-client) - Crowdsource end to end LLM deployment at any scale
222-
222+
- [llmaz](https://github.com/InftyAI/llmaz) - ☸️ Easy, advanced inference platform for large language models on Kubernetes.
223223
</details>
224224

225225
<details>

docs/backend/SYCL.md

Lines changed: 14 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,16 @@ The following release is verified with good quality:
4242

4343
## News
4444

45+
- 2025.2
46+
- Optimize MUL_MAT Q4_0 on Intel GPU for all dGPUs and built-in GPUs since MTL. Increase the performance of LLM (llama-2-7b.Q4_0.gguf) 21%-87% on Intel GPUs (MTL, ARL-H, Arc, Flex, PVC).
47+
|GPU|Base tokens/s|Increased tokens/s|Percent|
48+
|-|-|-|-|
49+
|PVC 1550|39|73|+87%|
50+
|Flex 170|39|50|+28%|
51+
|Arc770|42|55|+30%|
52+
|MTL|13|16|+23%|
53+
|ARL-H|14|17|+21%|
54+
4555
- 2024.11
4656
- Use syclcompat to improve the performance on some platforms. This requires to use oneAPI 2025.0 or newer.
4757

@@ -97,8 +107,8 @@ SYCL backend supports Intel GPU Family:
97107
| Intel Data Center Max Series | Support | Max 1550, 1100 |
98108
| Intel Data Center Flex Series | Support | Flex 170 |
99109
| Intel Arc Series | Support | Arc 770, 730M, Arc A750 |
100-
| Intel built-in Arc GPU | Support | built-in Arc GPU in Meteor Lake |
101-
| Intel iGPU | Support | iGPU in 13700k, i5-1250P, i7-1260P, i7-1165G7 |
110+
| Intel built-in Arc GPU | Support | built-in Arc GPU in Meteor Lake, Arrow Lake |
111+
| Intel iGPU | Support | iGPU in 13700k,iGPU in 13400, i5-1250P, i7-1260P, i7-1165G7 |
102112

103113
*Notes:*
104114

@@ -660,8 +670,10 @@ use 1 SYCL GPUs: [0] with Max compute units:512
660670
| Name | Value | Function |
661671
|-------------------|------------------|---------------------------------------------------------------------------------------------------------------------------|
662672
| GGML_SYCL_DEBUG | 0 (default) or 1 | Enable log function by macro: GGML_SYCL_DEBUG |
673+
| GGML_SYCL_DISABLE_OPT | 0 (default) or 1 | Disable optimize features based on Intel GPU type, to compare the performance increase |
663674
| ZES_ENABLE_SYSMAN | 0 (default) or 1 | Support to get free memory of GPU by sycl::aspect::ext_intel_free_memory.<br>Recommended to use when --split-mode = layer |
664675

676+
665677
## Known Issues
666678

667679
- `Split-mode:[row]` is not supported.

docs/function-calling.md

Lines changed: 390 additions & 0 deletions
Large diffs are not rendered by default.
Lines changed: 183 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,183 @@
1+
# Granite Vision
2+
3+
Download the model and point your `GRANITE_MODEL` environment variable to the path.
4+
5+
```bash
6+
$ git clone https://huggingface.co/ibm-granite/granite-vision-3.1-2b-preview
7+
$ export GRANITE_MODEL=./granite-vision-3.1-2b-preview
8+
```
9+
10+
11+
### 1. Running llava surgery v2.
12+
First, we need to run the llava surgery script as shown below:
13+
14+
`python llava_surgery_v2.py -C -m $GRANITE_MODEL`
15+
16+
You should see two new files (`llava.clip` and `llava.projector`) written into your model's directory, as shown below.
17+
18+
```bash
19+
$ ls $GRANITE_MODEL | grep -i llava
20+
llava.clip
21+
llava.projector
22+
```
23+
24+
We should see that the projector and visual encoder get split out into the llava files. Quick check to make sure they aren't empty:
25+
```python
26+
import os
27+
import torch
28+
29+
MODEL_PATH = os.getenv("GRANITE_MODEL")
30+
if not MODEL_PATH:
31+
raise ValueError("env var GRANITE_MODEL is unset!")
32+
33+
encoder_tensors = torch.load(os.path.join(MODEL_PATH, "llava.clip"))
34+
projector_tensors = torch.load(os.path.join(MODEL_PATH, "llava.projector"))
35+
36+
assert len(encoder_tensors) > 0
37+
assert len(projector_tensors) > 0
38+
```
39+
40+
If you actually inspect the `.keys()` of the loaded tensors, you should see a lot of `vision_model` tensors in the `encoder_tensors`, and 5 tensors (`'multi_modal_projector.linear_1.bias'`, `'multi_modal_projector.linear_1.weight'`, `'multi_modal_projector.linear_2.bias'`, `'multi_modal_projector.linear_2.weight'`, `'image_newline'`) in the multimodal `projector_tensors`.
41+
42+
43+
### 2. Creating the Visual Component GGUF
44+
To create the GGUF for the visual components, we need to write a config for the visual encoder; make sure the config contains the correct `image_grid_pinpoints`
45+
46+
47+
Note: we refer to this file as `$VISION_CONFIG` later on.
48+
```json
49+
{
50+
"_name_or_path": "siglip-model",
51+
"architectures": [
52+
"SiglipVisionModel"
53+
],
54+
"image_grid_pinpoints": [
55+
[384,768],
56+
[384,1152],
57+
[384,1536],
58+
[384,1920],
59+
[384,2304],
60+
[384,2688],
61+
[384,3072],
62+
[384,3456],
63+
[384,3840],
64+
[768,384],
65+
[768,768],
66+
[768,1152],
67+
[768,1536],
68+
[768,1920],
69+
[1152,384],
70+
[1152,768],
71+
[1152,1152],
72+
[1536,384],
73+
[1536,768],
74+
[1920,384],
75+
[1920,768],
76+
[2304,384],
77+
[2688,384],
78+
[3072,384],
79+
[3456,384],
80+
[3840,384]
81+
],
82+
"mm_patch_merge_type": "spatial_unpad",
83+
"hidden_size": 1152,
84+
"image_size": 384,
85+
"intermediate_size": 4304,
86+
"model_type": "siglip_vision_model",
87+
"num_attention_heads": 16,
88+
"num_hidden_layers": 27,
89+
"patch_size": 14,
90+
"layer_norm_eps": 1e-6,
91+
"hidden_act": "gelu_pytorch_tanh",
92+
"projection_dim": 0,
93+
"vision_feature_layer": [-24, -20, -12, -1]
94+
}
95+
```
96+
97+
Create a new directory to hold the visual components, and copy the llava.clip/projector files, as well as the vision config into it.
98+
99+
```bash
100+
$ ENCODER_PATH=$PWD/visual_encoder
101+
$ mkdir $ENCODER_PATH
102+
103+
$ cp $GRANITE_MODEL/llava.clip $ENCODER_PATH/pytorch_model.bin
104+
$ cp $GRANITE_MODEL/llava.projector $ENCODER_PATH/
105+
$ cp $VISION_CONFIG $ENCODER_PATH/config.json
106+
```
107+
108+
At which point you should have something like this:
109+
```bash
110+
$ ls $ENCODER_PATH
111+
config.json llava.projector pytorch_model.bin
112+
```
113+
114+
Now convert the components to GGUF; Note that we also override the image mean/std dev to `[.5,.5,.5]` since we use the siglip visual encoder - in the transformers model, you can find these numbers in the [preprocessor_config.json](https://huggingface.co/ibm-granite/granite-vision-3.1-2b-preview/blob/main/preprocessor_config.json).
115+
```bash
116+
$ python convert_image_encoder_to_gguf.py \
117+
-m $ENCODER_PATH \
118+
--llava-projector $ENCODER_PATH/llava.projector \
119+
--output-dir $ENCODER_PATH \
120+
--clip-model-is-vision \
121+
--clip-model-is-siglip \
122+
--image-mean 0.5 0.5 0.5 --image-std 0.5 0.5 0.5
123+
```
124+
125+
this will create the first GGUF file at `$ENCODER_PATH/mmproj-model-f16.gguf`; we will refer to the abs path of this file as the `$VISUAL_GGUF_PATH.`
126+
127+
128+
### 3. Creating the LLM GGUF.
129+
The granite vision model contains a granite LLM as its language model. For now, the easiest way to get the GGUF for LLM is by loading the composite model in `transformers` and exporting the LLM so that it can be directly converted with the normal conversion path.
130+
131+
First, set the `LLM_EXPORT_PATH` to the path to export the `transformers` LLM to.
132+
```
133+
$ export LLM_EXPORT_PATH=$PWD/granite_vision_llm
134+
```
135+
136+
```python
137+
import os
138+
import transformers
139+
140+
MODEL_PATH = os.getenv("GRANITE_MODEL")
141+
if not MODEL_PATH:
142+
raise ValueError("env var GRANITE_MODEL is unset!")
143+
144+
LLM_EXPORT_PATH = os.getenv("LLM_EXPORT_PATH")
145+
if not MODEL_PATH:
146+
raise ValueError("env var LLM_EXPORT_PATH is unset!")
147+
148+
tokenizer = transformers.AutoTokenizer.from_pretrained(MODEL_PATH)
149+
150+
# NOTE: granite vision support was added to transformers very recently (4.49);
151+
# if you get size mismatches, your version is too old.
152+
# If you are running with an older version, set `ignore_mismatched_sizes=True`
153+
# as shown below; it won't be loaded correctly, but the LLM part of the model that
154+
# we are exporting will be loaded correctly.
155+
model = transformers.AutoModelForImageTextToText.from_pretrained(MODEL_PATH, ignore_mismatched_sizes=True)
156+
157+
tokenizer.save_pretrained(LLM_EXPORT_PATH)
158+
model.language_model.save_pretrained(LLM_EXPORT_PATH)
159+
```
160+
161+
Now you can convert the exported LLM to GGUF with the normal converter in the root of the llama cpp project.
162+
```bash
163+
$ LLM_GGUF_PATH=$LLM_EXPORT_PATH/granite_llm.gguf
164+
...
165+
$ python convert_hf_to_gguf.py --outfile $LLM_GGUF_PATH $LLM_EXPORT_PATH
166+
```
167+
168+
169+
### 4. Running the Model in Llama cpp
170+
Build llama cpp normally; you should have a target binary named `llama-llava-cli`, which you can pass two binaries to. Sample usage:
171+
172+
Note - the test image shown below can be found [here](https://github-production-user-asset-6210df.s3.amazonaws.com/10740300/415512792-d90d5562-8844-4f34-a0a5-77f62d5a58b5.jpg?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAVCODYLSA53PQK4ZA%2F20250221%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20250221T054145Z&X-Amz-Expires=300&X-Amz-Signature=86c60be490aa49ef7d53f25d6c973580a8273904fed11ed2453d0a38240ee40a&X-Amz-SignedHeaders=host).
173+
174+
```bash
175+
$ ./build/bin/llama-llava-cli -m $LLM_GGUF_PATH \
176+
--mmproj $VISUAL_GGUF_PATH \
177+
--image cherry_blossom.jpg \
178+
-c 16384 \
179+
-p "<|system|>\nA chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.\n<|user|>\n\<image>\nWhat type of flowers are in this picture?\n<|assistant|>\n" \
180+
--temp 0
181+
```
182+
183+
Sample response: `The flowers in the picture are cherry blossoms, which are known for their delicate pink petals and are often associated with the beauty of spring.`

examples/llava/README.md

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -101,8 +101,27 @@ python ./examples/convert_legacy_llama.py ../llava-v1.6-vicuna-7b/ --skip-unknow
101101
```
102102

103103
**note** llava-1.6 needs more context than llava-1.5, at least 3000 is needed (just run it at -c 4096)
104+
104105
**note** llava-1.6 greatly benefits from batched prompt processing (defaults work)
105106

107+
**note** if the language model in step `6)` is incompatible with the legacy conversion script, the easiest way handle the LLM model conversion is to load the model in transformers, and export only the LLM from the llava next model.
108+
109+
```python
110+
import os
111+
import transformers
112+
113+
model_path = ...
114+
llm_export_path = ...
115+
116+
tokenizer = transformers.AutoTokenizer.from_pretrained(model_path)
117+
model = transformers.AutoModelForImageTextToText.from_pretrained(model_path)
118+
119+
tokenizer.save_pretrained(llm_export_path)
120+
model.language_model.save_pretrained(llm_export_path)
121+
```
122+
123+
Then, you can convert the LLM using the `convert_hf_to_gguf.py` script, which handles more LLM architectures.
124+
106125
## llava-cli templating and llava-1.6 prompting
107126

108127
llava-1.5 models all use the same vicuna prompt, here you can just add your image question like `-p "Provide a full description."`

0 commit comments

Comments
 (0)