Skip to content

Commit 7b076ee

Browse files
committed
Merge branch 'master' into llamacli-tools
2 parents 66eff76 + 401af80 commit 7b076ee

File tree

102 files changed

+4779
-901
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

102 files changed

+4779
-901
lines changed

.github/workflows/build.yml

Lines changed: 12 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -173,7 +173,15 @@ jobs:
173173
name: llama-bin-macos-x64.zip
174174

175175
ubuntu-cpu-cmake:
176-
runs-on: ubuntu-22.04
176+
strategy:
177+
matrix:
178+
include:
179+
- build: 'x64'
180+
os: ubuntu-22.04
181+
- build: 'arm64'
182+
os: ubuntu-22.04-arm
183+
184+
runs-on: ${{ matrix.os }}
177185

178186
steps:
179187
- name: Clone
@@ -239,14 +247,14 @@ jobs:
239247
run: |
240248
cp LICENSE ./build/bin/
241249
cp examples/run/linenoise.cpp/LICENSE ./build/bin/LICENSE.linenoise.cpp
242-
zip -r llama-${{ steps.tag.outputs.name }}-bin-ubuntu-x64.zip ./build/bin/*
250+
zip -r llama-${{ steps.tag.outputs.name }}-bin-ubuntu-${{ matrix.build }}.zip ./build/bin/*
243251
244252
- name: Upload artifacts
245253
if: ${{ ( github.event_name == 'push' && github.ref == 'refs/heads/master' ) || github.event.inputs.create_release == 'true' }}
246254
uses: actions/upload-artifact@v4
247255
with:
248-
path: llama-${{ steps.tag.outputs.name }}-bin-ubuntu-x64.zip
249-
name: llama-bin-ubuntu-x64.zip
256+
path: llama-${{ steps.tag.outputs.name }}-bin-ubuntu-${{ matrix.build }}.zip
257+
name: llama-bin-ubuntu-${{ matrix.build }}.zip
250258

251259
ubuntu-latest-cmake-sanitizer:
252260
runs-on: ubuntu-latest

CONTRIBUTING.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,12 @@
11
# Pull requests (for contributors)
22

3+
- llama.cpp uses the ggml tensor library for model evaluation. If you are unfamiliar with ggml, consider taking a look at the [examples in the ggml repository](https://github.com/ggml-org/ggml/tree/master/examples/). [simple](https://github.com/ggml-org/ggml/tree/master/examples/simple) shows the bare minimum for using ggml. [gpt-2](https://github.com/ggml-org/ggml/tree/master/examples/gpt-2) has minimal implementations for language model inference using GPT-2. [mnist](https://github.com/ggml-org/ggml/tree/master/examples/mnist) demonstrates how to train and evaluate a simple image classifier
34
- Test your changes:
45
- Execute [the full CI locally on your machine](ci/README.md) before publishing
56
- Verify that the perplexity and the performance are not affected negatively by your changes (use `llama-perplexity` and `llama-bench`)
67
- If you modified the `ggml` source, run the `test-backend-ops` tool to check whether different backend implementations of the `ggml` operators produce consistent results (this requires access to at least two different `ggml` backends)
78
- If you modified a `ggml` operator or added a new one, add the corresponding test cases to `test-backend-ops`
9+
- Create separate PRs for each feature or fix. Avoid combining unrelated changes in a single PR
810
- Consider allowing write access to your branch for faster reviews, as reviewers can push commits directly
911
- If your PR becomes stale, don't hesitate to ping the maintainers in the comments
1012

Makefile

Lines changed: 13 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -680,6 +680,10 @@ ifdef GGML_CUDA_CCBIN
680680
MK_NVCCFLAGS += -ccbin $(GGML_CUDA_CCBIN)
681681
endif # GGML_CUDA_CCBIN
682682

683+
ifdef GGML_CUDA_NO_FA
684+
MK_NVCCFLAGS += -DGGML_CUDA_NO_FA
685+
endif # GGML_CUDA_NO_FA
686+
683687
ifdef GGML_CUDA_FA_ALL_QUANTS
684688
MK_NVCCFLAGS += -DGGML_CUDA_FA_ALL_QUANTS
685689
endif # GGML_CUDA_FA_ALL_QUANTS
@@ -800,6 +804,10 @@ ifdef GGML_CUDA_NO_PEER_COPY
800804
HIPFLAGS += -DGGML_CUDA_NO_PEER_COPY
801805
endif # GGML_CUDA_NO_PEER_COPY
802806

807+
ifdef GGML_CUDA_NO_FA
808+
HIPFLAGS += -DGGML_CUDA_NO_FA
809+
endif # GGML_CUDA_NO_FA
810+
803811
OBJ_GGML_EXT += ggml/src/ggml-cuda/ggml-cuda.o
804812
OBJ_GGML_EXT += $(patsubst %.cu,%.o,$(wildcard ggml/src/ggml-cuda/*.cu))
805813
OBJ_GGML_EXT += $(OBJ_CUDA_TMPL)
@@ -847,7 +855,7 @@ ifdef GGML_MUSA
847855
CXX := $(MUSA_PATH)/bin/clang++
848856
MCC := $(CCACHE) $(MUSA_PATH)/bin/mcc
849857

850-
MUSAFLAGS = -x musa -mtgpu
858+
MUSAFLAGS = -fsigned-char -x musa -mtgpu
851859
MUSAFLAGS += $(foreach arch,$(subst ;, ,$(MUSA_ARCHITECTURES)),--cuda-gpu-arch=mp_$(arch))
852860

853861
ifdef GGML_CUDA_FORCE_MMQ
@@ -876,6 +884,10 @@ ifdef GGML_CUDA_NO_PEER_COPY
876884
MUSAFLAGS += -DGGML_CUDA_NO_PEER_COPY
877885
endif # GGML_CUDA_NO_PEER_COPY
878886

887+
ifdef GGML_CUDA_NO_FA
888+
MUSAFLAGS += -DGGML_CUDA_NO_FA
889+
endif # GGML_CUDA_NO_FA
890+
879891
ifdef GGML_CUDA_FA_ALL_QUANTS
880892
MUSAFLAGS += -DGGML_CUDA_FA_ALL_QUANTS
881893
endif # GGML_CUDA_FA_ALL_QUANTS

docs/backend/SYCL.md

Lines changed: 14 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,16 @@ The following release is verified with good quality:
4242

4343
## News
4444

45+
- 2025.2
46+
- Optimize MUL_MAT Q4_0 on Intel GPU for all dGPUs and built-in GPUs since MTL. Increase the performance of LLM (llama-2-7b.Q4_0.gguf) 21%-87% on Intel GPUs (MTL, ARL-H, Arc, Flex, PVC).
47+
|GPU|Base tokens/s|Increased tokens/s|Percent|
48+
|-|-|-|-|
49+
|PVC 1550|39|73|+87%|
50+
|Flex 170|39|50|+28%|
51+
|Arc770|42|55|+30%|
52+
|MTL|13|16|+23%|
53+
|ARL-H|14|17|+21%|
54+
4555
- 2024.11
4656
- Use syclcompat to improve the performance on some platforms. This requires to use oneAPI 2025.0 or newer.
4757

@@ -97,8 +107,8 @@ SYCL backend supports Intel GPU Family:
97107
| Intel Data Center Max Series | Support | Max 1550, 1100 |
98108
| Intel Data Center Flex Series | Support | Flex 170 |
99109
| Intel Arc Series | Support | Arc 770, 730M, Arc A750 |
100-
| Intel built-in Arc GPU | Support | built-in Arc GPU in Meteor Lake |
101-
| Intel iGPU | Support | iGPU in 13700k, i5-1250P, i7-1260P, i7-1165G7 |
110+
| Intel built-in Arc GPU | Support | built-in Arc GPU in Meteor Lake, Arrow Lake |
111+
| Intel iGPU | Support | iGPU in 13700k,iGPU in 13400, i5-1250P, i7-1260P, i7-1165G7 |
102112

103113
*Notes:*
104114

@@ -660,8 +670,10 @@ use 1 SYCL GPUs: [0] with Max compute units:512
660670
| Name | Value | Function |
661671
|-------------------|------------------|---------------------------------------------------------------------------------------------------------------------------|
662672
| GGML_SYCL_DEBUG | 0 (default) or 1 | Enable log function by macro: GGML_SYCL_DEBUG |
673+
| GGML_SYCL_DISABLE_OPT | 0 (default) or 1 | Disable optimize features based on Intel GPU type, to compare the performance increase |
663674
| ZES_ENABLE_SYSMAN | 0 (default) or 1 | Support to get free memory of GPU by sycl::aspect::ext_intel_free_memory.<br>Recommended to use when --split-mode = layer |
664675

676+
665677
## Known Issues
666678

667679
- `Split-mode:[row]` is not supported.

docs/build.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -206,6 +206,14 @@ This provides GPU acceleration using the MUSA cores of your Moore Threads MTT GP
206206
cmake --build build --config Release
207207
```
208208

209+
For static build:
210+
211+
```bash
212+
cmake -B build -DGGML_MUSA=ON \
213+
-DBUILD_SHARED_LIBS=OFF -DCMAKE_POSITION_INDEPENDENT_CODE=ON
214+
cmake --build build --config Release
215+
```
216+
209217
The environment variable [`MUSA_VISIBLE_DEVICES`](https://docs.mthreads.com/musa-sdk/musa-sdk-doc-online/programming_guide/Z%E9%99%84%E5%BD%95/) can be used to specify which GPU(s) will be used.
210218

211219
The environment variable `GGML_CUDA_ENABLE_UNIFIED_MEMORY=1` can be used to enable unified memory in Linux. This allows swapping to system RAM instead of crashing when the GPU VRAM is exhausted.

examples/llama.swiftui/llama.swiftui/UI/ContentView.swift

Lines changed: 18 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -124,15 +124,26 @@ struct ContentView: View {
124124
}
125125
}
126126
}.sheet(isPresented: $showingHelp) { // Sheet for help modal
127-
VStack(alignment: .leading) {
127+
NavigationView {
128128
VStack(alignment: .leading) {
129-
Text("1. Make sure the model is in GGUF Format")
130-
.padding()
131-
Text("2. Copy the download link of the quantized model")
132-
.padding()
129+
VStack(alignment: .leading) {
130+
Text("1. Make sure the model is in GGUF Format")
131+
.padding()
132+
Text("2. Copy the download link of the quantized model")
133+
.padding()
134+
}
135+
Spacer()
136+
}
137+
.navigationTitle("Help")
138+
.navigationBarTitleDisplayMode(.inline)
139+
.toolbar {
140+
ToolbarItem(placement: .navigationBarTrailing) {
141+
Button("Done") {
142+
showingHelp = false
143+
}
144+
}
133145
}
134-
Spacer()
135-
}
146+
}
136147
}
137148
}
138149
}
Lines changed: 183 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,183 @@
1+
# Granite Vision
2+
3+
Download the model and point your `GRANITE_MODEL` environment variable to the path.
4+
5+
```bash
6+
$ git clone https://huggingface.co/ibm-granite/granite-vision-3.1-2b-preview
7+
$ export GRANITE_MODEL=./granite-vision-3.1-2b-preview
8+
```
9+
10+
11+
### 1. Running llava surgery v2.
12+
First, we need to run the llava surgery script as shown below:
13+
14+
`python llava_surgery_v2.py -C -m $GRANITE_MODEL`
15+
16+
You should see two new files (`llava.clip` and `llava.projector`) written into your model's directory, as shown below.
17+
18+
```bash
19+
$ ls $GRANITE_MODEL | grep -i llava
20+
llava.clip
21+
llava.projector
22+
```
23+
24+
We should see that the projector and visual encoder get split out into the llava files. Quick check to make sure they aren't empty:
25+
```python
26+
import os
27+
import torch
28+
29+
MODEL_PATH = os.getenv("GRANITE_MODEL")
30+
if not MODEL_PATH:
31+
raise ValueError("env var GRANITE_MODEL is unset!")
32+
33+
encoder_tensors = torch.load(os.path.join(MODEL_PATH, "llava.clip"))
34+
projector_tensors = torch.load(os.path.join(MODEL_PATH, "llava.projector"))
35+
36+
assert len(encoder_tensors) > 0
37+
assert len(projector_tensors) > 0
38+
```
39+
40+
If you actually inspect the `.keys()` of the loaded tensors, you should see a lot of `vision_model` tensors in the `encoder_tensors`, and 5 tensors (`'multi_modal_projector.linear_1.bias'`, `'multi_modal_projector.linear_1.weight'`, `'multi_modal_projector.linear_2.bias'`, `'multi_modal_projector.linear_2.weight'`, `'image_newline'`) in the multimodal `projector_tensors`.
41+
42+
43+
### 2. Creating the Visual Component GGUF
44+
To create the GGUF for the visual components, we need to write a config for the visual encoder; make sure the config contains the correct `image_grid_pinpoints`
45+
46+
47+
Note: we refer to this file as `$VISION_CONFIG` later on.
48+
```json
49+
{
50+
"_name_or_path": "siglip-model",
51+
"architectures": [
52+
"SiglipVisionModel"
53+
],
54+
"image_grid_pinpoints": [
55+
[384,768],
56+
[384,1152],
57+
[384,1536],
58+
[384,1920],
59+
[384,2304],
60+
[384,2688],
61+
[384,3072],
62+
[384,3456],
63+
[384,3840],
64+
[768,384],
65+
[768,768],
66+
[768,1152],
67+
[768,1536],
68+
[768,1920],
69+
[1152,384],
70+
[1152,768],
71+
[1152,1152],
72+
[1536,384],
73+
[1536,768],
74+
[1920,384],
75+
[1920,768],
76+
[2304,384],
77+
[2688,384],
78+
[3072,384],
79+
[3456,384],
80+
[3840,384]
81+
],
82+
"mm_patch_merge_type": "spatial_unpad",
83+
"hidden_size": 1152,
84+
"image_size": 384,
85+
"intermediate_size": 4304,
86+
"model_type": "siglip_vision_model",
87+
"num_attention_heads": 16,
88+
"num_hidden_layers": 27,
89+
"patch_size": 14,
90+
"layer_norm_eps": 1e-6,
91+
"hidden_act": "gelu_pytorch_tanh",
92+
"projection_dim": 0,
93+
"vision_feature_layer": [-24, -20, -12, -1]
94+
}
95+
```
96+
97+
Create a new directory to hold the visual components, and copy the llava.clip/projector files, as well as the vision config into it.
98+
99+
```bash
100+
$ ENCODER_PATH=$PWD/visual_encoder
101+
$ mkdir $ENCODER_PATH
102+
103+
$ cp $GRANITE_MODEL/llava.clip $ENCODER_PATH/pytorch_model.bin
104+
$ cp $GRANITE_MODEL/llava.projector $ENCODER_PATH/
105+
$ cp $VISION_CONFIG $ENCODER_PATH/config.json
106+
```
107+
108+
At which point you should have something like this:
109+
```bash
110+
$ ls $ENCODER_PATH
111+
config.json llava.projector pytorch_model.bin
112+
```
113+
114+
Now convert the components to GGUF; Note that we also override the image mean/std dev to `[.5,.5,.5]` since we use the siglip visual encoder - in the transformers model, you can find these numbers in the [preprocessor_config.json](https://huggingface.co/ibm-granite/granite-vision-3.1-2b-preview/blob/main/preprocessor_config.json).
115+
```bash
116+
$ python convert_image_encoder_to_gguf.py \
117+
-m $ENCODER_PATH \
118+
--llava-projector $ENCODER_PATH/llava.projector \
119+
--output-dir $ENCODER_PATH \
120+
--clip-model-is-vision \
121+
--clip-model-is-siglip \
122+
--image-mean 0.5 0.5 0.5 --image-std 0.5 0.5 0.5
123+
```
124+
125+
this will create the first GGUF file at `$ENCODER_PATH/mmproj-model-f16.gguf`; we will refer to the abs path of this file as the `$VISUAL_GGUF_PATH.`
126+
127+
128+
### 3. Creating the LLM GGUF.
129+
The granite vision model contains a granite LLM as its language model. For now, the easiest way to get the GGUF for LLM is by loading the composite model in `transformers` and exporting the LLM so that it can be directly converted with the normal conversion path.
130+
131+
First, set the `LLM_EXPORT_PATH` to the path to export the `transformers` LLM to.
132+
```
133+
$ export LLM_EXPORT_PATH=$PWD/granite_vision_llm
134+
```
135+
136+
```python
137+
import os
138+
import transformers
139+
140+
MODEL_PATH = os.getenv("GRANITE_MODEL")
141+
if not MODEL_PATH:
142+
raise ValueError("env var GRANITE_MODEL is unset!")
143+
144+
LLM_EXPORT_PATH = os.getenv("LLM_EXPORT_PATH")
145+
if not MODEL_PATH:
146+
raise ValueError("env var LLM_EXPORT_PATH is unset!")
147+
148+
tokenizer = transformers.AutoTokenizer.from_pretrained(MODEL_PATH)
149+
150+
# NOTE: granite vision support was added to transformers very recently (4.49);
151+
# if you get size mismatches, your version is too old.
152+
# If you are running with an older version, set `ignore_mismatched_sizes=True`
153+
# as shown below; it won't be loaded correctly, but the LLM part of the model that
154+
# we are exporting will be loaded correctly.
155+
model = transformers.AutoModelForImageTextToText.from_pretrained(MODEL_PATH, ignore_mismatched_sizes=True)
156+
157+
tokenizer.save_pretrained(LLM_EXPORT_PATH)
158+
model.language_model.save_pretrained(LLM_EXPORT_PATH)
159+
```
160+
161+
Now you can convert the exported LLM to GGUF with the normal converter in the root of the llama cpp project.
162+
```bash
163+
$ LLM_GGUF_PATH=$LLM_EXPORT_PATH/granite_llm.gguf
164+
...
165+
$ python convert_hf_to_gguf.py --outfile $LLM_GGUF_PATH $LLM_EXPORT_PATH
166+
```
167+
168+
169+
### 4. Running the Model in Llama cpp
170+
Build llama cpp normally; you should have a target binary named `llama-llava-cli`, which you can pass two binaries to. Sample usage:
171+
172+
Note - the test image shown below can be found [here](https://github-production-user-asset-6210df.s3.amazonaws.com/10740300/415512792-d90d5562-8844-4f34-a0a5-77f62d5a58b5.jpg?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAVCODYLSA53PQK4ZA%2F20250221%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20250221T054145Z&X-Amz-Expires=300&X-Amz-Signature=86c60be490aa49ef7d53f25d6c973580a8273904fed11ed2453d0a38240ee40a&X-Amz-SignedHeaders=host).
173+
174+
```bash
175+
$ ./build/bin/llama-llava-cli -m $LLM_GGUF_PATH \
176+
--mmproj $VISUAL_GGUF_PATH \
177+
--image cherry_blossom.jpg \
178+
-c 16384 \
179+
-p "<|system|>\nA chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.\n<|user|>\n\<image>\nWhat type of flowers are in this picture?\n<|assistant|>\n" \
180+
--temp 0
181+
```
182+
183+
Sample response: `The flowers in the picture are cherry blossoms, which are known for their delicate pink petals and are often associated with the beauty of spring.`

examples/llava/README.md

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -101,8 +101,27 @@ python ./examples/convert_legacy_llama.py ../llava-v1.6-vicuna-7b/ --skip-unknow
101101
```
102102

103103
**note** llava-1.6 needs more context than llava-1.5, at least 3000 is needed (just run it at -c 4096)
104+
104105
**note** llava-1.6 greatly benefits from batched prompt processing (defaults work)
105106

107+
**note** if the language model in step `6)` is incompatible with the legacy conversion script, the easiest way handle the LLM model conversion is to load the model in transformers, and export only the LLM from the llava next model.
108+
109+
```python
110+
import os
111+
import transformers
112+
113+
model_path = ...
114+
llm_export_path = ...
115+
116+
tokenizer = transformers.AutoTokenizer.from_pretrained(model_path)
117+
model = transformers.AutoModelForImageTextToText.from_pretrained(model_path)
118+
119+
tokenizer.save_pretrained(llm_export_path)
120+
model.language_model.save_pretrained(llm_export_path)
121+
```
122+
123+
Then, you can convert the LLM using the `convert_hf_to_gguf.py` script, which handles more LLM architectures.
124+
106125
## llava-cli templating and llava-1.6 prompting
107126

108127
llava-1.5 models all use the same vicuna prompt, here you can just add your image question like `-p "Provide a full description."`

0 commit comments

Comments
 (0)