You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Building llama.cpp with BLAS support is highly recommended as it has shown to provide performance improvements.
19
+
Building llama.cpp with BLAS support is highly recommended as it has shown to provide performance improvements. Make sure to have OpenBLAS installed in your environment.
- By default, NNPA is enabled when available. To disable it (not recommended):
46
+
47
+
```bash
48
+
cmake -S . -B build \
49
+
-DCMAKE_BUILD_TYPE=Release \
50
+
-DGGML_BLAS=ON \
51
+
-DGGML_BLAS_VENDOR=OpenBLAS \
52
+
-DGGML_NNPA=OFF
53
+
54
+
cmake --build build --config Release -j $(nproc)
55
+
```
56
+
57
+
- For debug builds:
45
58
46
59
```bash
47
60
cmake -S . -B build \
48
61
-DCMAKE_BUILD_TYPE=Debug \
49
62
-DGGML_BLAS=ON \
50
63
-DGGML_BLAS_VENDOR=OpenBLAS
51
-
52
64
cmake --build build --config Debug -j $(nproc)
53
65
```
54
66
55
-
- For static builds, add `-DBUILD_SHARED_LIBS=OFF`:
67
+
- For static builds, add `-DBUILD_SHARED_LIBS=OFF`:
56
68
57
69
```bash
58
70
cmake -S . -B build \
@@ -70,12 +82,18 @@ All models need to be converted to Big-Endian. You can achieve this in three cas
70
82
71
83
1. **Use pre-converted models verified for use on IBM Z & LinuxONE (easiest)**
72
84
73
-
You can find popular models pre-converted and verified at [s390x Ready Models](hf.co/collections/taronaeo/s390x-ready-models-672765393af438d0ccb72a08).
85
+

74
86
75
-
These models and their respective tokenizers are verified to run correctly on IBM Z & LinuxONE.
87
+
You can find popular models pre-converted and verified at [s390x Ready Models](https://huggingface.co/collections/taronaeo/s390x-ready-models-672765393af438d0ccb72a08).
88
+
89
+
These models have already been converted from `safetensors` to `GGUF Big-Endian` and their respective tokenizers verified to run correctly on IBM z15 and later system.
76
90
77
91
2. **Convert safetensors model to GGUF Big-Endian directly (recommended)**
78
92
93
+

94
+
95
+
The model you are trying to convert must be in`safetensors` file format (for example [IBM Granite 3.3 2B](https://huggingface.co/ibm-granite/granite-3.3-2b-instruct)). Make sure you have downloaded the model repository for this case.
96
+
79
97
```bash
80
98
python3 convert_hf_to_gguf.py \
81
99
--outfile model-name-be.f16.gguf \
@@ -96,32 +114,42 @@ All models need to be converted to Big-Endian. You can achieve this in three cas
96
114
97
115
3. **Convert existing GGUF Little-Endian model to Big-Endian**
98
116
117
+

118
+
119
+
The model you are trying to convert must be in`gguf` file format (for example [IBM Granite 3.3 2B](https://huggingface.co/ibm-granite/granite-3.3-2b-instruct-GGUF)). Make sure you have downloaded the model file for this case.
120
+
99
121
```bash
100
122
python3 gguf-py/gguf/scripts/gguf_convert_endian.py model-name.f16.gguf BIG
101
123
```
102
124
103
125
For example,
126
+
104
127
```bash
105
128
python3 gguf-py/gguf/scripts/gguf_convert_endian.py granite-3.3-2b-instruct-le.f16.gguf BIG
- The GGUF endian conversion script may not support all data types at the moment and may fail for some models/quantizations. When that happens, please try manually converting the safetensors model to GGUF Big-Endian via Step 2.
111
135
112
136
## IBM Accelerators
113
137
114
138
### 1. SIMD Acceleration
115
139
116
-
Only available in IBM z15 or later system with the `-DGGML_VXE=ON` (turned on by default) compile flag. No hardware acceleration is possible with llama.cpp with older systems, such as IBM z14 or EC13. In such systems, the APIs can still run but will use a scalar implementation.
140
+
Only available in IBM z15 or later system with the `-DGGML_VXE=ON` (turned on by default) compile flag. No hardware acceleration is possible with llama.cpp with older systems, such as IBM z14/arch12. In such systems, the APIs can still run but will use a scalar implementation.
117
141
118
-
### 2. zDNN Accelerator
142
+
### 2. NNPA Vector Intrinsics Acceleration
119
143
120
-
*Only available in IBM z16 or later system. No direction at the moment.*
144
+
Only available in IBM z16 or later system with the `-DGGML_NNPA=ON` (turned on when available) compile flag. No hardware acceleration is possible with llama.cpp with older systems, such as IBM z15/arch13. In such systems, the APIs can still run but will use a scalar implementation.
121
145
122
-
### 3. Spyre Accelerator
146
+
### 3. zDNN Accelerator
123
147
124
-
*No direction at the moment.*
148
+
_Only available in IBM z16 or later system. No direction at the moment._
149
+
150
+
### 4. Spyre Accelerator
151
+
152
+
_No direction at the moment._
125
153
126
154
## Performance Tuning
127
155
@@ -145,6 +173,22 @@ It is strongly recommended to disable SMT via the kernel boot parameters as it n
145
173
146
174
IBM VXE/VXE2 SIMD acceleration depends on the BLAS implementation. It is strongly recommended to use BLAS.
147
175
176
+
## Frequently Asked Questions (FAQ)
177
+
178
+
1. I'm getting the following error message while trying to load a model: `gguf_init_from_file_impl: failed to load model: this GGUF file version 50331648 is extremely large, is there a mismatch between the host and model endianness?`
179
+
180
+
Answer: Please ensure that the model you have downloaded/converted is GGUFv3 Big-Endian. These models are usually denoted with the `-be` suffix, i.e., `granite-3.3-2b-instruct-be.F16.gguf`.
181
+
182
+
You may refer to the [Getting GGUF Models](#getting-gguf-models) section to manually convert a `safetensors` model to `GGUF` Big Endian.
183
+
184
+
2. I'm getting extremely poor performance when running inference on a model
185
+
186
+
Answer: Please refer to the [Appendix B: SIMD Support Matrix](#appendix-b-simd-support-matrix) to check if your model quantization is supported by SIMD acceleration.
187
+
188
+
3. I'm building on IBM z17 and getting the following error messages: `invalid switch -march=z17`
189
+
190
+
Answer: Please ensure that your GCC compiler is of minimum GCC 15.1.0 version, and have `binutils` updated to the latest version. If this does not fix the problem, kindly open an issue.
191
+
148
192
## Getting Help on IBM Z & LinuxONE
149
193
150
194
1. **Bugs, Feature Requests**
@@ -155,3 +199,48 @@ IBM VXE/VXE2 SIMD acceleration depends on the BLAS implementation. It is strongl
0 commit comments