You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This provides acceleration using the IBM zAIU co-processor located in the Telum I and Telum II processors. Make sure to have the [IBM zDNN library](https://github.com/IBM/zDNN) installed.
82
+
83
+
#### Compile from source from IBM
84
+
85
+
You may find the official build instructions here: [Building and Installing zDNN](https://github.com/IBM/zDNN?tab=readme-ov-file#building-and-installing-zdnn)
86
+
87
+
### Compilation
88
+
89
+
```bash
90
+
cmake -S . -B build \
91
+
-DCMAKE_BUILD_TYPE=Release \
92
+
-DGGML_ZDNN=ON
93
+
cmake --build build --config Release -j$(nproc)
94
+
```
95
+
79
96
## Getting GGUF Models
80
97
81
98
All models need to be converted to Big-Endian. You can achieve this in three cases:
@@ -84,16 +101,24 @@ All models need to be converted to Big-Endian. You can achieve this in three cas
84
101
85
102

86
103
87
-
You can find popular models pre-converted and verified at [s390x Ready Models](https://huggingface.co/collections/taronaeo/s390x-ready-models-672765393af438d0ccb72a08).
104
+
You can find popular models pre-converted and verified at [s390x Verified Models](https://huggingface.co/collections/taronaeo/s390x-verified-models-672765393af438d0ccb72a08) or [s390x Runnable Models](https://huggingface.co/collections/taronaeo/s390x-runnable-models-686e951824198df12416017e).
88
105
89
-
These models have already been converted from `safetensors` to `GGUF Big-Endian` and their respective tokenizers verified to run correctly on IBM z15 and later system.
106
+
These models have already been converted from `safetensors` to `GGUF` Big-Endian and their respective tokenizers verified to run correctly on IBM z15 and later system.
90
107
91
108
2. **Convert safetensors model to GGUF Big-Endian directly (recommended)**
92
109
93
110

94
111
95
112
The model you are trying to convert must be in`safetensors` file format (for example [IBM Granite 3.3 2B](https://huggingface.co/ibm-granite/granite-3.3-2b-instruct)). Make sure you have downloaded the model repository for this case.
96
113
114
+
Ensure that you have installed the required packages in advance
115
+
116
+
```bash
117
+
pip3 install -r requirements.txt
118
+
```
119
+
120
+
Convert the `safetensors` model to `GGUF`
121
+
97
122
```bash
98
123
python3 convert_hf_to_gguf.py \
99
124
--outfile model-name-be.f16.gguf \
@@ -116,7 +141,7 @@ All models need to be converted to Big-Endian. You can achieve this in three cas
116
141
117
142

118
143
119
-
The model you are trying to convert must be in`gguf` file format (for example [IBM Granite 3.3 2B](https://huggingface.co/ibm-granite/granite-3.3-2b-instruct-GGUF)). Make sure you have downloaded the model file for this case.
144
+
The model you are trying to convert must be in`gguf` file format (for example [IBM Granite 3.3 2B GGUF](https://huggingface.co/ibm-granite/granite-3.3-2b-instruct-GGUF)). Make sure you have downloaded the model file for this case.
120
145
121
146
```bash
122
147
python3 gguf-py/gguf/scripts/gguf_convert_endian.py model-name.f16.gguf BIG
@@ -137,19 +162,19 @@ All models need to be converted to Big-Endian. You can achieve this in three cas
137
162
138
163
### 1. SIMD Acceleration
139
164
140
-
Only available in IBM z15 or later system with the `-DGGML_VXE=ON` (turned on by default) compile flag. No hardware acceleration is possible with llama.cpp with older systems, such as IBM z14/arch12. In such systems, the APIs can still run but will use a scalar implementation.
165
+
Only available in IBM z15/LinuxONE 3 or later system with the `-DGGML_VXE=ON` (turned on by default) compile flag. No hardware acceleration is possible with llama.cpp with older systems, such as IBM z14/arch12. In such systems, the APIs can still run but will use a scalar implementation.
141
166
142
167
### 2. NNPA Vector Intrinsics Acceleration
143
168
144
-
Only available in IBM z16or later system with the `-DGGML_NNPA=ON` (turned on when available) compile flag. No hardware acceleration is possible with llama.cpp with older systems, such as IBM z15/arch13. In such systems, the APIs can still run but will use a scalar implementation.
169
+
Only available in IBM z16/LinuxONE 4 or later system with the `-DGGML_NNPA=ON` (turned off by default) compile flag. No hardware acceleration is possible with llama.cpp with older systems, such as IBM z15/arch13. In such systems, the APIs can still run but will use a scalar implementation.
145
170
146
-
### 3. zDNN Accelerator
171
+
### 3. zDNN Accelerator (WIP)
147
172
148
-
_Only available in IBM z16 or later system. No direction at the moment._
173
+
Only available in IBM z17/LinuxONE 5 or later system with the `-DGGML_ZDNN=ON` compile flag. No hardware acceleration is possible with llama.cpp with older systems, such as IBM z15/arch13. In such systems, the APIs will default back to CPU routines.
149
174
150
175
### 4. Spyre Accelerator
151
176
152
-
_No direction at the moment._
177
+
_Only available with IBM z17 / LinuxONE 5 or later system. No support currently available._
153
178
154
179
## Performance Tuning
155
180
@@ -189,6 +214,26 @@ IBM VXE/VXE2 SIMD acceleration depends on the BLAS implementation. It is strongl
189
214
190
215
Answer: Please ensure that your GCC compiler is of minimum GCC 15.1.0 version, and have `binutils` updated to the latest version. If this does not fix the problem, kindly open an issue.
191
216
217
+
4. Failing to install the `sentencepiece` package using GCC 15+
218
+
219
+
Answer: The `sentencepiece` team are aware of this as seen in [this issue](https://github.com/google/sentencepiece/issues/1108).
220
+
221
+
As a temporary workaround, please run the installation command with the following environment variables.
Answer: We are aware of this as detailed in [this issue](https://github.com/ggml-org/llama.cpp/issues/14877). Please either try reducing the number of threads, or disable the compile option using `-DGGML_NNPA=OFF`.
236
+
192
237
## Getting Help on IBM Z & LinuxONE
193
238
194
239
1. **Bugs, Feature Requests**
@@ -201,11 +246,12 @@ IBM VXE/VXE2 SIMD acceleration depends on the BLAS implementation. It is strongl
201
246
202
247
## Appendix A: Hardware Support Matrix
203
248
204
-
|| Support | Minimum Compiler Version |
205
-
| ------- | ------- | ------------------------ |
206
-
| IBM z15 | ✅ ||
207
-
| IBM z16 | ✅ ||
208
-
| IBM z17 | ✅ | GCC 15.1.0 |
249
+
|| Support | Minimum Compiler Version |
250
+
| -------- | ------- | ------------------------ |
251
+
| IBM z15 | ✅ ||
252
+
| IBM z16 | ✅ ||
253
+
| IBM z17 | ✅ | GCC 15.1.0 |
254
+
| IBM zAIU | ✅ ||
209
255
210
256
- ✅ - supported and verified to run as intended
211
257
- 🚫 - unsupported, we are unlikely able to provide support
@@ -214,7 +260,7 @@ IBM VXE/VXE2 SIMD acceleration depends on the BLAS implementation. It is strongl
0 commit comments