ggml-zdnn: update documentation, prepare for upstream

taronaeo · taronaeo · commit cf8cdcd3722a · 2025-07-31T01:26:30.000+08:00
Signed-off-by: Aaron Teo &lt;aaron.teo1@ibm.com&gt;
diff --git a/.github/ISSUE_TEMPLATE/010-bug-compilation.yml b/.github/ISSUE_TEMPLATE/010-bug-compilation.yml
@@ -40,7 +40,7 @@ body:
     attributes:
         label: GGML backends
         description: Which GGML backends do you know to be affected?
-        options: [AMX, BLAS, CPU, CUDA, HIP, Metal, Musa, RPC, SYCL, Vulkan, OpenCL]
+        options: [AMX, BLAS, CPU, CUDA, HIP, Metal, Musa, RPC, SYCL, Vulkan, OpenCL, zDNN]
         multiple: true
     validations:
       required: true
diff --git a/.github/ISSUE_TEMPLATE/011-bug-results.yml b/.github/ISSUE_TEMPLATE/011-bug-results.yml
@@ -42,7 +42,7 @@ body:
     attributes:
         label: GGML backends
         description: Which GGML backends do you know to be affected?
-        options: [AMX, BLAS, CPU, CUDA, HIP, Metal, Musa, RPC, SYCL, Vulkan, OpenCL]
+        options: [AMX, BLAS, CPU, CUDA, HIP, Metal, Musa, RPC, SYCL, Vulkan, OpenCL, zDNN]
         multiple: true
     validations:
       required: true
diff --git a/.github/labeler.yml b/.github/labeler.yml
@@ -22,6 +22,11 @@ Vulkan:
         - any-glob-to-any-file:
             - ggml/include/ggml-vulkan.h
             - ggml/src/ggml-vulkan/**
+IBM zDNN:
+    - changed-files:
+        - any-glob-to-any-file:
+            - ggml/include/ggml-zdnn.h
+            - ggml/src/ggml-zdnn/**
 documentation:
     - changed-files:
         - any-glob-to-any-file:
diff --git a/docs/build-s390x.md b/docs/build-s390x.md
@@ -42,14 +42,14 @@ cmake --build build --config Release -j $(nproc)
     cmake --build build --config Release -j $(nproc)
     ```
 
--   By default, NNPA is enabled when available. To disable it (not recommended):
+-   By default, NNPA is disabled by default. To enable it:
 
     ```bash
     cmake -S . -B build             \
         -DCMAKE_BUILD_TYPE=Release  \
         -DGGML_BLAS=ON              \
         -DGGML_BLAS_VENDOR=OpenBLAS \
-        -DGGML_NNPA=OFF
+        -DGGML_NNPA=ON
 
     cmake --build build --config Release -j $(nproc)
     ```
@@ -76,6 +76,23 @@ cmake --build build --config Release -j $(nproc)
     cmake --build build --config Release -j $(nproc)
     ```
 
+## IBM zDNN Accelerator
+
+This provides acceleration using the IBM zAIU co-processor located in the Telum I and Telum II processors. Make sure to have the [IBM zDNN library](https://github.com/IBM/zDNN) installed.
+
+#### Compile from source from IBM
+
+You may find the official build instructions here: [Building and Installing zDNN](https://github.com/IBM/zDNN?tab=readme-ov-file#building-and-installing-zdnn)
+
+### Compilation
+
+```bash
+cmake -S . -B build             \
+    -DCMAKE_BUILD_TYPE=Release  \
+    -DGGML_ZDNN=ON
+cmake --build build --config Release -j$(nproc)
+```
+
 ## Getting GGUF Models
 
 All models need to be converted to Big-Endian. You can achieve this in three cases:
@@ -84,16 +101,24 @@ All models need to be converted to Big-Endian. You can achieve this in three cas
 
     ![File Type - gguf](https://img.shields.io/badge/File_Type-gguf-fff)
 
-    You can find popular models pre-converted and verified at [s390x Ready Models](https://huggingface.co/collections/taronaeo/s390x-ready-models-672765393af438d0ccb72a08).
+    You can find popular models pre-converted and verified at [s390x Verified Models](https://huggingface.co/collections/taronaeo/s390x-verified-models-672765393af438d0ccb72a08) or [s390x Runnable Models](https://huggingface.co/collections/taronaeo/s390x-runnable-models-686e951824198df12416017e).
 
-    These models have already been converted from `safetensors` to `GGUF Big-Endian` and their respective tokenizers verified to run correctly on IBM z15 and later system.
+    These models have already been converted from `safetensors` to `GGUF` Big-Endian and their respective tokenizers verified to run correctly on IBM z15 and later system.
 
 2. **Convert safetensors model to GGUF Big-Endian directly (recommended)**
 
     ![File Type - safetensors](https://img.shields.io/badge/File_Type-safetensors-da1e28)
 
     The model you are trying to convert must be in `safetensors` file format (for example [IBM Granite 3.3 2B](https://huggingface.co/ibm-granite/granite-3.3-2b-instruct)). Make sure you have downloaded the model repository for this case.
 
+    Ensure that you have installed the required packages in advance
+
+    ```bash
+    pip3 install -r requirements.txt
+    ```
+
+    Convert the `safetensors` model to `GGUF`
+
     ```bash
     python3 convert_hf_to_gguf.py \
         --outfile model-name-be.f16.gguf \
@@ -116,7 +141,7 @@ All models need to be converted to Big-Endian. You can achieve this in three cas
 
     ![File Type - gguf](https://img.shields.io/badge/File_Type-gguf-fff)
 
-    The model you are trying to convert must be in `gguf` file format (for example [IBM Granite 3.3 2B](https://huggingface.co/ibm-granite/granite-3.3-2b-instruct-GGUF)). Make sure you have downloaded the model file for this case.
+    The model you are trying to convert must be in `gguf` file format (for example [IBM Granite 3.3 2B GGUF](https://huggingface.co/ibm-granite/granite-3.3-2b-instruct-GGUF)). Make sure you have downloaded the model file for this case.
 
     ```bash
     python3 gguf-py/gguf/scripts/gguf_convert_endian.py model-name.f16.gguf BIG
@@ -137,19 +162,19 @@ All models need to be converted to Big-Endian. You can achieve this in three cas
 
 ### 1. SIMD Acceleration
 
-Only available in IBM z15 or later system with the `-DGGML_VXE=ON` (turned on by default) compile flag. No hardware acceleration is possible with llama.cpp with older systems, such as IBM z14/arch12. In such systems, the APIs can still run but will use a scalar implementation.
+Only available in IBM z15/LinuxONE 3 or later system with the `-DGGML_VXE=ON` (turned on by default) compile flag. No hardware acceleration is possible with llama.cpp with older systems, such as IBM z14/arch12. In such systems, the APIs can still run but will use a scalar implementation.
 
 ### 2. NNPA Vector Intrinsics Acceleration
 
-Only available in IBM z16 or later system with the `-DGGML_NNPA=ON` (turned on when available) compile flag. No hardware acceleration is possible with llama.cpp with older systems, such as IBM z15/arch13. In such systems, the APIs can still run but will use a scalar implementation.
+Only available in IBM z16/LinuxONE 4 or later system with the `-DGGML_NNPA=ON` (turned off by default) compile flag. No hardware acceleration is possible with llama.cpp with older systems, such as IBM z15/arch13. In such systems, the APIs can still run but will use a scalar implementation.
 
-### 3. zDNN Accelerator
+### 3. zDNN Accelerator (WIP)
 
-_Only available in IBM z16 or later system. No direction at the moment._
+Only available in IBM z17/LinuxONE 5 or later system with the `-DGGML_ZDNN=ON` compile flag. No hardware acceleration is possible with llama.cpp with older systems, such as IBM z15/arch13. In such systems, the APIs will default back to CPU routines.
 
 ### 4. Spyre Accelerator
 
-_No direction at the moment._
+_Only available with IBM z17 / LinuxONE 5 or later system. No support currently available._
 
 ## Performance Tuning
 
@@ -189,6 +214,26 @@ IBM VXE/VXE2 SIMD acceleration depends on the BLAS implementation. It is strongl
 
     Answer: Please ensure that your GCC compiler is of minimum GCC 15.1.0 version, and have `binutils` updated to the latest version. If this does not fix the problem, kindly open an issue.
 
+4. Failing to install the `sentencepiece` package using GCC 15+
+
+    Answer: The `sentencepiece` team are aware of this as seen in [this issue](https://github.com/google/sentencepiece/issues/1108).
+
+    As a temporary workaround, please run the installation command with the following environment variables.
+
+    ```bash
+    export CXXFLAGS="-include cstdint"
+    ```
+
+    For example,
+
+    ```bash
+    CXXFLAGS="-include cstdint" pip3 install -r requirements.txt
+    ```
+
+5. `-DGGML_NNPA=ON` generates gibberish output
+
+    Answer: We are aware of this as detailed in [this issue](https://github.com/ggml-org/llama.cpp/issues/14877). Please either try reducing the number of threads, or disable the compile option using `-DGGML_NNPA=OFF`.
+
 ## Getting Help on IBM Z & LinuxONE
 
 1. **Bugs, Feature Requests**
@@ -201,11 +246,12 @@ IBM VXE/VXE2 SIMD acceleration depends on the BLAS implementation. It is strongl
 
 ## Appendix A: Hardware Support Matrix
 
-|         | Support | Minimum Compiler Version |
-| ------- | ------- | ------------------------ |
-| IBM z15 | ✅      |                          |
-| IBM z16 | ✅      |                          |
-| IBM z17 | ✅      | GCC 15.1.0               |
+|          | Support | Minimum Compiler Version |
+| -------- | ------- | ------------------------ |
+| IBM z15  | ✅      |                          |
+| IBM z16  | ✅      |                          |
+| IBM z17  | ✅      | GCC 15.1.0               |
+| IBM zAIU | ✅      |                          |
 
 -   ✅ - supported and verified to run as intended
 -   🚫 - unsupported, we are unlikely able to provide support
@@ -214,7 +260,7 @@ IBM VXE/VXE2 SIMD acceleration depends on the BLAS implementation. It is strongl
 
 |            | VX/VXE/VXE2 | NNPA | zDNN | Spyre |
 | ---------- | ----------- | ---- | ---- | ----- |
-| FP32       | ✅          | ✅   | ❓   | ❓    |
+| FP32       | ✅          | ✅   | ✅   | ❓    |
 | FP16       | ✅          | ✅   | ❓   | ❓    |
 | BF16       | 🚫          | 🚫   | ❓   | ❓    |
 | Q4_0       | ✅          | ✅   | ❓   | ❓    |
@@ -244,3 +290,5 @@ IBM VXE/VXE2 SIMD acceleration depends on the BLAS implementation. It is strongl
 -   ✅ - acceleration available
 -   🚫 - acceleration unavailable, will still run using scalar implementation
 -   ❓ - acceleration unknown, please contribute if you can test it yourself
+
+Last Updated by **Aaron Teo (aaron.teo1@ibm.com)** on July 31, 2025.