Skip to content

Commit 017761d

Browse files
z-vishalManoj Kumar
andauthored
ggml-zendnn : add ZenDNN backend for AMD CPUs (ggml-org#17690)
* ggml-zennn: add ZenDNN backend support * ggml-zendnn : address ZenDNN backend review fixes and suggestions * docs : apply blockquote syntax to ZenDNN docs --------- Co-authored-by: Manoj Kumar <[email protected]>
1 parent c42712b commit 017761d

File tree

13 files changed

+19740
-109
lines changed

13 files changed

+19740
-109
lines changed

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -276,6 +276,7 @@ Instructions for adding support for new models: [HOWTO-add-model.md](docs/develo
276276
| [MUSA](docs/build.md#musa) | Moore Threads GPU |
277277
| [CUDA](docs/build.md#cuda) | Nvidia GPU |
278278
| [HIP](docs/build.md#hip) | AMD GPU |
279+
| [ZenDNN](docs/build.md#zendnn) | AMD CPU |
279280
| [Vulkan](docs/build.md#vulkan) | GPU |
280281
| [CANN](docs/build.md#cann) | Ascend NPU |
281282
| [OpenCL](docs/backend/OPENCL.md) | Adreno GPU |

docs/backend/ZenDNN.md

Lines changed: 258 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,258 @@
1+
# llama.cpp for AMD ZenDNN
2+
3+
> [!WARNING]
4+
> **Note:** ZenDNN is **not** the same as zDNN.
5+
> - **ZenDNN** (this page): AMD's deep learning library for AMD EPYC CPUs
6+
> - **zDNN**: IBM's Deep Neural Network acceleration library for IBM Z & LinuxONE Mainframes ([see zDNN documentation](zDNN.md))
7+
8+
- [Background](#background)
9+
- [OS](#os)
10+
- [Hardware](#hardware)
11+
- [Supported Operations](#supported-operations)
12+
- [DataType Supports](#datatype-supports)
13+
- [Linux](#linux)
14+
- [Environment Variable](#environment-variable)
15+
- [Performance Optimization](#performance-optimization)
16+
- [Known Issues](#known-issues)
17+
- [TODO](#todo)
18+
19+
## Background
20+
21+
**ZenDNN** (Zen Deep Neural Network Library) is AMD's high-performance deep learning inference library optimized for AMD EPYC™ CPUs. It provides optimized implementations of key deep learning primitives and operations, delivering significant performance improvements for neural network workloads on AMD Zen-based processor architectures.
22+
23+
**Llama.cpp + ZenDNN**
24+
25+
The llama.cpp ZenDNN backend leverages AMD's optimized matrix multiplication primitives to accelerate inference on AMD CPUs. It utilizes ZenDNN's **LowOHA (Low Overhead Hardware Accelerated)** MatMul operator for efficient GEMM operations with minimal execution overhead, built-in weight caching, and direct access to backend libraries (AOCL BLIS, LibXSMM, OneDNN).
26+
27+
For more information about ZenDNN, visit: https://www.amd.com/en/developer/zendnn.html
28+
29+
## OS
30+
31+
| OS | Status | Verified |
32+
|:-------:|:-------:|:----------------------------------------------:|
33+
| Linux | Support | Ubuntu 20.04, 22.04, 24.04 |
34+
35+
For the latest list of supported operating systems, see the [ZenDNN Supported OS](https://github.com/amd/ZenDNN/blob/zendnnl/README.md#15-supported-os).
36+
37+
## Hardware
38+
39+
### AMD CPUs
40+
41+
**Recommended Processors**
42+
43+
ZenDNN is optimized for AMD EPYC™ processors and AMD Ryzen™ processors based on "Zen" microarchitecture and newer.
44+
45+
| CPU Family | Status | Notes |
46+
|:-----------------------------:|:-------:|:----------------------------------:|
47+
| AMD EPYC™ 9005 Series (Turin)| Support | 5th Gen - Zen 5 architecture |
48+
| AMD EPYC™ 9004 Series (Genoa)| Support | 4th Gen - Zen 4 architecture |
49+
| AMD EPYC™ 7003 Series (Milan)| Support | 3rd Gen - Zen 3 architecture |
50+
| AMD Ryzen™ AI MAX (Strix Halo)| Support | High-performance mobile processors |
51+
52+
*Notes:*
53+
54+
- Best performance is achieved on AMD EPYC™ processors with high core counts (e.g., EPYC 9005 series).
55+
- ZenDNN leverages AMD's advanced CPU features including AVX2 and AVX-512 instruction sets.
56+
- For optimal performance, ensure your system has sufficient memory bandwidth.
57+
58+
## Supported Operations
59+
60+
The ZenDNN backend currently accelerates **matrix multiplication (MUL_MAT)** operations only. Other operations are handled by the standard CPU backend.
61+
62+
| Operation | Status | Notes |
63+
|:-------------|:-------:|:----------------------------------------------:|
64+
| MUL_MAT || Accelerated via ZenDNN LowOHA MatMul |
65+
66+
*Note:* Since only MUL_MAT is accelerated, models will benefit most from ZenDNN when matrix multiplications dominate the computational workload (which is typical for transformer-based LLMs).
67+
68+
## DataType Supports
69+
70+
| DataType | Status | Notes |
71+
|:----------------------:|:-------:|:---------------------------------------------:|
72+
| FP32 | Support | Full precision floating point |
73+
| BF16 | Support | BFloat16 (best performance on Zen 4/Zen 5) |
74+
75+
*Notes:*
76+
77+
- **BF16** provides best performance on Zen 4 and Zen 5 EPYC™ processors (Genoa, Turin).
78+
79+
## Linux
80+
81+
### I. Setup Environment
82+
83+
You have two options to set up ZenDNN:
84+
85+
#### Option 1: Automatic Download and Build (Recommended)
86+
87+
CMake will automatically download and build ZenDNN for you:
88+
89+
```sh
90+
# Build llama.cpp - ZenDNN will be automatically downloaded and built
91+
cmake -B build -DGGML_ZENDNN=ON -DCMAKE_BUILD_TYPE=Release
92+
cmake --build build --config Release -j $(nproc)
93+
```
94+
95+
No manual ZenDNN installation required. CMake will handle everything automatically.
96+
97+
#### Option 2: Use Custom ZenDNN Installation
98+
99+
If you want to build ZenDNN yourself or use a specific version:
100+
101+
**Step 1: Build ZenDNN from source**
102+
103+
```sh
104+
# Clone ZenDNN repository
105+
git clone https://github.com/amd/ZenDNN.git
106+
cd ZenDNN
107+
git checkout zendnnl
108+
109+
# Build and install (requires CMake >= 3.25)
110+
mkdir build && cd build
111+
cmake ..
112+
cmake --build . --target all
113+
```
114+
115+
Default installation path: `ZenDNN/build/install`
116+
117+
**For detailed build instructions**, refer to the [ZenDNN README](https://github.com/amd/ZenDNN/blob/zendnnl/README.md).
118+
119+
**Step 2: Build llama.cpp with custom ZenDNN path**
120+
121+
```sh
122+
# Using environment variable
123+
export ZENDNN_ROOT=/path/to/ZenDNN/build/install
124+
cmake -B build -DGGML_ZENDNN=ON -DCMAKE_BUILD_TYPE=Release
125+
cmake --build build --config Release -j $(nproc)
126+
127+
# OR specify path directly in CMake
128+
cmake -B build -DGGML_ZENDNN=ON -DZENDNN_ROOT=/path/to/ZenDNN/build/install -DCMAKE_BUILD_TYPE=Release
129+
cmake --build build --config Release -j $(nproc)
130+
```
131+
132+
### II. Run the Server
133+
134+
#### 1. Download Model
135+
136+
Download LLaMA 3.1 8B Instruct BF16 model:
137+
138+
```sh
139+
# Download from Hugging Face
140+
huggingface-cli download meta-llama/Llama-3.1-8B-Instruct-GGUF --local-dir models/
141+
```
142+
143+
#### 2. Start Server
144+
145+
Run llama.cpp server with ZenDNN acceleration:
146+
147+
```sh
148+
# Set optimal configuration
149+
export OMP_NUM_THREADS=64 # Adjust to your CPU core count
150+
export ZENDNNL_MATMUL_ALGO=2 # Blocked AOCL BLIS for best performance
151+
152+
# Start server
153+
./build/bin/llama-server \
154+
-m models/Llama-3.1-8B-Instruct.BF16.gguf \
155+
--host 0.0.0.0 \
156+
--port 8080 \
157+
-t 64
158+
```
159+
160+
Access the server at `http://localhost:8080`.
161+
162+
**Performance tips**:
163+
- Set `OMP_NUM_THREADS` to match your physical core count
164+
- Use `ZENDNNL_MATMUL_ALGO=2` for optimal performance
165+
- For NUMA systems: `numactl --cpunodebind=0 --membind=0 ./build/bin/llama-server ...`
166+
167+
## Environment Variable
168+
169+
### Build Time
170+
171+
| Name | Value | Function |
172+
|--------------------|---------------------------------------|---------------------------------------------|
173+
| GGML_ZENDNN | ON/OFF | Enable ZenDNN backend support |
174+
| ZENDNN_ROOT | Path to ZenDNN installation | Set ZenDNN installation directory |
175+
| GGML_OPENMP | ON/OFF (recommended: ON) | Enable OpenMP for multi-threading |
176+
177+
### Runtime
178+
179+
| Name | Value | Function |
180+
|-------------------------|--------------------------|-------------------------------------------------------------------|
181+
| OMP_NUM_THREADS | Number (e.g., 64) | Set number of OpenMP threads (recommended: physical core count) |
182+
| ZENDNNL_MATMUL_ALGO | 0-5 | Select MatMul backend algorithm (see Performance Optimization) |
183+
| ZENDNNL_PROFILE_LOG_LEVEL | 0-4 | Profiling log level (0=disabled, 4=verbose) |
184+
| ZENDNNL_ENABLE_PROFILER | 0 or 1 | Enable detailed profiling (1=enabled) |
185+
| ZENDNNL_API_LOG_LEVEL | 0-4 | API log level (0=disabled, 4=verbose) |
186+
187+
**Example**:
188+
189+
```sh
190+
export OMP_NUM_THREADS=64
191+
export ZENDNNL_MATMUL_ALGO=2 # Use Blocked AOCL BLIS for best performance
192+
./build/bin/llama-cli -m models/llama-2-7b.Q4_0.gguf -p "Test" -n 100
193+
```
194+
195+
## Performance Optimization
196+
197+
### MatMul Algorithm Selection
198+
199+
ZenDNN's LowOHA MatMul supports multiple backend algorithms. For **best performance**, use the **Blocked AOCL BLIS** algorithm:
200+
201+
```sh
202+
export ZENDNNL_MATMUL_ALGO=2 # Blocked AOCL BLIS (recommended)
203+
```
204+
205+
**Available algorithms**:
206+
207+
| Value | Algorithm | Description |
208+
|:-----:|:-----------------------|:----------------------------------------------|
209+
| 0 | Dynamic Dispatch | Automatic backend selection (default) |
210+
| 1 | AOCL BLIS | AOCL BLIS backend |
211+
| 2 | AOCL BLIS Blocked | **Blocked AOCL BLIS (recommended)** |
212+
| 3 | OneDNN | OneDNN backend |
213+
| 4 | OneDNN Blocked | Blocked OneDNN |
214+
| 5 | LibXSMM | LibXSMM backend |
215+
216+
### Profiling and Debugging
217+
218+
For detailed profiling and logging options, refer to the [ZenDNN Logging Documentation](https://github.com/amd/ZenDNN/blob/zendnnl/docs/logging.md).
219+
220+
## Known Issues
221+
222+
- **Limited operation support**: Currently only matrix multiplication (MUL_MAT) is accelerated via ZenDNN. Other operations fall back to the standard CPU backend.
223+
- **BF16 support**: BF16 operations require AMD Zen 4 or Zen 5 architecture (EPYC 9004/9005 series). On older CPUs, operations will use FP32.
224+
- **NUMA awareness**: For multi-socket systems, manual NUMA binding may be required for optimal performance.
225+
226+
## Q&A
227+
228+
**Q: How do I verify that ZenDNN backend is being used?**
229+
230+
A: Check the log output when running llama.cpp. You should see messages indicating the ZenDNN backend is initialized. You can also check the backend name in the output.
231+
232+
**Q: What performance improvement can I expect?**
233+
234+
A: Performance gains vary depending on the model size, batch size, and CPU architecture. On AMD EPYC processors, you can typically expect 1.1x-2x speedup compared to standard CPU inference for matrix multiplication operations.
235+
236+
**Q: Can I use ZenDNN on non-AMD processors?**
237+
238+
A: ZenDNN is optimized specifically for AMD processors. While it may work on other x86-64 CPUs, performance benefits are only guaranteed on AMD Zen-based architectures.
239+
240+
**Q: Does ZenDNN support quantized models?**
241+
242+
A: Currently, ZenDNN primarily supports FP32 and BF16 data types. Quantized model support is not available at this time.
243+
244+
**Q: Why is my inference not faster with ZenDNN?**
245+
246+
A: Ensure:
247+
1. You're using an AMD EPYC or Ryzen processor (Zen 2 or newer)
248+
2. `OMP_NUM_THREADS` is set appropriately (physical core count)
249+
3. `ZENDNNL_MATMUL_ALGO=2` is set for best performance (Blocked AOCL BLIS)
250+
4. You're using a sufficiently large model (small models may not benefit as much)
251+
5. Enable profiling to verify ZenDNN MatMul is being called
252+
253+
### **GitHub Contribution**:
254+
Please add the **[ZenDNN]** prefix/tag in issues/PRs titles to help the ZenDNN-team check/address them without delay.
255+
256+
## TODO
257+
258+
- Expand operation support beyond MUL_MAT (attention operations, activations, etc.)

docs/backend/zDNN.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,10 @@
11
# llama.cpp for IBM zDNN Accelerator
22

3+
> [!WARNING]
4+
> **Note:** zDNN is **not** the same as ZenDNN.
5+
> - **zDNN** (this page): IBM's Deep Neural Network acceleration library for IBM Z & LinuxONE Mainframes
6+
> - **ZenDNN**: AMD's deep learning library for AMD EPYC CPUs ([see ZenDNN documentation](ZenDNN.md))
7+
38
## Background
49

510
IBM zDNN (Z Deep Neural Network) is a hardware acceleration library designed specifically to leverage the IBM NNPA (Neural Network Processor Assist) accelerator located within IBM Telum I and II processors. It provides significant performance improvements for neural network inference operations.

docs/build.md

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -495,6 +495,38 @@ llama_new_context_with_model: CANN compute buffer size = 1260.81 MiB
495495

496496
For detailed info, such as model/device supports, CANN install, please refer to [llama.cpp for CANN](./backend/CANN.md).
497497

498+
## ZenDNN
499+
500+
ZenDNN provides optimized deep learning primitives for AMD EPYC™ CPUs. It accelerates matrix multiplication operations for inference workloads.
501+
502+
### Compilation
503+
504+
- Using `CMake` on Linux (automatic build):
505+
506+
```bash
507+
cmake -B build -DGGML_ZENDNN=ON
508+
cmake --build build --config Release
509+
```
510+
511+
The first build will automatically download and build ZenDNN, which may take 5-10 minutes. Subsequent builds will be much faster.
512+
513+
- Using `CMake` with custom ZenDNN installation:
514+
515+
```bash
516+
cmake -B build -DGGML_ZENDNN=ON -DZENDNN_ROOT=/path/to/zendnn/install
517+
cmake --build build --config Release
518+
```
519+
520+
### Testing
521+
522+
You can test with:
523+
524+
```bash
525+
./build/bin/llama-cli -m PATH_TO_MODEL -p "Building a website can be done in 10 steps:" -n 50
526+
```
527+
528+
For detailed information about hardware support, setup instructions, and performance optimization, refer to [llama.cpp for ZenDNN](./backend/ZenDNN.md).
529+
498530
## Arm® KleidiAI™
499531
KleidiAI is a library of optimized microkernels for AI workloads, specifically designed for Arm CPUs. These microkernels enhance performance and can be enabled for use by the CPU backend.
500532

0 commit comments

Comments
 (0)