|
| 1 | +# llama.cpp for AMD ZenDNN |
| 2 | + |
| 3 | +> [!WARNING] |
| 4 | +> **Note:** ZenDNN is **not** the same as zDNN. |
| 5 | +> - **ZenDNN** (this page): AMD's deep learning library for AMD EPYC CPUs |
| 6 | +> - **zDNN**: IBM's Deep Neural Network acceleration library for IBM Z & LinuxONE Mainframes ([see zDNN documentation](zDNN.md)) |
| 7 | +
|
| 8 | +- [Background](#background) |
| 9 | +- [OS](#os) |
| 10 | +- [Hardware](#hardware) |
| 11 | +- [Supported Operations](#supported-operations) |
| 12 | +- [DataType Supports](#datatype-supports) |
| 13 | +- [Linux](#linux) |
| 14 | +- [Environment Variable](#environment-variable) |
| 15 | +- [Performance Optimization](#performance-optimization) |
| 16 | +- [Known Issues](#known-issues) |
| 17 | +- [TODO](#todo) |
| 18 | + |
| 19 | +## Background |
| 20 | + |
| 21 | +**ZenDNN** (Zen Deep Neural Network Library) is AMD's high-performance deep learning inference library optimized for AMD EPYC™ CPUs. It provides optimized implementations of key deep learning primitives and operations, delivering significant performance improvements for neural network workloads on AMD Zen-based processor architectures. |
| 22 | + |
| 23 | +**Llama.cpp + ZenDNN** |
| 24 | + |
| 25 | +The llama.cpp ZenDNN backend leverages AMD's optimized matrix multiplication primitives to accelerate inference on AMD CPUs. It utilizes ZenDNN's **LowOHA (Low Overhead Hardware Accelerated)** MatMul operator for efficient GEMM operations with minimal execution overhead, built-in weight caching, and direct access to backend libraries (AOCL BLIS, LibXSMM, OneDNN). |
| 26 | + |
| 27 | +For more information about ZenDNN, visit: https://www.amd.com/en/developer/zendnn.html |
| 28 | + |
| 29 | +## OS |
| 30 | + |
| 31 | +| OS | Status | Verified | |
| 32 | +|:-------:|:-------:|:----------------------------------------------:| |
| 33 | +| Linux | Support | Ubuntu 20.04, 22.04, 24.04 | |
| 34 | + |
| 35 | +For the latest list of supported operating systems, see the [ZenDNN Supported OS](https://github.com/amd/ZenDNN/blob/zendnnl/README.md#15-supported-os). |
| 36 | + |
| 37 | +## Hardware |
| 38 | + |
| 39 | +### AMD CPUs |
| 40 | + |
| 41 | +**Recommended Processors** |
| 42 | + |
| 43 | +ZenDNN is optimized for AMD EPYC™ processors and AMD Ryzen™ processors based on "Zen" microarchitecture and newer. |
| 44 | + |
| 45 | +| CPU Family | Status | Notes | |
| 46 | +|:-----------------------------:|:-------:|:----------------------------------:| |
| 47 | +| AMD EPYC™ 9005 Series (Turin)| Support | 5th Gen - Zen 5 architecture | |
| 48 | +| AMD EPYC™ 9004 Series (Genoa)| Support | 4th Gen - Zen 4 architecture | |
| 49 | +| AMD EPYC™ 7003 Series (Milan)| Support | 3rd Gen - Zen 3 architecture | |
| 50 | +| AMD Ryzen™ AI MAX (Strix Halo)| Support | High-performance mobile processors | |
| 51 | + |
| 52 | +*Notes:* |
| 53 | + |
| 54 | +- Best performance is achieved on AMD EPYC™ processors with high core counts (e.g., EPYC 9005 series). |
| 55 | +- ZenDNN leverages AMD's advanced CPU features including AVX2 and AVX-512 instruction sets. |
| 56 | +- For optimal performance, ensure your system has sufficient memory bandwidth. |
| 57 | + |
| 58 | +## Supported Operations |
| 59 | + |
| 60 | +The ZenDNN backend currently accelerates **matrix multiplication (MUL_MAT)** operations only. Other operations are handled by the standard CPU backend. |
| 61 | + |
| 62 | +| Operation | Status | Notes | |
| 63 | +|:-------------|:-------:|:----------------------------------------------:| |
| 64 | +| MUL_MAT | ✓ | Accelerated via ZenDNN LowOHA MatMul | |
| 65 | + |
| 66 | +*Note:* Since only MUL_MAT is accelerated, models will benefit most from ZenDNN when matrix multiplications dominate the computational workload (which is typical for transformer-based LLMs). |
| 67 | + |
| 68 | +## DataType Supports |
| 69 | + |
| 70 | +| DataType | Status | Notes | |
| 71 | +|:----------------------:|:-------:|:---------------------------------------------:| |
| 72 | +| FP32 | Support | Full precision floating point | |
| 73 | +| BF16 | Support | BFloat16 (best performance on Zen 4/Zen 5) | |
| 74 | + |
| 75 | +*Notes:* |
| 76 | + |
| 77 | +- **BF16** provides best performance on Zen 4 and Zen 5 EPYC™ processors (Genoa, Turin). |
| 78 | + |
| 79 | +## Linux |
| 80 | + |
| 81 | +### I. Setup Environment |
| 82 | + |
| 83 | +You have two options to set up ZenDNN: |
| 84 | + |
| 85 | +#### Option 1: Automatic Download and Build (Recommended) |
| 86 | + |
| 87 | +CMake will automatically download and build ZenDNN for you: |
| 88 | + |
| 89 | +```sh |
| 90 | +# Build llama.cpp - ZenDNN will be automatically downloaded and built |
| 91 | +cmake -B build -DGGML_ZENDNN=ON -DCMAKE_BUILD_TYPE=Release |
| 92 | +cmake --build build --config Release -j $(nproc) |
| 93 | +``` |
| 94 | + |
| 95 | +No manual ZenDNN installation required. CMake will handle everything automatically. |
| 96 | + |
| 97 | +#### Option 2: Use Custom ZenDNN Installation |
| 98 | + |
| 99 | +If you want to build ZenDNN yourself or use a specific version: |
| 100 | + |
| 101 | +**Step 1: Build ZenDNN from source** |
| 102 | + |
| 103 | +```sh |
| 104 | +# Clone ZenDNN repository |
| 105 | +git clone https://github.com/amd/ZenDNN.git |
| 106 | +cd ZenDNN |
| 107 | +git checkout zendnnl |
| 108 | + |
| 109 | +# Build and install (requires CMake >= 3.25) |
| 110 | +mkdir build && cd build |
| 111 | +cmake .. |
| 112 | +cmake --build . --target all |
| 113 | +``` |
| 114 | + |
| 115 | +Default installation path: `ZenDNN/build/install` |
| 116 | + |
| 117 | +**For detailed build instructions**, refer to the [ZenDNN README](https://github.com/amd/ZenDNN/blob/zendnnl/README.md). |
| 118 | + |
| 119 | +**Step 2: Build llama.cpp with custom ZenDNN path** |
| 120 | + |
| 121 | +```sh |
| 122 | +# Using environment variable |
| 123 | +export ZENDNN_ROOT=/path/to/ZenDNN/build/install |
| 124 | +cmake -B build -DGGML_ZENDNN=ON -DCMAKE_BUILD_TYPE=Release |
| 125 | +cmake --build build --config Release -j $(nproc) |
| 126 | + |
| 127 | +# OR specify path directly in CMake |
| 128 | +cmake -B build -DGGML_ZENDNN=ON -DZENDNN_ROOT=/path/to/ZenDNN/build/install -DCMAKE_BUILD_TYPE=Release |
| 129 | +cmake --build build --config Release -j $(nproc) |
| 130 | +``` |
| 131 | + |
| 132 | +### II. Run the Server |
| 133 | + |
| 134 | +#### 1. Download Model |
| 135 | + |
| 136 | +Download LLaMA 3.1 8B Instruct BF16 model: |
| 137 | + |
| 138 | +```sh |
| 139 | +# Download from Hugging Face |
| 140 | +huggingface-cli download meta-llama/Llama-3.1-8B-Instruct-GGUF --local-dir models/ |
| 141 | +``` |
| 142 | + |
| 143 | +#### 2. Start Server |
| 144 | + |
| 145 | +Run llama.cpp server with ZenDNN acceleration: |
| 146 | + |
| 147 | +```sh |
| 148 | +# Set optimal configuration |
| 149 | +export OMP_NUM_THREADS=64 # Adjust to your CPU core count |
| 150 | +export ZENDNNL_MATMUL_ALGO=2 # Blocked AOCL BLIS for best performance |
| 151 | + |
| 152 | +# Start server |
| 153 | +./build/bin/llama-server \ |
| 154 | + -m models/Llama-3.1-8B-Instruct.BF16.gguf \ |
| 155 | + --host 0.0.0.0 \ |
| 156 | + --port 8080 \ |
| 157 | + -t 64 |
| 158 | +``` |
| 159 | + |
| 160 | +Access the server at `http://localhost:8080`. |
| 161 | + |
| 162 | +**Performance tips**: |
| 163 | +- Set `OMP_NUM_THREADS` to match your physical core count |
| 164 | +- Use `ZENDNNL_MATMUL_ALGO=2` for optimal performance |
| 165 | +- For NUMA systems: `numactl --cpunodebind=0 --membind=0 ./build/bin/llama-server ...` |
| 166 | + |
| 167 | +## Environment Variable |
| 168 | + |
| 169 | +### Build Time |
| 170 | + |
| 171 | +| Name | Value | Function | |
| 172 | +|--------------------|---------------------------------------|---------------------------------------------| |
| 173 | +| GGML_ZENDNN | ON/OFF | Enable ZenDNN backend support | |
| 174 | +| ZENDNN_ROOT | Path to ZenDNN installation | Set ZenDNN installation directory | |
| 175 | +| GGML_OPENMP | ON/OFF (recommended: ON) | Enable OpenMP for multi-threading | |
| 176 | + |
| 177 | +### Runtime |
| 178 | + |
| 179 | +| Name | Value | Function | |
| 180 | +|-------------------------|--------------------------|-------------------------------------------------------------------| |
| 181 | +| OMP_NUM_THREADS | Number (e.g., 64) | Set number of OpenMP threads (recommended: physical core count) | |
| 182 | +| ZENDNNL_MATMUL_ALGO | 0-5 | Select MatMul backend algorithm (see Performance Optimization) | |
| 183 | +| ZENDNNL_PROFILE_LOG_LEVEL | 0-4 | Profiling log level (0=disabled, 4=verbose) | |
| 184 | +| ZENDNNL_ENABLE_PROFILER | 0 or 1 | Enable detailed profiling (1=enabled) | |
| 185 | +| ZENDNNL_API_LOG_LEVEL | 0-4 | API log level (0=disabled, 4=verbose) | |
| 186 | + |
| 187 | +**Example**: |
| 188 | + |
| 189 | +```sh |
| 190 | +export OMP_NUM_THREADS=64 |
| 191 | +export ZENDNNL_MATMUL_ALGO=2 # Use Blocked AOCL BLIS for best performance |
| 192 | +./build/bin/llama-cli -m models/llama-2-7b.Q4_0.gguf -p "Test" -n 100 |
| 193 | +``` |
| 194 | + |
| 195 | +## Performance Optimization |
| 196 | + |
| 197 | +### MatMul Algorithm Selection |
| 198 | + |
| 199 | +ZenDNN's LowOHA MatMul supports multiple backend algorithms. For **best performance**, use the **Blocked AOCL BLIS** algorithm: |
| 200 | + |
| 201 | +```sh |
| 202 | +export ZENDNNL_MATMUL_ALGO=2 # Blocked AOCL BLIS (recommended) |
| 203 | +``` |
| 204 | + |
| 205 | +**Available algorithms**: |
| 206 | + |
| 207 | +| Value | Algorithm | Description | |
| 208 | +|:-----:|:-----------------------|:----------------------------------------------| |
| 209 | +| 0 | Dynamic Dispatch | Automatic backend selection (default) | |
| 210 | +| 1 | AOCL BLIS | AOCL BLIS backend | |
| 211 | +| 2 | AOCL BLIS Blocked | **Blocked AOCL BLIS (recommended)** | |
| 212 | +| 3 | OneDNN | OneDNN backend | |
| 213 | +| 4 | OneDNN Blocked | Blocked OneDNN | |
| 214 | +| 5 | LibXSMM | LibXSMM backend | |
| 215 | + |
| 216 | +### Profiling and Debugging |
| 217 | + |
| 218 | +For detailed profiling and logging options, refer to the [ZenDNN Logging Documentation](https://github.com/amd/ZenDNN/blob/zendnnl/docs/logging.md). |
| 219 | + |
| 220 | +## Known Issues |
| 221 | + |
| 222 | +- **Limited operation support**: Currently only matrix multiplication (MUL_MAT) is accelerated via ZenDNN. Other operations fall back to the standard CPU backend. |
| 223 | +- **BF16 support**: BF16 operations require AMD Zen 4 or Zen 5 architecture (EPYC 9004/9005 series). On older CPUs, operations will use FP32. |
| 224 | +- **NUMA awareness**: For multi-socket systems, manual NUMA binding may be required for optimal performance. |
| 225 | + |
| 226 | +## Q&A |
| 227 | + |
| 228 | +**Q: How do I verify that ZenDNN backend is being used?** |
| 229 | + |
| 230 | +A: Check the log output when running llama.cpp. You should see messages indicating the ZenDNN backend is initialized. You can also check the backend name in the output. |
| 231 | + |
| 232 | +**Q: What performance improvement can I expect?** |
| 233 | + |
| 234 | +A: Performance gains vary depending on the model size, batch size, and CPU architecture. On AMD EPYC processors, you can typically expect 1.1x-2x speedup compared to standard CPU inference for matrix multiplication operations. |
| 235 | + |
| 236 | +**Q: Can I use ZenDNN on non-AMD processors?** |
| 237 | + |
| 238 | +A: ZenDNN is optimized specifically for AMD processors. While it may work on other x86-64 CPUs, performance benefits are only guaranteed on AMD Zen-based architectures. |
| 239 | + |
| 240 | +**Q: Does ZenDNN support quantized models?** |
| 241 | + |
| 242 | +A: Currently, ZenDNN primarily supports FP32 and BF16 data types. Quantized model support is not available at this time. |
| 243 | + |
| 244 | +**Q: Why is my inference not faster with ZenDNN?** |
| 245 | + |
| 246 | +A: Ensure: |
| 247 | +1. You're using an AMD EPYC or Ryzen processor (Zen 2 or newer) |
| 248 | +2. `OMP_NUM_THREADS` is set appropriately (physical core count) |
| 249 | +3. `ZENDNNL_MATMUL_ALGO=2` is set for best performance (Blocked AOCL BLIS) |
| 250 | +4. You're using a sufficiently large model (small models may not benefit as much) |
| 251 | +5. Enable profiling to verify ZenDNN MatMul is being called |
| 252 | + |
| 253 | +### **GitHub Contribution**: |
| 254 | +Please add the **[ZenDNN]** prefix/tag in issues/PRs titles to help the ZenDNN-team check/address them without delay. |
| 255 | + |
| 256 | +## TODO |
| 257 | + |
| 258 | +- Expand operation support beyond MUL_MAT (attention operations, activations, etc.) |
0 commit comments