You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- For faster compilation, add the `-j` argument to run multiple jobs in parallel. For example, `cmake --build build --config Release -j 8` will run 8 jobs in parallel.
24
-
- For faster repeated compilation, install [ccache](https://ccache.dev/).
23
+
- For faster compilation, add the `-j` argument to run multiple jobs in parallel, or use a generator that does this automatically such as Ninja. For example, `cmake --build build --config Release -j 8` will run 8 jobs in parallel.
24
+
- For faster repeated compilation, install [ccache](https://ccache.dev/)
25
25
- For debug builds, there are two cases:
26
26
27
27
1. Single-config generators (e.g. default = `Unix Makefiles`; note that they just ignore the `--config` flag):
For more details and a list of supported generators, see the [CMake documentation](https://cmake.org/cmake/help/latest/manual/cmake-generators.7.html).
42
+
40
43
- Building for Windows (x86, x64 and arm64) with MSVC or clang as compilers:
41
44
- Install Visual Studio 2022, e.g. via the [Community Edition](https://visualstudio.microsoft.com/de/vs/community/). In the installer, selectat least the following options (this also automatically installs the required additional tools like CMake,...):
Note: Building for arm64 could also be donejust with MSVC (with the build-arm64-windows-MSVC preset, or the standard CMake build instructions). But MSVC does not support inline ARM assembly-code, used e.g. for the accelerated Q4_0_4_8 CPU kernels.
53
+
Building for arm64 can also be done with the MSVC compiler with the build-arm64-windows-MSVC preset, or the standard CMake build instructions. However, note that the MSVC compiler does not support inline ARM assemblycode, used e.g. for the accelerated Q4_0_4_8 CPU kernels.
51
54
52
55
## Metal Build
53
56
54
57
On MacOS, Metal is enabled by default. Using Metal makes the computation run on the GPU.
55
58
To disable the Metal build at compile time use the `-DGGML_METAL=OFF` cmake option.
56
59
57
-
When built with Metal support, you can explicitly disable GPU inference with the `--n-gpu-layers|-ngl 0` command-line
58
-
argument.
60
+
When built with Metal support, you can explicitly disable GPU inference with the `--n-gpu-layers 0` command-line argument.
59
61
60
62
## BLAS Build
61
63
62
-
Building the program with BLAS support may lead to some performance improvements in prompt processing using batch sizes higher than 32 (the default is 512). Support with CPU-only BLAS implementations doesn't affect the normal generation performance. We may see generation performance improvements with GPU-involved BLAS implementations, e.g. cuBLAS, hipBLAS. There are currently several different BLAS implementations available for build and use:
64
+
Building the program with BLAS support may lead to some performance improvements in prompt processing using batch sizes higher than 32 (the default is 512). Using BLAS doesn't affect the generation performance. There are currently several different BLAS implementations available for build and use:
63
65
64
66
### Accelerate Framework:
65
67
@@ -80,15 +82,15 @@ This provides BLAS acceleration using only the CPU. Make sure to have OpenBLAS i
80
82
81
83
Check [BLIS.md](./backend/BLIS.md) for more information.
82
84
83
-
### SYCL
85
+
## SYCL
84
86
85
87
SYCL is a higher-level programming model to improve programming productivity on various hardware accelerators.
86
88
87
89
llama.cpp based on SYCL is used to **support Intel GPU** (Data Center Max series, Flex series, Arc series, Built-in GPU and iGPU).
88
90
89
91
For detailed info, please refer to [llama.cpp for SYCL](./backend/SYCL.md).
90
92
91
-
### Intel oneMKL
93
+
## Intel oneMKL
92
94
93
95
Building through oneAPI compilers will make avx_vnni instruction set available for intel processors that do not support avx512 and avx512_vnni. Please note that this build config **does not support Intel GPU**. For Intel GPU support, please refer to [llama.cpp for SYCL](./backend/SYCL.md).
94
96
@@ -105,11 +107,9 @@ Building through oneAPI compilers will make avx_vnni instruction set available f
105
107
106
108
Check [Optimizing and Running LLaMA2 on Intel® CPU](https://www.intel.com/content/www/us/en/content-details/791610/optimizing-and-running-llama2-on-intel-cpu.html) for more information.
107
109
108
-
### CUDA
109
-
110
-
This provides GPU acceleration using the CUDA cores of your Nvidia GPU. Make sure to have the CUDA toolkit installed. You can download it from your Linux distro's package manager (e.g. `apt install nvidia-cuda-toolkit`) or from here: [CUDA Toolkit](https://developer.nvidia.com/cuda-downloads).
110
+
## CUDA
111
111
112
-
For Jetson user, if you have Jetson Orin, you can try this: [Offical Support](https://www.jetson-ai-lab.com/tutorial_text-generation.html). If you are using an old model(nano/TX2), need some additional operations before compiling.
112
+
This provides GPU acceleration using an NVIDIA GPU. Make sure to have the CUDA toolkit installed. You can download it from your Linux distro's package manager (e.g. `apt install nvidia-cuda-toolkit`) or from here: [CUDA Toolkit](https://developer.nvidia.com/cuda-downloads).
113
113
114
114
- Using `CMake`:
115
115
@@ -132,7 +132,7 @@ The following compilation options are also available to tweak performance:
132
132
| GGML_CUDA_PEER_MAX_BATCH_SIZE | Positive integer | 128 | Maximum batch size for which to enable peer access between multiple GPUs. Peer access requires either Linux or NVLink. When using NVLink enabling peer access for larger batch sizes is potentially beneficial. |
133
133
| GGML_CUDA_FA_ALL_QUANTS | Boolean | false | Compile support for all KV cache quantization type (combinations) for the FlashAttention CUDA kernels. More fine-grained control over KV cache size but compilation takes much longer. |
134
134
135
-
### MUSA
135
+
## MUSA
136
136
137
137
This provides GPU acceleration using the MUSA cores of your Moore Threads MTT GPU. Make sure to have the MUSA SDK installed. You can download it from here: [MUSA SDK](https://developer.mthreads.com/sdk/download/musa).
138
138
@@ -149,7 +149,7 @@ The environment variable `GGML_CUDA_ENABLE_UNIFIED_MEMORY=1` can be used to enab
149
149
150
150
Most of the compilation options available for CUDA should also be available for MUSA, though they haven't been thoroughly tested yet.
151
151
152
-
### HIP
152
+
## HIP
153
153
154
154
This provides GPU acceleration on HIP-supported AMD GPUs.
155
155
Make sure to have ROCm installed.
@@ -192,11 +192,11 @@ You can download it from your Linux distro's package manager or from here: [ROCm
192
192
The environment variable [`HIP_VISIBLE_DEVICES`](https://rocm.docs.amd.com/en/latest/understand/gpu_isolation.html#hip-visible-devices) can be used to specify which GPU(s) will be used.
193
193
If your GPU is not officially supported you can use the environment variable [`HSA_OVERRIDE_GFX_VERSION`] set to a similar GPU, for example 10.3.0 on RDNA2 (e.g. gfx1030, gfx1031, or gfx1035) or 11.0.0 on RDNA3.
194
194
195
-
### Vulkan
195
+
## Vulkan
196
196
197
197
**Windows**
198
198
199
-
#### w64devkit
199
+
### w64devkit
200
200
201
201
Download and extract [`w64devkit`](https://github.com/skeeto/w64devkit/releases).
This provides NPU acceleration using the AI cores of your Ascend NPU. And [CANN](https://www.hiascend.com/en/software/cann) is a hierarchical APIs to help you to quickly build AI applications and service based on Ascend NPU.
310
310
311
311
For more information about Ascend NPU in [Ascend Community](https://www.hiascend.com/en/).
0 commit comments