Skip to content

Commit 37a4bb2

Browse files
author
Olivier Chafik
committed
Merge remote-tracking branch 'origin/master' into r1-toolcall
2 parents 01db429 + c3d6af7 commit 37a4bb2

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

42 files changed

+2158
-889
lines changed

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -235,6 +235,7 @@ Instructions for adding support for new models: [HOWTO-add-model.md](docs/develo
235235
| [HIP](docs/build.md#hip) | AMD GPU |
236236
| [Vulkan](docs/build.md#vulkan) | GPU |
237237
| [CANN](docs/build.md#cann) | Ascend NPU |
238+
| [OpenCL](docs/backend/OPENCL.md) | Adreno GPU |
238239

239240
## Building the project
240241

common/arg.cpp

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -674,7 +674,7 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
674674
));
675675
add_opt(common_arg(
676676
{"--no-context-shift"},
677-
string_format("disables context shift on inifinite text generation (default: %s)", params.ctx_shift ? "disabled" : "enabled"),
677+
string_format("disables context shift on infinite text generation (default: %s)", params.ctx_shift ? "disabled" : "enabled"),
678678
[](common_params & params) {
679679
params.ctx_shift = false;
680680
}

common/log.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22

33
#include "ggml.h" // for ggml_log_level
44

5+
#define LOG_CLR_TO_EOL "\033[K\r"
56
#define LOG_COL_DEFAULT "\033[0m"
67
#define LOG_COL_BOLD "\033[1m"
78
#define LOG_COL_RED "\033[31m"

common/speculative.h

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ struct common_speculative_params {
99
int n_draft = 16; // max drafted tokens
1010
int n_reuse = 256;
1111

12-
float p_min = 0.9f; // min probabiliy required to accept a token in the draft
12+
float p_min = 0.9f; // min probability required to accept a token in the draft
1313
};
1414

1515
struct common_speculative * common_speculative_init(struct llama_context * ctx_dft);

docs/backend/OPENCL.md

Lines changed: 205 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,205 @@
1+
# llama.cpp for OpenCL
2+
3+
- [Background](#background)
4+
- [OS](#os)
5+
- [Hardware](#hardware)
6+
- [DataType Supports](#datatype-supports)
7+
- [Model Preparation](#model-preparation)
8+
- [CMake Options](#cmake-options)
9+
- [Android](#android)
10+
- [Windows 11 Arm64](#windows-11-arm64)
11+
- [Known Issue](#known-issues)
12+
- [TODO](#todo)
13+
14+
## Background
15+
16+
OpenCL (Open Computing Language) is an open, royalty-free standard for cross-platform, parallel programming of diverse accelerators found in supercomputers, cloud servers, personal computers, mobile devices and embedded platforms. OpenCL specifies a programming language (based on C99) for programming these devices and application programming interfaces (APIs) to control the platform and execute programs on the compute devices. Similar to CUDA, OpenCL has been widely used to program GPUs and is supported by most GPU vendors.
17+
18+
### Llama.cpp + OpenCL
19+
20+
The llama.cpp OpenCL backend is designed to enable llama.cpp on **Qualcomm Adreno GPU** firstly via OpenCL. Thanks to the portabilty of OpenCL, the OpenCL backend can also run on certain Intel GPUs although the performance is not optimal.
21+
22+
## OS
23+
24+
| OS | Status | Verified |
25+
|---------|---------|------------------------------------------------|
26+
| Android | Support | Snapdragon 8 Gen 3, Snapdragon 8 Elite |
27+
| Windows | Support | Windows 11 Arm64 with Snapdragon X Elite |
28+
| Linux | Support | Ubuntu 22.04 WSL2 with Intel 12700H |
29+
30+
## Hardware
31+
32+
### Adreno GPU
33+
34+
**Verified devices**
35+
36+
| Adreno GPU | Status |
37+
|:------------------------------------:|:-------:|
38+
| Adreno 750 (Snapdragon 8 Gen 3) | Support |
39+
| Adreno 830 (Snapdragon 8 Elite) | Support |
40+
| Adreno X85 (Snapdragon X Elite) | Support |
41+
42+
## DataType Supports
43+
44+
| DataType | Status |
45+
|:----------------------:|:--------------------------:|
46+
| Q4_0 | Support |
47+
| Q6_K | Support, but not optimized |
48+
49+
## Model Preparation
50+
51+
You can refer to the general [*Prepare and Quantize*](README.md#prepare-and-quantize) guide for model prepration.
52+
53+
Currently we support `Q4_0` quantization and have optimize for it. To achieve best performance on Adreno GPU, add `--pure` to `llama-quantize`. For example,
54+
55+
```sh
56+
./llama-quantize --pure ggml-model-qwen2.5-3b-f16.gguf ggml-model-qwen-3b-Q4_0.gguf Q4_0
57+
```
58+
59+
Since `Q6_K` is also supported, `Q4_0` quantization without `--pure` will also work. However, the performance will be worse compared to pure `Q4_0` quantization.
60+
61+
## CMake Options
62+
63+
The OpenCL backend has the following CMake options that control the behavior of the backend.
64+
65+
| CMake options | Default value | Description |
66+
|:---------------------------------:|:--------------:|:------------------------------------------|
67+
| `GGML_OPENCL_EMBED_KERNELS` | `ON` | Embed OpenCL kernels into the executable. |
68+
| `GGML_OPENCL_USE_ADRENO_KERNELS` | `ON` | Use kernels optimized for Adreno. |
69+
70+
## Android
71+
72+
Ubuntu 22.04 is used for targeting Android. Make sure the following tools are accessible from command line,
73+
74+
* Git
75+
* CMake 3.29
76+
* Ninja
77+
* Python3
78+
79+
### I. Setup Environment
80+
81+
1. **Install NDK**
82+
83+
```sh
84+
cd ~
85+
wget https://dl.google.com/android/repository/commandlinetools-linux-8512546_latest.zip && \
86+
unzip commandlinetools-linux-8512546_latest.zip && \
87+
mkdir -p ~/android-sdk/cmdline-tools && \
88+
mv cmdline-tools latest && \
89+
mv latest ~/android-sdk/cmdline-tools/ && \
90+
rm -rf commandlinetools-linux-8512546_latest.zip
91+
92+
yes | ~/android-sdk/cmdline-tools/latest/bin/sdkmanager "ndk;26.3.11579264"
93+
```
94+
95+
2. **Install OpenCL Headers and Library**
96+
97+
```sh
98+
mkdir -p ~/dev/llm
99+
cd ~/dev/llm
100+
101+
git clone https://github.com/KhronosGroup/OpenCL-Headers && \
102+
cd OpenCL-Headers && \
103+
cp -r CL ~/android-sdk/ndk/26.3.11579264/toolchains/llvm/prebuilt/linux-x86_64/sysroot/usr/include
104+
105+
cd ~/dev/llm
106+
107+
git clone https://github.com/KhronosGroup/OpenCL-ICD-Loader && \
108+
cd OpenCL-ICD-Loader && \
109+
mkdir build_ndk26 && cd build_ndk26 && \
110+
cmake .. -G Ninja -DCMAKE_BUILD_TYPE=Release \
111+
-DCMAKE_TOOLCHAIN_FILE=$HOME/android-sdk/ndk/26.3.11579264/build/cmake/android.toolchain.cmake \
112+
-DOPENCL_ICD_LOADER_HEADERS_DIR=$HOME/android-sdk/ndk/26.3.11579264/toolchains/llvm/prebuilt/linux-x86_64/sysroot/usr/include \
113+
-DANDROID_ABI=arm64-v8a \
114+
-DANDROID_PLATFORM=24 \
115+
-DANDROID_STL=c++_shared && \
116+
ninja && \
117+
cp libOpenCL.so ~/android-sdk/ndk/26.3.11579264/toolchains/llvm/prebuilt/linux-x86_64/sysroot/usr/lib/aarch64-linux-android
118+
```
119+
120+
### II. Build llama.cpp
121+
122+
```sh
123+
cd ~/dev/llm
124+
125+
git clone https://github.com/ggerganov/llama.cpp && \
126+
cd llama.cpp && \
127+
mkdir build-android && cd build-android
128+
129+
cmake .. -G Ninja \
130+
-DCMAKE_TOOLCHAIN_FILE=$HOME/android-sdk/ndk/26.3.11579264/build/cmake/android.toolchain.cmake \
131+
-DANDROID_ABI=arm64-v8a \
132+
-DANDROID_PLATFORM=android-28 \
133+
-DBUILD_SHARED_LIBS=OFF \
134+
-DGGML_OPENCL=ON
135+
136+
ninja
137+
```
138+
139+
## Windows 11 Arm64
140+
141+
A Snapdragon X Elite device with Windows 11 Arm64 is used. Make sure the following tools are accessible from command line,
142+
143+
* Git
144+
* CMake 3.29
145+
* Clang 19
146+
* Ninja
147+
* Visual Studio 2022
148+
149+
Powershell is used for the following instructions.
150+
151+
### I. Setup Environment
152+
153+
1. **Install OpenCL Headers and Library**
154+
155+
```powershell
156+
mkdir -p ~/dev/llm
157+
158+
cd ~/dev/llm
159+
git clone https://github.com/KhronosGroup/OpenCL-Headers && cd OpenCL-Headers
160+
mkdir build && cd build
161+
cmake .. -G Ninja `
162+
-DBUILD_TESTING=OFF `
163+
-DOPENCL_HEADERS_BUILD_TESTING=OFF `
164+
-DOPENCL_HEADERS_BUILD_CXX_TESTS=OFF `
165+
-DCMAKE_INSTALL_PREFIX="$HOME/dev/llm/opencl"
166+
cmake --build . --target install
167+
168+
cd ~/dev/llm
169+
git clone https://github.com/KhronosGroup/OpenCL-ICD-Loader && cd OpenCL-ICD-Loader
170+
mkdir build && cd build
171+
cmake .. -G Ninja `
172+
-DCMAKE_BUILD_TYPE=Release `
173+
-DCMAKE_PREFIX_PATH="$HOME/dev/llm/opencl" `
174+
-DCMAKE_INSTALL_PREFIX="$HOME/dev/llm/opencl"
175+
cmake --build . --target install
176+
```
177+
178+
### II. Build llama.cpp
179+
180+
```powershell
181+
182+
mkdir -p ~/dev/llm
183+
cd ~/dev/llm
184+
185+
git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp
186+
mkdir build && cd build
187+
188+
cmake .. -G Ninja `
189+
-DCMAKE_TOOLCHAIN_FILE="$HOME/dev/llm/llama.cpp/cmake/arm64-windows-llvm.cmake" `
190+
-DCMAKE_BUILD_TYPE=Release `
191+
-DCMAKE_PREFIX_PATH="$HOME/dev/llm/opencl" `
192+
-DBUILD_SHARED_LIBS=OFF `
193+
-DGGML_OPENCL=ON
194+
ninja
195+
```
196+
197+
## Known Issues
198+
199+
- Qwen2.5 0.5B model produces gibberish output with Adreno kernels.
200+
201+
## TODO
202+
203+
- Fix Qwen2.5 0.5B
204+
- Optimization for Q6_K
205+
- Support and optimization for Q4_K

examples/main/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@ Once downloaded, place your model in the models folder in llama.cpp.
3737

3838
##### Infinite text from a starting prompt (you can use `Ctrl-C` to stop it):
3939
```bash
40-
./llama-cli -m models\gemma-1.1-7b-it.Q4_K_M.gguf --ignore-eos -n -1
40+
./llama-cli -m models/gemma-1.1-7b-it.Q4_K_M.gguf --ignore-eos -n -1
4141
```
4242

4343
### Windows:

examples/run/run.cpp

Lines changed: 4 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -535,8 +535,7 @@ class HttpClient {
535535

536536
static void print_progress(const std::string & progress_prefix, const std::string & progress_bar,
537537
const std::string & progress_suffix) {
538-
printe("\r%*s\r%s%s| %s", get_terminal_width(), " ", progress_prefix.c_str(), progress_bar.c_str(),
539-
progress_suffix.c_str());
538+
printe("\r" LOG_CLR_TO_EOL "%s%s| %s", progress_prefix.c_str(), progress_bar.c_str(), progress_suffix.c_str());
540539
}
541540
// Function to write data to a file
542541
static size_t write_data(void * ptr, size_t size, size_t nmemb, void * stream) {
@@ -797,16 +796,13 @@ class LlamaData {
797796
llama_model_ptr initialize_model(Opt & opt) {
798797
ggml_backend_load_all();
799798
resolve_model(opt.model_);
800-
printe(
801-
"\r%*s"
802-
"\rLoading model",
803-
get_terminal_width(), " ");
799+
printe("\r" LOG_CLR_TO_EOL "Loading model");
804800
llama_model_ptr model(llama_model_load_from_file(opt.model_.c_str(), opt.model_params));
805801
if (!model) {
806802
printe("%s: error: unable to load model from file: %s\n", __func__, opt.model_.c_str());
807803
}
808804

809-
printe("\r%*s\r", static_cast<int>(sizeof("Loading model")), " ");
805+
printe("\r" LOG_CLR_TO_EOL);
810806
return model;
811807
}
812808

@@ -969,10 +965,7 @@ static int generate(LlamaData & llama_data, const std::string & prompt, std::str
969965
static int read_user_input(std::string & user_input) {
970966
static const char * prompt_prefix = "> ";
971967
#ifdef WIN32
972-
printf(
973-
"\r%*s"
974-
"\r" LOG_COL_DEFAULT "%s",
975-
get_terminal_width(), " ", prompt_prefix);
968+
printf("\r" LOG_CLR_TO_EOL LOG_COL_DEFAULT "%s", prompt_prefix);
976969

977970
std::getline(std::cin, user_input);
978971
if (std::cin.eof()) {
38.8 KB
Binary file not shown.

0 commit comments

Comments
 (0)