Skip to content

Commit 82ab059

Browse files
committed
opencl: update docs
1 parent c5023da commit 82ab059

File tree

1 file changed

+16
-3
lines changed

1 file changed

+16
-3
lines changed

docs/backend/OPENCL.md

Lines changed: 16 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -45,19 +45,27 @@ The llama.cpp OpenCL backend is designed to enable llama.cpp on **Qualcomm Adren
4545
|:----------------------:|:--------------------------:|
4646
| Q4_0 | Support |
4747
| Q6_K | Support, but not optimized |
48+
| Q8_0 | Support |
49+
| MXFP4 | Support |
4850

4951
## Model Preparation
5052

51-
You can refer to the general [*Prepare and Quantize*](README.md#prepare-and-quantize) guide for model prepration.
53+
You can refer to the general [llama-quantize tool](tools/quantize/README.md) for steps to convert a model in Hugging Face safetensor format to GGUF with quantization.
5254

53-
Currently we support `Q4_0` quantization and have optimize for it. To achieve best performance on Adreno GPU, add `--pure` to `llama-quantize`. For example,
55+
Currently we support `Q4_0` quantization and have optimized for it. To achieve best performance on Adreno GPU, add `--pure` to `llama-quantize`. For example,
5456

5557
```sh
5658
./llama-quantize --pure ggml-model-qwen2.5-3b-f16.gguf ggml-model-qwen-3b-Q4_0.gguf Q4_0
5759
```
5860

5961
Since `Q6_K` is also supported, `Q4_0` quantization without `--pure` will also work. However, the performance will be worse compared to pure `Q4_0` quantization.
6062

63+
### MXFP4 Models
64+
65+
OpenAI gpt-oss models are in MXFP4. The quantized model will be in MXFP4_MOE, a mixture of MXFP4 and Q8_0.
66+
For this quantization, there is no need to specify `--pure`.
67+
For gpt-oss-20b model, you can directly download a quantized GGUF file in MXFP4 from Hugging Face.
68+
6169
## CMake Options
6270

6371
The OpenCL backend has the following CMake options that control the behavior of the backend.
@@ -146,10 +154,13 @@ A Snapdragon X Elite device with Windows 11 Arm64 is used. Make sure the followi
146154
* Ninja
147155
* Visual Studio 2022
148156
* Powershell 7
157+
* Python
149158

150159
Visual Studio provides necessary headers and libraries although it is not directly used for building.
151160
Alternatively, Visual Studio Build Tools can be installed instead of the full Visual Studio.
152161

162+
> Note that building using Visual Studio's cl compiler is not supported. Clang must be used. Clang depends on libraries provided by Visual Studio to work. Therefore, Visual Studio must be installed. Alternatively, Visual Studio Build Tools can be installed instead of the full Visual Studio.
163+
153164
Powershell 7 is used for the following commands.
154165
If an older version of Powershell is used, these commands may not work as they are.
155166

@@ -201,7 +212,9 @@ ninja
201212

202213
## Known Issues
203214

204-
- Currently OpenCL backend does not work on Adreno 6xx GPUs.
215+
- Flash attention does not always improve performance. Disable it for models above 3B.
216+
- Currently OpenCL backend works on A6xx GPUs with recent drivers and compilers (usually found in IoT platforms).
217+
However, it does not work on A6xx GPUs found in phones with old drivers and compilers.
205218

206219
## TODO
207220

0 commit comments

Comments
 (0)