Skip to content

Commit 40d2095

Browse files
committed
opencl: update doc
1 parent d90c174 commit 40d2095

File tree

1 file changed

+9
-5
lines changed

1 file changed

+9
-5
lines changed

docs/backend/OPENCL.md

Lines changed: 9 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,9 @@ The llama.cpp OpenCL backend is designed to enable llama.cpp on **Qualcomm Adren
3939
| Adreno 830 (Snapdragon 8 Elite) | Support |
4040
| Adreno X85 (Snapdragon X Elite) | Support |
4141

42+
> A6x GPUs with a recent driver and compiler are supported; they are usually found in IoT platforms.
43+
However, A6x GPUs in phones are likely not supported due to the outdated driver and compiler.
44+
4245
## DataType Supports
4346

4447
| DataType | Status |
@@ -52,7 +55,7 @@ The llama.cpp OpenCL backend is designed to enable llama.cpp on **Qualcomm Adren
5255

5356
You can refer to the general [llama-quantize tool](/tools/quantize/README.md) for steps to convert a model in Hugging Face safetensor format to GGUF with quantization.
5457

55-
Currently we support `Q4_0` quantization and have optimized for it. To achieve best performance on Adreno GPU, add `--pure` to `llama-quantize`. For example,
58+
Currently we support `Q4_0` quantization and have optimized for it. To achieve best performance on Adreno GPU, add `--pure` to `llama-quantize` (i.e., make all weights in `Q4_0`). For example,
5659

5760
```sh
5861
./llama-quantize --pure ggml-model-qwen2.5-3b-f16.gguf ggml-model-qwen-3b-Q4_0.gguf Q4_0
@@ -66,10 +69,10 @@ OpenAI gpt-oss models are MoE models in `MXFP4`. The quantized model will be in
6669
For this quantization, there is no need to specify `--pure`.
6770
For gpt-oss-20b model, you can directly [download](https://huggingface.co/ggml-org/gpt-oss-20b-GGUF) the quantized GGUF file in `MXFP4_MOE` from Hugging Face.
6871

69-
Although it is possible to quantize gpt-oss-20b model in pure `Q4_0`, it is not recommendedsince `MXFP4` has been optimized for MoE while `Q4_0` is not.
70-
Hence, using the default `MXFP4_MOE` quantization will give better performance compared to pure `Q4_0` quantization for this model.
72+
Although it is possible to quantize gpt-oss-20b model in pure `Q4_0` (all weights in `Q4_0`), it is not recommended since `MXFP4` has been optimized for MoE while `Q4_0` is not. In addition, accuracy should degrade with such pure `Q4_0` quantization.
73+
Hence, using the default `MXFP4_MOE` quantization (see the link above) is recommended for this model.
7174

72-
However, note that the `Q4_0` model found [here](https://huggingface.co/unsloth/gpt-oss-20b-GGUF/blob/main/gpt-oss-20b-Q4_0.gguf) is a mixture of `Q4_0`, `Q8_0` and `MXFP4` and gives better performance than `MXFP4_MOE` quantization.
75+
> Note that the `Q4_0` model found [here](https://huggingface.co/unsloth/gpt-oss-20b-GGUF/blob/main/gpt-oss-20b-Q4_0.gguf) is a mixture of `Q4_0`, `Q8_0` and `MXFP4` and gives better performance than `MXFP4_MOE` quantization.
7376
7477
## CMake Options
7578

@@ -217,11 +220,12 @@ ninja
217220

218221
## Known Issues
219222

220-
- Flash attention does not always improve performance. Disable it for models above 3B.
223+
- Flash attention does not always improve performance.
221224
- Currently OpenCL backend works on A6xx GPUs with recent drivers and compilers (usually found in IoT platforms).
222225
However, it does not work on A6xx GPUs found in phones with old drivers and compilers.
223226

224227
## TODO
225228

226229
- Optimization for Q6_K
227230
- Support and optimization for Q4_K
231+
- Improve flash attention

0 commit comments

Comments
 (0)