Skip to content

Commit 501891c

Browse files
author
Alberto Cabrera
committed
Addressing comments in the PR
1 parent 230279d commit 501891c

File tree

5 files changed

+49
-14
lines changed

5 files changed

+49
-14
lines changed

docs/backend/SYCL.md

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -26,16 +26,16 @@
2626

2727
### Llama.cpp + SYCL
2828

29-
The llama.cpp SYCL backend is mainly designed to support **Intel GPUs**.
30-
Based on the cross-platform feature of SYCL, it also supports Nvidia GPUs, with very limited support for AMD.
29+
The llama.cpp SYCL backend is primarily designed for **Intel GPUs**.
30+
SYCL cross-platform capabilities enable support for Nvidia GPUs as well, with limited support for AMD.
3131

3232
## Recommended Release
3333

34-
The following releases are verified:
34+
The following releases are verified and recommended:
3535

3636
|Commit ID|Tag|Release|Verified Platform| Update date|
3737
|-|-|-|-|-|
38-
|24e86cae7219b0f3ede1d5abdf5bf3ad515cccb8|b5377 |[llama-b5377-bin-win-sycl-x64.zip](https://github.com/ggml-org/llama.cpp/releases/download/b5377/llama-b5377-bin-win-sycl-x64.zip) |ArcB580/Linux/oneAPI 2025.1<br>LNL Arc GPU/Windows 11/oneAPI 2025.1|2025-05-15|
38+
|24e86cae7219b0f3ede1d5abdf5bf3ad515cccb8|b5377 |[llama-b5377-bin-win-sycl-x64.zip](https://github.com/ggml-org/llama.cpp/releases/download/b5377/llama-b5377-bin-win-sycl-x64.zip) |ArcB580/Linux/oneAPI 2025.1<br>LNL Arc GPU/Windows 11/oneAPI 2025.1.1|2025-05-15|
3939
|3bcd40b3c593d14261fb2abfabad3c0fb5b9e318|b4040 |[llama-b4040-bin-win-sycl-x64.zip](https://github.com/ggml-org/llama.cpp/releases/download/b4040/llama-b4040-bin-win-sycl-x64.zip) |Arc770/Linux/oneAPI 2024.1<br>MTL Arc GPU/Windows 11/oneAPI 2024.1| 2024-11-19|
4040
|fb76ec31a9914b7761c1727303ab30380fd4f05c|b3038 |[llama-b3038-bin-win-sycl-x64.zip](https://github.com/ggml-org/llama.cpp/releases/download/b3038/llama-b3038-bin-win-sycl-x64.zip) |Arc770/Linux/oneAPI 2024.1<br>MTL Arc GPU/Windows 11/oneAPI 2024.1||
4141

@@ -107,8 +107,8 @@ SYCL backend supports Intel GPU Family:
107107
| Intel Data Center Max Series | Support | Max 1550, 1100 |
108108
| Intel Data Center Flex Series | Support | Flex 170 |
109109
| Intel Arc Series | Support | Arc 770, 730M, Arc A750, B580 |
110-
| Intel built-in Arc GPU | Support | built-in Arc GPU in Meteor Lake, Arrow Lake |
111-
| Intel iGPU | Support | iGPU in 13700k, 13400, i5-1250P, i7-1260P, i7-1165G7, Ultra 7 268V |
110+
| Intel built-in Arc GPU | Support | built-in Arc GPU in Meteor Lake, Arrow Lake, Lunar Lake |
111+
| Intel iGPU | Support | iGPU in 13700k, 13400, i5-1250P, i7-1260P, i7-1165G7 |
112112

113113
*Notes:*
114114

@@ -734,12 +734,12 @@ use 1 SYCL GPUs: [0] with Max compute units:512
734734
| GGML_SYCL | ON (mandatory) | Enable build with SYCL code path. |
735735
| GGML_SYCL_TARGET | INTEL *(default)* \| NVIDIA \| AMD | Set the SYCL target device type. |
736736
| GGML_SYCL_DEVICE_ARCH | Optional (except for AMD) | Set the SYCL device architecture, optional except for AMD. Setting the device architecture can improve the performance. See the table [--offload-arch](https://github.com/intel/llvm/blob/sycl/sycl/doc/design/OffloadDesign.md#--offload-arch) for a list of valid architectures. |
737-
| GGML_SYCL_F16 | OFF *(default)* \|ON *(optional)* | Enable FP16 build with SYCL code path.\* |
737+
| GGML_SYCL_F16 | OFF *(default)* \|ON *(optional)* | Enable FP16 build with SYCL code path. (1.) |
738738
| GGML_SYCL_GRAPH | ON *(default)* \|OFF *(Optional)* | Enable build with [SYCL Graph extension](https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_oneapi_graph.asciidoc). |
739739
| CMAKE_C_COMPILER | `icx` *(Linux)*, `icx/cl` *(Windows)* | Set `icx` compiler for SYCL code path. |
740740
| CMAKE_CXX_COMPILER | `icpx` *(Linux)*, `icx` *(Windows)* | Set `icpx/icx` compiler for SYCL code path. |
741741

742-
* The FP32 codepath used to have better on quantized models but latest results show similar performance in text generation. Check both `GGML_SYCL_F16` ON and OFF to check in your system, but take into accound that FP32 reduces Prompt processing performance.
742+
1. FP16 is recommended for better prompt processing performance on quantized models. Performance is equivalent in text generation but set `GGML_SYCL_F16=OFF` if you are experiencing issues with FP16 builds.
743743

744744
#### Runtime
745745

@@ -800,4 +800,4 @@ Please add the `SYCL :` prefix/tag in issues/PRs titles to help the SYCL contrib
800800

801801
## TODO
802802

803-
- NA
803+
- Review ZES_ENABLE_SYSMAN: https://github.com/intel/compute-runtime/blob/master/programmers-guide/SYSMAN.md#support-and-limitations

examples/sycl/run-llama.sh renamed to examples/sycl/run-llama2.sh

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -12,16 +12,16 @@ source /opt/intel/oneapi/setvars.sh
1212

1313
INPUT_PROMPT="Building a website can be done in 10 simple steps:\nStep 1:"
1414
MODEL_FILE=models/llama-2-7b.Q4_0.gguf
15-
NGL=33
16-
CONEXT=4096
15+
NGL=99
16+
CONTEXT=4096
1717

1818
if [ $# -gt 0 ]; then
1919
GGML_SYCL_DEVICE=$1
2020
echo "use $GGML_SYCL_DEVICE as main GPU"
2121
#use signle GPU only
22-
ZES_ENABLE_SYSMAN=1 ./build/bin/llama-cli -m ${MODEL_FILE} -p "${INPUT_PROMPT}" -n 400 -e -ngl ${NGL} -c ${CONEXT} -mg $GGML_SYCL_DEVICE -sm none
22+
ZES_ENABLE_SYSMAN=1 ./build/bin/llama-cli -m ${MODEL_FILE} -p "${INPUT_PROMPT}" -n 400 -e -ngl ${NGL} -s 0 -c ${CONTEXT} -mg $GGML_SYCL_DEVICE -sm none
2323

2424
else
2525
#use multiple GPUs with same max compute units
26-
ZES_ENABLE_SYSMAN=1 ./build/bin/llama-cli -m ${MODEL_FILE} -p "${INPUT_PROMPT}" -n 400 -e -ngl ${NGL} -c ${CONEXT}
26+
ZES_ENABLE_SYSMAN=1 ./build/bin/llama-cli -m ${MODEL_FILE} -p "${INPUT_PROMPT}" -n 400 -e -ngl ${NGL} -s 0 -c ${CONTEXT}
2727
fi

examples/sycl/run-llama3.sh

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
#!/bin/bash
2+
3+
# MIT license
4+
# Copyright (C) 2025 Intel Corporation
5+
# SPDX-License-Identifier: MIT
6+
7+
export ONEAPI_DEVICE_SELECTOR="level_zero:0"
8+
source /opt/intel/oneapi/setvars.sh
9+
10+
#export GGML_SYCL_DEBUG=1
11+
12+
#ZES_ENABLE_SYSMAN=1, Support to get free memory of GPU by sycl::aspect::ext_intel_free_memory. Recommended to use when --split-mode = layer.
13+
14+
INPUT_PROMPT="Building a website can be done in 10 simple steps:\nStep 1:"
15+
MODEL_FILE=models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf
16+
NGL=99 # Layers offloaded to the GPU. If the device runs out of memory, reduce this value according to the model you are using.
17+
CONTEXT=4096
18+
19+
if [ $# -gt 0 ]; then
20+
GGML_SYCL_DEVICE=$1
21+
echo "Using $GGML_SYCL_DEVICE as the main GPU"
22+
ZES_ENABLE_SYSMAN=1 ./build/bin/llama-cli -m ${MODEL_FILE} -p "${INPUT_PROMPT}" -n 400 -e -ngl ${NGL} -c ${CONTEXT} -mg $GGML_SYCL_DEVICE -sm none
23+
else
24+
#use multiple GPUs with same max compute units
25+
ZES_ENABLE_SYSMAN=1 ./build/bin/llama-cli -m ${MODEL_FILE} -p "${INPUT_PROMPT}" -n 400 -e -ngl ${NGL} -c ${CONTEXT}
26+
fi

examples/sycl/win-run-llama.bat renamed to examples/sycl/win-run-llama2.bat

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,4 +6,4 @@ set INPUT2="Building a website can be done in 10 simple steps:\nStep 1:"
66
@call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" intel64 --force
77

88

9-
.\build\bin\llama-cli.exe -m models\llama-2-7b.Q4_0.gguf -p %INPUT2% -n 400 -e -ngl 33
9+
.\build\bin\llama-cli.exe -m models\llama-2-7b.Q4_0.gguf -p %INPUT2% -n 400 -e -ngl 99 -s 0

examples/sycl/win-run-llama3.bat

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
:: MIT license
2+
:: Copyright (C) 2024 Intel Corporation
3+
:: SPDX-License-Identifier: MIT
4+
5+
set INPUT2="Building a website can be done in 10 simple steps:\nStep 1:"
6+
@call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" intel64 --force
7+
8+
9+
.\build\bin\llama-cli.exe -m models\Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf -p %INPUT2% -n 400 -e -ngl 33

0 commit comments

Comments
 (0)