Skip to content

Commit 420e3b3

Browse files
NeoZhangJianyuarthw
andcommitted
[SYCL] Optimize mul_mat for Q4_0 on Intel GPU (ggml-org#12035)
* opt performance by reorder for Intel GPU * detect hw type and save opt feature, and print opt feature * correct name * support optimize graph once when compute graph, record the opt status in tensor->extra, make CI passed * add env variable GGML_SYCL_DISABLE_OPT for debug * use syclex::architecture replace the custom hw define, update the guide for GGML_SYCL_DISABLE_OPT * add performance data * mv getrows functions to separeted files * fix global variables --------- Co-authored-by: arthw <[email protected]>
1 parent 89b48a8 commit 420e3b3

File tree

16 files changed

+835
-315
lines changed

16 files changed

+835
-315
lines changed

docs/backend/SYCL.md

Lines changed: 14 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,16 @@ For CI and performance test summary, please refer to [llama.cpp CI for SYCL Back
4343

4444
## News
4545

46+
- 2025.2
47+
- Optimize MUL_MAT Q4_0 on Intel GPU for all dGPUs and built-in GPUs since MTL. Increase the performance of LLM (llama-2-7b.Q4_0.gguf) 21%-87% on Intel GPUs (MTL, ARL-H, Arc, Flex, PVC).
48+
|GPU|Base tokens/s|Increased tokens/s|Percent|
49+
|-|-|-|-|
50+
|PVC 1550|39|73|+87%|
51+
|Flex 170|39|50|+28%|
52+
|Arc770|42|55|+30%|
53+
|MTL|13|16|+23%|
54+
|ARL-H|14|17|+21%|
55+
4656
- 2024.11
4757
- Use syclcompat to improve the performance on some platforms. This requires to use oneAPI 2025.0 or newer.
4858

@@ -101,8 +111,8 @@ SYCL backend supports Intel GPU Family:
101111
| Intel Data Center Max Series | Support | Max 1550, 1100 |
102112
| Intel Data Center Flex Series | Support | Flex 170 |
103113
| Intel Arc Series | Support | Arc 770, 730M, Arc A750 |
104-
| Intel built-in Arc GPU | Support | built-in Arc GPU in Meteor Lake |
105-
| Intel iGPU | Support | iGPU in 13700k, i5-1250P, i7-1260P, i7-1165G7 |
114+
| Intel built-in Arc GPU | Support | built-in Arc GPU in Meteor Lake, Arrow Lake |
115+
| Intel iGPU | Support | iGPU in 13700k,iGPU in 13400, i5-1250P, i7-1260P, i7-1165G7 |
106116

107117
*Notes:*
108118

@@ -697,6 +707,7 @@ use 1 SYCL GPUs: [0] with Max compute units:512
697707
| Name | Value | Function |
698708
|-------------------|------------------|---------------------------------------------------------------------------------------------------------------------------|
699709
| GGML_SYCL_DEBUG | 0 (default) or 1 | Enable log function by macro: GGML_SYCL_DEBUG |
710+
| GGML_SYCL_DISABLE_OPT | 0 (default) or 1 | Disable optimize features based on Intel GPU type, to compare the performance increase |
700711
| ZES_ENABLE_SYSMAN | 0 (default) or 1 | Support to get free memory of GPU by sycl::aspect::ext_intel_free_memory.<br>Recommended to use when --split-mode = layer |
701712
| GGML_SYCL_VISIBLE_DEVICES|id1,id2,...|It's like `CUDA_VISIBLE_DEVICES`, define the SYCL device ID list to visible. Like "0", "0,2", "2,1" |
702713
| ONEAPI_DEVICE_SELECTOR|Refer to [oneapi-device-selector](https://intel.github.io/llvm-docs/EnvironmentVariables.html#oneapi-device-selector)|be used to limit the choice of devices available when the SYCL-using application is run|
@@ -725,6 +736,7 @@ The parameters about device choose of llama.cpp works with SYCL backend rule to
725736
|Multiple Device|`--split-mode=layer`|Default|
726737

727738

739+
728740
## Known Issues
729741

730742
- `Split-mode:[row]` is not supported.

examples/sycl/run-llama2.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
# MIT license
44
# Copyright (C) 2024 Intel Corporation
55
# SPDX-License-Identifier: MIT
6-
6+
export ONEAPI_DEVICE_SELECTOR="level_zero:0"
77
source /opt/intel/oneapi/setvars.sh
88

99
#export GGML_SYCL_DEBUG=1

ggml/src/ggml-sycl/CMakeLists.txt

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
1+
message(STATUS "GGML_SYCL_TARGET=${GGML_SYCL_TARGET}")
2+
13
if (NOT GGML_SYCL_TARGET MATCHES "^(INTEL|NVIDIA|AMD)$")
24
message(FATAL_ERROR "Invalid backend chosen, supported options are INTEL, NVIDIA, or AMD")
35
endif()

ggml/src/ggml-sycl/common.cpp

Lines changed: 71 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -91,7 +91,7 @@ void print_device_detail_part1(int id, sycl::device &device, std::string device_
9191
name.c_str(), global_mem_size);
9292
}
9393

94-
void print_device_detail_part2(int id, sycl::device &device, std::string device_type) {
94+
void print_device_detail_part2(int id, sycl::device &device) {
9595

9696
dpct::device_info prop;
9797
SYCL_CHECK(CHECK_TRY_ERROR(
@@ -103,6 +103,30 @@ void print_device_detail_part2(int id, sycl::device &device, std::string device_
103103
device.get_info<sycl::info::device::driver_version>().c_str());
104104
}
105105

106+
void print_device_opt_feature(ggml_sycl_device_info &info) {
107+
GGML_LOG_INFO("SYCL Optimization Feature:\n");
108+
GGML_LOG_INFO(
109+
"|ID| Device Type|Reorder|\n");
110+
GGML_LOG_INFO(
111+
"|--|-------------------|-------|\n");
112+
std::map<std::string, size_t> DeviceNums;
113+
int device_count = info.device_count;
114+
115+
for (int id = 0; id < device_count; ++id) {
116+
printf("zjy id=%d\n", id);
117+
sycl::device device = dpct::dev_mgr::instance().get_device(id);
118+
std::string backend_type = get_device_backend_and_type(device);
119+
int type_id = DeviceNums[backend_type]++;
120+
std::stringstream device_type;
121+
device_type << "[" << backend_type << ":" << std::to_string(type_id)
122+
<< "]";
123+
std::string device_type_s = device_type.str();
124+
device_type_s = std::regex_replace(device_type_s, std::regex("ext_oneapi_"), "");
125+
GGML_LOG_INFO("|%2d|%19s|%7s|\n", id, device_type_s.c_str(),
126+
info.infos[id].opt_feature.reorder ? "Y": "N");
127+
}
128+
}
129+
106130
void ggml_backend_sycl_print_sycl_devices() {
107131
GGML_SYCL_DEBUG("[SYCL] call ggml_backend_sycl_print_sycl_devices\n");
108132
int device_count = dpct::dev_mgr::instance().device_count();
@@ -113,7 +137,6 @@ void ggml_backend_sycl_print_sycl_devices() {
113137
fprintf(stderr, "|--|-------------------|-----|---------------------------------------|---------------|\n");
114138
for (int id = 0; id < device_count; ++id) {
115139
sycl::device device = dpct::dev_mgr::instance().get_device(id);
116-
sycl::backend backend = device.get_backend();
117140
std::string backend_type = get_device_backend_and_type(device);
118141
int type_id=DeviceNums[backend_type]++;
119142
std::stringstream device_type;
@@ -127,64 +150,66 @@ void ggml_backend_sycl_print_sycl_devices() {
127150
fprintf(stderr, "|--|-----------------|--------------|------------|----------------------------------|\n");
128151
for (int id = 0; id < device_count; ++id) {
129152
sycl::device device = dpct::dev_mgr::instance().get_device(id);
130-
sycl::backend backend = device.get_backend();
131153
std::string backend_type = get_device_backend_and_type(device);
132154
int type_id=DeviceNums2[backend_type]++;
133155
std::stringstream device_type;
134156
device_type << "[" << backend_type << ":" << std::to_string(type_id) << "]";
135-
print_device_detail_part2(id, device, device_type.str());
157+
print_device_detail_part2(id, device);
136158
}
137159
}
138160

139161
static ggml_sycl_device_info ggml_sycl_init(int main_gpu_id) try {
140162
static bool initialized = false;
141-
163+
static ggml_sycl_device_info info(main_gpu_id);
142164
if (!initialized) {
143-
fprintf(stderr, "[SYCL] call ggml_init_sycl\n");
144-
145165
g_ggml_sycl_debug = get_sycl_env("GGML_SYCL_DEBUG", 0);
146-
fprintf(stderr, "%s: GGML_SYCL_DEBUG: %d\n", __func__,
147-
g_ggml_sycl_debug);
148-
149-
#if defined(GGML_SYCL_F16)
150-
fprintf(stderr, "%s: GGML_SYCL_F16: yes\n", __func__);
166+
g_ggml_sycl_disable_optimize= get_sycl_env("GGML_SYCL_DISABLE_OPT", 0);
167+
GGML_SYCL_DEBUG("[SYCL] call ggml_check_sycl\n");
168+
GGML_LOG_INFO("Running with Environment Variables:\n");
169+
GGML_LOG_INFO(" GGML_SYCL_DEBUG: %d\n", g_ggml_sycl_debug);
170+
GGML_LOG_INFO(" GGML_SYCL_DISABLE_OPT: %d\n", g_ggml_sycl_disable_optimize);
171+
GGML_LOG_INFO("Build with Macros:\n");
172+
#if defined(GGML_SYCL_FORCE_MMQ)
173+
GGML_LOG_INFO(" GGML_SYCL_FORCE_MMQ: yes\n");
151174
#else
152-
fprintf(stderr, "%s: GGML_SYCL_F16: no\n", __func__);
175+
GGML_LOG_INFO(" GGML_SYCL_FORCE_MMQ: no\n");
153176
#endif
154-
155-
#if defined(GGML_SYCL_FORCE_MMQ)
156-
fprintf(stderr, "%s: GGML_SYCL_FORCE_MMQ: yes\n", __func__);
177+
#if defined(GGML_SYCL_F16)
178+
GGML_LOG_INFO(" GGML_SYCL_F16: yes\n");
157179
#else
158-
fprintf(stderr, "%s: GGML_SYCL_FORCE_MMQ: no\n", __func__);
180+
GGML_LOG_INFO(" GGML_SYCL_F16: no\n");
159181
#endif
160182

161183
#if defined(SYCL_USE_XMX)
162-
fprintf(stderr, "%s: SYCL_USE_XMX: yes\n", __func__);
184+
GGML_LOG_INFO("%s: SYCL_USE_XMX: yes\n", __func__);
163185
#else
164-
fprintf(stderr, "%s: SYCL_USE_XMX: no\n", __func__);
186+
GGML_LOG_INFO("%s: SYCL_USE_XMX: no\n", __func__);
165187
#endif
166188

167189
if (CHECK_TRY_ERROR(g_all_sycl_device_count =
168190
dpct::dev_mgr::instance().device_count()) !=
169191
0) {
170192
initialized = true;
171-
return;
193+
GGML_LOG_INFO(" g_all_sycl_device_count is wrong:%d\n",
194+
g_all_sycl_device_count);
195+
return info;
172196
}
173197
GGML_ASSERT(g_all_sycl_device_count <= GGML_SYCL_MAX_DEVICES);
174-
ggml_backend_sycl_print_sycl_devices();
175-
initialized = true;
176-
}
177198

178-
static ggml_sycl_device_info info(main_gpu_id);
199+
if (info.device_count == 0) {
200+
GGML_LOG_INFO("%s: failed to initialize " GGML_SYCL_NAME ": no available device found\n",
201+
__func__);
202+
return info;
203+
}
204+
GGML_ASSERT(info.device_count <= GGML_SYCL_MAX_DEVICES);
179205

180-
if (info.device_count == 0) {
181-
fprintf(stderr, "%s: failed to initialize " GGML_SYCL_NAME ": no available device found\n",
182-
__func__);
183-
return info;
206+
ggml_backend_sycl_print_sycl_devices();
207+
print_device_opt_feature(info);
208+
initialized = true;
184209
}
185-
GGML_ASSERT(info.device_count <= GGML_SYCL_MAX_DEVICES);
186210

187211
return info;
212+
188213
} catch (sycl::exception const &exc) {
189214
std::cerr << exc.what() << "Exception caught at file:" << __FILE__
190215
<< ", line:" << __LINE__ << std::endl;
@@ -245,3 +270,20 @@ catch (sycl::exception const &exc) {
245270
<< ", line:" << __LINE__ << std::endl;
246271
std::exit(1);
247272
}
273+
274+
275+
void release_extra_gpu(ggml_tensor_extra_gpu * extra, std::vector<queue_ptr> streams) {
276+
for (int i = 0; i < ggml_sycl_info().device_count; ++i) {
277+
for (int64_t is = 0; is < GGML_SYCL_MAX_STREAMS; ++is) {
278+
if (extra->events[i][is] != nullptr) {
279+
SYCL_CHECK(CHECK_TRY_ERROR(dpct::destroy_event(extra->events[i][is])));
280+
}
281+
}
282+
if (extra->data_device[i] != nullptr && streams.size()>0) {
283+
ggml_sycl_set_device(i);
284+
SYCL_CHECK(
285+
CHECK_TRY_ERROR(sycl::free(extra->data_device[i], *(streams[i]))));
286+
}
287+
}
288+
delete extra;
289+
}

ggml/src/ggml-sycl/common.hpp

Lines changed: 12 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,10 @@
3838

3939
void ggml_sycl_host_free(void* ptr);
4040

41+
4142
extern int g_ggml_sycl_debug;
43+
extern int g_ggml_sycl_disable_optimize;
44+
4245
#define GGML_SYCL_DEBUG(...) \
4346
do { \
4447
if (g_ggml_sycl_debug) \
@@ -237,20 +240,26 @@ struct ggml_tensor_extra_gpu {
237240
// tensors
238241
dpct::event_ptr events[GGML_SYCL_MAX_DEVICES]
239242
[GGML_SYCL_MAX_STREAMS]; // events for synchronizing multiple GPUs
243+
optimize_feature optimized_feature;
240244
};
241245

246+
void release_extra_gpu(ggml_tensor_extra_gpu * extra, std::vector<queue_ptr> streams={});
247+
242248
struct ggml_backend_sycl_context {
243249
int device;
244250
std::string name;
251+
optimize_feature opt_feature;
252+
bool optimized_graph=false;
245253

246254
queue_ptr qptrs[GGML_SYCL_MAX_DEVICES][GGML_SYCL_MAX_STREAMS] = { { nullptr } };
247255

248256
explicit ggml_backend_sycl_context(struct ggml_sycl_device_info &sycl_device_info, int id) :
249257
device(id),
250258
name(GGML_SYCL_NAME + std::to_string(device)) {
251-
for (int i=0;i<GGML_SYCL_MAX_STREAMS; i++){
252-
qptrs[id][i] = sycl_device_info.infos[id].qptrs[i];
253-
}
259+
for (int i=0;i<GGML_SYCL_MAX_STREAMS; i++){
260+
qptrs[id][i] = sycl_device_info.infos[id].qptrs[i];
261+
}
262+
opt_feature = sycl_device_info.infos[id].opt_feature;
254263
}
255264

256265
queue_ptr stream(int id, int stream) {
@@ -672,5 +681,4 @@ bool gpu_has_xmx(sycl::device &dev);
672681
void ggml_sycl_op_flatten(ggml_backend_sycl_context & ctx, const ggml_tensor *src0,
673682
const ggml_tensor *src1, ggml_tensor *dst,
674683
const ggml_sycl_op_flatten_t op);
675-
676684
#endif // GGML_SYCL_COMMON_HPP

ggml/src/ggml-sycl/convert.cpp

Lines changed: 33 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -125,6 +125,25 @@ static void dequantize_row_q4_0_sycl(const void *vx, dst_t *y, const int64_t k,
125125
}
126126
}
127127

128+
template <typename dst_t>
129+
static void dequantize_row_q4_0_sycl_reorder(const void *vx, dst_t *y, const int64_t k,
130+
dpct::queue_ptr stream) {
131+
132+
dpct::has_capability_or_fail(stream->get_device(),
133+
{sycl::aspect::fp16});
134+
135+
int constexpr WARP_K = WARP_SIZE * QK4_0;
136+
const int n_warp = (k + WARP_K - 1) / WARP_K;
137+
GGML_ASSERT(k % 2 == 0);
138+
stream->parallel_for(sycl::nd_range<3>(sycl::range<3>(1, 1, n_warp) *
139+
sycl::range<3>(1, 1, WARP_SIZE),
140+
sycl::range<3>(1, 1, WARP_SIZE)),
141+
[=](sycl::nd_item<3> item_ct1) [[intel::reqd_sub_group_size(WARP_SIZE)]]{
142+
dequantize_block_q4_0_reorder(vx, y, k, item_ct1);
143+
});
144+
145+
}
146+
128147
template <typename dst_t>
129148
static void dequantize_row_q4_1_sycl(const void *vx, dst_t *y, const int64_t k,
130149
dpct::queue_ptr stream) {
@@ -452,10 +471,15 @@ static void convert_unary_sycl(const void *__restrict__ vx,
452471
}
453472
}
454473

455-
to_fp16_sycl_t ggml_get_to_fp16_sycl(ggml_type type) {
474+
to_fp16_sycl_t ggml_get_to_fp16_sycl(ggml_type type, ggml_tensor *dst) {
456475
switch (type) {
457476
case GGML_TYPE_Q4_0:
458-
return dequantize_block_sycl<QK4_0, QR4_0, dequantize_q4_0>;
477+
if (dst->src[0]->extra &&
478+
((ggml_tensor_extra_gpu*)dst->src[0]->extra)->optimized_feature.reorder) {
479+
return dequantize_row_q4_0_sycl_reorder;
480+
} else {
481+
return dequantize_block_sycl<QK4_0, QR4_0, dequantize_q4_0>;
482+
}
459483
case GGML_TYPE_Q4_1:
460484
return dequantize_block_sycl<QK4_1, QR4_1, dequantize_q4_1>;
461485
case GGML_TYPE_Q5_0:
@@ -499,10 +523,15 @@ to_fp16_sycl_t ggml_get_to_fp16_sycl(ggml_type type) {
499523
}
500524
}
501525

502-
to_fp32_sycl_t ggml_get_to_fp32_sycl(ggml_type type) {
526+
to_fp32_sycl_t ggml_get_to_fp32_sycl(ggml_type type, ggml_tensor *dst) {
503527
switch (type) {
504528
case GGML_TYPE_Q4_0:
505-
return dequantize_row_q4_0_sycl;
529+
if (dst->src[0]->extra &&
530+
((ggml_tensor_extra_gpu*)dst->src[0]->extra)->optimized_feature.reorder) {
531+
return dequantize_row_q4_0_sycl_reorder;
532+
} else {
533+
return dequantize_row_q4_0_sycl;
534+
}
506535
case GGML_TYPE_Q4_1:
507536
return dequantize_row_q4_1_sycl;
508537
case GGML_TYPE_Q5_0:

ggml/src/ggml-sycl/convert.hpp

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ using to_t_sycl_t = void (*)(const void *__restrict__ x, T *__restrict__ y,
2121
typedef to_t_sycl_t<float> to_fp32_sycl_t;
2222
typedef to_t_sycl_t<sycl::half> to_fp16_sycl_t;
2323

24-
to_fp16_sycl_t ggml_get_to_fp16_sycl(ggml_type type);
25-
to_fp32_sycl_t ggml_get_to_fp32_sycl(ggml_type type);
24+
to_fp16_sycl_t ggml_get_to_fp16_sycl(ggml_type type, ggml_tensor *dst);
25+
to_fp32_sycl_t ggml_get_to_fp32_sycl(ggml_type type, ggml_tensor *dst);
2626

2727
#endif // GGML_SYCL_CONVERT_HPP

0 commit comments

Comments
 (0)