sync : ggml #3193

ggerganov · 2025-05-27T14:09:15Z

No description provided.

…32B incoherence (llama/13607)

…13482) * Remove mmap workaround on windows After some testing I found that mmap is supported on windows and for many GPUs on Linux. Therefore I remove the workaround for windows since it is not necessary. * Update llama-bench README SYCL backend introduced a workaround that allows execution of llama-bench also without specifying `--mmp 0` flag

* CUDA: skip fully masked-out KV in FA vec kernel

* small fixes * remove ifdef

…ITY op to accelerate D2D memory copy (llama/13647) * musa: fix build warning (unused parameter) Signed-off-by: Xiaodong Ye <[email protected]> * musa: upgrade MUSA SDK version to rc4.0.1 Signed-off-by: Xiaodong Ye <[email protected]> * musa: use mudnn::Unary::IDENTITY op to accelerate D2D memory copy Signed-off-by: Xiaodong Ye <[email protected]> * Update ggml/src/ggml-cuda/cpy.cu Co-authored-by: Johannes Gäßler <[email protected]> * musa: remove MUDNN_CHECK_GEN and use CUDA_CHECK_GEN instead in MUDNN_CHECK Signed-off-by: Xiaodong Ye <[email protected]> --------- Signed-off-by: Xiaodong Ye <[email protected]> Co-authored-by: Johannes Gäßler <[email protected]>

* ggml : add ggml_gelu_na (not approximated) * fix naming order * rename na --> erf * apply review suggesions * revert naming order

* opencl: fix couple crashes * fix kernel launches failed on devices which do not support non-uniform work-groups. When non-uniform work-groups are not supported, set `local_work_size` to NULL (= let driver choose the work-group sizes). This patch does not cover everything - just the cases tested by test-backend-ops. * fix sub-buffer creation failed due to `cl_buffer_region::origin` not being aligned to `CL_DEVICE_MEM_BASE_ADDR_ALIGN`. * OpenCL: query non-uniform WG sizes only on OpenCL 3.0+

* opencl: Add support for multiple devices ... but limited to one platform. A platform with a GPU will be preferred. Additionally: * Filter out devices that lack capabilities needed by the backend implementation (half support, OpenCL 2.0+, etc). * Make ggml_backend_opencl_reg() thread-safe. * fixup: fix an error in sync_with_other_backends ... when there is only one OpenCL device available.

Currently on a CUDA backend to SYCL when running `GGML_SYCL_DISABLE_GRAPH=0 ./bin/test-backend-ops -b SYCL0` there are two operations that throw an exception from the blocking waits during queue recording. * `-o CONCAT` : Use of blocking waits on a queue that's being recorded https://github.com/ggml-org/llama.cpp/blob/master/ggml/src/ggml-sycl/concat.cpp#L185-L187 * `-o MUL_MAT_ID`: Blocking wait on a recording queue for a copy to host memory https://github.com/ggml-org/llama.cpp/blob/master/ggml/src/ggml-sycl/ggml-sycl.cpp#L3072-L3074 We've noticed that `ggml-cuda.cu` has the [check_node_graph_compatibility_and_refresh_copy_ops](https://github.com/ggml-org/llama.cpp/blob/39e73ae0d69f882d7e29cecc6dd8f5052fca6731/ggml/src/ggml-cuda/ggml-cuda.cu#L2458-L2458) method for checking if a graph can be used, even if enabled. I've taken a similar approach in this PR by adding a method to `ggml-sycl.cpp` for checking if a graph can be used for the operations even if a user has asked for it to be enabled.

* removes the waits in async memcpy functions

…upport it (llama/13696)

Reuse the f16/f32 copy shaders, and just scale the number of elements according to the type size.

* [CANN]Support MUL_MAT_ID Q8 && Q4 Signed-off-by: noemotiovon <[email protected]> * codestyle adjustment Signed-off-by: noemotiovon <[email protected]> --------- Signed-off-by: noemotiovon <[email protected]>

* ggml : add ggml_gelu_erf() CUDA kernel * missing semicolon

…752) Temporarily reverted due to failing fp16 DIV operation This reverts commit 02cdd2d8b092b5a4bb18e013c6887ce49ba20ac5. ggml-ci

* cann: add the basic FA support * cann: update the readme * cann: update the FlashAttention with PSEShift * cann: update the input parameters in FA * cann: update the alibi with max_bias * cann: add the constrints of softcap * cann: update the docs CANN.md * cann: update the docs CANN.md * cann: fix typo of CANN.md * cann: add some comments and update the CANN.md * cann: update the CANN.md * cann: update the inner precise for fusedInferAttention * cann: update the constraints of flash_attn_ext on ggml-cann.cpp * cann: clean the whitespace * cann: clean the whitespace * cann: add a new endline

…13611) * SYCL: Add non contiguous input support to norm kernel * refactor and add RMS_NORM non contiguous input support ggml-ci * restore subgroup reduction for multi-subgroup thread blocks in norm kernels * Swap grid dims of nsamples and nrows ggml-ci * Revert "Swap grid dims of nsamples and nrows" This reverts commit 43be2d657fec7f7fba54e2cd154106bc0fc45adf. * restore not required changes ggml-ci * address review comments: change it to more like SYCL * Use a common function to calculate offset * remove wrap around logic for handling broadcasts * remove static from calculate_offset fn and use ceil_div

ggml-ci

* ggml : riscv: add xtheadvector support * ggml : clean up some macro usage

ggml-ci

ggerganov and others added 30 commits May 27, 2025 17:06

sync : ggml

8263e97

Vulkan: Add f32 accumulator support to quantized mul mat to fix GLM4 …

525d51b

…32B incoherence (llama/13607)

metal : fix typo in FA kernel comments (llama/13651)

2192394

sycl: disable reorder for sycl mulmat (llama/13536)

1e7f745

CUDA: skip fully masked-out KV in FA vec kernel (llama/13584)

5e90842

* CUDA: skip fully masked-out KV in FA vec kernel

vulkan: fix warnings (llama/13626)

e5f0301

* small fixes * remove ifdef

ggml : add ggml_gelu_erf() (llama/13667)

39a2783

* ggml : add ggml_gelu_na (not approximated) * fix naming order * rename na --> erf * apply review suggesions * revert naming order

sycl : Remove waits from function calls (llama/13702)

28c7ab8

* removes the waits in async memcpy functions

use LOG_WARN to replace std::cerr (llama/13657)

c10deb2

vulkan: Disable coopmat/coopmat2/bfloat extensions if glslc doesn't s…

8e2a08b

…upport it (llama/13696)

vulkan: support CPY from any type to itself (llama/13695)

1fecf05

Reuse the f16/f32 copy shaders, and just scale the number of elements according to the type size.

ggml : fix the order of ggml_unary_op (llama/13718)

39dc9dd

CANN: Support MUL_MAT_ID for q8_0 and q4_0 (llama/13705)

85c583d

* [CANN]Support MUL_MAT_ID Q8 && Q4 Signed-off-by: noemotiovon <[email protected]> * codestyle adjustment Signed-off-by: noemotiovon <[email protected]> --------- Signed-off-by: noemotiovon <[email protected]>

CUDA: fix race condition in FA vector kernels (llama/13742)

d3b5380

ggml : add ggml_gelu_erf() CUDA kernel (llama/13719)

093dfaa

* ggml : add ggml_gelu_erf() CUDA kernel * missing semicolon

ggml-cpu : set openmp wait time if not set (llama/13758)

3df6086

SYCL: revert "sycl: simplify bin_bcast_kernel (ggml/13383)" (llama/13…

4d4a5d7

…752) Temporarily reverted due to failing fp16 DIV operation This reverts commit 02cdd2d8b092b5a4bb18e013c6887ce49ba20ac5. ggml-ci

vulkan: mark IM2COL as supporting non-contig (llama/13783)

6370037

sycl: Add more debug prints (llama/13640)

02ed80e

cuda : avoid cuGetErrorString (llama/13791)

45f2e0f

ggml-ci

ggml : allow CUDA graphs when using pipeline parallelism (llama/13814)

5eabeb7

ggml-cpu: x86 feature detection is specific to x86 (llama/13811)

4e61025

ggml : riscv: add xtheadvector support (llama/13720)

65206c2

* ggml : riscv: add xtheadvector support * ggml : clean up some macro usage

ggerganov added 2 commits May 27, 2025 17:07

sync : ggml

4575811

ggml-ci

talk-llama : sync llama.cpp

e2ac490

ggml-ci

danbev approved these changes May 27, 2025

View reviewed changes

sync : fix builds - musa, ruby

255eac6

ggerganov merged commit 527fe6a into master May 27, 2025
3 checks passed

ggerganov deleted the sync-ggml-25-05-27 branch May 27, 2025 15:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

sync : ggml #3193

sync : ggml #3193

Uh oh!

ggerganov commented May 27, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

18 participants

sync : ggml #3193

sync : ggml #3193

Uh oh!

Conversation

ggerganov commented May 27, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

18 participants