sync : ggml #3342

ggerganov · 2025-07-28T05:44:14Z

No description provided.

This commit removes the inclusion of `<cstdlib>`. The motivation for this change is that this source file does not seem to use any functions from this header and the comment about `qsort` is a little misleading/confusing.

* CMake config: Create target only once Fix error on repeated find_package(ggml). For simplicity, check only for the top-level ggml::ggml. * CMake config: Add CUDA link libs * CMake config: Add OpenCL link libs * CMake config: Use canonical find_dependency Use set and append to control link lib variables. Apply more $<LINK_ONLY...>. * CMake config: Wire OpenMP dependency

…74) (llama/14707)

…316) * ggml/ggml-vulkan/test-backend-ops: adds CONV_2D for Vulkan * ggml-vulkan: adds f32 scalar shader to compute 2D convolution directly with gemm (no need for im2col), * test-backend-ops: adds test_case_ref to check the validity/performance of ops against reference implementations having different graphs, adds tests * * Performance fixes: minimized branch divergence, uses collectives to eliminate redundant calculation, macros removed. * Kernel shared memory size check * Updates test-backend-ops to support graphs for performance measurement. * * Apple/Win32 compile errors fixed * Subgroup size used to determine tile size -> fixes llvmpipe errors. * Collectives disabled by default. * Intel support is disabled as the performance is poor. * Conv2d enabled for Intel with disabled collectives, disabled for Apple * test-backend-ops modifications are reverted * Trailing spaces and missing override fixed. * Triggering pipeline relaunch. * Code formatted with .clang-format.

The tid is decomposed into "ow + ky*OW + kx*OW*KH". Change "ksize" to match.

* kleidiai: add support for get_rows * apply fixes based on code review * apply more fixes based on code review

* add conv2d kernel * fix trailing whitespace * whitespace fixe * handle f16 input and f16 kernel, more opt * resolve conflicts * use enqueue_ndrange_kernel

Signed-off-by: Xiaodong Ye <[email protected]>

* implement bf16 cpy ops and enable bf16 cont * deduplicate copy functions * deduplicate checks

* weight format to nz for 310p * remove quant weight format to nz * clean code * fix * make the conditions for converting weights to NZ format consistent * clean code

* CUDA: fix quantized KV cache + multiple sequences * Update ggml/src/ggml-cuda/fattn-common.cuh Co-authored-by: Georgi Gerganov <[email protected]> --------- Co-authored-by: Georgi Gerganov <[email protected]>

* metal : fix fusion across different encoders ggml-ci * cont : add assertion ggml-ci

* musa: apply mublas API changes Signed-off-by: Xiaodong Ye <[email protected]> * musa: update musa version to 4.2.0 Signed-off-by: Xiaodong Ye <[email protected]> * musa: restore MUSA graph settings in CMakeLists.txt Signed-off-by: Xiaodong Ye <[email protected]> * musa: disable mudnnMemcpyAsync by default Signed-off-by: Xiaodong Ye <[email protected]> * musa: switch back to non-mudnn images Signed-off-by: Xiaodong Ye <[email protected]> * minor changes Signed-off-by: Xiaodong Ye <[email protected]> * musa: restore rc in docker image tag Signed-off-by: Xiaodong Ye <[email protected]> --------- Signed-off-by: Xiaodong Ye <[email protected]>

…llelism (llama/14855) ggml-ci

…14868)

Neither "g" nor "x" are valid portPos specifiers per the official [graphviz documents](https://graphviz.org/docs/attr-types/portPos/): > If a compass point is used, it must have the form "n","ne","e","se","s","sw","w","nw","c","_". I tested locally for it to fall back to default portPos specifier if an invalid portPos is specified. As a consequence, we can remove associated code.

* opencl: add fused `rms_norm` + `mul` * opencl: improve workgroup size for `rms_norm_mul`

@compilade

* feat: Add s_off as a parameter in the args struct This may not be necessary, but it more closely mirrors the CUDA kernel Branch: GraniteFourPerf Signed-off-by: Gabe Goodhart <[email protected]> * perf: Parallelize mamba2 SSM_SCAN metal kernel over d_state This is a first attempt at optimizing the metal kernel. The changes here are: - Launch the kernel with a thread group of size d_state - Use simd groups and shared memory to do the summation for the y computation When tested with G4 tiny preview, this shows roughly a 3x speedup on prefill and 15% speedup on decode. Signed-off-by: Gabe Goodhart <[email protected]> * fix: Update logic to correctly do the multi-layer parallel sum Signed-off-by: Gabe Goodhart <[email protected]> * fix: Correctly size the shared memory bufer and assert expected size relationships Branch: GraniteFourPerf Signed-off-by: Gabe Goodhart <[email protected]> * refactor: Compute block offsets once rather than once per token Branch: GraniteFourPerf Signed-off-by: Gabe Goodhart <[email protected]> * feat: Use local variable for state recursion Branch: GraniteFourPerf Signed-off-by: Gabe Goodhart <[email protected]> * feat: Use a secondary simd_sum instead of a for loop Branch: GraniteFourPerf Signed-off-by: Gabe Goodhart <[email protected]> * feat: Add assertion and comment about relationship between simd size and num simd groups Branch: GraniteFourPerf Signed-off-by: Gabe Goodhart <[email protected]> * feat: Parallelize of d_state for mamba-1 Branch: GraniteFourPerf Signed-off-by: Gabe Goodhart <[email protected]> * feat: Parallel sum in SSM_CONV Branch: GraniteFourPerf Signed-off-by: Gabe Goodhart <[email protected]> * Revert "feat: Parallel sum in SSM_CONV" After discussion with @compilade, the size of the parallelism here is not worth the cost in complexity or overhead of the parallel for. ggml-org/llama.cpp#14743 (comment) This reverts commit 16bc059660c1c59e566628201c0ca2c20c9f4bc3. Signed-off-by: Gabe Goodhart <[email protected]> * refactor: Simplify shared memory sizing Branch: GraniteFourPerf Signed-off-by: Gabe Goodhart <[email protected]> Co-Authored-By: Georgi Gerganov <[email protected]> --------- Signed-off-by: Gabe Goodhart <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]>

* docs: update s390x document for sentencepiece Signed-off-by: Aaron Teo <[email protected]> (cherry picked from commit e086c5e3a7ab3463d8e0906efcfa39352db0a48d) * docs: update huggingface links + reword Signed-off-by: Aaron Teo <[email protected]> (cherry picked from commit 8410b085ea8c46e22be38266147a1e94757ef108) * ggml-cpu: disable ggml-nnpa compile flag by default fixes #14877 Signed-off-by: Aaron Teo <[email protected]> (cherry picked from commit 412f4c7c88894b8f55846b4719c76892a23cfe09) * docs: update s390x build docs to reflect nnpa disable Signed-off-by: Aaron Teo <[email protected]> (cherry picked from commit c1eeae1d0c2edc74ab9fbeff2707b0d357cf0b4d) --------- Signed-off-by: Aaron Teo <[email protected]>

Signed-off-by: Xiaodong Ye <[email protected]>

Implement REGLU, GEGLU, SWIGLU ops according to #14158

…(llama/14624) This commit adds support for MFMA instructions to MMQ. CDNA1/GFX908 CDNA2/GFX90a and CDNA3/GFX942 are supported by the MFMA-enabled code path added by this commit. The code path and stream-k is only enabled on CDNA3 for now as it fails to outperform blas in all cases on the other devices. Blas is currently only consistently outperformed on CDNA3 due to issues in the amd-provided blas libraries. This commit also improves the awareness of MMQ towards different warp sizes and as a side effect improves the performance of all quant formats besides q4_0 and q4_1, which regress slightly, on GCN gpus.

* add f16 to conv_2d testing * weaken conv2d test error threshold

ggml-ci

danbev and others added 30 commits July 28, 2025 08:43

ggml-cpu : remove stdlib include from repack.cpp (ggml/1276)

722a963

This commit removes the inclusion of `<cstdlib>`. The motivation for this change is that this source file does not seem to use any functions from this header and the comment about `qsort` is a little misleading/confusing.

Vulkan: Fix fprintf format-security warning (llama/14770)

682df28

vulkan: Add logging for bf16 features to ggml_vk_print_gpu_info (#132…

0c949db

…74) (llama/14707)

vulkan/cuda: Fix im2col when KW!=KH (llama/14789)

cb03991

The tid is decomposed into "ow + ky*OW + kx*OW*KH". Change "ksize" to match.

kleidiai: add support for get_rows (llama/14676)

1d1d640

* kleidiai: add support for get_rows * apply fixes based on code review * apply more fixes based on code review

sycl: Fix im2col (llama/14797)

06c74b3

opencl: add conv2d kernel (llama/14403)

4242186

* add conv2d kernel * fix trailing whitespace * whitespace fixe * handle f16 input and f16 kernel, more opt * resolve conflicts * use enqueue_ndrange_kernel

opencl: fix im2col when KW!=KH (llama/14803)

444a0fe

cuda: remove linking to cublasLt (llama/14790)

d7494d5

Signed-off-by: Xiaodong Ye <[email protected]>

opencl: remove unreachable return (llama/14806)

bbaaa93

cuda : implement bf16 cpy ops and enable bf16 cont (llama/14763)

0e5770e

* implement bf16 cpy ops and enable bf16 cont * deduplicate copy functions * deduplicate checks

vulkan: fix rms_norm_mul to handle broadcasting dim0 (llama/14817)

7162f92

CUDA: add fused rms norm (llama/14800)

c193044

CANN: weight format to NZ for Ascend310P3 (llama/14407)

88853c4

* weight format to nz for 310p * remove quant weight format to nz * clean code * fix * make the conditions for converting weights to NZ format consistent * clean code

ggml: fix loongarch quantize_row_q8_1 error (llama/14827)

6ca9a0e

CUDA: fix quantized KV cache + multiple sequences (llama/14822)

c137464

* CUDA: fix quantized KV cache + multiple sequences * Update ggml/src/ggml-cuda/fattn-common.cuh Co-authored-by: Georgi Gerganov <[email protected]> --------- Co-authored-by: Georgi Gerganov <[email protected]>

CUDA: fix compilation with GGML_CUDA_F16 (llama/14837)

10d2a51

CUDA: fix overflow in FA, tune performance (llama/14840)

9119c3c

sycl: fix undefined variable in work group size check (llama/14843)

510b3aa

metal : fix fusion across different encoders (llama/14849)

3a814b9

* metal : fix fusion across different encoders ggml-ci * cont : add assertion ggml-ci

sycl: fixed semantics of block offset calculation (llama/14814)

805c890

cmake : Indent ggml-config.cmake (ggml/1310)

f8122d2

sched : fix multiple evaluations of the same graph with pipeline para…

2feb28a

…llelism (llama/14855) ggml-ci

rpc : check for null buffers in get/set/copy tensor endpoints (llama/…

0e5d0ee

…14868)

opencl: add fused rms_norm_mul (llama/14841)

5fdfe3b

* opencl: add fused `rms_norm` + `mul` * opencl: improve workgroup size for `rms_norm_mul`

taronaeo and others added 7 commits July 28, 2025 08:43

musa: fix build warnings (unused variable) (llama/14869)

ef7a7f9

Signed-off-by: Xiaodong Ye <[email protected]>

CANN: Implement GLU ops (llama/14884)

5962f89

Implement REGLU, GEGLU, SWIGLU ops according to #14158

vulkan: skip empty set_rows to avoid invalid API usage (llama/14860)

4297312

vulkan : add fp16 support for the conv_2d kernel (llama/14872)

6ef17cd

* add f16 to conv_2d testing * weaken conv2d test error threshold

sync : ggml

c189a3c

ggml-ci

danbev approved these changes Jul 28, 2025

View reviewed changes

talk-llama : sync llama.cpp

8f48565

ggerganov merged commit d0a9d8c into master Jul 28, 2025
55 checks passed

ggerganov deleted the sync-ggml-25-07-28 branch July 28, 2025 10:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

sync : ggml #3342

sync : ggml #3342

Uh oh!

ggerganov commented Jul 28, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

27 participants

sync : ggml #3342

sync : ggml #3342

Uh oh!

Conversation

ggerganov commented Jul 28, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

27 participants