sync : ggml #3428

ggerganov · 2025-09-20T10:49:33Z

Huge sync incoming.

It would be a miracle if everything works correctly after this 🤞

ggml-ci

…ama/15385) * Added VSX intrinsics for Power9+ systems Signed-off-by: mgiessing <[email protected]> * Manual unrolling for minor perf improvement Signed-off-by: mgiessing <[email protected]> * Update ggml/src/ggml-cpu/arch/powerpc/quants.c Co-authored-by: Georgi Gerganov <[email protected]> --------- Signed-off-by: mgiessing <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]>

…lama/15413) Signed-off-by: Xiaodong Ye <[email protected]>

* optimize rope ops * amendment * delete trailing whitespace * change the variable name

…ama/15375)

* musa: fix build warnings Signed-off-by: Xiaodong Ye <[email protected]> * fix warning: comparison of integers of different signs: 'const int' and 'unsigned int' [-Wsign-compare] Signed-off-by: Xiaodong Ye <[email protected]> --------- Signed-off-by: Xiaodong Ye <[email protected]>

These detailed strings were causing increased build time on gcc.

…(llama/15346)

Signed-off-by: Xiaodong Ye <[email protected]>

* vulkan: Reuse conversion results in prealloc_y Cache the pipeline and tensor that were most recently used to fill prealloc_y, and skip the conversion if the current pipeline/tensor match. * don't use shared pointer for prealloc_y_last_pipeline_used

Co-authored-by: aeseulgi <[email protected]>

…pt processing (llama/15488)

* [CANN] Optimize RMS_NORM using cache Signed-off-by: noemotiovon <[email protected]> * fix typo Signed-off-by: noemotiovon <[email protected]> * fix review comment Signed-off-by: noemotiovon <[email protected]> * codestyle adjustment Signed-off-by: noemotiovon <[email protected]> --------- Signed-off-by: noemotiovon <[email protected]>

* ggml-cpu: initial q5_0 impl for s390x Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: updated q5_0 code for better performance Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: use optimised hsum for better performance Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: introduce q5_1 simd + refactor q5_0 Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: fix incorrect return type vec_hsum Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: q5_0 incomplete refactor + table_b2b_0 activation Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: refactor q5_1 Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: q5_1 update loop unroll to 4 Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: update q5_0 unroll to 4 Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: update build-s390x docs Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: update unused variables q5_0 Signed-off-by: Aaron Teo <[email protected]> * docs: update the last update date Signed-off-by: Aaron Teo <[email protected]> --------- Signed-off-by: Aaron Teo <[email protected]>

* Add Pad Reflect 1D CUDA support * Update ggml/src/ggml-cuda/pad_reflect_1d.cu Co-authored-by: Johannes Gäßler <[email protected]> --------- Co-authored-by: Johannes Gäßler <[email protected]>

* add conv3d * bump GGML_OP_COUNT

* Begin work on set_rows * Work on set rows * Add error buffers for reporting unsupported SET_ROWS indices * Remove extra comments * Work on templating for different types in shaders * Work on shader type generation * Working q4_0 mul_mat and some templating for different types * Add q4_0_f16 matmul and fix device init * Add matmul support for basic quantization types * Add q2_k and q3_k quantization * Add rest of k-quants * Get firt i-quant working * Closer to supporting all i-quants * Support rest of i-quants * Cleanup code * Fix python formatting * debug * Bugfix for memset * Add padding to end of buffers on creation * Simplify bit-shifting * Update usage of StringView

…/15427) - Spread the work across the whole workgroup. Using more threads seems to far outweigh the synchronization overhead. - Specialize the code for when the division is by a power of two.

* vulkan : support ggml_mean * vulkan : support sum, sum_rows and mean with non-contiguous tensors * vulkan : fix subbuffer size not accounting for misalign offset * tests : add backend-op tests for non-contiguous sum_rows * cuda : require contiguous src for SUM_ROWS, MEAN support * sycl : require contiguous src for SUM, SUM_ROWS, ARGSORT support * require ggml_contiguous_rows in supports_op and expect nb00=1 in the shader

…llama/15489) Track a list of nodes that need synchronization, and only sync if the new node depends on them (or overwrites them). This allows some overlap which can improve performance, and centralizes a big chunk of the synchronization logic. The remaining synchronization logic involves writes to memory other than the nodes, e.g. for dequantization or split_k. Each of these allocations has a bool indicating whether they were in use and need to be synced. This should be checked before they are written to, and set to true after they are done being consumed.

…le SMs (llama/15281) * vulkan: optimize rms_norm, and allow the work to spread across multiple SMs There are really two parts to this change: (1) Some optimizations similar to what we have in soft_max, to unroll with different numbers of iterations. (2) A fusion optimization where we detect add followed by rms_norm, and make the add shader atomically accumulate the values^2 into memory. Then the rms_norm shader can just load that sum. This allows the rms_norm to be parallelized across multiple workgroups, it just becomes a simple per-element multiply. The fusion optimization is currently only applied when the rms_norm is on a single vector. This previously always ran on a single SM. It could apply more broadly, but when there are other dimensions the work can already spread across SMs, and there would be some complexity to tracking multiple atomic sums. * Change add+rms_norm optimization to write out an array of partial sums rather than using atomic add, to make it deterministic. The rms_norm shader fetches a subgroup's worth in parallel and uses subgroupAdd to add them up. * complete rebase against fused adds - multi_add shader can also compute partial sums * fix validation errors * disable add_rms_fusion for Intel due to possible driver bug * resolve against #15489, sync after clearing partial sums

* vulkan: workaround MoltenVK compile failure in multi_add * Update ggml/src/ggml-vulkan/vulkan-shaders/multi_add.comp Co-authored-by: 0cc4m <[email protected]>

…5526)

The scalar FA shader already handled multiples of 8. The coopmat1 FA shader assumed 16x16x16 and the shared memory allocations need the HSK dimensions padded to a multiple of 16. NVIDIA's coopmat2 implementation requires multiples of 16 for N and K, and needs the matrix dimensions padded and loads clamped. Store the FA pipelines in a map, indexed by the pipeline state.

… (llama/15524) * vulkan: use subgroup function for mul_mat_id shader even without coopmat * vulkan: fix compile warnings * vulkan: properly check for subgroup size control and require full subgroups for subgroup mul_mat_id * vulkan: disable subgroup mul_mat_id on devices with subgroups < 16

This commit removes the `-dev` suffix from the version string in CMakeLists.txt and the release script. The version will now be just be formatted as `MAJOR.MINOR.PATCH`.

* CUDA: add a fused top-K MoE kernel This kernel does the following: 1. softmax over the logits per token [n_experts, n_tokens] 2. argmax reduce over the top-k (n_experts_used) logits 3. write weights + ids to global memory It is intended as fusion of softmax->top-k->get_rows pipeline for MoE models * Refactor into ggml_cuda_should_use_topk_moe * Review: Use better coalescing pattern, use WARP_SIZE, store logits into registers before * Review: format + micro-optimizations * Fix bug: fix tie breakers * Add optional norm + clean-up code * Use smem for final write * Add bounds check * Use better memory pattern for writeback

Signed-off-by: Xiaodong Ye <[email protected]>

* ggml-cpu: impl mxfp4 s390x Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: missing s = sumf Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: fix incorrect kval_mxfp4 type Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: rework mxfp4 Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: missing delta calc Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: fix typo Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: fix typo for vec_splats Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: expand to 2 blocks per loop Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: add unroll to boost perf Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: back to 1 block per loop to test perf Signed-off-by: Aaron Teo <[email protected]> * Revert "ggml-cpu: back to 1 block per loop to test perf" This reverts commit 1fe55724e2dc295701101bf838bdd4a512237492. Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: rm unroll from single block Signed-off-by: Aaron Teo <[email protected]> --------- Signed-off-by: Aaron Teo <[email protected]>

…6185) * vendor : update httplib Signed-off-by: Adrien Gallouët <[email protected]> * common : use cpp-httplib as a cURL alternative for downloads The existing cURL implementation is intentionally left untouched to prevent any regressions and to allow for safe, side-by-side testing by toggling the `LLAMA_CURL` CMake option. Signed-off-by: Adrien Gallouët <[email protected]> * ggml : Bump to Windows 10 Signed-off-by: Adrien Gallouët <[email protected]> --------- Signed-off-by: Adrien Gallouët <[email protected]>

* devops: move s390x and ppc64le ci build we have access to ubuntu-24.04-s390x and ppc64le images now Signed-off-by: Aaron Teo <[email protected]> * devops: disable ppc64le for now since they have compiler errors Signed-off-by: Aaron Teo <[email protected]> * devops: stop warnings as errors Signed-off-by: Aaron Teo <[email protected]> * devops: switch to non-macro flag Signed-off-by: Aaron Teo <[email protected]> * devops: going the llama macro route Signed-off-by: Aaron Teo <[email protected]> * devops: add big-endian gguf test models Signed-off-by: Aaron Teo <[email protected]> * devops: disable ppc64le to test s390x, check test build Signed-off-by: Aaron Teo <[email protected]> * devops: dup .gguf.inp files for big-endian tests Signed-off-by: Aaron Teo <[email protected]> * devops: dup .gguf.out files for big-endian too Signed-off-by: Aaron Teo <[email protected]> * devops: add python setup and endian byteswap Signed-off-by: Aaron Teo <[email protected]> * devops: pooring thing does not have s390x python3 Signed-off-by: Aaron Teo <[email protected]> * devops: add missing rust compiler for s390x Signed-off-by: Aaron Teo <[email protected]> * devops: try rust actions runner Signed-off-by: Aaron Teo <[email protected]> * Revert "devops: try rust actions runner" This reverts commit 3f8db04356033d6c1d7eccc75ca396bc5298250c. Signed-off-by: Aaron Teo <[email protected]> * devops: try a different path for rust Signed-off-by: Aaron Teo <[email protected]> * devops: dump home directory and user info Signed-off-by: Aaron Teo <[email protected]> * devops: install gguf-py only Signed-off-by: Aaron Teo <[email protected]> * devops: missed relative path Signed-off-by: Aaron Teo <[email protected]> * devops: remove big-endian files since local swapping is working Signed-off-by: Aaron Teo <[email protected]> * devops: revert test-tokenizer-0 cmakelists Signed-off-by: Aaron Teo <[email protected]> * Fix unicode flags conversion from and to uint16_t Bitfields are allocated in different order on s390x Signed-off-by: Aaron Teo <[email protected]> * Simplify byteswap command Signed-off-by: Aaron Teo <[email protected]> * Add byteswapping and git-lfs for test-tokenizers-ggml-vocabs Signed-off-by: Aaron Teo <[email protected]> * Fix endianness detection in vocab loader Signed-off-by: Aaron Teo <[email protected]> * Disable test-thread-safety on s390x In this test a model is downloaded, then immediately loaded to check if more downloads are needed, and then used for test. There is no clean way to separate all those steps to add byteswapping between them, so just skip this test. Signed-off-by: Aaron Teo <[email protected]> * Fix q8_0 test in test-quantize-fns vec_signed uses unexpected rounding mode. Explicitly use different rounding function. Signed-off-by: Aaron Teo <[email protected]> * devops: add big-endian stories260K Signed-off-by: Aaron Teo <[email protected]> * devops: add s390x test-eval-callback Signed-off-by: Aaron Teo <[email protected]> * devops: fix test does not exist Signed-off-by: Aaron Teo <[email protected]> * devops: fix model not found llama-eval-callback Signed-off-by: Aaron Teo <[email protected]> * Fix q3_K dot product error in test-quantize-fns on s390x Array q8bytes had only 4 elements allocated, but 8 elements accessed. This lead to write out of bounds and later read of overwritten values out of bounds and incorrect result. Signed-off-by: Aaron Teo <[email protected]> * devops: re-enable ppc64le for testing Signed-off-by: Aaron Teo <[email protected]> * devops: activate test-thread-safety for s390x Signed-off-by: Aaron Teo <[email protected]> * devops: disable ppc64le tests for some reason it keeps failing test-thread-safety tests and I do not have a machine that is able to replicate the tests. Signed-off-by: Aaron Teo <[email protected]> * devops: LLAMA_FATAL_WARNINGS=ON Signed-off-by: Aaron Teo <[email protected]> * Correct repository URL for s390x for test-thread-safety model Signed-off-by: Aaron Teo <[email protected]> * Fix fs_get_cache_directory Ensure it works even if both XDG_CACHE_HOME and HOME are unset. This might happen in containers. Signed-off-by: Aaron Teo <[email protected]> * Re-enable CI for ppc64le Signed-off-by: Aaron Teo <[email protected]> * Fortify ggml_rope_impl Only memcpy data from sections argument if it's non-NULL. Signed-off-by: Aaron Teo <[email protected]> * Add TODO in struct unicode_cpt_flags to reimplement it in endian-independent way * Update URL for big-endian model * Update .github/workflows/build.yml Co-authored-by: Sigbjørn Skjæret <[email protected]> * Update remaining mentions of BE models to ggml-org/models repo --------- Signed-off-by: Aaron Teo <[email protected]> Co-authored-by: Aleksei Nikiforov <[email protected]> Co-authored-by: Aleksei Nikiforov <[email protected]> Co-authored-by: Sigbjørn Skjæret <[email protected]>

The dequantize functions are copy/pasted from mul_mm_funcs.comp with very few changes - add a_offset and divide iqs by 2. It's probably possible to call these functions from mul_mm_funcs and avoid the duplication, but I didn't go that far in this change.

…vices (llama/16156) * Throw system error on old Vulkan driver rather than SIGABRT * Optionally handle any potential error in vulkan init

* CUDA: refactor and deduplicate vector FA kernels

…lama/16277) * CUDA: mul_mat_id for mmf for bs <= 64 for f16 and bs <= 32 for f32 This commit adds mul_mat_id support for ncols_dst >= 16. It does this by packing ncols_dst tiles into the blockDim.y. My tests on a RTX 3090 show that this is faster than the cuBLAS fallback for f16 till bs=64, and for f32 till bs=32 * Review: refactor if statement

…ma/16224) * don't use VULKAN_HPP_DEFAULT_DISPATCH_LOADER_DYNAMIC_STORAGE which can cause conflicts if application or other libraries do the same

The "Clamp" spec constant is already based on whether KV is a multiple of Bc, so use that to control whether bounds checking is performed. Add bounds checking to the scalar and coopmat1 paths. Coopmat2 didn't need any changes (the K/V tensors are already optionally clamped, nothing else needed to be changed).

* vulkan: handle mat_mul with A matrix > 4GB This change splits mat_mul operations with huge A matrix into chunks in the M dimension. This works well for stable-diffusion use cases where the im2col matrix has very large M. Fix the order of setting the stride in mul_mm_cm2 - setting the dimension clobbers the stride, so stride should be set after. * build fixes

* metal : fuse non-sequential nodes * cont : add comment * cont : simplify bounds checks

* metal : support mul_mm with src1->type == GGML_TYPE_F16 * metal : support mul_mm_id with src1->type == GGML_TYPE_F16 [no ci] * metal : mul_mm support ne00 % 32 != 0 * metal : support mul_mm_id with ne00 % 32 != 0 * cont : remove unnecessary unrolls * cont : simplify data loading * metal : optimize mul_mm when output bounds checks are not needed

* vulkan: 64-bit im2col Add variants of the im2col shaders that use buffer_device_address/buffer_reference, and use 64-bit address calculations. This is needed for large convolutions used in stable-diffusion.cpp. * fix validation error for large im2col

…a/16307) * fix GGML_F32_VEC_FMA argument order in ggml_vec_mad1_f32 * add test that fails on simd

…292)

* check cuda argsort limits and add test * add metal check

ggerganov and others added 30 commits September 20, 2025 13:42

cuda : remove obsolete sources (ggml/1332)

80447f7

ggml-ci

musa: handle __hgt2_mask, available starting from MUSA SDK rc4.3.0 (l…

02b49af

…lama/15413) Signed-off-by: Xiaodong Ye <[email protected]>

CANN: optimize rope operator (llama/15335)

2572322

* optimize rope ops * amendment * delete trailing whitespace * change the variable name

opencl: mark argsort unsupported if cols exceed workgroup limit (ll…

db1d238

…ama/15375)

vulkan: shorten pipeline name strings (llama/15431)

5907ab3

These detailed strings were causing increased build time on gcc.

CUDA: replace GGML_CUDA_F16 with CUDA arch checks (llama/15433)

316ed78

CUDA: refactor FA support/selection code (llama/15454)

8f0579a

sched : copy only the used experts when offloading prompt processing …

622dec5

…(llama/15346)

musa: add GGML_UNUSED_VARS (llama/15446)

7c07784

Signed-off-by: Xiaodong Ye <[email protected]>

ggml : fix condition of im2col on Metal backend (llama/15460)

c5874bc

vulkan: add exp operation (llama/15456)

7eebd49

Co-authored-by: aeseulgi <[email protected]>

vulkan : support conv_2d_dw with f16 weights (llama/15392)

9dd5039

sched : fix possible use of wrong ids tensor when offloading moe prom…

554f96f

…pt processing (llama/15488)

cuda : add Pad Reflect 1D support (llama/14659)

18ca4e8

* Add Pad Reflect 1D CUDA support * Update ggml/src/ggml-cuda/pad_reflect_1d.cu Co-authored-by: Johannes Gäßler <[email protected]> --------- Co-authored-by: Johannes Gäßler <[email protected]>

ggml: add conv3d op (llama/15182)

d7b7498

* add conv3d * bump GGML_OP_COUNT

vulkan: optimize mul_mat_id loading row ids into shared memory (llama…

485c5c3

…/15427) - Spread the work across the whole workgroup. Using more threads seems to far outweigh the synchronization overhead. - Specialize the code for when the division is by a power of two.

CUDA: fix half2 -> half conversion for HIP (llama/15529)

b0d15e1

vulkan: workaround MoltenVK compile failure in multi_add (llama/15506)

2781786

* vulkan: workaround MoltenVK compile failure in multi_add * Update ggml/src/ggml-vulkan/vulkan-shaders/multi_add.comp Co-authored-by: 0cc4m <[email protected]>

vulkan: enable Conv2D for Apple after MoltenVK fixed the bug (llama/1…

8c7872d

…5526)

danbev and others added 26 commits September 29, 2025 15:18

ggml : bump version to 0.9.3 (ggml/1353)

06d7b3d

ggml : remove -dev suffix from release version (ggml/1355)

611ff19

This commit removes the `-dev` suffix from the version string in CMakeLists.txt and the release script. The version will now be just be formatted as `MAJOR.MINOR.PATCH`.

ggml : fix loongarch lsx compilation error (llama/15864)

24ea547

musa: fix build warnings (llama/15611)

98ac209

Signed-off-by: Xiaodong Ye <[email protected]>

metal : report OOM errors (llama/16274)

670d54e

vulkan: throw system error instead of SIGABRT during init on older de…

88dd9e0

…vices (llama/16156) * Throw system error on old Vulkan driver rather than SIGABRT * Optionally handle any potential error in vulkan init

CUDA: refactor and deduplicate vector FA kernels (llama/16208)

e856483

* CUDA: refactor and deduplicate vector FA kernels

vulkan : make the vulkan.hpp dynamic dispatcher instance private (lla…

bc1ac13

…ma/16224) * don't use VULKAN_HPP_DEFAULT_DISPATCH_LOADER_DYNAMIC_STORAGE which can cause conflicts if application or other libraries do the same

metal : fuse non-sequential nodes (llama/16102)

45976f2

* metal : fuse non-sequential nodes * cont : add comment * cont : simplify bounds checks

ggml : fix GGML_F32_VEC_FMA argument order in ggml_vec_mad1_f32 (llam…

5c6e795

…a/16307) * fix GGML_F32_VEC_FMA argument order in ggml_vec_mad1_f32 * add test that fails on simd

vulkan: Fix validation failure in quantized flash attention (llama/16…

a375e4c

…292)

ggml : fix dependencies for ggml_set_rows (llama/16318)

7ce0a7b

ggml : check cuda and metal argsort limits and add test (llama/16323)

112e10f

* check cuda argsort limits and add test * add metal check

cmake : remove metal flag (llama/0)

3201382

sync : ggml

404a931

talk-llama : sync llama.cpp

fcf0181

ggerganov force-pushed the sync-ggml-25-09-20 branch from aa10f77 to fcf0181 Compare September 29, 2025 12:18

ggerganov merged commit fcf0181 into master Sep 29, 2025
54 of 55 checks passed

ggerganov temporarily deployed to github-pages September 29, 2025 13:41 — with GitHub Actions Inactive

ggerganov deleted the sync-ggml-25-09-20 branch September 29, 2025 13:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

sync : ggml #3428

sync : ggml #3428

Uh oh!

ggerganov commented Sep 20, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

55 participants

sync : ggml #3428

sync : ggml #3428

Uh oh!

Conversation

ggerganov commented Sep 20, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

55 participants