sync : llama.cpp #1327

ggerganov · 2025-08-13T16:05:54Z

waiting for fix: ggml-org/llama.cpp#15132 (comment)

…llama/15062)

* cmake: Add GGML_BACKEND_DIR option This can be used by distributions to specify where to look for backends when ggml is built with GGML_BACKEND_DL=ON. * Fix phrasing

* oai moe * compat with new checkpoint * add attn sink impl * add rope scaling yarn * logits match with latest transformers code * wip chat template * rm trailing space * use ggml_scale_bias * rm redundant is_swa_all * convert interleaved gate_up * graph : fix activation function to match reference (llama/7) * vocab : handle o200k_harmony special tokens * ggml : add attention sinks support (llama/1) * llama : add attn sinks * ggml : add attn sinks * cuda : add attn sinks * vulkan : add support for sinks in softmax remove unnecessary return * ggml : add fused swiglu_oai op (llama/11) * ggml : add fused swiglu_oai op * Update src/ggml-cpu/ops.cpp Co-authored-by: Georgi Gerganov <[email protected]> * update CUDA impl * cont : metal impl * add vulkan impl * test-backend-ops : more test cases, clean up * llama : remove unfused impl * remove extra lines --------- Co-authored-by: Georgi Gerganov <[email protected]> --------- Co-authored-by: slaren <[email protected]> * repack mxfp4 upon conversion * clean up a bit * enable thinking * add quick hack to render only some special tokens * fix bf16 conversion * remove vocab hack * webui ok * support chat parsing for gpt-oss * fix webui * direct mapping mxfp4, FINALLY * force using mxfp4 * properly use lazy tensor * ggml : add mxfp4 ggml : use e8m0 conversion instead of powf Co-authored-by: Diego Devesa <[email protected]> change kvalues_mxfp4 table to match e2m1 (llama/6) metal : remove quantization for now (not used) cuda : fix disabled CUDA graphs due to ffn moe bias vulkan : add support for mxfp4 cont : add cm2 dequant * ggml : add ggml_add_id (llama/13) * ggml : add ggml_add_id * add cuda impl * llama : add weight support check for add_id * perf opt * add vulkan impl * rename cuda files * add metal impl * allow in-place ggml_add_id * llama : keep biases on CPU with --cpu-moe * llama : fix compile error ggml-ci * cuda : add fallback for __nv_cvt_e8m0_to_bf16raw ggml-ci * cleanup ggml-ci * sycl : fix supports_op for MXFP4 ggml-ci * fix Unknown reasoning format * ggml-cpu : fix AVX build ggml-ci * fix hip build ggml-ci * cuda : add mxfp4 dequantization support for cuBLAS ggml-ci * ggml-cpu : fix mxfp4 fallback definitions for some architectures ggml-ci * cuda : fix version required for __nv_cvt_e8m0_to_bf16raw --------- Co-authored-by: Xuan Son Nguyen <[email protected]> Co-authored-by: slaren <[email protected]>

* feat(cann): add optional support for ACL Graph execution This commit adds support for executing ggml computational graphs using Huawei's ACL graph mode via the USE_CANN_GRAPH flag. The support can be enabled at compile time using the CMake option: -DUSE_CANN_GRAPH=ON By default, ACL graph execution is **disabled**, and the fallback path uses node-by-node execution. Key additions: - CMake option to toggle graph mode - Graph capture and execution logic using - Tensor property matching to determine whether graph update is required - Safe fallback and logging if the environment variable LLAMA_SET_ROWS is unset or invalid This prepares the backend for performance improvements in repetitive graph execution scenarios on Ascend devices. Signed-off-by: noemotiovon <[email protected]> * Fix review comments Signed-off-by: noemotiovon <[email protected]> * remane USE_CANN_GRAPH to USE_ACL_GRAPH Signed-off-by: noemotiovon <[email protected]> * fix typo Signed-off-by: noemotiovon <[email protected]> --------- Signed-off-by: noemotiovon <[email protected]>

* opencl: add `swiglu-oai` * opencl: add `add_id` * opencl: add missing `add_id.cl`

* CUDA: GEMM for FP32/FP16/BF16 and ne11 <= 16

…a/15094) Any available libraries are found and loaded dynamically at runtime.

…age metrics (llama/15103)

* gguf-py : add MXFP4 de/quantization support * ggml-quants : handle zero amax for MXFP4

* CUDA: add attention sinks for tile and wmma * Review: formatting changes + remove syncthreads from tile + remove warp_reduce_max from wmma

* cuda: refactored ssm_scan to use CUB * fixed compilation error when when not using CUB * assign L to constant and use size_t instead of int * deduplicated functions * change min blocks per mp to 1 * Use cub load and store warp transpose * suppress clang warning

* kleidiai: fix unsigned overflow bug * address review comments

* refactor softmax * fix fa * fix mask shape * format * add comments * Remove whitespace

* musa: fix failures in test-backend-ops for mul_mat_id op Signed-off-by: Xiaodong Ye <[email protected]> * Address review comments Signed-off-by: Xiaodong Ye <[email protected]> --------- Signed-off-by: Xiaodong Ye <[email protected]>

Signed-off-by: noemotiovon <[email protected]>

* sycl: Fix and disable more configurations of mul_mat * Disable more configurations

…ions.h (llama/15273)

…over RPC (macOS & others) (llama/15188) * ggml-rpc: chunk send()/recv() to avoid EINVAL for very large tensors over RPC (macOS & others). Fixes #15055 * ggml-rpc: rename RPC_IO_CHUNK->MAX_CHUNK_SIZE, use std::min() for cap, switch to GGML_LOG_ERROR, handle 0-length send/recv * rpc: drop n==0 special case in send_data(); retry in loop per review * rpc: remove trailing whitespace in send_data() --------- Co-authored-by: Shinnosuke Takagi <[email protected]>

@JohannesGaessler

…vement on kernel-level and 10% perf increase for Gemma3n (llama/15132) * Factor out `reduce_rows_f32` from common.cuh This increases iteration cycle speed by not having to recompile every kernel all the time * Hide memory-latency by loop unrolling in reduce_rows_f32 * Further optimizations to `reduce_rows_f32` 1. Increase threadblock size to better hide latency of memory requests. As a consequence of bigger threadblocks, do 2-step summation, using shared memory to communicate results between invocations 2. Use sum_temp array to reduce waits on sum 3. Adjust num_unroll to reflext bigger threadblock 4. Improve default block_dims, increase support for more block_dims * Add perf tests for `reduce_rows_f32` kernel * Add heuristic to toggle 128/512 threads based on sm count Break even point was the minimum of the following multiples. | GPU Model | Nrow SM Count Multiple | | ----------- | ----------- | | RTX 4000 SFF ADA | 2.0x | | RTX 6000 ADA | 2.5x | | RTX PRO 6000 Blackwell Max-Q | 3.04x | | RTX PRO 4500 Blackwell | 3.15x | * Ensure perf gains also for small ncols and large nrows Alternative to this, one could have also made the number of unrollings template-able, but that would require compiling the kernel multiple times, increasing binary size unnecessarily * Modify perf and unit-tests * Apply auto-formatting by clang * Fix CI build failure See https://github.com/ggml-org/llama.cpp/actions/runs/16798370266/job/47573716079?pr=15132#step:7:486 Building with VS generator worked though. * Remove sm_count property from `ggml_backend_cuda_context` Requested by @JohannesGaessler, and should fix remaining CI issues as a side-effect * Add CUB-based implementation for GGML_OP_MEAN Currently this branch is only executed for nrows==1 * Add heuristics to execute CUB branch only when it brings perf Heuristics were determined on the following HW: * RTX 4000 SFF ADA * RTX 6000 ADA * RTX PRO 6000 Blackwell Max-Q * RTX PRO 4500 Blackwell * Add unit-test for CUB-based mean Tests should run with CUDA Graphs enabled per default on NVGPUs * Rename `USE_CUB` to `GGML_CUDA_USE_CUB` Suggested by @JohannesGaessler * Unindent Preprocessor directives See ggml-org/llama.cpp#15132 (comment)

ggml-ci

* update `rope_multi`: 1. add `ggml_rope_multi_inplace`; 1. use `GGML_MROPE_SECTIONS` instead of 4. * Apply suggestions from code review Co-authored-by: Georgi Gerganov <[email protected]> --------- Co-authored-by: Georgi Gerganov <[email protected]>

ggml-ci

* examples/finetune -opt SGD (stochastic gradient descent) memory opt add unit tested GGML_OPT_OPTIMIZER_SGD to ggml - avoids allocating m, v tensors. support finetune.cpp arg -opt SGD (or sgd). (default adamw as before) llama 3.2-1b-F32 result: observed 11gb gpu ram (41 sec/epoch) when using SGD instead of 19gb (55 sec/epoch) using adamw. (wikipedia 100 lines finetune) ( using the same GPU memory, adamw can only do before OOM 512 batch/context, reaching: train: [███████▉] data=0000140/0000140 loss=0.02575±0.00099 acc=99.52±0.03% t=00:00:47 ETA=00:00:00 val: [███████▉] data=0000008/0000008 loss=4.76565±0.28810 acc=41.46±0.77% t=00:00:00 ETA=00:00:00 SGD is superior, though it converges slower, with max before OOM 1728 batch/context (esp see the better validation perf): train: [███████▉] data=0000039/0000039 loss=0.00371±0.00010 acc=99.96±0.01% t=00:00:41 ETA=00:00:00 val: [███████▉] data=0000003/0000003 loss=5.11406±0.76034 acc=48.01±0.69% t=00:00:01 ETA=00:00:00 ) note: when finetuning long enough (or w/ enough -lr), validation accuracy *eventually* drops ('catastrophic forgetting') -lr-half (halflife) option useful for SGD to avoid oscillation or super slow underdamped learning (makes setting -lr more forgiving). terminal -lr for now is set by lr-halvings i.e. if you want at most 1/8 the inital -lr you set -lr-halvings 3. note: objective loss not directly comparable between adamw, sgd? - check perplexity or accuracy or consider relative improvements for convergence new finetune args -wd 1e-9 to enable weight decay in sgd or adamw, and max -epochs N (default 2 as before) cache (1 - wd*alpha) in 'adamw' opt struct - no noticeable perf benefit, disabled (still done for new SGD though) since opt. memory is pre-allocated, the ggml_opt_get_optimizer_params would probably be able to change between SGD and AdamW with each epoch but would need to use adamw for the first (unconfirmed - no cmdline arg to set such a policy yet) test-opt checks adamw as before and now sgd (except for a few disabled tests for sgd only; probably just needs logging values and adding alternate reference values); tolerance on the 'regression' test is broader for sgd (so we don't need many more epochs) * Vulkan: Implement GGML_OP_OPT_STEP_SGD * tests: Fix OPT_STEP_SGD test-backend-ops * SGD op param store weight-decay and not 1-alpha*wd * minor + cosmetic changes * fix vulkan sgd * try CI fix --------- Co-authored-by: 0cc4m <[email protected]> Co-authored-by: Johannes Gäßler <[email protected]>

* fix USE_CUDA_GRAPH=OFF ggml-ci * check capture status * completely disable capturing check instead

ggml-ci

JohannesGaessler and others added 30 commits August 13, 2025 19:04

CUDA: use mma FA kernel for gqa > 4 on RTX 4000 (llama/15035)

9c60398

opencl: fix adreno compiler detection logic (llama/15029)

2651cb7

vulkan: Use coopmat2 for conv2d (llama/14982)

949131a

vulkan: fix build when using glslang that does not support coopmat2 (…

8b87017

…llama/15062)

cmake: Add GGML_BACKEND_DIR option (llama/15074)

63b6d5e

* cmake: Add GGML_BACKEND_DIR option This can be used by distributions to specify where to look for backends when ggml is built with GGML_BACKEND_DL=ON. * Fix phrasing

sycl: fix mul_mat selection (llama/15092)

1c18a3a

ggml : fix fallback to CPU for ununsupported ops (llama/15118)

1667d77

opencl: add swiglu_oai and add_id (llama/15121)

0be191f

* opencl: add `swiglu-oai` * opencl: add `add_id` * opencl: add missing `add_id.cl`

fix profiling crash (llama/15072)

06e36b3

CUDA: GEMM for FP32/FP16/BF16 and ne11 <= 16 (llama/15131)

63806ec

* CUDA: GEMM for FP32/FP16/BF16 and ne11 <= 16

ggml: Skip backend library linking code when GGML_BACKEND_DL=ON (llam…

10cf8ea

…a/15094) Any available libraries are found and loaded dynamically at runtime.

HIP: add cmake option to enable compiler output of kernel resource us…

2d056bc

…age metrics (llama/15103)

vulkan: Add env var to disable host visible vidmem (llama/15109)

412c6db

vulkan: support fattn sinks (llama/15126)

2d26b6d

opencl: support sink in soft_max (attn sinks) (llama/15152)

54f6875

CUDA: attention sinks for mma FlashAttention (llama/15157)

d654e38

ggml : fix field name when new ggml_backend (llama/14944)

8122b79

gguf-py : add Numpy MXFP4 de/quantization support (llama/15111)

eab860c

* gguf-py : add MXFP4 de/quantization support * ggml-quants : handle zero amax for MXFP4

CUDA: add attention sinks for tile and wmma (llama/15178)

44ad8cc

* CUDA: add attention sinks for tile and wmma * Review: formatting changes + remove syncthreads from tile + remove warp_reduce_max from wmma

kleidiai: fix unsigned overflow bug (llama/15150)

f9070d6

* kleidiai: fix unsigned overflow bug * address review comments

CANN: Add broadcast for softmax and FA (llama/15208)

b11d639

* refactor softmax * fix fa * fix mask shape * format * add comments * Remove whitespace

CANN: GGML_OP_CPY optimization (llama/15070)

3a904fd

Signed-off-by: noemotiovon <[email protected]>

CUDA cmake: add -lineinfo for easier debug (llama/15260)

8e1c682

opencl: allow mixed f16/f32 add (llama/15140)

6f0c19c

sycl: Fix and disable more configurations of mul_mat (llama/15151)

9969cfc

* sycl: Fix and disable more configurations of mul_mat * Disable more configurations

HIP: disable sync warp shuffel operators from clr amd_warp_sync_funct…

cf811e3

…ions.h (llama/15273)

Tak-RS and others added 11 commits August 13, 2025 19:04

ggml : repack block_iq4_nlx8 (llama/14904)

f381597

ggml-ci

sync : llama.cpp

5606bd1

ggml-ci

HIP: bump requirement to rocm 6.1 (llama/15296)

78d736a

cuda : fix GGML_CUDA_GRAPHS=OFF (llama/15300)

fc8640d

* fix USE_CUDA_GRAPH=OFF ggml-ci * check capture status * completely disable capturing check instead

sync : llama.cpp

d65b660

ggml-ci

tests : remove unused includes (#0)

28d9223

mnist : adapt to opt changes

c75e2b1

ggml-ci

ggerganov force-pushed the sync-llama.cpp-25-08-13 branch from 888092a to c75e2b1 Compare August 14, 2025 10:51

ggerganov merged commit c765c8f into master Aug 14, 2025
8 of 16 checks passed

ggerganov deleted the sync-llama.cpp-25-08-13 branch August 14, 2025 11:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

sync : llama.cpp #1327

sync : llama.cpp #1327

Uh oh!

ggerganov commented Aug 13, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

23 participants

sync : llama.cpp #1327

sync : llama.cpp #1327

Uh oh!

Conversation

ggerganov commented Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

23 participants

ggerganov commented Aug 13, 2025 •

edited

Loading