merge from upstream #80

l3utterfly · 2025-08-24T13:21:05Z

No description provided.

* kleidiai: fix unsigned overflow bug * address review comments

* Improve Mistral models integration with llama.cpp * Revert changes and fix gguf * Revert change * refactor convert_mistral_to_gguf.py in convert_hf_to_gguf.py * Revert collateral * Rename model name * refactor * revert * remove duplicate * Remove duplication code * Fixes * Fix flake issues * Apply comments * Apply comments * Apply comments * Fix remote * add default chat template * Revert * nit

…g#15227) This commit updates comments and error messages to use "decode" instead of "eval" in perplexity.cpp. The motivation for this is that `llama_eval` was renamed to `llama_decode` a while ago, but the comments and error messages still referred to "eval". This change ensures consistency and clarity.

This commit updates `llama_kv_cache_unified::find_slot` to log information for all streams when debug is enabled. The motivation for this change is that currently if a non-unified kv-cache is used, then only one stream will be logged because the code was currently uses `seq_to_stream[1]`.

* kv-cache : fix seq_rm with seq_id == -1 ggml-ci * cont : iterate over streams ggml-ci

…15238)

* chat : hotfix gpt-oss jinja raising an exception * fix

…ggml-org#14750) * Fix MinicpmV model converter and clip to avoid using hardcode. * Code update for pr/14750 * Remove unused field, update script path in docs. * Add version 5 for fallback code. --------- Co-authored-by: lzhang <[email protected]>

* refactor softmax * fix fa * fix mask shape * format * add comments * Remove whitespace

) * musa: fix failures in test-backend-ops for mul_mat_id op Signed-off-by: Xiaodong Ye <[email protected]> * Address review comments Signed-off-by: Xiaodong Ye <[email protected]> --------- Signed-off-by: Xiaodong Ye <[email protected]>

Signed-off-by: noemotiovon <[email protected]>

* sycl: Fix and disable more configurations of mul_mat * Disable more configurations

…ions.h (ggml-org#15273)

…over RPC (macOS & others) (ggml-org#15188) * ggml-rpc: chunk send()/recv() to avoid EINVAL for very large tensors over RPC (macOS & others). Fixes ggml-org#15055 * ggml-rpc: rename RPC_IO_CHUNK->MAX_CHUNK_SIZE, use std::min() for cap, switch to GGML_LOG_ERROR, handle 0-length send/recv * rpc: drop n==0 special case in send_data(); retry in loop per review * rpc: remove trailing whitespace in send_data() --------- Co-authored-by: Shinnosuke Takagi <[email protected]>

@JohannesGaessler

…vement on kernel-level and 10% perf increase for Gemma3n (ggml-org#15132) * Factor out `reduce_rows_f32` from common.cuh This increases iteration cycle speed by not having to recompile every kernel all the time * Hide memory-latency by loop unrolling in reduce_rows_f32 * Further optimizations to `reduce_rows_f32` 1. Increase threadblock size to better hide latency of memory requests. As a consequence of bigger threadblocks, do 2-step summation, using shared memory to communicate results between invocations 2. Use sum_temp array to reduce waits on sum 3. Adjust num_unroll to reflext bigger threadblock 4. Improve default block_dims, increase support for more block_dims * Add perf tests for `reduce_rows_f32` kernel * Add heuristic to toggle 128/512 threads based on sm count Break even point was the minimum of the following multiples. | GPU Model | Nrow SM Count Multiple | | ----------- | ----------- | | RTX 4000 SFF ADA | 2.0x | | RTX 6000 ADA | 2.5x | | RTX PRO 6000 Blackwell Max-Q | 3.04x | | RTX PRO 4500 Blackwell | 3.15x | * Ensure perf gains also for small ncols and large nrows Alternative to this, one could have also made the number of unrollings template-able, but that would require compiling the kernel multiple times, increasing binary size unnecessarily * Modify perf and unit-tests * Apply auto-formatting by clang * Fix CI build failure See https://github.com/ggml-org/llama.cpp/actions/runs/16798370266/job/47573716079?pr=15132#step:7:486 Building with VS generator worked though. * Remove sm_count property from `ggml_backend_cuda_context` Requested by @JohannesGaessler, and should fix remaining CI issues as a side-effect * Add CUB-based implementation for GGML_OP_MEAN Currently this branch is only executed for nrows==1 * Add heuristics to execute CUB branch only when it brings perf Heuristics were determined on the following HW: * RTX 4000 SFF ADA * RTX 6000 ADA * RTX PRO 6000 Blackwell Max-Q * RTX PRO 4500 Blackwell * Add unit-test for CUB-based mean Tests should run with CUDA Graphs enabled per default on NVGPUs * Rename `USE_CUB` to `GGML_CUDA_USE_CUB` Suggested by @JohannesGaessler * Unindent Preprocessor directives See ggml-org#15132 (comment)

ggml-ci

) * ci : add flake8 and pyright to copilot-setup-steps.yml * add tools/server/tests/requirements.txt

* Changed the CI file to hw * Changed the CI file to hw * Added to sudoers for apt * Removed the clone command and used checkout * Added libcurl * Added gcc-14 * Checking gcc --version * added gcc-14 symlink * added CC and C++ variables * Added the gguf weight * Changed the weights path * Added system specification * Removed white spaces * ci: Replace Jenkins riscv native build Cloud-V pipeline with GitHub Actions workflow Removed the legacy .devops/cloud-v-pipeline Jenkins CI configuration and introduced .github/workflows/build-riscv-native.yml for native RISC-V builds using GitHub Actions. * removed trailing whitespaces --------- Co-authored-by: Akif Ejaz <[email protected]>

@slaren

…e-draft parameters (ggml-org#15191) * Checkpoint from VS Code for coding agent session * Initial plan * Fix typo in --override-tensor-draft flag implementation * Add null termination for speculative tensor buffer overrides * Apply suggestions from code review * Apply suggestions from code review * Extract tensor override parsing logic to common function (addresses @slaren's feedback) * Apply suggestions from code review * Apply suggestions --------- Co-authored-by: Sigbjørn Skjæret <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]> Co-authored-by: Diego Devesa <[email protected]>

* update `rope_multi`: 1. add `ggml_rope_multi_inplace`; 1. use `GGML_MROPE_SECTIONS` instead of 4. * Apply suggestions from code review Co-authored-by: Georgi Gerganov <[email protected]> --------- Co-authored-by: Georgi Gerganov <[email protected]>

…ggml-org#15295) The flake.nix included references to llama-cpp.cachix.org cache with a comment claiming it's 'Populated by the CI in ggml-org/llama.cpp', but: 1. No visible CI workflow populates this cache 2. The cache is empty for recent builds (tested b6150, etc.) 3. This misleads users into expecting pre-built binaries that don't exist This change removes the non-functional cache references entirely, leaving only the working cuda-maintainers cache that actually provides CUDA dependencies. Users can still manually add the llama-cpp cache if it becomes functional in the future.

…org#15303) * perplexity: give more information about constraints on failure This checks whether -np is insufficient vs context, and provides clues as to how much is needed for each. * log formatting * log error and return instead of storing max_seq_exceeded int * check if s0 is zero for -np check

* vulkan: Reuse conversion results in prealloc_y Cache the pipeline and tensor that were most recently used to fill prealloc_y, and skip the conversion if the current pipeline/tensor match. * don't use shared pointer for prealloc_y_last_pipeline_used

Co-authored-by: aeseulgi <[email protected]>

ggml-ci

…pt processing (ggml-org#15488)

* [CANN] Optimize RMS_NORM using cache Signed-off-by: noemotiovon <[email protected]> * fix typo Signed-off-by: noemotiovon <[email protected]> * fix review comment Signed-off-by: noemotiovon <[email protected]> * codestyle adjustment Signed-off-by: noemotiovon <[email protected]> --------- Signed-off-by: noemotiovon <[email protected]>

* Support untied embeddings * Increase number of image tokens to 1024 * Add LFM2-VL to readme * Actually use untied embeddings

… format (ggml-org#15108) - Use server_tokens in more places in server and util.cpp - Convert most functions that used llama_tokens to server_tokens - Modify input tokenizer to handle JSON objects as subprompts - Break out MTMD prompt parsing into utility function - Support JSON objects with multimodal_data arrays for MTMD prompts along with other existing types - Add capability to model endpoint to indicate if client can send multimodal data - Add tests.

* ggml-cpu: initial q5_0 impl for s390x Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: updated q5_0 code for better performance Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: use optimised hsum for better performance Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: introduce q5_1 simd + refactor q5_0 Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: fix incorrect return type vec_hsum Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: q5_0 incomplete refactor + table_b2b_0 activation Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: refactor q5_1 Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: q5_1 update loop unroll to 4 Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: update q5_0 unroll to 4 Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: update build-s390x docs Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: update unused variables q5_0 Signed-off-by: Aaron Teo <[email protected]> * docs: update the last update date Signed-off-by: Aaron Teo <[email protected]> --------- Signed-off-by: Aaron Teo <[email protected]>

ggml-ci

* Add Pad Reflect 1D CUDA support * Update ggml/src/ggml-cuda/pad_reflect_1d.cu Co-authored-by: Johannes Gäßler <[email protected]> --------- Co-authored-by: Johannes Gäßler <[email protected]>

* add conv3d * bump GGML_OP_COUNT

* Begin work on set_rows * Work on set rows * Add error buffers for reporting unsupported SET_ROWS indices * Remove extra comments * Work on templating for different types in shaders * Work on shader type generation * Working q4_0 mul_mat and some templating for different types * Add q4_0_f16 matmul and fix device init * Add matmul support for basic quantization types * Add q2_k and q3_k quantization * Add rest of k-quants * Get firt i-quant working * Closer to supporting all i-quants * Support rest of i-quants * Cleanup code * Fix python formatting * debug * Bugfix for memset * Add padding to end of buffers on creation * Simplify bit-shifting * Update usage of StringView

…org#15427) - Spread the work across the whole workgroup. Using more threads seems to far outweigh the synchronization overhead. - Specialize the code for when the division is by a power of two.

* vulkan : support ggml_mean * vulkan : support sum, sum_rows and mean with non-contiguous tensors * vulkan : fix subbuffer size not accounting for misalign offset * tests : add backend-op tests for non-contiguous sum_rows * cuda : require contiguous src for SUM_ROWS, MEAN support * sycl : require contiguous src for SUM, SUM_ROWS, ARGSORT support * require ggml_contiguous_rows in supports_op and expect nb00=1 in the shader

Signed-off-by: Xiaodong Ye <[email protected]>

…gml-org#15489) Track a list of nodes that need synchronization, and only sync if the new node depends on them (or overwrites them). This allows some overlap which can improve performance, and centralizes a big chunk of the synchronization logic. The remaining synchronization logic involves writes to memory other than the nodes, e.g. for dequantization or split_k. Each of these allocations has a bool indicating whether they were in use and need to be synced. This should be checked before they are written to, and set to true after they are done being consumed.

* First draft * Fix linter errors * Added missing sinks nullptr * Don't forget the llama-arch! * We're through to the generation stage. * Fix post-attention norm * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <[email protected]> * Fix RoPE type * Fix tensor name and reorder llm_types * Update gguf-py/gguf/constants.py Remove nonexistent FFN_POST_NORM tensor Co-authored-by: Sigbjørn Skjæret <[email protected]> * Update src/llama-model.h Co-authored-by: Sigbjørn Skjæret <[email protected]> * Add basic chat template * Add chat template tests * Remake chat template test * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <[email protected]> * Update src/llama-chat.cpp Co-authored-by: Sigbjørn Skjæret <[email protected]> * Reorder llm type descriptions * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <[email protected]> --------- Co-authored-by: Sigbjørn Skjæret <[email protected]>

…le SMs (ggml-org#15281) * vulkan: optimize rms_norm, and allow the work to spread across multiple SMs There are really two parts to this change: (1) Some optimizations similar to what we have in soft_max, to unroll with different numbers of iterations. (2) A fusion optimization where we detect add followed by rms_norm, and make the add shader atomically accumulate the values^2 into memory. Then the rms_norm shader can just load that sum. This allows the rms_norm to be parallelized across multiple workgroups, it just becomes a simple per-element multiply. The fusion optimization is currently only applied when the rms_norm is on a single vector. This previously always ran on a single SM. It could apply more broadly, but when there are other dimensions the work can already spread across SMs, and there would be some complexity to tracking multiple atomic sums. * Change add+rms_norm optimization to write out an array of partial sums rather than using atomic add, to make it deterministic. The rms_norm shader fetches a subgroup's worth in parallel and uses subgroupAdd to add them up. * complete rebase against fused adds - multi_add shader can also compute partial sums * fix validation errors * disable add_rms_fusion for Intel due to possible driver bug * resolve against ggml-org#15489, sync after clearing partial sums

) * vulkan: workaround MoltenVK compile failure in multi_add * Update ggml/src/ggml-vulkan/vulkan-shaders/multi_add.comp Co-authored-by: 0cc4m <[email protected]>

…g#15526)

The scalar FA shader already handled multiples of 8. The coopmat1 FA shader assumed 16x16x16 and the shared memory allocations need the HSK dimensions padded to a multiple of 16. NVIDIA's coopmat2 implementation requires multiples of 16 for N and K, and needs the matrix dimensions padded and loads clamped. Store the FA pipelines in a map, indexed by the pipeline state.

* kv-cache : support layer reuse ggml-ci * cont : update comments [no ci]

chaxu01 and others added 30 commits August 11, 2025 09:59

kleidiai: fix unsigned overflow bug (ggml-org#15150)

002cb1b

* kleidiai: fix unsigned overflow bug * address review comments

convert : fix merge conflicts (ggml-org#15229)

50e81bd

kv-cache : fix seq_rm with seq_id == -1 (ggml-org#15226)

228f724

* kv-cache : fix seq_rm with seq_id == -1 ggml-ci * cont : iterate over streams ggml-ci

readme : update infra list (ggml-org#15234)

27093af

server : allow specifying reasoning_format in HTTP request (ggml-org#…

53d0a12

…15238)

chat : hotfix gpt-oss jinja raising an exception (ggml-org#15243)

fba5c0d

* chat : hotfix gpt-oss jinja raising an exception * fix

CANN: Add broadcast for softmax and FA (ggml-org#15208)

be48528

* refactor softmax * fix fa * fix mask shape * format * add comments * Remove whitespace

CANN: GGML_OP_CPY optimization (ggml-org#15070)

bbd57b7

Signed-off-by: noemotiovon <[email protected]>

CUDA cmake: add -lineinfo for easier debug (ggml-org#15260)

efe3a90

opencl: allow mixed f16/f32 add (ggml-org#15140)

60a7658

sycl: Fix and disable more configurations of mul_mat (ggml-org#15151)

f4586ee

* sycl: Fix and disable more configurations of mul_mat * Disable more configurations

HIP: disable sync warp shuffel operators from clr amd_warp_sync_funct…

b049315

…ions.h (ggml-org#15273)

ci : add copilot-setup-steps.yml (ggml-org#15214)

bc51822

ggml : repack block_iq4_nlx8 (ggml-org#14904)

00f35d5

ggml-ci

ci : add more python requirements to copilot-setup-steps (ggml-org#15289

07aa869

) * ci : add flake8 and pyright to copilot-setup-steps.yml * add tools/server/tests/requirements.txt

server : filter out harmony thought messages (ggml-org#15278)

e885445

server : enable -td and -tbd parameters (ggml-org#15172)

b3e1666

HIP: bump requirement to rocm 6.1 (ggml-org#15296)

29c8fbe

jeffbolznv and others added 29 commits August 21, 2025 16:55

vulkan: add exp operation (ggml-org#15456)

20c2dac

Co-authored-by: aeseulgi <[email protected]>

vulkan : support conv_2d_dw with f16 weights (ggml-org#15392)

97ae596

graph : remove build_attn_with_sinks overload (ggml-org#15469)

3f196be

ggml-ci

llama : remove deprecated llama_kv_self API (ggml-org#15472)

cd36b5e

ggml-ci

sched : fix possible use of wrong ids tensor when offloading moe prom…

54a241f

…pt processing (ggml-org#15488)

readme : model : mtdm : lfm2 improvements (ggml-org#15476)

e288693

* Support untied embeddings * Increase number of image tokens to 1024 * Add LFM2-VL to readme * Actually use untied embeddings

llama : remove KV cache defragmentation logic (ggml-org#15473)

9ebebef

ggml-ci

cuda : add Pad Reflect 1D support (ggml-org#14659)

b1ab918

* Add Pad Reflect 1D CUDA support * Update ggml/src/ggml-cuda/pad_reflect_1d.cu Co-authored-by: Johannes Gäßler <[email protected]> --------- Co-authored-by: Johannes Gäßler <[email protected]>

ggml: add conv3d op (ggml-org#15182)

92f7f0a

* add conv3d * bump GGML_OP_COUNT

model : gpt-oss add response_format support (ggml-org#15494)

32732f2

test-opt: allow slight inprecision (ggml-org#15503)

e92734d

vulkan: optimize mul_mat_id loading row ids into shared memory (ggml-…

330c3d2

…org#15427) - Spread the work across the whole workgroup. Using more threads seems to far outweigh the synchronization overhead. - Specialize the code for when the division is by a power of two.

vulkan.Dockerfile: install vulkan SDK using tarball (ggml-org#15282)

b55f06e

Signed-off-by: Xiaodong Ye <[email protected]>

chat : fix debug build assertion in trim function (ggml-org#15520)

21dc4dd

scripts: fix compare-llama-bench.py (ggml-org#15521)

9ef5369

CUDA: fix half2 -> half conversion for HIP (ggml-org#15529)

710dfc4

vulkan: workaround MoltenVK compile failure in multi_add (ggml-org#15506

e78cf0d

) * vulkan: workaround MoltenVK compile failure in multi_add * Update ggml/src/ggml-vulkan/vulkan-shaders/multi_add.comp Co-authored-by: 0cc4m <[email protected]>

vulkan: enable Conv2D for Apple after MoltenVK fixed the bug (ggml-or…

a9c6ffc

…g#15526)

kv-cache : support layer reuse (ggml-org#15504)

b730706

* kv-cache : support layer reuse ggml-ci * cont : update comments [no ci]

l3utterfly merged commit 7b37ef1 into layla-build Aug 24, 2025
1 check failed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

merge from upstream #80

merge from upstream #80

Uh oh!

l3utterfly commented Aug 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

54 participants

merge from upstream #80

merge from upstream #80

Uh oh!

Conversation

l3utterfly commented Aug 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

54 participants