merge from upstream #84

l3utterfly · 2025-11-13T16:29:54Z

No description provided.

* memory : remove KV cache size padding * cont : restore padding for n_kv tensor shape * server : use slot context size instead of training context size * server : simplify context limit logic

* feat(cuda): add GGML_OP_SET support Implement CUDA kernel for SET operation with f32 support. All tests passing (14598/14598). * cuda(set): add I32 support; keep F32 * refactor(cuda): use ggml_cuda_cpy to unify SET operator logic and remove code duplication * Update ggml/src/ggml-cuda/ggml-cuda.cu Co-authored-by: Sigbjørn Skjæret <[email protected]> * Update ggml/src/ggml-cuda/set.cu Co-authored-by: Sigbjørn Skjæret <[email protected]> --------- Co-authored-by: Sigbjørn Skjæret <[email protected]>

* sycl: add RMS_NORM_BACK operation support * sycl: rms_norm_back: add dual reduction paths (FP64 and FP32) and savepoint before further changes * sycl: add RMS_NORM_BACK support Implement RMS_NORM_BACK for the SYCL backend using FP32 compensated parallel reduction. Minimal docs updates (ops.md / SYCL.csv). * revert: restore .gitignore and tools/run/CMakeLists.txt to upstream * revert: restore tests/CMakeLists.txt to upstream * sycl: optimize rms_norm_back * fix: restore SYCL.csv to correct state with RMS_NORM_BACK support * Update ggml/src/ggml-sycl/norm.cpp Co-authored-by: Neo Zhang Jianyu <[email protected]> * fix: remove trailing whitespace and add missing newline (EditorConfig) --------- Co-authored-by: Neo Zhang Jianyu <[email protected]>

* CUDA: Fix bug in topk-moe for gpt-oss When using ggml_can_fuse_subgraph, the output nodes which are passed are wrong. This causes `test-backend-ops` to still fuse ndoes (because the nodes are not used elsewhere in the graph), but it actually doesn't fuse in the actual gpt-oss * fix for qwen3 too * change ifndef to ifdef

…rg#16793) This lets the copy to the destination device use the host-visible vidmem optimization.

* sync minja.hpp Adds Call/EndCall support, used in MiniCPM3 and MiniCPM4-MCP. * remove spurious semicolon * sync from ochafik/minja

* CUDA: use fastdiv in set-rows * add assert about value fitting in u32

* hexagon: remove dspqueue callbacks and do all read processing inplace * hexagon: there is no need to ref/deref the buffers at this point We're not going to release the buffers without flushing the session queue. So there is no need to inc/dec the refcounts for every request. We also don't need to include those bufs in the response. * hexagon: bump the thread count in the adb wrapper scripts We can use more CPU cores now that the dedicated dspqueue polling threads are not used (ie no contention). Also enable more agressive polling for now since we still map Flash Attention (and a few other kernels) to the CPU and those dspqueue threads were keeping the CPU cores are higher clock freqs. * hexagon: add lhez as the second code owner

* vulkan: add mmq q2_k integer dot support * Refactor mmq caching * Reduce mmq register use * Load 4 quant blocks into shared memory in one step * Pack q2_k blocks into caches of 32 * Use 32-bit accumulators for integer dot matmul * Add q4_k mmq * Add q3_k mmq * Add q5_k mmq * Add q6_k mmq * Add mxfp4 mmq, enable MMQ MUL_MAT_ID * Fix mmv dm loads

…#16656) * vulkan: Update topk_moe fusion to handle gpt's late softmax Based on ggml-org#16649. * Add ggml_check_edges * Add sync logging to show fusion effects * handle clamp added in ggml-org#16655 * Update ggml/src/ggml-impl.h Co-authored-by: Diego Devesa <[email protected]>

* llama: store mrope data in KV cell * correct x,y ordering * address review comments * add consistency checks * Update src/llama-kv-cache.cpp Co-authored-by: Georgi Gerganov <[email protected]> * add TODO * fix asan error * kv-cells : improve ext handling * cont : fix headers --------- Co-authored-by: Georgi Gerganov <[email protected]>

This pattern appears in a lot of models, the rope operation is applied right before storing into the KV cache (usually on the K tensor). Add a path to some of the rope shaders that computes the destination address based on the set_rows tensor. Compile variants of the shader with D_TYPE of f16 (the usual KV cache type). Add a src3 operand to ggml_vk_op_f32 - sometimes rope uses three srcs and needs the fourth for the row indices. Add fused_ops_write_mask to indicate which intermediate tensors need to write their results to memory. Skipping writing the roped K value helps to allow more nodes to run concurrently. Add logic to ggml_vk_graph_optimize to make ROPE+VIEW+SET_ROWS consecutive. It rarely starts out that way in the graph. Add new backend tests.

This is realised by loading them into registers before computation of the dot-product, effectively batching them together with said dot-product. As a lot of threads are alive here, the warp scheduler has enough threads available to effectively hide the cost of additionally loading those two floats.

* Added GGUF mappings for CogVLM model * Add tensor mapping for CogVLM visual encoder * Add CogVLM to conversion script, no vision part yet * Added CogVLM vision model to conversion script * Add graph for CogVLM CLIP model * Add graph for CogVLM * Fixes for CogVLM. Now compiles. * Model now runs * Fixes for cogvlm graph * Account for graph context change after rebase * Changes for whitespace * Changes in convert script according to comments * Switch CogVLM LLM graph to merged QKV tensor * Use rope_type variable instead of direct definition * Change CogVLM CLIP encoder to use SWIGLU * Switch CogVLM CLIP to use merged QKV * Apply rebase edits and remove ggml_cont call that is now unnecessary * clean up --------- Co-authored-by: Xuan Son Nguyen <[email protected]>

Factor out the core FA loop into flash_atten_f16_one_chunk and add an outter loop on top that handles the chunks.

* support qwen3vl series. Co-authored-by: Thireus ☠ <[email protected]> Co-authored-by: yairpatch <[email protected]> Co-authored-by: LETS-BEE <[email protected]> * bugfix: fix the arch check for qwen3vl-moe. * use build_ffn * optimize deepstack structure * optimize deepstack feature saving * Revert "optimize deepstack feature saving" for temporal fix This reverts commit f321b9f. * code clean * use fused qkv in clip * clean up / rm is_deepstack_layers for simplification * add test model * move test model to "big" section * fix imrope check * remove trailing whitespace * fix rope fail * metal : add imrope support * add imrope support for sycl * vulkan: add imrope w/o check * fix vulkan * webgpu: add imrope w/o check * Update gguf-py/gguf/tensor_mapping.py Co-authored-by: Sigbjørn Skjæret <[email protected]> * fix tensor mapping --------- Co-authored-by: Thireus ☠ <[email protected]> Co-authored-by: yairpatch <[email protected]> Co-authored-by: LETS-BEE <[email protected]> Co-authored-by: Xuan Son Nguyen <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]> Co-authored-by: Sigbjørn Skjæret <[email protected]>

…ing on ARM64 (ggml-org#16833) Very similar implementation to the flash-attention chunking, with similar benefits.

* server : remove n_past * server : replace slot.n_prompt_tokens() with slot.task->n_tokens() * server : fixes + clean-up * cont : fix context shift * server : add server_tokens::pos_next() Co-authored-by: Xuan-Son Nguyen <[email protected]> * server : fix pos_next() usage Co-authored-by: Xuan-Son Nguyen <[email protected]> --------- Co-authored-by: Xuan-Son Nguyen <[email protected]>

* Update requirements-convert_legacy_llama.txt Updated requirements to support Qwen3-VL in transformers 4.57.1 version * Update requirements/requirements-convert_legacy_llama.txt Co-authored-by: Sigbjørn Skjæret <[email protected]> --------- Co-authored-by: Sigbjørn Skjæret <[email protected]>

…ml-org#16836) * respect input size when getting/setting tensor data allows partial repacking/copying when get tensor size is smaller than the actual tensor * Removed duplicate repack_mxfp4_mxfp4x4x2 function

* vulkan: fix shmem overrun in mmq id shader * metal : fix mul_mm_id --------- Co-authored-by: Georgi Gerganov <[email protected]>

…17214)

Co-authored-by: Zhang Jianyu <[email protected]>

* CUDA: add fused rope * move k forward_expand up * create helper function instead of re-using params * make assert statement more in line with comment * rope_norm: coalesced writes to global mem

* update L2_NORM op support * update L2_NORM op support * remove extra whitespace * cann: update cross_entropy_loss op support * remove trailing whitespaces * rebase the latest code in the main repository and remove the l2_norm operator that already exists in another pull request. * undo the l2_norm operator deletion

…" (ggml-org#17233) This reverts commit 1c398dc.

* metal: accelerated conv2d * cont : cleanup --------- Co-authored-by: bghira <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]>

…ations (ggml-org#17227) Signed-off-by: Wang Yang <[email protected]>

…heck (ggml-org#17219) * vulkan: remove shell call from vulkan-shaders-gen tool * use string vector for command execution * Fix condition * use string, remove const_cast * Fix dependency file quotation on Windows --------- Co-authored-by: Jeff Bolz <[email protected]>

ggerganov and others added 30 commits October 28, 2025 19:41

llama-bench : clarify benchmarked parts of the computation (ggml-org#…

a8ca18b

…16823)

memory : remove KV cache size padding (ggml-org#16812)

85a7d86

* memory : remove KV cache size padding * cont : restore padding for n_kv tensor shape * server : use slot context size instead of training context size * server : simplify context limit logic

vulkan: Call ggml_vk_buffer_write_2d from ggml_vk_buffer_copy (ggml-o…

f549b00

…rg#16793) This lets the copy to the destination device use the host-visible vidmem optimization.

vendor : sync minja (ggml-org#16500)

144a4ce

* sync minja.hpp Adds Call/EndCall support, used in MiniCPM3 and MiniCPM4-MCP. * remove spurious semicolon * sync from ochafik/minja

CUDA: use fastdiv in set-rows (ggml-org#16834)

e41bcce

* CUDA: use fastdiv in set-rows * add assert about value fitting in u32

llama: fix ASAN error with M-RoPE (ggml-org#16848)

3464bda

vulkan: Handle argsort with a large number of rows (ggml-org#16851)

052df28

llama : use std::abs instead of abs (ggml-org#16853)

d739511

cuda : fix argsort with 64k+ rows (ggml-org#16849)

229bf68

cpu: introduce chunking for flash attention (ggml-org#16829)

dcca0d3

Factor out the core FA loop into flash_atten_f16_one_chunk and add an outter loop on top that handles the chunks.

common: fix typo in cli help text (ggml-org#16864)

835e918

cpu: introduce chunking for repack matmuls and enable matmul-id chunk…

517b717

…ing on ARM64 (ggml-org#16833) Very similar implementation to the flash-attention chunking, with similar benefits.

server : bump request URI max length to 32768 (ggml-org#16862)

16724b5

opencl: fix boundary handling for mul_mm (ggml-org#16875)

9984cbb

ci : enable free-disk-space on cuda docker build (ggml-org#16877)

6eb208d

ggml-hexagon: respect input size when getting/setting tensor data (gg…

13002a0

…ml-org#16836) * respect input size when getting/setting tensor data allows partial repacking/copying when get tensor size is smaller than the actual tensor * Removed duplicate repack_mxfp4_mxfp4x4x2 function

vulkan: fix shmem overrun in mmq id shader (ggml-org#16873)

d2a2673

* vulkan: fix shmem overrun in mmq id shader * metal : fix mul_mm_id --------- Co-authored-by: Georgi Gerganov <[email protected]>

furrysalamander and others added 12 commits November 12, 2025 20:33

docker : preserve .so symlinks for docker container builds (ggml-org#…

92bb442

…17214)

CUDA: static assert to prevent misuse of memcpy_1 (ggml-org#17198)

5d6838b

vocab : correct bounds check for UGM XCDA array access (ggml-org#17215)

ffb6f3d

update SYCL support OPs (ggml-org#17208)

07751f8

Co-authored-by: Zhang Jianyu <[email protected]>

CUDA: fuse rope + set_rows (ggml-org#16884)

a90eb94

* CUDA: add fused rope * move k forward_expand up * create helper function instead of re-using params * make assert statement more in line with comment * rope_norm: coalesced writes to global mem

ggml-cpu : use template for argsort (ggml-org#17222)

879dec3

Revert "ggml-cpu: handle 3d tensors in repack mat_mul (ggml-org#17030)…

2776db6

…" (ggml-org#17233) This reverts commit 1c398dc.

metal: accelerated conv2d (ggml-org#17175)

0cfb191

* metal: accelerated conv2d * cont : cleanup --------- Co-authored-by: bghira <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]>

ggml-cpu : add RISC-V vector intrinsic support for silu and cvar oper…

1215dde

…ations (ggml-org#17227) Signed-off-by: Wang Yang <[email protected]>

sched : fix reserve ignoring user tensor assignments (ggml-org#17232)

dd091e5

l3utterfly merged commit 2338cf0 into layla-build Nov 13, 2025
72 of 84 checks passed

github-actions bot added documentation Improvements or additions to documentation SYCL Nvidia GPU Vulkan testing build examples devops python server ggml Apple Metal script nix Ascend NPU OpenCL model labels Nov 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

merge from upstream #84

merge from upstream #84

Uh oh!

l3utterfly commented Nov 13, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

70 participants

merge from upstream #84

merge from upstream #84

Uh oh!

Conversation

l3utterfly commented Nov 13, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

70 participants