sync : llama.cpp by ggerganov · Pull Request #1424 · ggml-org/ggml

ggerganov · 2026-02-14T15:55:40Z

No description provided.

* Rename variables + fix rope_neox Seems memory layout is shared with Vulkan so we can port fix from ggml-org/llama.cpp#19299 * Fix rope_multi * Fix rope_vision * Fix rope_norm * Rename ne* to ne0* for consistent variable naming * cont : consistent stride names --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* cuda : extend GGML_OP_PAD to work with non-cont src0 * tests : add permuted pad

Implement ggml_cann_mul_mat_id_quant function to support quantized matrix multiplication for Mixture of Experts (MoE) architectures on CANN backend. Key features: - Support Q4_0 and Q8_0 quantized weight formats - Use IndexSelect to dynamically route expert-specific weights based on indices - Leverage WeightQuantBatchMatmulV2 for efficient quantized computation - Handle automatic F16 type conversion for hardware compatibility - Support both per-expert and broadcast input modes Implementation details: - Extract expert weights and scales using CANN IndexSelect operation - Process each batch and expert combination independently - Create proper tensor views with correct stride for matmul operations - Automatic input/output type casting to/from F16 as needed Testing: All test cases passed for supported types (F32, F16, Q4_0, Q8_0).

…a/18968)

…ion (llama/19452) using noexcept std::filesystem::directory_entry::is_regular_file overload prevents abnormal termination upon throwing an error (as caused by symlinks to non-existent folders on linux) Resolves: #18560

…ons (dotprod) (llama/19360) * First working version of GEMM and GEMV * interleave loads and compute * Clang-format * Added missing fallback. Removed tested TODO. * Swap M and N to be consistent with the repack template convention

* Fix memory leaks in shader lib, backend, backend_context, buffer_context, and webgpu_buf_pool * Free pools * Cleanup * More cleanup * Run clang-format * Fix arg-parser and tokenizer test errors that free an unallocated buffer * Fix device lost callback to not print on device teardown * Fix include and run clang-format * remove unused unused * Update binary ops --------- Co-authored-by: Reese Levine <reeselevine1@gmail.com>

CCCL 3.2 has been released since it was added to llama.cpp as part of the backend-sampling PR, and it makes sense to update from RC to final released version. https://github.com/NVIDIA/cccl/releases/tag/v3.2.0

* tests : extend bin bcast for permuted src1 * cont : extend bin support * cont : s0 is always 1 * tests : simplify

* hexagon: add ARGSORT op Co-authored-by: Yarden Tal <yardent@qti.qualcomm.com> * hexagon: argsort reject tensors with huge rows for now * Adding support for DIV,SQR,SQRT,SUM_ROWS ops in hexagon backend * hexagon : Add GEGLU op * hexagon: fix editor config check * hexagon: rewrite and optimize binary ops ADD/SUB/MUL/DIV/ADD_ID to use DMA --------- Co-authored-by: Yarden Tal <yardent@qti.qualcomm.com> Co-authored-by: Manohara Hosakoppa Krishnamurthy <mhosakop@qti.qualcomm.com>

…9511) * ggml : unary ops support non-cont src0 * metal : support F16 unary ops + fix ELU

* opencl: add general q6_k mm * opencl: refine condition for q6_K mm * opencl: add general q4_K mv * opencl: fix whitespace

…lama/19407) * ggml-hexagon: implement 2x2 matmul kernel * hexmm: implement vec_dot_rx2x2 for Q8_0 and MXFP4 * hexagon: fix editor config failures * hexagon: refactor matmul ops to use context struct and remove wrappers Also implement vec_dot_f16 2x2 * hexagon: refactor dyn quantizers to use mmctx * hexagon: remove mm fastdiv from op_ctx * hexagon: refactor matmul entry point to reduce code duplication --------- Co-authored-by: Trivikram Reddy <tamarnat@qti.qualcomm.com>

…9461) There is an upstream problem [1] with AMD's LLVM 22 fork and rocWMMA 2.2.0 causing compilation issues on devices without native fp16 support (CDNA devices). The specialized types aren't resolved properly: ``` /opt/rocm/include/rocwmma/internal/mfma_impl.hpp:2549:37: error: ambiguous partial specializations of 'amdgcn_mfma<__half, __half, __half, 16, 16, 16>' 2549 | using ARegsT = typename Impl::ARegsT; ``` Add a workaround to explicitly declare the types and cast when compiling with HIP and ROCWMMA_FATTN [2]. When this is actually fixed upstream some guards can be used to detect and wrap the version that has the fix to only apply when necessary. Link: ROCm/rocm-libraries#4398 [1] Link: ggml-org/llama.cpp#19269 [2] Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>

* opencl: add q4_1 mv * opencl: clean up * opencl: add flattened q4_1 mv * opencl: clean up * opencl: add basic q4_1 mm * opencl: fix whitespace * opencl: add general q4_0 mm

* Do not mutate cgraph for fused ADDs 1. We should try to minimize in-place changes to the incoming ggml_cgraph where possible (those should happen in graph_optimize) 2. Modifying in-place leads to an additional, unnecessary graph capture step as we store the properties before modifying the graph in-place in the cuda-backend * Assert ggml_tensor is trivially copyable * Update ggml/src/ggml-cuda/ggml-cuda.cu Co-authored-by: Aman Gupta <amangupta052@gmail.com> --------- Co-authored-by: Aman Gupta <amangupta052@gmail.com>

* CUDA: loop over ne2*ne3 in case it overflows * use fastdiv

* fix vulkan ggml_acc only works in 3d but not 4d * removed clamp in test_acc_block * use the correct stride and its test case * cuda : fix "supports op" condition * change src0 to src1 in ggml_vk_acc. Update acc.comp with jeffbolznv\'s suggestion except to keep the boundary check * version without boundary check * revert back to boundary check version --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

…lama/19583) * ggml-hexagon: fa improvements ggml-hexagon: optimize flash attention calculations with improved variable handling ggml-hexagon: streamline flash attention operations by removing redundant checks for FP32 ggml-hexagon: optimize hvx_dot_f16_f16_aa_rx2 by simplifying variable handling for unused elements ggml-hexagon: optimize flash attention by changing slope vector type to F16 * hexfa: fixed test-backend-ops failurs due to leftover element handling * hexagon: refactor and optimize fa to use local context struct * ggml-hexagon: optimize flash-attention using hvx_vec_expf Use HVX for online softmax. --------- Co-authored-by: chraac <chraac@gmail.com>

This commit allows Qualcomm native vulkan driver to be used on Windows instead of Mesa Dozen.

last_graph is only available without OpenMP, but ggml_graph_compute_thread() is called in both cases. Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* models : optimizing qwen3next graph * cont * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * cont : remove redundant q, g chunking * minor * minor * avoid passing masks around * avoid concats during chunking * naming + shapes * update names and use prefix to disable CUDA graphs

ORippler and others added 30 commits February 14, 2026 17:53

cuda : extend GGML_OP_PAD to work with non-cont src0 (llama/19429)

c954c9c

* cuda : extend GGML_OP_PAD to work with non-cont src0 * tests : add permuted pad

CANN: Remove unnecessary wrapper for gml_backend_buft_is_cann (llam…

41fa5fa

…a/18968)

test: fix IMROPE perf test case (llama/19465)

90da869

CUDA : Update CCCL-tag for 3.2 to final release from RC (llama/19486)

592be70

CCCL 3.2 has been released since it was added to llama.cpp as part of the backend-sampling PR, and it makes sense to update from RC to final released version. https://github.com/NVIDIA/cccl/releases/tag/v3.2.0

metal : consolidate unary ops (llama/19490)

075f6f2

ggml : extend bin bcast for permuted src1 (llama/19484)

c48715e

* tests : extend bin bcast for permuted src1 * cont : extend bin support * cont : s0 is always 1 * tests : simplify

metal : extend l2_norm support for non-cont src0 (llama/19502)

22726b8

ggml : unary ops support non-cont src0 + metal F16 unary ops (llama/1…

caf316a

…9511) * ggml : unary ops support non-cont src0 * metal : support F16 unary ops + fix ELU

opencl: add general Q6_K mm and Q4_K mv (llama/19347)

052edc3

* opencl: add general q6_k mm * opencl: refine condition for q6_K mm * opencl: add general q4_K mv * opencl: fix whitespace

metal : update sum_rows kernel to support float4 (llama/19524)

50121bf

opencl: add basic support for q4_1 (llama/19534)

10caa03

* opencl: add q4_1 mv * opencl: clean up * opencl: add flattened q4_1 mv * opencl: clean up * opencl: add basic q4_1 mm * opencl: fix whitespace * opencl: add general q4_0 mm

hexagon: fix typo in vtcm_needs_release (llama/19545)

38d8fca

metal : support GGML_OP_SET (llama/19548)

ac24afe

metal : improve concurrency (llama/19555)

474c31d

CUDA: loop over ne2*ne3 in case it overflows (llama/19538)

34cf376

* CUDA: loop over ne2*ne3 in case it overflows * use fastdiv

Fix wrong memcpy length for block_interleave == 4 (llama/19575)

30117e5

vulkan: restore -inf check in FA shaders (llama/19582)

d474df5

vulkan: Add vendor id for Qualcomm drivers (llama/19569)

6f59de9

This commit allows Qualcomm native vulkan driver to be used on Windows instead of Mesa Dozen.

vulkan: support GGML_OP_SET (llama/19584)

6970c64

jeffbolznv and others added 5 commits February 14, 2026 17:53

vulkan: support L2_NORM with contiguous rows (llama/19604)

2c8d7dc

metal : fix ACC op (llama/19427)

b6bb56a

ggml : fix GGML_DEBUG with OpenMP (llama/19599)

36f4f06

last_graph is only available without OpenMP, but ggml_graph_compute_thread() is called in both cases. Signed-off-by: Adrien Gallouët <angt@huggingface.co>

sync : llama.cpp

bf6bc10

danbev approved these changes Feb 14, 2026

View reviewed changes

ggerganov merged commit 68fee72 into master Feb 14, 2026
17 checks passed

ggerganov deleted the sync-llama.cpp-26-02-14 branch February 14, 2026 22:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

sync : llama.cpp#1424

sync : llama.cpp#1424
ggerganov merged 35 commits intomasterfrom
sync-llama.cpp-26-02-14

ggerganov commented Feb 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

18 participants

Comments

Conversation

ggerganov commented Feb 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

18 participants