llama.cpp SYNC #45

akapoor3518 · 2025-09-02T20:24:37Z

llama.cpp SYNC

…16699) * run one test per backend/device (even if it's the same device)

…ec_set_f32 for faster fills (#16522) * Leverage the existing GGML_F32_VEC helpers to broadcast the fill value across SIMD registers and store in vector-sized chunks, while retaining the scalar tail for leftover elements and non-SIMD builds. * Vectorize additional f32 helper loops * Normalize f32 helper tails for ggml vec ops --------- Co-authored-by: Aaron <[email protected]>

@allozaur

…6562) * webui: introduce OpenAI-compatible model selector in JSON payload * webui: restore OpenAI-Compatible model source of truth and unify metadata capture This change re-establishes a single, reliable source of truth for the active model: fully aligned with the OpenAI-Compat API behavior It introduces a unified metadata flow that captures the model field from both streaming and non-streaming responses, wiring a new onModel callback through ChatService The model name is now resolved directly from the API payload rather than relying on server /props or UI assumptions ChatStore records and persists the resolved model for each assistant message during streaming, ensuring consistency across the UI and database Type definitions for API and settings were also extended to include model metadata and the onModel callback, completing the alignment with OpenAI-Compat semantics * webui: address review feedback from allozaur * webui: move model selector into ChatForm (idea by @allozaur) * webui: make model selector more subtle and integrated into ChatForm * webui: replaced the Flowbite selector with a native Svelte dropdown * webui: add developer setting to toggle the chat model selector * webui: address review feedback from allozaur Normalized streamed model names during chat updates by trimming input and removing directory components before saving or persisting them, so the conversation UI shows only the filename Forced model names within the chat form selector dropdown to render as a single-line, truncated entry with a tooltip revealing the full name * webui: toggle displayed model source for legacy vs OpenAI-Compat modes When the selector is disabled, it falls back to the active server model name from /props When the model selector is enabled, the displayed model comes from the message metadata (the one explicitly selected and sent in the request) * Update tools/server/webui/src/lib/components/app/chat/ChatForm/ChatFormActions.svelte Co-authored-by: Aleksander Grygier <[email protected]> * Update tools/server/webui/src/lib/constants/localstorage-keys.ts Co-authored-by: Aleksander Grygier <[email protected]> * Update tools/server/webui/src/lib/components/app/chat/ChatForm/ChatFormModelSelector.svelte Co-authored-by: Aleksander Grygier <[email protected]> * Update tools/server/webui/src/lib/components/app/chat/ChatMessages/ChatMessageAssistant.svelte Co-authored-by: Aleksander Grygier <[email protected]> * Update tools/server/webui/src/lib/services/chat.ts Co-authored-by: Aleksander Grygier <[email protected]> * Update tools/server/webui/src/lib/services/chat.ts Co-authored-by: Aleksander Grygier <[email protected]> * webui: refactor model selector and persistence helpers - Replace inline portal and event listeners with proper Svelte bindings - Introduce 'persisted' store helper for localStorage sync without runes - Extract 'normalizeModelName' utils + Vitest coverage - Simplify ChatFormModelSelector structure and cleanup logic Replaced the persisted store helper's use of '$state/$effect' runes with a plain TS implementation to prevent orphaned effect runtime errors outside component context Co-authored-by: Aleksander Grygier <[email protected]> * webui: document normalizeModelName usage with inline examples * Update tools/server/webui/src/lib/components/app/chat/ChatForm/ChatFormModelSelector.svelte Co-authored-by: Aleksander Grygier <[email protected]> * Update tools/server/webui/src/lib/stores/models.svelte.ts Co-authored-by: Aleksander Grygier <[email protected]> * Update tools/server/webui/src/lib/stores/models.svelte.ts Co-authored-by: Aleksander Grygier <[email protected]> * webui: extract ModelOption type into dedicated models.d.ts Co-authored-by: Aleksander Grygier <[email protected]> * webui: refine ChatMessageAssistant displayedModel source logic * webui: stabilize dropdown, simplify model extraction, and init assistant model field * chore: update webui static build * Update tools/server/webui/src/lib/components/app/chat/ChatMessages/ChatMessageAssistant.svelte Co-authored-by: Aleksander Grygier <[email protected]> * chore: npm format, update webui static build * webui: align sidebar trigger position, remove z-index glitch * chore: update webui build output --------- Co-authored-by: Aleksander Grygier <[email protected]>

…e ggml_v…" (#16723) This reverts commit 19a5a3e.

* model: add support for extra bufs for all devices * hexagon: add experimental ggml-hexagon backend for the Hexagon NPU This commit introduces a new experimental backend `ggml-hexagon` with support for the Hexagon NPU. Highlights: - Supports Hexagon versions: v73, v75, v79, and v81 - Targets Android devices based on Snapdragon SoCs: Gen3, 8-Elite, and 8-Elite Gen5 - Supports Q4_0, Q8_0, MXFP4, and FP32 data types - Implements core LLM ops: MUL_MAT/MUL_MAT_ID, ADD/SUB/MUL/ADD_ID, RMS_NORM, ROPE, GLU/SWIGLU, SOFTMAX **Note:** This backend is experimental and may exhibit instability or limited performance across supported devices. It is intended for early testing and feedback from llama.cpp/ggml developer and user community. Co-Authored-By: Rajdeep Ganguly <[email protected]> Co-Authored-By: Todor Boinovski <[email protected]> * hexagon: fix format checker errors * hexagon: update readme and cmake presets * ci: add android-ndk-build jobs that build plain ARM64 and Snapdragon versions * hexagon: add simple graph optimizer for stacking MUL_MAT ops with the same input * hexagon: move ADB helper scripts into scripts/snapdragon/adb * hexagon: replace all f/printfs with GGML_LOG_... * readme: add hexagon to the list supported backends * hexagon: stack malmuts with quantized inputs only * hexagon: add TODO for fixing issues in hexagon_graph_optimize * hexagon: update to hex-sdk 6.4.0 and add scripts for running on QDC * scripts: fix lint errors * scripts: update qdc pytest script to make linter happy * hexagon: add reduce sum in fp32 * hexagon: reduce number of vector stores in matmul output * hexagon: remove the need for vdelta in reduce-multiply-x8 * hexagon: consistent use of reduce_sum_fp32 for row_sums * hexagon: some more matmul optimizations and comments Optimize cases where tensor dims are not multiple of 1024 (e.g in Qwen models). We've handled those cases already but at a higher overhead. * hexagon: update cmake presets * hexagon: add OPMASK support for run-bench.sh wrapper * hexagon: update to use GGML_BACKEND_API * hexagon: remove unused logic for setting tensor flags for the views * hexagon: add asserts to set/get_tensor to make sure we handle complete tensors Same asserts as the CPU backend. * hexagon: use cpy_tensor slow path for non-host buffers * hexagon: error checks in the buffer allocator * cmake: move include(extProj) under ggml-hexagon * hexagon: don't forget to delete the backend on free * hexagon: set/get_tensor size assert apply only to quantized tensors * hexagon: reintroduce HEX_VERBOSE wrapper for GGML_LOG_DEBUG for now GGML_LOG_DEBUG is always enabled for test-backend-ops and the output gets in the way. Ideally we need a bit more finer log levels. * docs: typos in hexagon developer docs (libggm-...) * hexagon: overhaul error handling in the session/device allocation this should handle all failure paths in the session allocation. * hexagon: update cmake presets to enable fp16 vectors * hexagon: remove unused time_usec function * hexagon: don't forget to release buffer contexts * hexagon: fixed indents in hvx-utils (missed clang-format auto-format failure) * hexagon: remove custom can_repeat function and use ggml_can_repeat --------- Co-authored-by: Rajdeep Ganguly <[email protected]> Co-authored-by: Todor Boinovski <[email protected]>

…ng (#16644) * sycl: use async memory allocation to fix graph recording failures GGML_SYCL_DISABLE_GRAPHS=0 causes crashes because: - Host waits are currently unsupported in graph recording mode. - SYCL malloc / free calls are unsupported in graph recording mode. The following changes are made to fix SYCL graph functionality: - When graphs are enabled, use the SYCL async memory extension for temp buffers which is supported with SYCL graphs. - For compiler versions that do not support this extension, skip graphs with the affected op. - Switch from USM shared to device memory as the async extension currently just supports device allocations. * Address reviewer feedback * Use global async variable to decide path in sycl_ext_[malloc_device|free]

* mtmd-cli : allow using --jinja * support -sys * implement chat_history * fix clear memory * rm -sys support, added TODO

* Make mistral-common dependency optional * Fix typing

* chore: Update packages + upgrade Storybook to v10 * fix: Increase timeout for UI tests

* ggml : use std::sort in ggml_argsort CPU implementation * cont : add missing header

Co-authored-by: Zhang Jianyu <[email protected]>

* CUDA: add fused rope * move k forward_expand up * create helper function instead of re-using params * make assert statement more in line with comment * rope_norm: coalesced writes to global mem

* update L2_NORM op support * update L2_NORM op support * remove extra whitespace * cann: update cross_entropy_loss op support * remove trailing whitespaces * rebase the latest code in the main repository and remove the l2_norm operator that already exists in another pull request. * undo the l2_norm operator deletion

This reverts commit 1c398dc.

* metal: accelerated conv2d * cont : cleanup --------- Co-authored-by: bghira <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]>

…ations (#17227) Signed-off-by: Wang Yang <[email protected]>

…heck (#17219) * vulkan: remove shell call from vulkan-shaders-gen tool * use string vector for command execution * Fix condition * use string, remove const_cast * Fix dependency file quotation on Windows --------- Co-authored-by: Jeff Bolz <[email protected]>

* Add ops needed for new hybrid models: SOFTPLUS, EXPM1, TRI, SOLVE_TRI, CUMSUM * Update ggml/include/ggml.h Co-authored-by: Georgi Gerganov <[email protected]> * Update tests/test-backend-ops.cpp Co-authored-by: Georgi Gerganov <[email protected]> * Code review * Whitespace * Update tests/test-backend-ops.cpp Co-authored-by: Diego Devesa <[email protected]> * This is actually sigmoid, duh. * Add CONST, remove TRI_KEEP, other changes from review * Update tests/test-backend-ops.cpp Co-authored-by: Georgi Gerganov <[email protected]> * Update ggml/src/ggml.c Co-authored-by: Georgi Gerganov <[email protected]> * Update ggml/src/ggml.c Co-authored-by: Georgi Gerganov <[email protected]> * Update ggml/src/ggml-cuda/unary.cu Co-authored-by: Aman Gupta <[email protected]> * Remove extra script * Update ggml/src/ggml.c Co-authored-by: Diego Devesa <[email protected]> * Update tests/test-backend-ops.cpp Co-authored-by: Diego Devesa <[email protected]> * moving changes from laptop [no ci] * pre-rebase * Update tests/test-backend-ops.cpp Co-authored-by: Sigbjørn Skjæret <[email protected]> * Update tests/test-backend-ops.cpp Co-authored-by: Sigbjørn Skjæret <[email protected]> * Refactor tests * ggml : cleanup * cont : fix ggml_fill srcs * tests : add note * ggml : add ggml_fill_inplace * ggml : add asserts * ggml : fix ggml_fill constant cast * cont : ggml_tri minor * Use TENSOR_LOCALS * Fix regression from #14596, regenerate * Don't make commits at night... --------- Co-authored-by: Georgi Gerganov <[email protected]> Co-authored-by: Diego Devesa <[email protected]> Co-authored-by: Aman Gupta <[email protected]> Co-authored-by: Sigbjørn Skjæret <[email protected]>

* ggml-cpu: handle 3d tensors in repack mul_mat * Removed unnecessary branch, removed need for <algorithm> * Fixed dst_ptr pointer in chunk + clang_format * GGML_ASSERT to check wdata within bounds * Accidental ggml.h inclusion * Improved GGML_ASSERT on wdata boundaries * Address performance regression in Qwen and llama.cpp due to chunking

Signed-off-by: Wang Yang <[email protected]>

* metal : refactor argsort * cont : sort chunks * cont : merge sorted buckets * cont : cleanup

…nstruction (#17048) * fix : Dangling pointer for non-empty trigger words in llama_sampler_init_grammar_impl (#17047) * Replace 'static' workaround, with keeping variable in scope for longer * Create std::array directly and pass into llama_grammar_init_impl * Add back the trigger pattern * Missed array include

* Add AFMOE model support * Update to vocab * Add model sizing * Undo Rope change for ARCEE model * Address review comments * Update modeling code is_sliding -> use_rope, replace hard-coded logic * Fix AFMOE tokenizer * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <[email protected]> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <[email protected]> * Update AFMoE tokenizer class identification to be more unique --------- Co-authored-by: Sigbjørn Skjæret <[email protected]>

…17158) * vulkan: change graph_compute to be async and enable get_tensor_async This allows some additional CPU/GPU overlap for large pp workloads. Also seems to help a bit for token gen, maybe getting rid of a small bubble between graph_compute and get_tensor. Async set and copy functions seem to be very rarely used, so I didn't enable them because I didn't have a good way to test them. The async commands need to be ordered against each other, so put them all on the compute queue. The non-async commands still use the transfer queue. The fence for graph_compute/get_tensor_async is submitted and waited on in ggml_vk_synchronize. * fix thread safety errors * teardown context cleanly * Handle async read to non-pinned dst

* vulkan: Use ggml_vk_tensor_subbuffer in mul_mat_vec(id) paths * set allow_misalign

* docs: update Vulkan ops * vulkan: add NEG op * vulkan: add ABS op --------- Signed-off-by: Giuseppe Scrivano <[email protected]>

…cli (#17277)

akapoor3518 changed the base branch from master to llama.cpp-syn-sept2 September 2, 2025 20:28

github-actions bot added documentation Improvements or additions to documentation Apple Metal SYCL Nvidia GPU Vulkan IBM zDNN testing build examples devops python script android server ggml nix Ascend NPU OpenCL labels Sep 17, 2025

Acly and others added 11 commits October 22, 2025 12:01

tests : fix test-thread-safety when compiling with multiple backends (#…

d8eaa26

…16699) * run one test per backend/device (even if it's the same device)

Revert "ggml : Leverage the existing GGML_F32_VEC helpers to vectoriz…

a2e0088

…e ggml_v…" (#16723) This reverts commit 19a5a3e.

server : send partial stop string when <EOG> is reached (#15007)

8cf6b42

ggml-cuda: use passed ops instead of hardcoded ops (#16712)

061f0ef

Manually link -lbsd to resolve flock symbol on AIX (#16610)

fe6a988

mtmd-cli : allow using --jinja (#16718)

d0660f2

* mtmd-cli : allow using --jinja * support -sys * implement chat_history * fix clear memory * rm -sys support, added TODO

convert : Make mistral-common dependency optional (#16738)

dd62dcf

* Make mistral-common dependency optional * Fix typing

allozaur and others added 30 commits November 12, 2025 19:01

Update packages + upgrade Storybook to v10 (#17201)

8e878f0

* chore: Update packages + upgrade Storybook to v10 * fix: Increase timeout for UI tests

ggml : use std::sort in ggml_argsort CPU implementation (#17211)

374fe09

* ggml : use std::sort in ggml_argsort CPU implementation * cont : add missing header

docker : preserve .so symlinks for docker container builds (#17214)

92bb442

CUDA: static assert to prevent misuse of memcpy_1 (#17198)

5d6838b

vocab : correct bounds check for UGM XCDA array access (#17215)

ffb6f3d

update SYCL support OPs (#17208)

07751f8

Co-authored-by: Zhang Jianyu <[email protected]>

CUDA: fuse rope + set_rows (#16884)

a90eb94

* CUDA: add fused rope * move k forward_expand up * create helper function instead of re-using params * make assert statement more in line with comment * rope_norm: coalesced writes to global mem

ggml-cpu : use template for argsort (#17222)

879dec3

Revert "ggml-cpu: handle 3d tensors in repack mat_mul (#17030)" (#17233)

2776db6

This reverts commit 1c398dc.

metal: accelerated conv2d (#17175)

0cfb191

* metal: accelerated conv2d * cont : cleanup --------- Co-authored-by: bghira <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]>

ggml-cpu : add RISC-V vector intrinsic support for silu and cvar oper…

1215dde

…ations (#17227) Signed-off-by: Wang Yang <[email protected]>

sched : fix reserve ignoring user tensor assignments (#17232)

dd091e5

server: fixing naming conflict res_error (#17243)

c4abcb2

Better UX for handling multiple attachments in WebUI (#17246)

f1bad23

readme : add RVV,ZVFH,ZFH,ZICBOP support for RISC-V (#17259)

307772f

Signed-off-by: Wang Yang <[email protected]>

metal : make the FA extra sizes consistent (#17143)

2606b0a

metal : support argsort for ne00 > 1024 (#17247)

45c6ef7

* metal : refactor argsort * cont : sort chunks * cont : merge sorted buckets * cont : cleanup

server : fix "can batch with" bug (#17263)

d396b43

mtmd: add mtmd_log_set (#17268)

9b17d74

vulkan: skip all-negative-inf blocks in FA (#17186)

234ae7d

vulkan: Use ggml_vk_tensor_subbuffer in mul_mat_vec(id) paths (#17244)

439342e

* vulkan: Use ggml_vk_tensor_subbuffer in mul_mat_vec(id) paths * set allow_misalign

vulkan: implement ABS and NEG (#17245)

1568d13

* docs: update Vulkan ops * vulkan: add NEG op * vulkan: add ABS op --------- Signed-off-by: Giuseppe Scrivano <[email protected]>

mtmd-cli: Avoid logging to stdout for model loading messages in mtmd-…

c7b7db0

…cli (#17277)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

llama.cpp SYNC #45

llama.cpp SYNC #45

akapoor3518 commented Sep 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

88 participants

llama.cpp SYNC #45

Are you sure you want to change the base?

llama.cpp SYNC #45

Conversation

akapoor3518 commented Sep 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

88 participants