merge from upstream #71

l3utterfly · 2025-06-27T09:09:14Z

Make sure to read the contributing guidelines before submitting a PR

* mtmd : refactor llava-uhd preprocessing logic * fix editorconfig

Signed-off-by: Aaron Teo <[email protected]>

* docs: add s390x-specific build docs Signed-off-by: Aaron Teo <[email protected]> * docs: add s390x model conversion steps Signed-off-by: Aaron Teo <[email protected]> * docs: s390x build indent Signed-off-by: Aaron Teo <[email protected]> * docs: update hyperlinks for s390x docs Signed-off-by: Aaron Teo <[email protected]> * docs: update llama.h docs Signed-off-by: Aaron Teo <[email protected]> * docs: s390x add accelerator and perf optimizations Signed-off-by: Aaron Teo <[email protected]> * docs: s390x indent blocks Signed-off-by: Aaron Teo <[email protected]> * docs: revert block indentation Signed-off-by: Aaron Teo <[email protected]> * docs: add support information for s390x Signed-off-by: Aaron Teo <[email protected]> * docs: s390x reword Signed-off-by: Aaron Teo <[email protected]> * docs: remove indentation for accelerator section s390x Signed-off-by: Aaron Teo <[email protected]> * docs: remove redundant words s390x Signed-off-by: Aaron Teo <[email protected]> * docs: reword for s390x Signed-off-by: Aaron Teo <[email protected]> * docs: s390x reword simd Signed-off-by: Aaron Teo <[email protected]> * docs: fix trailing whitespace for s390x Signed-off-by: Aaron Teo <[email protected]> --------- Signed-off-by: Aaron Teo <[email protected]>

* metal : add mean kernel ggml-ci * cont : dedup implementation ggml-ci

@younesbelkada

* feat: Add llama_model_is_hybrid API call Also, split llama_model_is_recurrent into llm_arch_is_recurrent in llama-arch with llama_model_is_recurrent delegating to llm_arch_is_recurrent. The same split is done for hybird. This is needed because there are places where the llama_model has not yet been initialized but we need to check if the model is recurrent (specifically for the per-layer recurrent check array in hparams). Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]> * feat: Add c++ side constants for attention layer indices hparam Branch: GraniteFour * feat: Add support for distinguishing recurrent vs non-recurrent layers in hparams Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]> * feat: Auto-fill hparams.recurrent_layer_arr based on whether the model is recurrent Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]> * refactor: rename *_is_hybrid -> *_is_hybrid_recurrent The implementation of the hybrid cache intentionally does not specify the types of the child caches, so there was a naming mismatch with these predicate functions that used "hybrid" to imply "hybrid recurrent." Branch: HybridCache Signed-off-by: Gabe Goodhart <[email protected]> * feat: Add layer filter to recurrent cache Branch: HybridCache Signed-off-by: Gabe Goodhart <[email protected]> * fix: Use per-layer sizing everywhere in kv caches Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]> * feat: First pass at llama_kv_cache_hybrid_recurrent This follows the pattern in iswa where the two child caches are held explicitly to support the case where a model requires a single attention cache and a single recurrent cache where each layer uses exactly one of the caches. This is a rewrite of the more generic approach in the original hybrid cache PR: ggml-org#13276 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <[email protected]> * feat: Construct hybrid recurrent cache for hybrid recurrent models This includes a refactor of the create_memory logic to avoid needing to use the arch enum explicitly unless a model needs explicit cache instantiation logic beyond the standard logic for recurrent, hybrid, unified, and iswa. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <[email protected]> * fix: Fix wrong bool condition for split equal in hybrid cache Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <[email protected]> * fix: Fix shift logic to defer to unified cache Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <[email protected]> * feat: Support hybrid recurrent in llama-graph NOTE: I intentionally did not add support for s_mask since it will be going away soon Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <[email protected]> * fix: Fix logic for initializing inputs and attn layers for hybrid caches Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]> * fix: Update recurrent cache for changes to remove intermediate kv_cache interface Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <[email protected]> * fix: Fix status for init_update sig for recurrent cache state Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]> * fix: Add missing padding to n_ctx for hybrid cache construction Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]> * fix: Update clear signature for data argument after rebase Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <[email protected]> * fix: Remove errant virtual destructor leftover from previous impl attempt Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <[email protected]> * fix: Use per-layer n_embd_k/v_s calls for mamba (1) layers Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <[email protected]> * refactor: Remove n_embd_k/v_s from unified cache No longer needed now that unified isn't also supporting recurrent ggml-org#13979 (comment) Branch: HybridRecurrentCache * refactor: Remove layer index from n_embd_k/v_s Now that it's not used at all in the unified cache, we don't need to use the layer index to zero it out for attention layers. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <[email protected]> * refactor: Remove n_embd_k/v_gqa from recurrent cache This is no longer needed now that there are separate implementations ggml-org#13979 (comment) Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <[email protected]> * feat: Allow custom layer filters for hybrid recurrent This should help support architectures like Falcon H1 where there is overlap between layers that need attention and recurrent caches. ggml-org#13979 (comment) Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <[email protected]> * fix: Remove logits_all after rebase Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <[email protected]> * fix: Remove llama_model_is_hybrid_Recurrent public API ggml-org#13979 (comment) Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <[email protected]> * refactor: Use llama_memory_state_ptr for child states in hybrid memory state Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <[email protected]> * feat: Overhaul build_recurrent_state / build_inp_s_copy to match attention pattern https://github.com/ggml-org/llama.cpp/pull/13979/files#r2141701738 This is a big overhaul to bring consistency between how inputs and per- layer components are created for attention layers and recurrent layers. The main changes are: - Rename class llm_graph_input_s_copy -> llm_graph_input_rs - Add a corresponding llm_graph_input_rs_hybrid_recurrent - Rename build_inp_s_copy -> build_rs_inp_recurrent - Add a corresponding build_rs_inp_hybrid_recurrent - Rename build_recurrent_state -> build_rs to match build_attn w/ llm_graph_input_rs android-build AUTHORS bamba-9b-2.2T.gguf bamba-9b-2.2T.q4_k_m.gguf broken.log build build-rel build-xcframework.sh build.android build.android.bak ci cmake CMakeLists.txt CMakePresets.json CODEOWNERS common common.o CONTRIBUTING.md convert_hf_to_gguf_update.py convert_hf_to_gguf.py convert_llama_ggml_to_gguf.py convert_lora_to_gguf.py debug.log docs examples flake.lock flake.nix ggml ggml-alloc.o ggml-backend.o ggml-metal.o ggml-model-BF16.gguf ggml-model-Q4_K_M.gguf ggml-quants.o ggml.o gguf-py grammar-parser.o grammars include LICENSE licenses llama.log llama.o llamacpp_trace.log main.log Makefile media models mypy.ini pocs poetry.lock prompts pyproject.toml pyrightconfig.json q4_k_m_boot.log q8_0_boot.log quant.log quant2.log README.md requirements requirements.txt sampling.o scripts SECURITY.md src test-grammar-output.tmp test-json-schema-input.tmp tests tools vendor working.log as the first input - Add a corresponding overload of build_rs w/ llm_graph_input_rs_hybrid_recurrent android-build AUTHORS bamba-9b-2.2T.gguf bamba-9b-2.2T.q4_k_m.gguf broken.log build build-rel build-xcframework.sh build.android build.android.bak ci cmake CMakeLists.txt CMakePresets.json CODEOWNERS common common.o CONTRIBUTING.md convert_hf_to_gguf_update.py convert_hf_to_gguf.py convert_llama_ggml_to_gguf.py convert_lora_to_gguf.py debug.log docs examples flake.lock flake.nix ggml ggml-alloc.o ggml-backend.o ggml-metal.o ggml-model-BF16.gguf ggml-model-Q4_K_M.gguf ggml-quants.o ggml.o gguf-py grammar-parser.o grammars include LICENSE licenses llama.log llama.o llamacpp_trace.log main.log Makefile media models mypy.ini pocs poetry.lock prompts pyproject.toml pyrightconfig.json q4_k_m_boot.log q8_0_boot.log quant.log quant2.log README.md requirements requirements.txt sampling.o scripts SECURITY.md src test-grammar-output.tmp test-json-schema-input.tmp tests tools vendor working.log as the first input - Add a llm_graph_input_attn_kv_hybrid_recurrent analogous to llm_graph_input_attn_kv_unified - Add a build_attn override that takes llm_graph_input_attn_kv_hybrid_recurrent android-build AUTHORS bamba-9b-2.2T.gguf bamba-9b-2.2T.q4_k_m.gguf broken.log build build-rel build-xcframework.sh build.android build.android.bak ci cmake CMakeLists.txt CMakePresets.json CODEOWNERS common common.o CONTRIBUTING.md convert_hf_to_gguf_update.py convert_hf_to_gguf.py convert_llama_ggml_to_gguf.py convert_lora_to_gguf.py debug.log docs examples flake.lock flake.nix ggml ggml-alloc.o ggml-backend.o ggml-metal.o ggml-model-BF16.gguf ggml-model-Q4_K_M.gguf ggml-quants.o ggml.o gguf-py grammar-parser.o grammars include LICENSE licenses llama.log llama.o llamacpp_trace.log main.log Makefile media models mypy.ini pocs poetry.lock prompts pyproject.toml pyrightconfig.json q4_k_m_boot.log q8_0_boot.log quant.log quant2.log README.md requirements requirements.txt sampling.o scripts SECURITY.md src test-grammar-output.tmp test-json-schema-input.tmp tests tools vendor working.log as the first input This makes the two paradigms fully consistent. The main drawback is the code duplication in the build_attn and build_rs implementations where the only difference between implementations is how they cast the memory state. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <[email protected]> * fix: Fix resize vs reserve and skip null tensors in size computation https://github.com/ggml-org/llama.cpp/pull/13979/files#r2149469788 Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <[email protected]> Co-Authored-By: @younesbelkada * fix: Fix initialization of child states Since initially writing this PR, the logic in the child state types changed such that using the "init full" signature and keeping the ubatches on the parent struct no longer worked. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <[email protected]> * refactor: Use a common build_recurrent_state method that is cache-agnostic This reduces the code duplication between the different build_rs impls and also retains a similar signature to the previous build_recurrent_state method while standardizing on the input-dispatched build_rs implementation. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <[email protected]> * recurrent : rework graph inputs + add TODOs ggml-ci * refactor: Make status and child states const in hybrid and iswa Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <[email protected]> * refactor: Rename llama_kv_cache_[recurrent|hybrid_recurrent] to remove kv cache This removes the notion of "kv" from the interface names for these memory types. There are still many references to kv in the implementation of the recurrent memory which will need further adjustment. Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <[email protected]> * refactor!: Rename all k/v related values for recurrent/hybrid to r/s Anywhere that "kv_<state|cell|size|etc>" is used, I've used the more generic "mem_" prefix. The specifics of "k" (key) translate to "r" (recurrent state) and "v" (value) translate to "s" (state-space embedding states). Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <[email protected]> * refacor: _recurrent -> _recr for brevity It just _happens_ to have the same number of letters as _attn! Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <[email protected]> * style: Fix spacing for ref Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <[email protected]> * refactor: recurrent_layer() -> is_recurrent() Branch: HybridRecurrentCache Signed-off-by: Gabe Goodhart <[email protected]> * style: Fix spacing for size_s_bytes declaration Co-authored-by: Georgi Gerganov <[email protected]> --------- Signed-off-by: Gabe Goodhart <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]>

…fallback to CPU buffer (ggml-org#14249)

Add no_warmup parameter to cmd_params struct and command-line parsing to allow users to skip warmup runs before benchmarking. - Add no_warmup boolean field to cmd_params struct - Add --no-warmup command-line argument parsing - Add help text documentation for the new flag - Wrap existing warmup logic in conditional check - Maintain full backward compatibility (warmup enabled by default) Addresses ggml-org#14224

Addresses unused reorder path

* Change _contains_any() substrs to std::string_view and fix the find comparison logic.

…3782) Co-authored-by: aa956 <[email protected]>

* Make sentencepiece optional * Bump to 0.18.0 * Bump patch instead of minor Co-authored-by: compilade <[email protected]> --------- Co-authored-by: compilade <[email protected]>

Support for Arm runtime feature detection has now been added to GGML_CPU_ALL_VARIANTS. This removes the old and not very functional code.

* CUDA: add conv_2d_dw * better naming * simplify using template * Review: fix operation ordering in ggml-cuda, use __forceinline__, use more const

ggml-ci

* model : more uniform output id handling ggml-ci * cont : revert n_outputs < n_tokens optimization ggml-ci * cont : fix out_ids initialization ggml-ci

ggml-ci

…org#14288) Workarounds an issue that may cause CUDA graph capture to fail when a cuBLAS handle is destroyed in a different thread

* Add PowerPC feature detection and scoring * ggml-cpu: Implement GGML_CPU_ALL_VARIANTS for PowerPC * ggml-cpu: Delay some initializations until function is called When using GGML_BACKEND_DL=ON, these initializations might use instructions that are not supported by the current CPU. --------- Co-authored-by: Diego Devesa <[email protected]>

* Add header and namespace to use enqueue_functions extension * Convert submit and parallel_for to use new extension in convert.cpp * Convert submit and parallel_for to use extension in ggml-sycl.cpp * Convert submit and parallel_for to use extension in gla.cpp * Convert submit and parallel_for in mmq.cpp * Convert submit and parallel_for in mmvq.cpp * Convert submit and parallel_for in remaining files * Convert all simple parallel_for to nd_launch from enqueue_functions extension * Wrapping extension in general function Create a general function that enable the enqueue_functions extension if it is enable in the compiler, otherwise call the general SYCL function to launch kernels. --------- Signed-off-by: nscipione <[email protected]>

* vocab : prevent stack overflow in tokenize * vocab : return error instead of aborting on oversized token count * vocab : INT32_MIN from llama_tokenize on overflow

* CUDA: add conv_2d_transpose * remove direct include of cuda_fp16 * Review: add brackets for readability, remove ggml_set_param and add asserts

…enchmark works fine on Linux

…es and scripts

…ama.cpp:#34 (comment)

…ween QNN-CPU,QNN-GPU,QNN-NPU,cDSP,ggml

…ression issue in upstream

…e peformance between QNN-CPU,QNN-GPU,QNN-NPU,cDSP,ggml

…regression issue in upstream

… upstream

…s/tree/main/tutorials/llm_on_genie

CISC and others added 30 commits June 18, 2025 09:52

convert : fix null head_dim AutoConfig regression (ggml-org#14248)

3865cff

llama-chat : fix multiple system message for gemma, orion (ggml-org#1…

9540255

…4246)

mtmd : refactor llava-uhd preprocessing logic (ggml-org#14247)

413977d

* mtmd : refactor llava-uhd preprocessing logic * fix editorconfig

ggml: Add Apple support for GGML_CPU_ALL_VARIANTS (ggml-org#14258)

ef03580

ggml-cpu: fix uncaught underscore terminators (ggml-org#14023)

6231c5c

Signed-off-by: Aaron Teo <[email protected]>

ggml-cpu: reduce asm calls for hsum (ggml-org#14037)

50d2227

Signed-off-by: Aaron Teo <[email protected]>

metal : add mean kernel (ggml-org#14267)

ed3290a

* metal : add mean kernel ggml-ci * cont : dedup implementation ggml-ci

Vulkan: Set device max size for host memory to avoid OOM warning and …

10bb545

…fallback to CPU buffer (ggml-org#14249)

llamafile : support s390x SIMD instruction set (ggml-org#14273)

faed5a5

convert : fix remote option in Windows (ggml-org#14100)

5fc7856

sycl: Cleanup codepaths in Get Rows in sycl backend (ggml-org#14215)

600e3e9

Addresses unused reorder path

build : suppress gcc15 compile warnings (ggml-org#14261)

456af35

* Change _contains_any() substrs to std::string_view and fix the find comparison logic.

server : add server parameters for draft model cache type (ggml-org#1…

d67341d

…3782) Co-authored-by: aa956 <[email protected]>

gguf-py : make sentencepiece optional (ggml-org#14200)

381174b

* Make sentencepiece optional * Bump to 0.18.0 * Bump patch instead of minor Co-authored-by: compilade <[email protected]> --------- Co-authored-by: compilade <[email protected]>

ggml-cpu : remove unnecesary arm feature detection (ggml-org#14281)

8f71d0f

Support for Arm runtime feature detection has now been added to GGML_CPU_ALL_VARIANTS. This removes the old and not very functional code.

CUDA: add conv_2d_dw (ggml-org#14265)

9eaa51e

* CUDA: add conv_2d_dw * better naming * simplify using template * Review: fix operation ordering in ggml-cuda, use __forceinline__, use more const

ubatch : new splitting logic (ggml-org#14217)

4c9fdfb

ggml-ci

model : more uniform output id handling (ggml-org#14275)

812939a

* model : more uniform output id handling ggml-ci * cont : revert n_outputs < n_tokens optimization ggml-ci * cont : fix out_ids initialization ggml-ci

ggml: Update KleidiAI to v1.9.0 (ggml-org#14277)

9230dbe

ggml : fix repack work size for mul_mat_id (ggml-org#14292)

d27b3ca

ggml-ci

cuda : synchronize graph capture and cublas handle destruction (ggml-…

e28c1b9

…org#14288) Workarounds an issue that may cause CUDA graph capture to fail when a cuBLAS handle is destroyed in a different thread

llama : improve sep token handling (ggml-org#14272)

88fc854

vocab : prevent tokenizer overflow (ggml-org#14301)

dd6e6d0

* vocab : prevent stack overflow in tokenize * vocab : return error instead of aborting on oversized token count * vocab : INT32_MIN from llama_tokenize on overflow

lint : remove trailing whitepace (ggml-org#14304)

22015b2

CUDA: add conv_2d_transpose (ggml-org#14287)

c959f46

* CUDA: add conv_2d_transpose * remove direct include of cuda_fp16 * Review: add brackets for readability, remove ggml_set_param and add asserts

jeffzhou2000 added 28 commits June 27, 2025 10:55

ggmlhexagon-benchmark: add running timestamp and enable ggmlhexagon-b…

eeef21c

…enchmark works fine on Linux

ggml-hexagon: update ggml-hexagon.cpp to v1.11 and refine related cod…

1099a08

…es and scripts

llama-bench: add running timestamp to analysis regression issue in ll…

a5a1548

…ama.cpp:#34 (comment)

project: add prebuilt LLM models for compare inference peformance bet…

6fb8d8f

…ween QNN-CPU,QNN-GPU,QNN-NPU,cDSP,ggml

script: refine scripts/build-run-android.sh

06fd699

project: sync with upstream

3ec9341

troubleshooting: add ggml-20250531 to troubleshooting performance reg…

ba62457

…ression issue in upstream

script: simplify workflow

595f8e7

project: sync with upstream

a136a84

project: sync with upstream

97d22f7

project: sync with upstream

7ce85d3

project: add prebuilt LLM model t5-277M-F32.gguf for compare inferenc…

07687ba

…e peformance between QNN-CPU,QNN-GPU,QNN-NPU,cDSP,ggml

script: refine scripts/build-run-android.sh

8edc5ce

project: adapt to thread safety test in upstream

75a37a2

project: remove unused ggml-20250531 which added for troubleshooting …

c7c5797

…regression issue in upstream

ggml-hexagon: fix issue which introduced by test-thread-safety in the…

047b200

… upstream

project: add codes for developers/experts's effort on cDSP side

f5a892a

build: refine script for developers/experts's effort on cDSP side

40502b6

script: fix a minor issue in scripts/build-run-android.sh

775fda0

script: refine script according to https://github.com/quic/ai-hub-app…

477c0a3

…s/tree/main/tutorials/llm_on_genie

ggml-hexagon: add mulmat_algotype for further usage

6051578

project: sync with upstream

fdd0e75

ggml-dsp: fix typo

8525353

project: release libggmldsp-skel.so v0.98

3c72168

project: sync with upstream

8f69861

project: sync with upstream

26f96be

project: sync with upstream

8c6db00

test: verify Google gemma-3n on Android phone

0956c3a

l3utterfly deleted the branch l3utterfly:ggml-hexagon June 27, 2025 09:29

l3utterfly closed this Jun 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

merge from upstream #71

merge from upstream #71

Uh oh!

l3utterfly commented Jun 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

merge from upstream #71

merge from upstream #71

Uh oh!

Conversation

l3utterfly commented Jun 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants