Sync master with upstream release b5350 #84

jan-service-account · 2025-05-12T00:09:25Z

Updates dev branch with latest release (b5350) from ggml-org/llama.cpp

Signed-off-by: xiaofei <[email protected]>

* docker : do not build tests * include "ggml-cpu.h"

* arg : allow using -hf offline * add more comments in code [no ci]

z17 compilation requires GCC 15.1.0 and onwards Signed-off-by: Aaron Teo <[email protected]>

Build fails with compilation error on power pc. This patch fixes the same. Tested with unit tests run via --build <build_dir> && cd <build_dir> && make test Signed-off-by: Shalini Salomi Bodapati <[email protected]>

* convert : improve model arch handling * use AutoConfig * rm trust_remote_code * Update convert_hf_to_gguf.py * fix self.block_count for vision * fix NomicBertModel

* arg : -hf do not fail if url mismatch * do not return if cannot parse metadata json

ggml-ci

…ggml-org#13223)

* convert ok * load ok, missing patch merger * ah sheet it works * update llava/readme * add test * fix test

* whisper: suppress Windows compiler warnings This commit disables compiler warnings on window using MSVC. The motivation for these changes is that some compilers generate warnings for these conversion, for example Windows MSVC, and there are quite a few of them. This makes it a little difficult to spot new warnings that may be introduced and also can be difficult for users/embedders of ggml where these warnings are hard to separate from their own warnings. * squash! whisper: suppress Windows compiler warnings Move ggml related warnings into ggml. This commit also fixes the indentation and adds a missing whitespace to the if statement.

This commit adds a check to makes sure that the target exists before trying to add compile options to ignore warnings when using MSVC. The motivation for this is currently the build is broken depending on the cmake options provided. With this fix it should be possible to build even if the targets are not actually available. Refs: ggml-org/whisper.cpp#3090 (comment)

ggml-ci

…der (ggml-org#13191) * vulkan: Handle src1 batch dimension in non-contiguous mat-vec-mul shader

* vulkan: Add bfloat16 support This adds bfloat16 matrix multiply support based on VK_KHR_shader_bfloat16. The extension is required for coopmat multiply support, but matrix-vector multiply trivially promotes bf16 to fp32 and doesn't require the extension. The copy/get_rows shaders also don't require the extension. It's probably possible to fall back to non-coopmat and promote to fp32 when the extension isn't supported, but this change doesn't do that. The coopmat support also requires a glslc that supports the extension, which currently requires a custom build. * vulkan: Support bf16 tensors without the bf16 extension or coopmat support Compile a variant of the scalar mul_mm shader that will promote the bf16 values to float, and use that when either the bf16 extension or the coopmat extensions aren't available. * vulkan: bfloat16 fixes (really works without bfloat16 support now) * vulkan: fix spirv-val failure and reenable -O

* update GLM4 chat template * Update chat template Co-authored-by: Xuan-Son Nguyen <[email protected]> --------- Co-authored-by: Xuan-Son Nguyen <[email protected]>

… image size (ggml-org#13237)

* build : fix build info on windows * fix cuda host compiler msg

The following scenario will cause an assertion failure in the graph allocator: - Build and allocate a graph containing a tensor with a non-NULL data pointer - Build and allocate a new graph where that data is NULL Result: ggml-alloc.c:819: GGML_ASSERT(talloc->buffer_id >= 0) failed This happens during revalidation because we think that memory should have been previously allocated based on the current graph but in reality the previous graph was different. In this situation, we should do a full reallocation pass.

* CUDA: FA support for Deepseek (Ampere or newer) * do loop unrolling via C++ template

ggml-ci

…#12858) * sycl : Implemented reorder Q4_0 mmvq Signed-off-by: Alberto Cabrera <[email protected]> * sycl : Fixed mmvq being called when reorder is disabled * sycl : Improved comments in the quants header Signed-off-by: Alberto Cabrera <[email protected]> * Use static_assert * safe_div -> ceil_div * Clarify qi comment * change the reorder tensor from init to execute OP * dbg * Undo changes to test-backend-ops * Refactor changes on top of q4_0 reorder fix * Missing Reverts * Refactored opt_for_reorder logic to simplify code path * Explicit inlining and unroll * Renamed mul_mat_algo enum for consistency --------- Signed-off-by: Alberto Cabrera <[email protected]> Co-authored-by: romain.biessy <[email protected]>

* server : (experimental) vision support via libmtmd * mtmd : add more api around mtmd_image_tokens * mtmd : add more api around mtmd_image_tokens * mtmd : ability to calc image hash * shared_ptr for mtmd_image_tokens * move hash to user-define ID (fixed) * abstract out the batch management * small fix * refactor logic adding tokens to batch * implement hashing image * use FNV hash, now hash bitmap instead of file data * allow decoding image embedding to be split into batches * rm whitespace * disable some features when mtmd is on * fix --no-mmproj-offload * mtmd_context_params no timings * refactor server_inp to server_tokens * fix the failing test case * init * wip * working version * add mtmd::bitmaps * add test target * rm redundant define * test: mtmd_input_chunks_free * rm outdated comment * fix merging issue * explicitly create mtmd::input_chunks * mtmd_input_chunk_copy * add clone() * improve server_input struct * clip : fix confused naming ffn_up and ffn_down * rm ffn_i/o/g naming * rename n_embd, n_ff * small fix * no check n_ff * fix detokenize * add const to various places * add warning about breaking changes * add c api * helper: use mtmd_image_tokens_get_n_pos * fix ctx_shift * fix name shadowing * more strict condition * support remote image_url * remote image_url log * add CI test * do not log base64 * add "has_multimodal" to /props * remove dangling image * speculative: use slot.cache_tokens.insert * Apply suggestions from code review Co-authored-by: Georgi Gerganov <[email protected]> * rm can_be_detokenized * on prmpt processing done, assert cache_tokens.size * handle_completions_impl returns void * adapt the new web ui * update docs and hot topics * rm assert * small fix (2) --------- Co-authored-by: Georgi Gerganov <[email protected]>

…gml-org#13413)

* vulkan: scalar flash attention implementation * vulkan: always use fp32 for scalar flash attention * vulkan: use vector loads in scalar flash attention shader * vulkan: remove PV matrix, helps with register usage * vulkan: reduce register usage in scalar FA, but perf may be slightly worse * vulkan: load each Q value once. optimize O reduction. more tuning * vulkan: support q4_0/q8_0 KV in scalar FA * CI: increase timeout to accommodate newly-supported tests * vulkan: for scalar FA, select between 1 and 8 rows * vulkan: avoid using Float16 capability in scalar FA

* arg : add env var to control mmproj * small note about -hf --mmproj

* convert : internvl support * InternVL3-1B working * fix regression * rm mobilevlm from test * fix conversion * add test for internvl * add to list of pre-quant * restore boi/eoi check * add clarify comment for norm eps

before cleanup: 20G after cleanup: 44G after all built and pushed: 24G https://github.com/Thammachart/llama.cpp/actions/runs/14945093573/job/41987371245

…ml-org#13434) * mtmd : add hard limit on image resolution for qwen2vl / qwen2.5vl * fix typo

* mtmd : move helpers to dedicated file * fix windows build * rm redundant include

* Support InternVL 3 38B and 78B mmproj * Swap norms in clip.cpp * Group variables together

…ma4 400B (ggml-org#13386)

… to compare (ggml-org#13451)

* add constructor to initialize server_context::batch, preventing destructor's call to llama_batch_free from causing an invalid free() * Update tools/server/server.cpp Co-authored-by: Xuan-Son Nguyen <[email protected]> * use C++11 initializer syntax * switch from Copy-list-initialization to Direct-list-initialization --------- Co-authored-by: Xuan-Son Nguyen <[email protected]>

Minh141120

LGTM!

JohannesGaessler and others added 30 commits April 29, 2025 23:32

scripts: n_depth for compare-llama-bench [no ci] (ggml-org#13201)

19e899c

rpc : fix cache directory initialization (ggml-org#13188)

a0f7016

Signed-off-by: xiaofei <[email protected]>

docker : do not build tests (ggml-org#13204)

da84c04

* docker : do not build tests * include "ggml-cpu.h"

arg : allow using -hf offline (ggml-org#13202)

5933e6f

* arg : allow using -hf offline * add more comments in code [no ci]

feat(ggml-cpu): enable z17 compile (ggml-org#13182)

44cd8d9

z17 compilation requires GCC 15.1.0 and onwards Signed-off-by: Aaron Teo <[email protected]>

convert : correct typo image_mean --> image_std (ggml-org#13208)

07c2e2f

ggml : fix ppc64le build (ggml-org#13176)

4163137

Build fails with compilation error on power pc. This patch fixes the same. Tested with unit tests run via --build <build_dir> && cd <build_dir> && make test Signed-off-by: Shalini Salomi Bodapati <[email protected]>

vulkan: use uint array index to avoid glslang bug (ggml-org#13193)

e5007a5

common : add -jf / --json-schema-file flag (ggml-org#12011)

3b127c7

llava : remove duplicate include (ggml-org#13207)

ceda28e

convert : improve model arch handling (ggml-org#13122)

3e168be

* convert : improve model arch handling * use AutoConfig * rm trust_remote_code * Update convert_hf_to_gguf.py * fix self.block_count for vision * fix NomicBertModel

fix typo: n_ctx_pre_seq -> n_ctx_per_seq (ggml-org#13221)

16a457f

arg : -hf do not fail if url mismatch (ggml-org#13219)

6f67cf1

* arg : -hf do not fail if url mismatch * do not return if cannot parse metadata json

CUDA: batched+noncont MMQ, refactor bs>1 MoE code (ggml-org#13199)

e1e8e09

cuda : fix unused variable compile warning (whisper/0)

9998540

ggml-ci

ggml : fix ggml_gallocr_ptr type (ggml/1205)

4254bb4

sync : ggml

8d33d74

llama-model : fix the reported size class for nomic-embed-text-v2-moe (…

a70183e

…ggml-org#13223)

arg : remove CURLINFO_EFFECTIVE_METHOD (ggml-org#13228)

13c9a33

mtmd : add **vision** support for Mistral Small 3.1 (ggml-org#13231)

8936784

* convert ok * load ok, missing patch merger * ah sheet it works * update llava/readme * add test * fix test

sync : ggml

b1dd4d0

ggml-ci

test: non-cont. b in test-backend-ops -o MUL_MAT (ggml-org#13187)

b0ecbd4

vulkan: Handle src1 batch dimension in non-contiguous mat-vec-mul sha…

fc727bc

…der (ggml-org#13191) * vulkan: Handle src1 batch dimension in non-contiguous mat-vec-mul shader

llama-chat : update GLM4 chat template (ggml-org#13238)

e0f572c

* update GLM4 chat template * Update chat template Co-authored-by: Xuan-Son Nguyen <[email protected]> --------- Co-authored-by: Xuan-Son Nguyen <[email protected]>

clip : (minicpmv) Re-enable upscaling of images smaller than the CLIP…

b6e4ff6

… image size (ggml-org#13237)

build : fix build info on windows (ggml-org#13239)

d7a14c4

* build : fix build info on windows * fix cuda host compiler msg

JohannesGaessler and others added 25 commits May 9, 2025 13:34

CUDA: FA support for Deepseek (Ampere or newer) (ggml-org#13306)

0cf6725

* CUDA: FA support for Deepseek (Ampere or newer) * do loop unrolling via C++ template

metal : optimize MoE for large batches (ggml-org#13388)

611aa91

ggml-ci

chore(llguidance): use tagged version that does not break the build (g…

7c28a74

…gml-org#13413)

arg : add env var to control mmproj (ggml-org#13416)

7fef117

* arg : add env var to control mmproj * small note about -hf --mmproj

CUDA: fix FlashAttention on Turing (ggml-org#13415)

d891942

mtmd : support InternVL 2.5 and 3 (ggml-org#13422)

053367d

* convert : internvl support * InternVL3-1B working * fix regression * rm mobilevlm from test * fix conversion * add test for internvl * add to list of pre-quant * restore boi/eoi check * add clarify comment for norm eps

ci: free_disk_space flag enabled for intel variant (ggml-org#13426)

b064a51

before cleanup: 20G after cleanup: 44G after all built and pushed: 24G https://github.com/Thammachart/llama.cpp/actions/runs/14945093573/job/41987371245

llguidance : set tokenizer slices to default (ggml-org#13424)

43dfd74

server : update docs (ggml-org#13432)

3b24d26

mtmd : add hard limit on image resolution for qwen2vl / qwen2.5vl (gg…

15e6125

…ml-org#13434) * mtmd : add hard limit on image resolution for qwen2vl / qwen2.5vl * fix typo

vocab : add ByteDance-Seed/Seed-Coder (ggml-org#13423)

d2a4ef0

CUDA: fix race conditions FlashAttention kernels (ggml-org#13438)

0208355

docs : Fix typo in InternVL3 model name (ggml-org#13440)

62d4250

mtmd : move helpers to dedicated file (ggml-org#13442)

a634d75

* mtmd : move helpers to dedicated file * fix windows build * rm redundant include

mtmd : support InternVL 3 38B and 78B mmproj (ggml-org#13443)

3eac209

* Support InternVL 3 38B and 78B mmproj * Swap norms in clip.cpp * Group variables together

Add --no-op-offload to improve -ot pp perf in MoE models like lla…

7f323a5

…ma4 400B (ggml-org#13386)

CUDA: fix crash with partial offloading of MoE (ggml-org#13439)

7474e00

scripts : exit compare-llama-bench.py gracefully when there's nothing…

0923237

… to compare (ggml-org#13451)

mtmd : Use RMS norm for InternVL 3 38B and 78B mmproj (ggml-org#13459)

c104023

Merge branch 'dev' into update-dev-from-master-2025-05-12-00-09

6d44d90

chore: fix menlo build

e9ad99d

vansangpfiev requested a review from Minh141120 May 12, 2025 02:42

vansangpfiev enabled auto-merge May 12, 2025 02:43

Minh141120 approved these changes May 12, 2025

View reviewed changes

vansangpfiev merged commit 8d45408 into dev May 12, 2025
7 checks passed

vansangpfiev deleted the update-dev-from-master-2025-05-12-00-09 branch May 12, 2025 02:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Sync master with upstream release b5350 #84

Sync master with upstream release b5350 #84

Uh oh!

jan-service-account commented May 12, 2025

Uh oh!

Minh141120 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

45 participants

Sync master with upstream release b5350 #84

Sync master with upstream release b5350 #84

Uh oh!

Conversation

jan-service-account commented May 12, 2025

Uh oh!

Minh141120 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

45 participants