Sync master with upstream release b6293 #219

jan-service-account · 2025-08-27T00:12:01Z

Updates dev branch with latest release (b6293) from ggml-org/llama.cpp

Signed-off-by: noemotiovon <[email protected]>

* support interns1-mini * fix comment * update

ggml-ci

Signed-off-by: Weizhao Ouyang <[email protected]>

…5562) * batched-bench : fix unified KV cache handling + pp timing * cont : run dummy token only with split KV cache

…ml-org#15557) * model-conversion: add model card template for embeddings [no ci] This commit adds a separate model card template (model repository README.md template) for embedding models. The motivation for this is that there server command for the embedding model is a little different and some addition information can be useful in the model card for embedding models which might not be directly relevant for causal models. * squash! model-conversion: add model card template for embeddings [no ci] Fix pyright lint error. * remove --pooling override and clarify embd_normalize usage

…5564) This commit explicitly sets the pooling type to 'none' in the logits.cpp to support models that have a pooling type specified. The motivation for this is that some models may have a pooling type set in the model file (.gguf file) and for this specific case where we only want to extract logits, we need to ensure that no pooling is used to so that we are comparing raw logits and not pooled embeddings.

* CUDA: MoE helper in device code, better tile sizes * reduce superfluous CUDA blocks

This avoids backend-dependent behavior for argmax that leads to intermittent failures.

…gml-org#15565)

* CUDA: optimize get_int_from_table_16 * CUDA: use v_perm_b32 to replace byte_perm on AMD GPUs * revise documentation --------- Co-authored-by: xix <[email protected]> Co-authored-by: Johannes Gäßler <[email protected]>

row_ids only needs to hold the BN rows for the current tile.

* Add warning * Print the devices names * Add newlines * Apply suggestions from code review Co-authored-by: Johannes Gäßler <[email protected]> * Fix vector names --------- Co-authored-by: Johannes Gäßler <[email protected]>

* remove contiguous assertion for src0 in IM2COL * add contiguous check in supports_op

* metal : mul_mm_id remove hdst * metal : remove mul_mm_id hsrc1 * metal : mul_mm_id simplify + add test * metal : opt mul_mm_id map0 * metal : optimize mul_mm_id id gathering * metal : mul/div opt * metal : optimize mul_mm_id_map0 ggml-ci

ggml-ci

* convert : fix tensor naming conflict for llama 4 vision * convert ok * support kimi vision model * clean up * fix style * fix calc number of output tokens * refactor resize_position_embeddings * add test case * rename build fn * correct a small bug

* metal : optmize FA vec for large heads and sequences * metal : adjust small-batch mul mv kernels ggml-ci * batched-bench : fix total speed computation ggml-ci * cont : add comments ggml-ci

This commit adds two targets to the Makefile for quantizing of Quantization Aware Trained (QAT) models to Q4_0 format. The motivation for this is that this sets the token embedding and the output tensors data types to Q8_0 instead of the default Q6_K. This is someting that we wish to enforce for QAT Q4_0 models that are to be uploaded to ggml-org on Huggingface to guarantee the best quality.

ggml-ci

This patch improves GEMM for FP32 Data Type on PowerPC Implements GEMM on large blocks with configurable block size mc, nc, kc (default: 256, 256, 256). Packing Function optimized to access blocks as per memory layout. GEMM Optimized to work on larger blocks. Isolated Packing from GEMM Operations for better MMA utilization. Verified functionality and correctness uing llama-cli and stand alone test case (performs matmul and compares final mattrix C result with base). Minor code refactoring changes: Replace macro with inline function Code Indent made consistent with 4 spaces Performance Testing: Observed 50% ~ 70% improvement in Prompt Processing Speed mesured using llama-bench with Meta-Llama3-8B FP32 Model. Similar gains observed with Mistral-7b-Instruct-v0.3 Model. model Size Params Backend Threads Test Patch Base llama 8B all F32 29.92 GiB 8.03 B CPU 20 pp512 98.58 60.3 llama 8B all F32 29.92 GiB 8.03 B CPU 20 pp1024 95.88 57.36 llama 8B all F32 29.92 GiB 8.03 B CPU 20 pp2048 85.46 53.26 llama 8B all F32 29.92 GiB 8.03 B CPU 20 pp4096 68.66 45.78 llama 8B all F32 29.92 GiB 8.03 B CPU 20 pp6144 57.35 40.44 25 ~ 30% improvement in llama-batched-bench with Metla-Llama3-8B in Prompt Processing Speed for large prompts (256, 512, 1024, 2048, 4096)tokens with various batch sizes ( 1, 2, 4, 8, 16) Signed-off-by: Shalini Salomi Bodapati <[email protected]>

…gml-org#15592) The original implementation unconditionally returned true for this operation, leading to a failure when the tensor's first dimension (ne[0]) was not a multiple of WARP_SIZE. This caused an GGML_ASSERT(ncols % WARP_SIZE == 0) failure in ggml-sycl/norm.cpp. This change updates the ggml_backend_sycl_device_supports_op check to correctly return true for GGML_OP_RMS_NORM only when the first dimension of the tensor is a multiple of WARP_SIZE, ensuring the operation can be performed without error.

noemotiovon and others added 29 commits August 27, 2025 09:49

CANN: ROPE cache sin/cos repeat (ggml-org#15501)

5f32e9e

Signed-off-by: noemotiovon <[email protected]>

convert : support interns1-mini (ggml-org#15412)

794deb5

* support interns1-mini * fix comment * update

metal : add FA kernels for HS=40 (ggml-org#15559)

5636357

ggml-ci

convert : update Ernie 4.5 dense architecture name (ggml-org#15555)

aad6558

Signed-off-by: Weizhao Ouyang <[email protected]>

batched-bench : fix unified KV cache handling + pp timing (ggml-org#1…

dd7bb1b

…5562) * batched-bench : fix unified KV cache handling + pp timing * cont : run dummy token only with split KV cache

CUDA: MoE helper in device code, better tile sizes (ggml-org#15525)

6e3f6c1

* CUDA: MoE helper in device code, better tile sizes * reduce superfluous CUDA blocks

metal: fix regression when no metal devices are present (ggml-org#15531)

83a1a80

tests: Generate unique input values for count_equal (ggml-org#15487)

7addd18

This avoids backend-dependent behavior for argmax that leads to intermittent failures.

vulkan: fix min subgroup 16 condition for mmid subgroup optimization (g…

363b0fc

…gml-org#15565)

opencl: fix support ops condition for rms_norm (ggml-org#15560)

f2cf572

vulkan: Remove splitting for mul_mat_id (ggml-org#15568)

2e9f3f6

row_ids only needs to hold the BN rows for the current tile.

Add a warning for special devices (ggml-org#15563)

830e2f4

* Add warning * Print the devices names * Add newlines * Apply suggestions from code review Co-authored-by: Johannes Gäßler <[email protected]> * Fix vector names --------- Co-authored-by: Johannes Gäßler <[email protected]>

metal : remove contiguous assertion for src0 in IM2COL (ggml-org#15577)

ac05165

* remove contiguous assertion for src0 in IM2COL * add contiguous check in supports_op

gguf-py : remove erroneous FFN_GATE entry (ggml-org#15583)

384c7df

model : support MiniCPM-V 4.5 (ggml-org#15575)

b2b0b30

metal : improve MUL_MAT_ID (ggml-org#15541)

454e379

* metal : mul_mm_id remove hdst * metal : remove mul_mm_id hsrc1 * metal : mul_mm_id simplify + add test * metal : opt mul_mm_id map0 * metal : optimize mul_mm_id id gathering * metal : mul/div opt * metal : optimize mul_mm_id_map0 ggml-ci

context : print graph stats for memory-less contexts (ggml-org#15586)

88e1081

ggml-ci

metal : optimize FA vec for large sequences and BS <= 8 (ggml-org#15566)

7068477

* metal : optmize FA vec for large heads and sequences * metal : adjust small-batch mul mv kernels ggml-ci * batched-bench : fix total speed computation ggml-ci * cont : add comments ggml-ci

CUDA: return -1 for nonexistent compiled arch (ggml-org#15587)

5707c65

graph : fix assert in memory-less build_attn (ggml-org#15590)

de85c30

ggml-ci

tests: add performance test for mul mat id (ggml-org#15543)

6eebd96

mtmd : fix mtmd ios build (ggml-org#15579)

524254e

Minh141120 force-pushed the update-dev-from-master-2025-08-27-00-11 branch from 8b69686 to ed349df Compare August 27, 2025 02:50

Minh141120 requested a review from qnixsynapse August 27, 2025 02:51

qnixsynapse approved these changes Aug 27, 2025

View reviewed changes

Minh141120 added this pull request to the merge queue Aug 27, 2025

Merged via the queue into dev with commit 335ad88 Aug 27, 2025
11 checks passed

Minh141120 deleted the update-dev-from-master-2025-08-27-00-11 branch August 27, 2025 03:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Sync master with upstream release b6293 #219

Sync master with upstream release b6293 #219

Uh oh!

jan-service-account commented Aug 27, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

22 participants

Sync master with upstream release b6293 #219

Sync master with upstream release b6293 #219

Uh oh!

Conversation

jan-service-account commented Aug 27, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

22 participants