merge from upstream #81

l3utterfly · 2025-09-03T15:26:27Z

No description provided.

…ggml-org#15524) * vulkan: use subgroup function for mul_mat_id shader even without coopmat * vulkan: fix compile warnings * vulkan: properly check for subgroup size control and require full subgroups for subgroup mul_mat_id * vulkan: disable subgroup mul_mat_id on devices with subgroups < 16

Signed-off-by: noemotiovon <[email protected]>

* support interns1-mini * fix comment * update

ggml-ci

Signed-off-by: Weizhao Ouyang <[email protected]>

…5562) * batched-bench : fix unified KV cache handling + pp timing * cont : run dummy token only with split KV cache

…ml-org#15557) * model-conversion: add model card template for embeddings [no ci] This commit adds a separate model card template (model repository README.md template) for embedding models. The motivation for this is that there server command for the embedding model is a little different and some addition information can be useful in the model card for embedding models which might not be directly relevant for causal models. * squash! model-conversion: add model card template for embeddings [no ci] Fix pyright lint error. * remove --pooling override and clarify embd_normalize usage

…5564) This commit explicitly sets the pooling type to 'none' in the logits.cpp to support models that have a pooling type specified. The motivation for this is that some models may have a pooling type set in the model file (.gguf file) and for this specific case where we only want to extract logits, we need to ensure that no pooling is used to so that we are comparing raw logits and not pooled embeddings.

* CUDA: MoE helper in device code, better tile sizes * reduce superfluous CUDA blocks

This avoids backend-dependent behavior for argmax that leads to intermittent failures.

…gml-org#15565)

* CUDA: optimize get_int_from_table_16 * CUDA: use v_perm_b32 to replace byte_perm on AMD GPUs * revise documentation --------- Co-authored-by: xix <[email protected]> Co-authored-by: Johannes Gäßler <[email protected]>

row_ids only needs to hold the BN rows for the current tile.

* Add warning * Print the devices names * Add newlines * Apply suggestions from code review Co-authored-by: Johannes Gäßler <[email protected]> * Fix vector names --------- Co-authored-by: Johannes Gäßler <[email protected]>

* remove contiguous assertion for src0 in IM2COL * add contiguous check in supports_op

* metal : mul_mm_id remove hdst * metal : remove mul_mm_id hsrc1 * metal : mul_mm_id simplify + add test * metal : opt mul_mm_id map0 * metal : optimize mul_mm_id id gathering * metal : mul/div opt * metal : optimize mul_mm_id_map0 ggml-ci

ggml-ci

* convert : fix tensor naming conflict for llama 4 vision * convert ok * support kimi vision model * clean up * fix style * fix calc number of output tokens * refactor resize_position_embeddings * add test case * rename build fn * correct a small bug

* metal : optmize FA vec for large heads and sequences * metal : adjust small-batch mul mv kernels ggml-ci * batched-bench : fix total speed computation ggml-ci * cont : add comments ggml-ci

This commit adds two targets to the Makefile for quantizing of Quantization Aware Trained (QAT) models to Q4_0 format. The motivation for this is that this sets the token embedding and the output tensors data types to Q8_0 instead of the default Q6_K. This is someting that we wish to enforce for QAT Q4_0 models that are to be uploaded to ggml-org on Huggingface to guarantee the best quality.

ggml-ci

This patch improves GEMM for FP32 Data Type on PowerPC Implements GEMM on large blocks with configurable block size mc, nc, kc (default: 256, 256, 256). Packing Function optimized to access blocks as per memory layout. GEMM Optimized to work on larger blocks. Isolated Packing from GEMM Operations for better MMA utilization. Verified functionality and correctness uing llama-cli and stand alone test case (performs matmul and compares final mattrix C result with base). Minor code refactoring changes: Replace macro with inline function Code Indent made consistent with 4 spaces Performance Testing: Observed 50% ~ 70% improvement in Prompt Processing Speed mesured using llama-bench with Meta-Llama3-8B FP32 Model. Similar gains observed with Mistral-7b-Instruct-v0.3 Model. model Size Params Backend Threads Test Patch Base llama 8B all F32 29.92 GiB 8.03 B CPU 20 pp512 98.58 60.3 llama 8B all F32 29.92 GiB 8.03 B CPU 20 pp1024 95.88 57.36 llama 8B all F32 29.92 GiB 8.03 B CPU 20 pp2048 85.46 53.26 llama 8B all F32 29.92 GiB 8.03 B CPU 20 pp4096 68.66 45.78 llama 8B all F32 29.92 GiB 8.03 B CPU 20 pp6144 57.35 40.44 25 ~ 30% improvement in llama-batched-bench with Metla-Llama3-8B in Prompt Processing Speed for large prompts (256, 512, 1024, 2048, 4096)tokens with various batch sizes ( 1, 2, 4, 8, 16) Signed-off-by: Shalini Salomi Bodapati <[email protected]>

…gml-org#15592) The original implementation unconditionally returned true for this operation, leading to a failure when the tensor's first dimension (ne[0]) was not a multiple of WARP_SIZE. This caused an GGML_ASSERT(ncols % WARP_SIZE == 0) failure in ggml-sycl/norm.cpp. This change updates the ggml_backend_sycl_device_supports_op check to correctly return true for GGML_OP_RMS_NORM only when the first dimension of the tensor is a multiple of WARP_SIZE, ensuring the operation can be performed without error.

)

Previously, the slope tensor was set to fp16 to improve efficiency. While this worked correctly in FA, it caused precision issues in soft_max. This change applies different data types for different operators to balance both accuracy and performance.

)

…g#15744) This seems to correspond with what we want to do, see [here](ggml-org#15715 (comment)) and [clang-format docs](https://clang.llvm.org/docs/ClangFormatStyleOptions.html#binpackarguments)

…guf.cpp (ggml-org#15754)

Signed-off-by: noemotiovon <[email protected]>

CANN currently does not support kernels larger than 255. This change disables such cases.

This commit adds a curl script to the model-conversion examples which is currently missing. This script is required for the running the embedding server targets to test llama-server embeddings functionality.

* ggml-cpu : optimize rvv ggml_vec_dot_f32 * ggml-cpu : optimize 128-bit rvv ggml_vec_dot_q4_K_q8_K * ggml-cpu : fix riscv arch flags * ggml-cpu : add more rvv ops * ggml-cpu : optimize rvv ggml_vec_dot_q4_K_q8_K * ggml-cpu : optimize rvv ggml_vec_dot_q6_K_q8_K * ggml-cpu : minor rvv adjustments * ggml-cpu : fix riscv include

…org#15765) * model-conversion : remove hardcoded /bin/bash shebangs [no ci] This commit updates the bash scripts to use env instead of using hardcoded /bin/bash in the shebang line. The motivation for this is that some systems may have bash installed in a different location, and using /usr/bin/env bash ensures that the script will use the first bash interpreter found in the user's PATH, making the scripts more portable across different environments. * model-conversion : rename script to .py [no ci] This commit renames run-casual-gen-embeddings-org.sh to run-casual-gen-embeddings-org.py to reflect its Python nature.

This commit fixes the model type for the Gemma 270M model in llama_model.cpp which should be LLM_TYPE_270M. I incorrectly added this previously as LLM_TYPE_537M which was wrong. The motivation for this is that it causes the model to not be identified properly when using tools like llama-bench. For example: ```console $ ./build/bin/llama-bench -m models/gemma-3-270m-Q8_0.gguf | model | size | ... | ------------------------------ | ---------: | ... | gemma3 ?B Q8_0 | 271.81 MiB | ... | gemma3 ?B Q8_0 | 271.81 MiB | ... ``` With the changes in this commit the output will be: ```console $ ./build/bin/llama-bench -m models/gemma-3-270m-Q8_0.gguf | model | size | ... | ------------------------------ | ---------: | ... | gemma3 270M Q8_0 | 271.81 MiB | ... | gemma3 270M Q8_0 | 271.81 MiB | ... ```

ggml-ci

0cc4m and others added 30 commits August 24, 2025 19:36

CANN: ROPE cache sin/cos repeat (ggml-org#15501)

c247d06

Signed-off-by: noemotiovon <[email protected]>

convert : support interns1-mini (ggml-org#15412)

7da9fed

* support interns1-mini * fix comment * update

metal : add FA kernels for HS=40 (ggml-org#15559)

b0ba31f

ggml-ci

convert : update Ernie 4.5 dense architecture name (ggml-org#15555)

0d5a470

Signed-off-by: Weizhao Ouyang <[email protected]>

batched-bench : fix unified KV cache handling + pp timing (ggml-org#1…

6b64f74

…5562) * batched-bench : fix unified KV cache handling + pp timing * cont : run dummy token only with split KV cache

CUDA: MoE helper in device code, better tile sizes (ggml-org#15525)

5eff6ec

* CUDA: MoE helper in device code, better tile sizes * reduce superfluous CUDA blocks

metal: fix regression when no metal devices are present (ggml-org#15531)

111f8d0

tests: Generate unique input values for count_equal (ggml-org#15487)

886b97a

This avoids backend-dependent behavior for argmax that leads to intermittent failures.

vulkan: fix min subgroup 16 condition for mmid subgroup optimization (g…

4d917cd

…gml-org#15565)

opencl: fix support ops condition for rms_norm (ggml-org#15560)

f7207b0

vulkan: Remove splitting for mul_mat_id (ggml-org#15568)

34bdbbd

row_ids only needs to hold the BN rows for the current tile.

Add a warning for special devices (ggml-org#15563)

4c37636

* Add warning * Print the devices names * Add newlines * Apply suggestions from code review Co-authored-by: Johannes Gäßler <[email protected]> * Fix vector names --------- Co-authored-by: Johannes Gäßler <[email protected]>

metal : remove contiguous assertion for src0 in IM2COL (ggml-org#15577)

0fd90db

* remove contiguous assertion for src0 in IM2COL * add contiguous check in supports_op

gguf-py : remove erroneous FFN_GATE entry (ggml-org#15583)

39842a7

model : support MiniCPM-V 4.5 (ggml-org#15575)

c4e9239

metal : improve MUL_MAT_ID (ggml-org#15541)

1d8d83d

* metal : mul_mm_id remove hdst * metal : remove mul_mm_id hsrc1 * metal : mul_mm_id simplify + add test * metal : opt mul_mm_id map0 * metal : optimize mul_mm_id id gathering * metal : mul/div opt * metal : optimize mul_mm_id_map0 ggml-ci

context : print graph stats for memory-less contexts (ggml-org#15586)

85cc1ae

ggml-ci

metal : optimize FA vec for large sequences and BS <= 8 (ggml-org#15566)

b3964c1

* metal : optmize FA vec for large heads and sequences * metal : adjust small-batch mul mv kernels ggml-ci * batched-bench : fix total speed computation ggml-ci * cont : add comments ggml-ci

CUDA: return -1 for nonexistent compiled arch (ggml-org#15587)

8f5afa9

graph : fix assert in memory-less build_attn (ggml-org#15590)

0373486

ggml-ci

tests: add performance test for mul mat id (ggml-org#15543)

44b1efa

mtmd : fix mtmd ios build (ggml-org#15579)

8ce3ff1

rmatif and others added 15 commits September 1, 2025 23:26

opencl: add attn sinks support for FA kernels (ggml-org#15706)

97669e4

vulkan: Fix macro parameter order for f32 matmul shaders (ggml-org#15716

25f1045

)

vulkan: fix shaders gen when no integer dot is available (ggml-org#15740

0a2a384

)

llama: -fa 1/0/-1 aliases for -fa on/off/auto (ggml-org#15746)

c466abe

chore: Update .clang-format to use BinPackArguments=true (ggml-or…

69db8a5

…g#15744) This seems to correspond with what we want to do, see [here](ggml-org#15715 (comment)) and [clang-format docs](https://clang.llvm.org/docs/ClangFormatStyleOptions.html#binpackarguments)

fix: resolve unsigned int initialization warning for n_dims/size in g…

3de0082

…guf.cpp (ggml-org#15754)

CANN: Fix type float_t to float (ggml-org#15736)

8a2234e

Signed-off-by: noemotiovon <[email protected]>

CANN: Mask unsupported TRANSPOSE_1D operator (ggml-org#15733)

f6da8cb

CANN currently does not support kernels larger than 255. This change disables such cases.

model-conversion : add missing curl script [no ci] (ggml-org#15761)

8c3fdf4

This commit adds a curl script to the model-conversion examples which is currently missing. This script is required for the running the embedding server targets to test llama-server embeddings functionality.

CANN: Add RoPE contiguous check for 310I DUP device (ggml-org#15735)

5eae934

sampling : optimize dist sampler (ggml-org#15704)

cdedb70

ggml-ci

l3utterfly merged commit e0a5a9b into layla-build Sep 3, 2025
80 of 114 checks passed

github-actions bot added documentation Improvements or additions to documentation SYCL Nvidia GPU Vulkan testing examples devops python server ggml Apple Metal script Ascend NPU OpenCL labels Sep 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

merge from upstream #81

merge from upstream #81

Uh oh!

l3utterfly commented Sep 3, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

43 participants

merge from upstream #81

merge from upstream #81

Uh oh!

Conversation

l3utterfly commented Sep 3, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

43 participants