Sync master with upstream release b6337 #231

jan-service-account · 2025-09-01T00:42:05Z

Updates dev branch with latest release (b6337) from ggml-org/llama.cpp

…#15652) * vulkan: clamp matmul and FA results to the max finite value * only clamp for fp16

…#15649) * vulkan: Allow fallback to sysmem memory when vidmem is full * vulkan: Add env var GGML_VK_ALLOW_SYSMEM_FALLBACK

…#15679) This commit removes the portability_enumeration_ext variable from the ggml_vk_instance_portability_enumeration_ext_available function as it is initialized to false but never modified, making it redundant.

* vulkan: mul_mat_id coopmat2 optimizations Add a path for when the tile fits in BN/2, similar to what we have for mul_mat. Only call fetch_scales/store_scales once per QUANT_K block, and once at the beginning in case start_k is not aligned. * Also add a path for BN/4 - worth a couple more percent

) Exposes ggml_backend_sched_split_graph() to allow splitting the graph without allocating compute buffers and uses it to split the graph for the automatic Flash Attention check.

ggml-ci

* metal : fix checks for available FA kernels ggml-ci * cont : fix comment [no ci]

* server : enable /slots by default and make it secure ggml-ci * server : fix tests to pass `--no-slots` when necessary * server : extend /props with info about enabled endpoints

jeffbolznv and others added 10 commits August 31, 2025 08:27

vulkan: clamp matmul and FA results to the max finite value (ggml-org…

94e82c7

…#15652) * vulkan: clamp matmul and FA results to the max finite value * only clamp for fp16

vulkan: Allow fallback to sysmem memory when vidmem is full (ggml-org…

b97c9ed

…#15649) * vulkan: Allow fallback to sysmem memory when vidmem is full * vulkan: Add env var GGML_VK_ALLOW_SYSMEM_FALLBACK

vulkan: handle large sizes for get_rows (ggml-org#15686)

bbbf5ec

ci : explicitly set fa off or on (ggml-org#15692)

7d3c9f2

llama : separate compute buffer reserve from fattn check (ggml-org#15696

9777032

) Exposes ggml_backend_sched_split_graph() to allow splitting the graph without allocating compute buffers and uses it to split the graph for the automatic Flash Attention check.

llama : fix fattn reserve call n_seqs parameter (ggml-org#15699)

2749662

ggml-ci

metal : fix checks for available FA kernels (ggml-org#15700)

4efd5a8

* metal : fix checks for available FA kernels ggml-ci * cont : fix comment [no ci]

server : enable /slots by default and make it secure (ggml-org#15630)

0d161f0

* server : enable /slots by default and make it secure ggml-ci * server : fix tests to pass `--no-slots` when necessary * server : extend /props with info about enabled endpoints

jan-service-account merged commit 3832f54 into dev Sep 1, 2025
13 checks passed

jan-service-account deleted the update-dev-from-master-2025-09-01-00-42 branch September 1, 2025 00:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Sync master with upstream release b6337 #231

Sync master with upstream release b6337 #231

Uh oh!

jan-service-account commented Sep 1, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Sync master with upstream release b6337 #231

Sync master with upstream release b6337 #231

Uh oh!

Conversation

jan-service-account commented Sep 1, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants