Skip to content

Conversation

@jan-service-account
Copy link

Updates dev branch with latest release (b6189) from ggml-org/llama.cpp

jeffbolznv and others added 6 commits August 17, 2025 10:41
- Launch an appropriate number of invocations (next larger power of two).
32 invocations is common and the barrier is much cheaper there.
- Specialize for "needs bounds checking" vs not.
- Make the code less branchy and [[unroll]] the loops. In the final code,
I see no branches inside the main loop (only predicated stores) when
needs_bounds_check is false.
- Always sort ascending, then apply the ascending vs descending option when
doing the final stores to memory.
- Copy the values into shared memory, makes them slightly cheaper to access.
* fix hang in windows-latest-cmake-hip

* apply fix to release as well
ggml-org#15367)

* force patch_embd weights to f32

* use MmprojModel base tensor_force_quant instead
…rg#15355)

* vulkan: Use larger workgroups for mul_mat_vec when M is small

Also use subgroup instructions for (part of) the reduction when supported.
Without this, the more expensive reductions would eat into the benefits of
the larger workgroups.

* update heuristic for amd/intel

Co-authored-by: 0cc4m <[email protected]>

---------

Co-authored-by: 0cc4m <[email protected]>
Add tracking for high watermark cache usage and make it available in /metrics endpoint.

Use-case: Tracking largest needed cache usage under realistic workload
to better understand memory requirements and be able to adjust
cache size/quantization for model/cache accordingly.
@jan-service-account jan-service-account merged commit 4b8975c into dev Aug 18, 2025
17 checks passed
@jan-service-account jan-service-account deleted the update-dev-from-master-2025-08-18-00-13 branch August 18, 2025 00:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants