Skip to content

Releases: ngxson/llama.cpp

b6201

19 Aug 06:01
9d262f4
Compare
Choose a tag to compare
server : remove swa_full warning (#15399)

b6199

18 Aug 21:16
f08c4c0
Compare
Choose a tag to compare
mtmd : clean up clip_n_output_tokens (#15391)

b6195

18 Aug 17:57
baa9255
Compare
Choose a tag to compare
llama : merge conts and reshapes and remove unnecessary cont (#15380)

* remove unnecessary conts and merge reshapes

* restore necessary conts

* merge more conts and reshapes

* merge even more conts and reshapes

b6193

18 Aug 15:07
d1d8241
Compare
Choose a tag to compare
server : fix incoming tasks not process in order (#15395)

b6190

18 Aug 06:15
ae532ea
Compare
Choose a tag to compare
vulkan: disable spirv-opt for bfloat16 shaders (#15352)

b6189

17 Aug 22:48
e5155e6
Compare
Choose a tag to compare
server : export max observed n_past value (#15361)

Add tracking for high watermark cache usage and make it available in /metrics endpoint.

Use-case: Tracking largest needed cache usage under realistic workload
to better understand memory requirements and be able to adjust
cache size/quantization for model/cache accordingly.

b6188

17 Aug 16:32
21c17b5
Compare
Choose a tag to compare
vulkan: Use larger workgroups for mul_mat_vec when M is small (#15355)

* vulkan: Use larger workgroups for mul_mat_vec when M is small

Also use subgroup instructions for (part of) the reduction when supported.
Without this, the more expensive reductions would eat into the benefits of
the larger workgroups.

* update heuristic for amd/intel

Co-authored-by: 0cc4m <[email protected]>

---------

Co-authored-by: 0cc4m <[email protected]>

b6187

17 Aug 14:27
19f4dec
Compare
Choose a tag to compare
vulkan: support sqrt (#15370)

b6185

17 Aug 11:48
b143fbc
Compare
Choose a tag to compare
ci : fix hang in windows-hip build/release (#15365)

* fix hang in windows-latest-cmake-hip

* apply fix to release as well

b6184

17 Aug 09:08
de56279
Compare
Choose a tag to compare
vulkan: Optimize argsort (#15354)

- Launch an appropriate number of invocations (next larger power of two).
32 invocations is common and the barrier is much cheaper there.
- Specialize for "needs bounds checking" vs not.
- Make the code less branchy and [[unroll]] the loops. In the final code,
I see no branches inside the main loop (only predicated stores) when
needs_bounds_check is false.
- Always sort ascending, then apply the ascending vs descending option when
doing the final stores to memory.
- Copy the values into shared memory, makes them slightly cheaper to access.