Skip to content

Releases: ggml-org/llama.cpp

b6195

18 Aug 17:58
baa9255
Compare
Choose a tag to compare
llama : merge conts and reshapes and remove unnecessary cont (#15380)

* remove unnecessary conts and merge reshapes

* restore necessary conts

* merge more conts and reshapes

* merge even more conts and reshapes

b6193

18 Aug 15:18
d1d8241
Compare
Choose a tag to compare
server : fix incoming tasks not process in order (#15395)

b6191

18 Aug 07:55
f44f793
Compare
Choose a tag to compare
ggml-quants : fix make_qp_quants NANs and IQ1 assertion errors (#15379)

* ggml-quants : fix make_qp_quants NANs and IQ1 assertion errors

* ggml-quants : avoid division by zero in make_q3_quants

b6190

18 Aug 06:21
ae532ea
Compare
Choose a tag to compare
vulkan: disable spirv-opt for bfloat16 shaders (#15352)

b6189

17 Aug 22:55
e5155e6
Compare
Choose a tag to compare
server : export max observed n_past value (#15361)

Add tracking for high watermark cache usage and make it available in /metrics endpoint.

Use-case: Tracking largest needed cache usage under realistic workload
to better understand memory requirements and be able to adjust
cache size/quantization for model/cache accordingly.

b6188

17 Aug 16:33
21c17b5
Compare
Choose a tag to compare
vulkan: Use larger workgroups for mul_mat_vec when M is small (#15355)

* vulkan: Use larger workgroups for mul_mat_vec when M is small

Also use subgroup instructions for (part of) the reduction when supported.
Without this, the more expensive reductions would eat into the benefits of
the larger workgroups.

* update heuristic for amd/intel

Co-authored-by: 0cc4m <[email protected]>

---------

Co-authored-by: 0cc4m <[email protected]>

b6187

17 Aug 14:23
19f4dec
Compare
Choose a tag to compare
vulkan: support sqrt (#15370)

b6185

17 Aug 12:01
b143fbc
Compare
Choose a tag to compare
ci : fix hang in windows-hip build/release (#15365)

* fix hang in windows-latest-cmake-hip

* apply fix to release as well

b6184

17 Aug 09:01
de56279
Compare
Choose a tag to compare
vulkan: Optimize argsort (#15354)

- Launch an appropriate number of invocations (next larger power of two).
32 invocations is common and the barrier is much cheaper there.
- Specialize for "needs bounds checking" vs not.
- Make the code less branchy and [[unroll]] the loops. In the final code,
I see no branches inside the main loop (only predicated stores) when
needs_bounds_check is false.
- Always sort ascending, then apply the ascending vs descending option when
doing the final stores to memory.
- Copy the values into shared memory, makes them slightly cheaper to access.

b6183

16 Aug 21:48
65349f2
Compare
Choose a tag to compare
model : support vision LiquidAI LFM2-VL family (#15347)

* wip lfm2 vision model

* Fix conv weight

* Implement dynamic resolution

* Fix cuda

* support LFM2-VL-450M

* happy CI

* Remove extra `ggml_conv` and put others into the right place

Co-authored-by: Sigbjørn Skjæret <[email protected]>

---------

Co-authored-by: Xuan Son Nguyen <[email protected]>
Co-authored-by: Sigbjørn Skjæret <[email protected]>