Releases · ggml-org/llama.cpp

18 Aug 17:58

baa9255

b6195

llama : merge conts and reshapes and remove unnecessary cont (#15380)

* remove unnecessary conts and merge reshapes

* restore necessary conts

* merge more conts and reshapes

* merge even more conts and reshapes

Assets 15

18 Aug 15:18

github-actions

b6193

d1d8241

b6193

server : fix incoming tasks not process in order (#15395)

Assets 15

18 Aug 07:55

github-actions

b6191

f44f793

b6191

ggml-quants : fix make_qp_quants NANs and IQ1 assertion errors (#15379)

* ggml-quants : fix make_qp_quants NANs and IQ1 assertion errors

* ggml-quants : avoid division by zero in make_q3_quants

Assets 15

18 Aug 06:21

github-actions

b6190

ae532ea

b6190

vulkan: disable spirv-opt for bfloat16 shaders (#15352)

Assets 15

17 Aug 22:55

github-actions

b6189

e5155e6

b6189

server : export max observed n_past value (#15361)

Add tracking for high watermark cache usage and make it available in /metrics endpoint.

Use-case: Tracking largest needed cache usage under realistic workload
to better understand memory requirements and be able to adjust
cache size/quantization for model/cache accordingly.

Assets 15

17 Aug 16:33

github-actions

b6188

21c17b5

b6188

vulkan: Use larger workgroups for mul_mat_vec when M is small (#15355)

* vulkan: Use larger workgroups for mul_mat_vec when M is small

Also use subgroup instructions for (part of) the reduction when supported.
Without this, the more expensive reductions would eat into the benefits of
the larger workgroups.

* update heuristic for amd/intel

Co-authored-by: 0cc4m <[email protected]>

---------

Co-authored-by: 0cc4m <[email protected]>

Assets 15

17 Aug 14:23

github-actions

b6187

19f4dec

b6187

vulkan: support sqrt (#15370)

Assets 15

17 Aug 12:01

github-actions

b6185

b143fbc

b6185

ci : fix hang in windows-hip build/release (#15365)

* fix hang in windows-latest-cmake-hip

* apply fix to release as well

Assets 15

17 Aug 09:01

github-actions

b6184

de56279

b6184

vulkan: Optimize argsort (#15354)

- Launch an appropriate number of invocations (next larger power of two).
32 invocations is common and the barrier is much cheaper there.
- Specialize for "needs bounds checking" vs not.
- Make the code less branchy and [[unroll]] the loops. In the final code,
I see no branches inside the main loop (only predicated stores) when
needs_bounds_check is false.
- Always sort ascending, then apply the ascending vs descending option when
doing the final stores to memory.
- Copy the values into shared memory, makes them slightly cheaper to access.

Assets 15

16 Aug 21:48

github-actions

b6183

65349f2

b6183

model : support vision LiquidAI LFM2-VL family (#15347)

* wip lfm2 vision model

* Fix conv weight

* Implement dynamic resolution

* Fix cuda

* support LFM2-VL-450M

* happy CI

* Remove extra `ggml_conv` and put others into the right place

Co-authored-by: Sigbjørn Skjæret <[email protected]>

---------

Co-authored-by: Xuan Son Nguyen <[email protected]>
Co-authored-by: Sigbjørn Skjæret <[email protected]>

Assets 15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Releases: ggml-org/llama.cpp

b6195

Uh oh!

b6193

Uh oh!

b6191

Uh oh!

b6190

Uh oh!

b6189

Uh oh!

b6188

Uh oh!

b6187

Uh oh!

b6185

Uh oh!

b6184

Uh oh!

b6183

Uh oh!