Releases · ngxson/llama.cpp

19 Aug 06:01

9d262f4

b6201

server : remove swa_full warning (#15399)

Assets 15

18 Aug 21:16

github-actions

b6199

f08c4c0

b6199

mtmd : clean up clip_n_output_tokens (#15391)

Assets 15

18 Aug 17:57

github-actions

b6195

baa9255

b6195

llama : merge conts and reshapes and remove unnecessary cont (#15380)

* remove unnecessary conts and merge reshapes

* restore necessary conts

* merge more conts and reshapes

* merge even more conts and reshapes

Assets 15

18 Aug 15:07

github-actions

b6193

d1d8241

b6193

server : fix incoming tasks not process in order (#15395)

Assets 15

18 Aug 06:15

github-actions

b6190

ae532ea

b6190

vulkan: disable spirv-opt for bfloat16 shaders (#15352)

Assets 15

17 Aug 22:48

github-actions

b6189

e5155e6

b6189

server : export max observed n_past value (#15361)

Add tracking for high watermark cache usage and make it available in /metrics endpoint.

Use-case: Tracking largest needed cache usage under realistic workload
to better understand memory requirements and be able to adjust
cache size/quantization for model/cache accordingly.

Assets 15

17 Aug 16:32

github-actions

b6188

21c17b5

b6188

vulkan: Use larger workgroups for mul_mat_vec when M is small (#15355)

* vulkan: Use larger workgroups for mul_mat_vec when M is small

Also use subgroup instructions for (part of) the reduction when supported.
Without this, the more expensive reductions would eat into the benefits of
the larger workgroups.

* update heuristic for amd/intel

Co-authored-by: 0cc4m <[email protected]>

---------

Co-authored-by: 0cc4m <[email protected]>

Assets 15

17 Aug 14:27

github-actions

b6187

19f4dec

b6187

vulkan: support sqrt (#15370)

Assets 15

17 Aug 11:48

github-actions

b6185

b143fbc

b6185

ci : fix hang in windows-hip build/release (#15365)

* fix hang in windows-latest-cmake-hip

* apply fix to release as well

Assets 15

17 Aug 09:08

github-actions

b6184

de56279

b6184

vulkan: Optimize argsort (#15354)

- Launch an appropriate number of invocations (next larger power of two).
32 invocations is common and the barrier is much cheaper there.
- Specialize for "needs bounds checking" vs not.
- Make the code less branchy and [[unroll]] the loops. In the final code,
I see no branches inside the main loop (only predicated stores) when
needs_bounds_check is false.
- Always sort ascending, then apply the ascending vs descending option when
doing the final stores to memory.
- Copy the values into shared memory, makes them slightly cheaper to access.

Assets 15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Releases: ngxson/llama.cpp

b6201

Uh oh!

b6199

Uh oh!

b6195

Uh oh!

b6193

Uh oh!

b6190

Uh oh!

b6189

Uh oh!

b6188

Uh oh!

b6187

Uh oh!

b6185

Uh oh!

b6184

Uh oh!