Releases: ngxson/llama.cpp
Releases · ngxson/llama.cpp
b6195
llama : merge conts and reshapes and remove unnecessary cont (#15380) * remove unnecessary conts and merge reshapes * restore necessary conts * merge more conts and reshapes * merge even more conts and reshapes
b6193
server : fix incoming tasks not process in order (#15395)
b6190
vulkan: disable spirv-opt for bfloat16 shaders (#15352)
b6189
server : export max observed n_past value (#15361) Add tracking for high watermark cache usage and make it available in /metrics endpoint. Use-case: Tracking largest needed cache usage under realistic workload to better understand memory requirements and be able to adjust cache size/quantization for model/cache accordingly.
b6188
vulkan: Use larger workgroups for mul_mat_vec when M is small (#15355) * vulkan: Use larger workgroups for mul_mat_vec when M is small Also use subgroup instructions for (part of) the reduction when supported. Without this, the more expensive reductions would eat into the benefits of the larger workgroups. * update heuristic for amd/intel Co-authored-by: 0cc4m <[email protected]> --------- Co-authored-by: 0cc4m <[email protected]>
b6187
vulkan: support sqrt (#15370)
b6185
ci : fix hang in windows-hip build/release (#15365) * fix hang in windows-latest-cmake-hip * apply fix to release as well
b6184
vulkan: Optimize argsort (#15354) - Launch an appropriate number of invocations (next larger power of two). 32 invocations is common and the barrier is much cheaper there. - Specialize for "needs bounds checking" vs not. - Make the code less branchy and [[unroll]] the loops. In the final code, I see no branches inside the main loop (only predicated stores) when needs_bounds_check is false. - Always sort ascending, then apply the ascending vs descending option when doing the final stores to memory. - Copy the values into shared memory, makes them slightly cheaper to access.
b6183
model : support vision LiquidAI LFM2-VL family (#15347) * wip lfm2 vision model * Fix conv weight * Implement dynamic resolution * Fix cuda * support LFM2-VL-450M * happy CI * Remove extra `ggml_conv` and put others into the right place Co-authored-by: Sigbjørn Skjæret <[email protected]> --------- Co-authored-by: Xuan Son Nguyen <[email protected]> Co-authored-by: Sigbjørn Skjæret <[email protected]>
b6182
vulkan: fuse adds (#15252) * vulkan: fuse adds Fuse adds that have the same shape, which are common in MoE models. It will currently fuse up to 6 adds, because we assume no more than 8 descriptors per dispatch. But this could be changed. * check runtimeDescriptorArray feature * disable multi_add for Intel due to likely driver bug