Skip to content

Releases: JamePeng/llama-cpp-python

v0.3.16-cu124-AVX2-linux-20251103

03 Nov 16:33

Choose a tag to compare

feat: Sync llama/mtmd API change, support clip flash-attn
feat: Update Submodule vendor/llama.cpp 76af40a..48bd265
feat: Optimize Windows CUDA Wheel Build Workflow
feat: Implement LlamaTrieCache into llama_cache.py: Optimize LlamaCache lookup from O(N) to O(K) using a Trie

Update: CLIP Model can now run in CUDA's flash attention warmup.

Faster speeds and more efficient use of GPU memory.
clip flash-attn

Update: Implement LlamaTrieCache(Default still: LLAMA_RAM_CACHE)

implemented using a trie tree, significantly improves retrieval speed compared to the original linear scan method of finding the longest prefix, thereby enhancing service responsiveness.
Below is a comparison of tests; the advantage becomes more pronounced as the context length increases.
image

v0.3.16-cu128-AVX2-win-20251031

31 Oct 18:37

Choose a tag to compare

v0.3.16-cu128-AVX2-linux-20251031

31 Oct 16:12

Choose a tag to compare

v0.3.16-cu126-AVX2-win-20251031

31 Oct 19:37

Choose a tag to compare

v0.3.16-cu126-AVX2-linux-20251031

31 Oct 15:47

Choose a tag to compare

v0.3.16-cu124-AVX2-win-20251031

31 Oct 19:34

Choose a tag to compare

v0.3.16-cu124-AVX2-linux-20251031

31 Oct 15:47

Choose a tag to compare

v0.3.16-cu128-AVX2-win-20251024

v0.3.16-cu128-AVX2-linux-20251024

v0.3.16-cu126-AVX2-win-20251024