Release v0.3.16-cu124-AVX2-win-20251103 · JamePeng/llama-cpp-python

feat: Sync llama/mtmd API change, support clip flash-attn
feat: Update Submodule vendor/llama.cpp 76af40a..48bd265
feat: Optimize Windows CUDA Wheel Build Workflow
feat: Implement LlamaTrieCache into llama_cache.py: Optimize LlamaCache lookup from O(N) to O(K) using a Trie

Update: CLIP Model can now run in CUDA's flash attention warmup.

Faster speeds and more efficient use of GPU memory.

Update: Implement LlamaTrieCache（Default still: LLAMA_RAM_CACHE）

implemented using a trie tree, significantly improves retrieval speed compared to the original linear scan method of finding the longest prefix, thereby enhancing service responsiveness.
Below is a comparison of tests; the advantage becomes more pronounced as the context length increases.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v0.3.16-cu124-AVX2-win-20251103

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Update: CLIP Model can now run in CUDA's flash attention warmup.

Update: Implement LlamaTrieCache（Default still: LLAMA_RAM_CACHE）

Uh oh!