Skip to content

v0.3.16-cu124-AVX2-win-20251103

Choose a tag to compare

@github-actions github-actions released this 03 Nov 20:20
· 70 commits to main since this release

feat: Sync llama/mtmd API change, support clip flash-attn
feat: Update Submodule vendor/llama.cpp 76af40a..48bd265
feat: Optimize Windows CUDA Wheel Build Workflow
feat: Implement LlamaTrieCache into llama_cache.py: Optimize LlamaCache lookup from O(N) to O(K) using a Trie

Update: CLIP Model can now run in CUDA's flash attention warmup.

Faster speeds and more efficient use of GPU memory.
clip flash-attn

Update: Implement LlamaTrieCache(Default still: LLAMA_RAM_CACHE)

implemented using a trie tree, significantly improves retrieval speed compared to the original linear scan method of finding the longest prefix, thereby enhancing service responsiveness.
Below is a comparison of tests; the advantage becomes more pronounced as the context length increases.
image