v0.3.16-cu128-AVX2-win-20251103
·
70 commits
to main
since this release
feat: Sync llama/mtmd API change, support clip flash-attn
feat: Update Submodule vendor/llama.cpp 76af40a..48bd265
feat: Optimize Windows CUDA Wheel Build Workflow
feat: Implement LlamaTrieCache into llama_cache.py: Optimize LlamaCache lookup from O(N) to O(K) using a Trie
Update: CLIP Model can now run in CUDA's flash attention warmup.
Faster speeds and more efficient use of GPU memory.

Update: Implement LlamaTrieCache(Default still: LLAMA_RAM_CACHE)
implemented using a trie tree, significantly improves retrieval speed compared to the original linear scan method of finding the longest prefix, thereby enhancing service responsiveness.
Below is a comparison of tests; the advantage becomes more pronounced as the context length increases.
