Releases: JamePeng/llama-cpp-python
v0.3.16-cu124-AVX2-linux-20251103
feat: Sync llama/mtmd API change, support clip flash-attn
feat: Update Submodule vendor/llama.cpp 76af40a..48bd265
feat: Optimize Windows CUDA Wheel Build Workflow
feat: Implement LlamaTrieCache into llama_cache.py: Optimize LlamaCache lookup from O(N) to O(K) using a Trie
Update: CLIP Model can now run in CUDA's flash attention warmup.
Faster speeds and more efficient use of GPU memory.

Update: Implement LlamaTrieCache(Default still: LLAMA_RAM_CACHE)
implemented using a trie tree, significantly improves retrieval speed compared to the original linear scan method of finding the longest prefix, thereby enhancing service responsiveness.
Below is a comparison of tests; the advantage becomes more pronounced as the context length increases.

v0.3.16-cu128-AVX2-win-20251031
New Update: Support for Qwen3VL GGUF
feat: Update README.md for Qwen3VL example(Thinking/No Thinking)
feat: feat: Add Qwen3VLChatHandler into llama_chat_format.py
feat: Update llama.cpp api 20251031
update: Update Submodule vendor/llama.cpp 16724b5..8da3c0e
v0.3.16-cu128-AVX2-linux-20251031
New Update: Support for Qwen3VL GGUF
feat: Update README.md for Qwen3VL example(Thinking/No Thinking)
feat: feat: Add Qwen3VLChatHandler into llama_chat_format.py
feat: Update llama.cpp api 20251031
update: Update Submodule vendor/llama.cpp 16724b5..8da3c0e
v0.3.16-cu126-AVX2-win-20251031
New Update: Support for Qwen3VL GGUF
feat: Update README.md for Qwen3VL example(Thinking/No Thinking)
feat: feat: Add Qwen3VLChatHandler into llama_chat_format.py
feat: Update llama.cpp api 20251031
update: Update Submodule vendor/llama.cpp 16724b5..8da3c0e
v0.3.16-cu126-AVX2-linux-20251031
New Update: Support for Qwen3VL GGUF
feat: Update README.md for Qwen3VL example(Thinking/No Thinking)
feat: feat: Add Qwen3VLChatHandler into llama_chat_format.py
feat: Update llama.cpp api 20251031
update: Update Submodule vendor/llama.cpp 16724b5..8da3c0e
v0.3.16-cu124-AVX2-win-20251031
New Update: Support for Qwen3VL GGUF
feat: Update README.md for Qwen3VL example(Thinking/No Thinking)
feat: feat: Add Qwen3VLChatHandler into llama_chat_format.py
feat: Update llama.cpp api 20251031
update: Update Submodule vendor/llama.cpp 16724b5..8da3c0e
v0.3.16-cu124-AVX2-linux-20251031
New Update: Support for Qwen3VL GGUF
feat: Update README.md for Qwen3VL example(Thinking/No Thinking)
feat: feat: Add Qwen3VLChatHandler into llama_chat_format.py
feat: Update llama.cpp api 20251031
update: Update Submodule vendor/llama.cpp 16724b5..8da3c0e
v0.3.16-cu128-AVX2-win-20251024
feat: Supplement sm_87 sm101 compilation
feat: Update Submodule vendor/llama.cpp df1b612..dd62dcf
feat: Update some llama model parameters(check_tensors, use_extra_bufts, no_host)
feat: Sync model : Granite docling + Idefics3 preprocessing (SmolVLM)
feat: Sync server : context checkpointing for hybrid and recurrent models
feat: Sync llama: print memory breakdown on exit
feat: Synchronize some enum variable values
feat: Introducing index numbers to avoid the hallucination problem of multiple images entering the minicpm multimodal model series as much as possible
v0.3.16-cu128-AVX2-linux-20251024
feat: Supplement sm_87 sm101 compilation
feat: Update Submodule vendor/llama.cpp df1b612..dd62dcf
feat: Update some llama model parameters(check_tensors, use_extra_bufts, no_host)
feat: Sync model : Granite docling + Idefics3 preprocessing (SmolVLM)
feat: Sync server : context checkpointing for hybrid and recurrent models
feat: Sync llama: print memory breakdown on exit
feat: Synchronize some enum variable values
feat: Introducing index numbers to avoid the hallucination problem of multiple images entering the minicpm multimodal model series as much as possible
v0.3.16-cu126-AVX2-win-20251024
feat: Supplement sm_87 compilation
feat: Update Submodule vendor/llama.cpp df1b612..dd62dcf
feat: Update some llama model parameters(check_tensors, use_extra_bufts, no_host)
feat: Sync model : Granite docling + Idefics3 preprocessing (SmolVLM)
feat: Sync server : context checkpointing for hybrid and recurrent models
feat: Sync llama: print memory breakdown on exit
feat: Synchronize some enum variable values
feat: Introducing index numbers to avoid the hallucination problem of multiple images entering the minicpm multimodal model series as much as possible