Releases · JamePeng/llama-cpp-python · GitHub

03 Nov 16:33

v0.3.16-cu124-AVX2-linux-20251103

v0.3.16-cu124-AVX2-linux-20251103

feat: Sync llama/mtmd API change, support clip flash-attn
feat: Update Submodule vendor/llama.cpp 76af40a..48bd265
feat: Optimize Windows CUDA Wheel Build Workflow
feat: Implement LlamaTrieCache into llama_cache.py: Optimize LlamaCache lookup from O(N) to O(K) using a Trie

Update: CLIP Model can now run in CUDA's flash attention warmup.

Faster speeds and more efficient use of GPU memory.

Update: Implement LlamaTrieCache（Default still: LLAMA_RAM_CACHE）

implemented using a trie tree, significantly improves retrieval speed compared to the original linear scan method of finding the longest prefix, thereby enhancing service responsiveness.
Below is a comparison of tests; the advantage becomes more pronounced as the context length increases.

Assets 6

31 Oct 18:37

v0.3.16-cu128-AVX2-win-20251031

v0.3.16-cu128-AVX2-win-20251031

New Update: Support for Qwen3VL GGUF

feat: Update README.md for Qwen3VL example(Thinking/No Thinking)
feat: feat: Add Qwen3VLChatHandler into llama_chat_format.py
feat: Update llama.cpp api 20251031
update: Update Submodule vendor/llama.cpp 16724b5..8da3c0e

Assets 6

31 Oct 16:12

v0.3.16-cu128-AVX2-linux-20251031

v0.3.16-cu128-AVX2-linux-20251031

New Update: Support for Qwen3VL GGUF

feat: Update README.md for Qwen3VL example(Thinking/No Thinking)
feat: feat: Add Qwen3VLChatHandler into llama_chat_format.py
feat: Update llama.cpp api 20251031
update: Update Submodule vendor/llama.cpp 16724b5..8da3c0e

Assets 6

31 Oct 19:37

v0.3.16-cu126-AVX2-win-20251031

v0.3.16-cu126-AVX2-win-20251031

New Update: Support for Qwen3VL GGUF

feat: Update README.md for Qwen3VL example(Thinking/No Thinking)
feat: feat: Add Qwen3VLChatHandler into llama_chat_format.py
feat: Update llama.cpp api 20251031
update: Update Submodule vendor/llama.cpp 16724b5..8da3c0e

Assets 6

31 Oct 15:47

v0.3.16-cu126-AVX2-linux-20251031

v0.3.16-cu126-AVX2-linux-20251031

New Update: Support for Qwen3VL GGUF

feat: Update README.md for Qwen3VL example(Thinking/No Thinking)
feat: feat: Add Qwen3VLChatHandler into llama_chat_format.py
feat: Update llama.cpp api 20251031
update: Update Submodule vendor/llama.cpp 16724b5..8da3c0e

Assets 6

31 Oct 19:34

v0.3.16-cu124-AVX2-win-20251031

v0.3.16-cu124-AVX2-win-20251031

New Update: Support for Qwen3VL GGUF

feat: Update README.md for Qwen3VL example(Thinking/No Thinking)
feat: feat: Add Qwen3VLChatHandler into llama_chat_format.py
feat: Update llama.cpp api 20251031
update: Update Submodule vendor/llama.cpp 16724b5..8da3c0e

Assets 6

31 Oct 15:47

v0.3.16-cu124-AVX2-linux-20251031

v0.3.16-cu124-AVX2-linux-20251031

New Update: Support for Qwen3VL GGUF

feat: Update README.md for Qwen3VL example(Thinking/No Thinking)
feat: feat: Add Qwen3VLChatHandler into llama_chat_format.py
feat: Update llama.cpp api 20251031
update: Update Submodule vendor/llama.cpp 16724b5..8da3c0e

Assets 6

23 Oct 22:07

v0.3.16-cu128-AVX2-win-20251023

v0.3.16-cu128-AVX2-win-20251024

feat: Supplement sm_87 sm101 compilation
feat: Update Submodule vendor/llama.cpp df1b612..dd62dcf
feat: Update some llama model parameters(check_tensors, use_extra_bufts, no_host)
feat: Sync model : Granite docling + Idefics3 preprocessing (SmolVLM)
feat: Sync server : context checkpointing for hybrid and recurrent models
feat: Sync llama: print memory breakdown on exit
feat: Synchronize some enum variable values
feat: Introducing index numbers to avoid the hallucination problem of multiple images entering the minicpm multimodal model series as much as possible

Assets 6

23 Oct 19:43

v0.3.16-cu128-AVX2-linux-20251023

v0.3.16-cu128-AVX2-linux-20251024

feat: Supplement sm_87 sm101 compilation
feat: Update Submodule vendor/llama.cpp df1b612..dd62dcf
feat: Update some llama model parameters(check_tensors, use_extra_bufts, no_host)
feat: Sync model : Granite docling + Idefics3 preprocessing (SmolVLM)
feat: Sync server : context checkpointing for hybrid and recurrent models
feat: Sync llama: print memory breakdown on exit
feat: Synchronize some enum variable values
feat: Introducing index numbers to avoid the hallucination problem of multiple images entering the minicpm multimodal model series as much as possible

Assets 6

24 Oct 00:01

v0.3.16-cu126-AVX2-win-20251024

v0.3.16-cu126-AVX2-win-20251024

feat: Supplement sm_87 compilation
feat: Update Submodule vendor/llama.cpp df1b612..dd62dcf
feat: Update some llama model parameters(check_tensors, use_extra_bufts, no_host)
feat: Sync model : Granite docling + Idefics3 preprocessing (SmolVLM)
feat: Sync server : context checkpointing for hybrid and recurrent models
feat: Sync llama: print memory breakdown on exit
feat: Synchronize some enum variable values
feat: Introducing index numbers to avoid the hallucination problem of multiple images entering the minicpm multimodal model series as much as possible

Assets 6