Releases · JamePeng/llama-cpp-python

08 Nov 05:15

v0.3.16-cu128-AVX2-linux-20251108

3d96053

v0.3.16-cu128-AVX2-linux-20251108

feat: Update Submodule vendor/llama.cpp 48bd265..299f5d7
feat: Update llama.cpp API and supplementing the State/sessions API
feat: Better Qwen3VL chat template. (Thank to @alcoftTAO)

Note: llama_chat_template now allows for more flexible input of parameters required by the model and the application of more complex Jinja formats.
The initial input parameters for Qwen3VLChatHandler have changed: "use_think_prompt" has been changed to "force_reasoning".

Contributors

alcoftTAO

Assets 6

08 Nov 08:47

github-actions

v0.3.16-cu126-AVX2-win-20251108

3d96053

v0.3.16-cu126-AVX2-win-20251108

feat: Update Submodule vendor/llama.cpp 48bd265..299f5d7
feat: Update llama.cpp API and supplementing the State/sessions API
feat: Better Qwen3VL chat template. (Thank to @alcoftTAO)

Contributors

alcoftTAO

Assets 6

08 Nov 04:55

github-actions

v0.3.16-cu126-AVX2-linux-20251108

3d96053

v0.3.16-cu126-AVX2-linux-20251108

feat: Update Submodule vendor/llama.cpp 48bd265..299f5d7
feat: Update llama.cpp API and supplementing the State/sessions API
feat: Better Qwen3VL chat template. (Thank to @alcoftTAO)

Contributors

alcoftTAO

Assets 6

08 Nov 08:33

github-actions

v0.3.16-cu124-AVX2-win-20251108

3d96053

v0.3.16-cu124-AVX2-win-20251108

feat: Update Submodule vendor/llama.cpp 48bd265..299f5d7
feat: Update llama.cpp API and supplementing the State/sessions API
feat: Better Qwen3VL chat template. (Thank to @alcoftTAO)

Contributors

alcoftTAO

Assets 6

08 Nov 04:53

github-actions

v0.3.16-cu124-AVX2-linux-20251108

3d96053

v0.3.16-cu124-AVX2-linux-20251108

feat: Update Submodule vendor/llama.cpp 48bd265..299f5d7
feat: Update llama.cpp API and supplementing the State/sessions API
feat: Better Qwen3VL chat template. (Thank to @alcoftTAO)

Contributors

alcoftTAO

Assets 6

03 Nov 20:18

github-actions

v0.3.16-cu128-AVX2-win-20251103

6f8ec8b

v0.3.16-cu128-AVX2-win-20251103

feat: Sync llama/mtmd API change, support clip flash-attn
feat: Update Submodule vendor/llama.cpp 76af40a..48bd265
feat: Optimize Windows CUDA Wheel Build Workflow
feat: Implement LlamaTrieCache into llama_cache.py: Optimize LlamaCache lookup from O(N) to O(K) using a Trie

Update: CLIP Model can now run in CUDA's flash attention warmup.

Faster speeds and more efficient use of GPU memory.

Update: Implement LlamaTrieCache（Default still: LLAMA_RAM_CACHE）

implemented using a trie tree, significantly improves retrieval speed compared to the original linear scan method of finding the longest prefix, thereby enhancing service responsiveness.
Below is a comparison of tests; the advantage becomes more pronounced as the context length increases.