Releases: mofosyne/llama.cpp
Releases · mofosyne/llama.cpp
b4622
server : (webui) Fix Shift+Enter handling (#11609) * Fix Shift+Enter handling `exact` on the Enter handler means the message is not sent when Shift+Enter is pressed anyway * build index.html.gz --------- Co-authored-by: Xuan Son Nguyen <[email protected]>
b3563
Add support for encoder-only T5 models (#8900) * gguf-py : add T5ENCODER model architecture * common : call llama_decode() during warmup only if the model has decoder * convert-hf : add T5EncoderModel * llama : add llama_model_has_decoder() API function * llama : split build_t5() into build_t5_encoder() and build_t5_decoder() * llama : add support for LLM_ARCH_T5ENCODER * llama-embedding : add support for LLAMA_POOLING_TYPE_NONE * llama-embedding : add support for encoder-only models --------- Co-authored-by: Stanisław Szymczyk <[email protected]>
b3524
cann: fix buffer_num and runtime speed slowly error (#8865)
b3521
py: Add more authorship metadata from model card (#8810) * py: add more authorship metadata from model card * fixup! py: add more authorship metadata from model card
b3499
cann: support q8_0 for Ascend backend (#8805)
b3491
flake.lock: Update (#8729)
b3445
Vulkan IQ4_NL Support (#8613) * Fix Vulkan matmul tests compile errors * Add Vulkan IQ4_NL support * Fix Vulkan DeepSeek-Coder-V2-Lite MoE support
b3431
examples : Rewrite pydantic_models_to_grammar_examples.py (#8493)
Changes:
- Move each example into its own function. This makes the code much
easier to read and understand.
- Make the program easy to only run one test by commenting out function
calls in main().
- Make the output easy to parse by indenting the output for each example.
- Add shebang and +x bit to make it clear it's an executable.
- Make the host configurable via --host with a default 127.0.0.1:8080.
- Make the code look in the tools list to call the registered tool,
instead of hardcoding the returned values. This makes the code more
copy-pastable.
- Add error checking, so that the program exits 1 if the LLM didn't
returned expected values. It's super useful to check for correctness.
Testing:
- Tested with Mistral-7B-Instruct-v0.3 in F16 and Q5_K_M and
Meta-Llama-3-8B-Instruct in F16 and Q5_K_M.
- I did not observe a failure even once in Mistral-7B-Instruct-v0.3.
- Llama-3 failed about a third of the time in example_concurrent: it
only returned one call instead of 3. Even for F16.
Potential follow ups:
- Do not fix the prompt encoding yet. Surprisingly it mostly works even
if the prompt encoding is not model optimized.
- Add chained answer and response.
Test only change.
b3424
gguf_dump.py: fix markddown kv array print (#8588) * gguf_dump.py: fix markddown kv array print * Update gguf-py/scripts/gguf_dump.py Co-authored-by: compilade <[email protected]> * gguf_dump.py: refactor kv array string handling * gguf_dump.py: escape backticks inside of strings * gguf_dump.py: inline code markdown escape handler added >>> escape_markdown_inline_code("hello world") '`hello world`' >>> escape_markdown_inline_code("hello ` world") '``hello ` world``' * gguf_dump.py: handle edge case about backticks on start or end of a string --------- Co-authored-by: compilade <[email protected]>
b3418
fix: typo of chatglm4 chat tmpl (#8586) Signed-off-by: thxCode <[email protected]>