Main #33

apicalshark · 2025-02-06T16:26:05Z

Make sure to read the contributing guidelines before submitting a PR

* slot.can_batch_with * lora per request * test: force disable cache prompt * move can_batch_with check * fix condition * add slow test with llama 8b * update docs * move lora change task to queue * Apply suggestions from code review Co-authored-by: Georgi Gerganov <[email protected]> * lora_base * remove redundant check --------- Co-authored-by: Georgi Gerganov <[email protected]>

* server/bench: - support openAI streaming standard output with [DONE]\n\n - export k6 raw results in csv - fix too many tcp idle connection in tcp_wait - add metric time to emit first token * server/bench: - fix when prometheus not started - wait for server to be ready before starting bench

* llama : scatter llama.cpp into multiple modules (wip) * llama : control-vector -> adapter * llama : arch * llama : mmap ggml-ci * ci : remove BUILD_SHARED_LIBS=OFF ggml-ci * llama : arch (cont) ggml-ci * llama : chat ggml-ci * llama : model ggml-ci * llama : hparams ggml-ci * llama : adapter ggml-ci * examples : fix ggml-ci * rebase ggml-ci * minor * llama : kv cache ggml-ci * llama : impl ggml-ci * llama : batch ggml-ci * cont ggml-ci * llama : context ggml-ci * minor * llama : context (cont) ggml-ci * llama : model loader ggml-ci * common : update lora ggml-ci * llama : quant ggml-ci * llama : quant (cont) ggml-ci * minor [no ci]

…ls (ggml-org#11053) * Disable KV cache shifting automatically for unsupported models instead of exiting directly Signed-off-by: Molly Sophia <[email protected]> * Update common/common.cpp Co-authored-by: Georgi Gerganov <[email protected]> --------- Signed-off-by: Molly Sophia <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]>

This commit attempts to improve the log message for the inputs of the splits in the sched_print_assignments function. The motivation for this change is that currently even if there are no inputs a colon is displayed at the end of the line, which can make it a little confusing when reading the output as it could be interpreted as the line below are inputs when they are in fact nodes. With this change the colon will only be printed if there actually are inputs.

…ggml-org#11047) * Added init tensor calling code * Added get_alloc_size forwarding * Cleaned up and improved type/error handling. * fix: remove trailing whitespaces. * Cleanup and use GGML error logging functions. * Handle potentially dangerous edge cases. * Apply suggestions from code review Co-authored-by: Diego Devesa <[email protected]> --------- Co-authored-by: Diego Devesa <[email protected]>

* convert : extend DEEPSEEK2 model architecture to support DeepseekV3ForCausalLM by adding EXPERT_WEIGHTS_NORM and EXPERT_GATING_FUNC model parameters and FFN_EXP_PROBS_B tensor type * vocab : add DeepSeek V3 pre-tokenizer regexes * unicode : handle ACCENT_MARK and SYMBOL categories in regex * llama : add DeepSeek V3 chat template, handle new model parameters and tensor types --------- Co-authored-by: Stanisław Szymczyk <[email protected]>

…tary driver (ggml-org#11074) * Vulkan: Add device-specific blacklist for coopmat for the AMD proprietary driver * Add (TM) to AMD name check

* CUDA: add BF16 support

ggml-ci

* mmap : fix fileno macro clash ggml-ci * cont ggml-ci

* tokenize : escape the prompt * tokenize : update help

* llama : deprecate llama_free_model, add llama_model_free ggml-ci * llama : change `llama_load_model_from_file` -> `llama_model_load_from_file` ggml-ci

This commit renames the `batch` parameter to `ubatch` in the `llama_kv_cache_find_slot`, `llm_build_inp_embd`, and `llm_build_mamba` functions. The motivation for this is that this should have been done as part of Commit 19d900a ("llama : rename batch to ubatch (ggml-org#9950)") but for some reason I missed these functions in that commit and only noticed them now (sorry).

…g#11101)

* server : fix extra BOS in infill endpoing ggml-ci * server : update infill tests

@ngxson

* github : cmd line to bug report * codeowners : (@ngxson) only watch dockerfile * Apply suggestions from code review [no ci] Co-authored-by: Johannes Gäßler <[email protected]> * rm cmd in log output [no ci] * rm 2 [no ci] * no need backticks [no ci] --------- Co-authored-by: Johannes Gäßler <[email protected]>

ggml-ci

Set `n_ctx` equal to `n_batch` in `Opt` class. Now context size is a more reasonable 2048. Signed-off-by: Eric Curtin <[email protected]>

…ml-org#11087) * SYCL: Use get_multi_ptr instead of deprecated get_pointer in wkv6 * Revert "SYCL: Use get_multi_ptr instead of deprecated get_pointer in wkv6" This reverts commit f62dc45. * Reland: Use get_multi_ptr instead of deprecated get_pointer in wkv6

Remove duplicated macros, use GGML_LOG_ERROR for errors

…-org#11644) * Added quantization for visual projector * Added README * Fixed the clip quantize implementation in the file * Fixed the gcc warning regarding minor linting * Removed trailing whitespace

cont ggml-org#11659 ggml-ci

Autopen (https://github.com/blackhole89/autopen) is a graphical text editor that uses llama.cpp to tokenize the buffer on the fly, score the buffer, visualise token logits and allow you to switch back and forth between different possible completions at any point. It hopefully meets the criteria for inclusion, as the dependency on llama.cpp is stated prominently.

…ggml-org#11690) Avoids breakage in nix flake build introduced by b056913

…-org#11551)

* vulkan: optimize coopmat2 iq2/iq3 callbacks * build: trigger CI on GLSL compute shader changes

Signed-off-by: Adrien Gallouët <[email protected]>

SYCL does not support non contiguous tensors for norm operations

* docs: update fedora cuda guide for 12.8 release * docs: build cuda update

ngxson and others added 30 commits January 2, 2025 15:05

metal : avoid uint (ggml-org#11019)

e7da954

fix: Vulkan shader gen binary path (ggml-org#11037)

c31fc8b

ggml : do not install metal source when embed library (ggml/1054)

5e3b08d

sync : ggml

78c6785

llama : add support for the cohere2 model architecture (ggml-org#10900)

46be942

Vulkan: Add device-specific blacklist for coopmat for the AMD proprie…

b56f079

…tary driver (ggml-org#11074) * Vulkan: Add device-specific blacklist for coopmat for the AMD proprietary driver * Add (TM) to AMD name check

CUDA: add BF16 support (ggml-org#11093)

46e3556

* CUDA: add BF16 support

llama : use _impl suffix instead of _internal (ggml-org#11060)

5047dd3

ggml-ci

llama : use LLAMA_TOKEN_NULL (ggml-org#11062)

727368c

ggml-ci

mmap : fix fileno macro clash (ggml-org#11076)

ae2f606

* mmap : fix fileno macro clash ggml-ci * cont ggml-ci

tokenize : escape the prompt (ggml-org#11058)

3e6e7a6

* tokenize : escape the prompt * tokenize : update help

llama : update llama_model API names (ggml-org#11063)

47182dd

* llama : deprecate llama_free_model, add llama_model_free ggml-ci * llama : change `llama_load_model_from_file` -> `llama_model_load_from_file` ggml-ci

llama : prevent system info string accumulation across calls (ggml-or…

96a1dc2

…g#11101)

llama : remove check flash_attn with lora (ggml-org#11104)

09186fa

server : fix extra BOS in infill endpoint (ggml-org#11106)

e6e7c75

* server : fix extra BOS in infill endpoing ggml-ci * server : update infill tests

llama : remove unused headers (ggml-org#11109)

ecebbd2

ggml-ci

llama-run : fix context size (ggml-org#11094)

dc7cef9

Set `n_ctx` equal to `n_batch` in `Opt` class. Now context size is a more reasonable 2048. Signed-off-by: Eric Curtin <[email protected]>

rpc : code cleanup (ggml-org#11107)

a4dd490

Remove duplicated macros, use GGML_LOG_ERROR for errors

ggml-backend : only offload from host buffers (ggml-org#11120)

a3d50bc

ggml-backend : only offload from host buffers (fix) (ggml-org#11124)

017cc5f

samkoesnadi and others added 13 commits February 5, 2025 10:45

llava: add quantization for the visual projector LLAVA, Qwen2VL (ggml…

1ec2080

…-org#11644) * Added quantization for visual projector * Added README * Fixed the clip quantize implementation in the file * Fixed the gcc warning regarding minor linting * Removed trailing whitespace

CUDA: support for mat. mul. with ne03 != ne13 (ggml-org#11656)

fa62da9

metal : adjust support conditions for norm operators (ggml-org#11671)

d774ab3

cont ggml-org#11659 ggml-ci

metal : avoid breaking build when metal API predates TARGET_OS_VISION (…

902368a

…ggml-org#11690) Avoids breakage in nix flake build introduced by b056913

vulkan: use smaller combined allocations to avoid fragmentation (ggml…

1b598b3

…-org#11551)

vulkan: initial support for IQ4_XS quantization (ggml-org#11501)

8a7e3bf

vulkan: optimize coopmat2 iq2/iq3 callbacks (ggml-org#11521)

2c6c8df

* vulkan: optimize coopmat2 iq2/iq3 callbacks * build: trigger CI on GLSL compute shader changes

ggml : fix LoongArch compile error with 128-bit SIMD (ggml-org#11701)

8d4d2be

build : fix llama.pc (ggml-org#11658)

c0d4843

Signed-off-by: Adrien Gallouët <[email protected]>

llama : add log about loading model tensors (ggml-org#11699)

9dd7a03

SYCL: Adjust support condition for norm operators (ggml-org#11674)

194b2e6

SYCL does not support non contiguous tensors for norm operations

docs: update fedora cuda guide for 12.8 release (ggml-org#11393)

9ab42dc

* docs: update fedora cuda guide for 12.8 release * docs: build cuda update

github-actions bot added documentation Improvements or additions to documentation examples server build devops testing python script ggml SYCL Nvidia GPU Vulkan Apple Metal android labels Feb 6, 2025

Merge branch 'master' into main

271e33f

apicalshark merged commit 11d8995 into master Feb 6, 2025
2 of 9 checks passed

apicalshark deleted the main branch February 6, 2025 16:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Main #33

Main #33

Uh oh!

apicalshark commented Feb 6, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

79 participants

Main #33

Main #33

Uh oh!

Conversation

apicalshark commented Feb 6, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

79 participants