merge from upstream #59

l3utterfly · 2025-03-26T04:32:17Z

No description provided.

…g#12447) * context : always use non-causal attention for encoder graphs ggml-ci * context : move the change to llama_context::encode() ggml-ci

Signed-off-by: Xiaodong Ye <[email protected]>

* graph : normalize Q, K, V shapes and add comments ggml-ci * context : synchronize before getting cross attention data * model : fix command-r attention norm check

* opencl: more profiling timing * opencl: generate trace for profiling * opencl: reduce profiling overhead * Populate profiling timing info at the end rather than after each kernel run * opencl: fix for chrome tracing

) I've been seeing significantly worse performance for tg with flash attention enabled vs disabled, and it seems to be related to the submit heuristic. Change the heuristic to check how many bytes worth of weight matrix are used and flush every 100MB, and ramp up after the first few submits. This seems to resolve the issue, and also increases perf for non-FA a bit.

ggml-org#12456) * Add support for GPT2, Bloom and CodeShell tied word embeddings * Deduplicate tied word embeddings weights * Workaround for incorrect weight map It appears transformer.wte.weight is in the weight map even though the weights are not there, remove it if output weights are encountered first. * check++ * fatfingers--

* ci: add visionOS build workflow Add a new GitHub Actions workflow for building on visionOS with CMake and Xcode. * ggml: Define _DARWIN_C_SOURCE for visionOS to fix missing u_xxx typedefs * ci: remove define hacks for u_xxx system types --------- Co-authored-by: Giovanni Petrantoni <[email protected]>

…-org#12183) - Find out active blocks per SM using cudaOccupancyMaxActiveBlocksPerMultiprocessor API. Use this value to determine the optimal parallel_blocks value. - Prefer vector flash attention kernels over MMA kernel for BS=1 Fixes Issue: ggml-org#12182 --------- Co-authored-by: Johannes Gäßler <[email protected]>

…oring new values (ggml-org#12470) Co-authored-by: Stanisław Szymczyk <[email protected]>

tokenizer.added_tokens_decoder returns a fresh dict every time relatively slowly (~0.04s on average) which results in massive slowdowns when we have a huge number of added tokens

…architecture (ggml-org#12332) * Add block interleaving support for Q4_K quantization * Remove whitespaces and fix CI/CD issues * Update pointer of bsums from int16_t to const int16_t * Add vector version of quantize_q8_K_4x8 function * Update code formatting based on review comments

* webui: Make textarea uncontrolled to eliminate devastating lag * Update index.html.gz * use signal-style implementation * rm console log * no duplicated savedInitValue set --------- Co-authored-by: Xuan Son Nguyen <[email protected]>

…-org#9976) * [SYCL] Fix build on Windows when ccache enabled (ggml-org#9954) * take effect only on windows and force it to icl --------- Co-authored-by: Romain Biessy <[email protected]>

…2482)

ggml-ci

…g#12472)

* Vulkan: RTE rounding for cpy to quant Co-Authored-By: Jeff Bolz <[email protected]> * remove trailing whitespace * avoid duplicating pipeline_cpy_f32_quant * fix copypasting issue * remove duplicated code --------- Co-authored-by: Jeff Bolz <[email protected]>

* tests: add mul_mat perf/functional tests for p021/nc vulkan shaders * vulkan: Optimize mul_mat_vec p021 and nc shaders. These shaders are used in attention calculations, and when the KV cache grows large they start to dominate the run time. For the nc shader (which is called with large 'k' dimension), use unrolling and vector loads. For the p021 shader (which is called with large 'm' and small 'k' dimensions), take advantage of grouped query attention to reuse loads from the A matrix for the whole group, and reduce the number of workgroups (too much overhead from tiny dispatches). Using subgroupAdd in the p021 shader also helps, use that conditionally.

* musa: refine compute capability Signed-off-by: Xiaodong Ye <[email protected]> * Address review comments Signed-off-by: Xiaodong Ye <[email protected]> --------- Signed-off-by: Xiaodong Ye <[email protected]>

* ggml : fix quantized cpy op ggml-ci * tests : add cpy tests for all types ggml-ci * tests : add BF16 copy tests ggml-ci * tests : fix loop for same-type copy ggml-ci * tests : add option to permute the dst tensor ggml-ci

…-org#12506) * llama : gemma3 : use output tensor if it exists in model weight * also add to the llm_tensor_names

MacPorts section added

…g#12246) Add verbose output to server_task_result_cmpl_final::to_json_oaicompat_chat_stream, making it conform with server_task_result_cmpl_final::to_json_oaicompat_chat, as well as the other to_json methods.

The OOB calculation could be wrong if the last iteration was during one of the unrolled loops. Adjust the unrolling counts to avoid this. Add a couple new backend tests that hit this failure on NVIDIA GPUs.

Signed-off-by: Xiaodong Ye <[email protected]>

* docs: update fedora-cuda guide - Rename and place into Backend Folder. - Update Host-Supplied Packages. - Expand Recommended Users Section. * docs: improve the flow of CUDA-FEDORA.md

Co-authored-by: Max Krasnyansky <[email protected]>

ggml-ci

Signed-off-by: Xiaodong Ye <[email protected]>

Signed-off-by: Dan Johansson <[email protected]>

ggml-ci

ggml-cpu : bug fix related to KleidiAI LHS packing Signed-off-by: Dan Johansson <[email protected]>

* Fix Mistral3/Gemma3 model hparams init * set positional args correctly * use existing hparams if passed

ggerganov and others added 30 commits March 18, 2025 13:05

context : always use non-causal attention for encoder graphs (ggml-or…

8551c44

…g#12447) * context : always use non-causal attention for encoder graphs ggml-ci * context : move the change to llama_context::encode() ggml-ci

llama : add support for EXAONE tied word embeddings (ggml-org#12451)

99aa304

speculative : fix seg fault in certain cases (ggml-org#12454)

c6af216

llama : support converting Mistral Small text-only (ggml-org#12450)

29fff30

musa: override warp_size of musa device to 32 (ggml-org#12445)

bb115d2

Signed-off-by: Xiaodong Ye <[email protected]>

graph : normalize Q, K, V shapes + sync cross attention (ggml-org#12449)

75422e8

* graph : normalize Q, K, V shapes and add comments ggml-ci * context : synchronize before getting cross attention data * model : fix command-r attention norm check

opencl: improve profiling (ggml-org#12442)

d84635b

* opencl: more profiling timing * opencl: generate trace for profiling * opencl: reduce profiling overhead * Populate profiling timing info at the end rather than after each kernel run * opencl: fix for chrome tracing

convert : Support chat_template.json (ggml-org#12460)

a686171

vulkan: optimize iq1 coopmat2 dequant functions (ggml-org#12427)

a9b5928

context : clear sets containing encoder output sequence ids before st…

568013d

…oring new values (ggml-org#12470) Co-authored-by: Stanisław Szymczyk <[email protected]>

convert : avoid calls to tokenizer.added_tokens_decoder (ggml-org#12473)

732b5fb

tokenizer.added_tokens_decoder returns a fresh dict every time relatively slowly (~0.04s on average) which results in massive slowdowns when we have a huge number of added tokens

llama : make Qwen2MoE QKV bias optional (ggml-org#12477)

dbb3a47

sycl: cleanup oneDNN related code (ggml-org#12097)

9ffcc9e

[SYCL] Fix build on Windows when ccache enabled (ggml-org#9954) (ggml…

1aa87ee

…-org#9976) * [SYCL] Fix build on Windows when ccache enabled (ggml-org#9954) * take effect only on windows and force it to icl --------- Co-authored-by: Romain Biessy <[email protected]>

llama-tts : avoid crashes related to bad model file paths (ggml-org#1…

ea1518e

…2482)

chore : cleanup llama_model_loader::TENSOR_ usage (ggml-org#12492)

960e726

model : do not repack if a GPU device is present (ggml-org#12498)

af04481

ggml-ci

vulkan: workaround for AMD Windows driver 16 bit unpack8 bug (ggml-or…

30c42ef

…g#12472)

musa: refine compute capability (ggml-org#12493)

fac63a3

* musa: refine compute capability Signed-off-by: Xiaodong Ye <[email protected]> * Address review comments Signed-off-by: Xiaodong Ye <[email protected]> --------- Signed-off-by: Xiaodong Ye <[email protected]>

ggml : fix quantized cpy op (ggml-org#12310)

ba932df

* ggml : fix quantized cpy op ggml-ci * tests : add cpy tests for all types ggml-ci * tests : add BF16 copy tests ggml-ci * tests : fix loop for same-type copy ggml-ci * tests : add option to permute the dst tensor ggml-ci

llama : gemma3 : use output tensor if it exists in model weight (ggml…

fbdfefe

…-org#12506) * llama : gemma3 : use output tensor if it exists in model weight * also add to the llm_tensor_names

install : add macports (ggml-org#12518)

18b663d

MacPorts section added

mglambda and others added 16 commits March 23, 2025 19:30

server : Add verbose output to OAI compatible chat endpoint. (ggml-or…

77f9c6b

…g#12246) Add verbose output to server_task_result_cmpl_final::to_json_oaicompat_chat_stream, making it conform with server_task_result_cmpl_final::to_json_oaicompat_chat, as well as the other to_json methods.

vulkan: fix mul_mat_vec failure in backend tests (ggml-org#12529)

9b169a4

The OOB calculation could be wrong if the last iteration was during one of the unrolled loops. Adjust the unrolling counts to avoid this. Add a couple new backend tests that hit this failure on NVIDIA GPUs.

mmap : skip resource limit checks on AIX (ggml-org#12541)

c54f6b7

CUDA: Fix clang warnings (ggml-org#12540)

7ea7503

Signed-off-by: Xiaodong Ye <[email protected]>

llama-vocab : add SuperBPE pre-tokenizer (ggml-org#12532)

00d5380

docs: update: improve the Fedoa CUDA guide (ggml-org#12536)

3361e2d

* docs: update fedora-cuda guide - Rename and place into Backend Folder. - Update Host-Supplied Packages. - Expand Recommended Users Section. * docs: improve the flow of CUDA-FEDORA.md

CI: fix SYCL build (ggml-org#12546)

48d7021

opencl: simplify kernel embedding logic in cmakefile (ggml-org#12503)

2b65ae3

Co-authored-by: Max Krasnyansky <[email protected]>

ci: [SYCL] ggml-ci Use main GPU and enable sysman (ggml-org#12547)

c95fa36

context : fix worst-case reserve outputs (ggml-org#12545)

2d77d88

ggml-ci

ci: [MUSA] add CI and update doc (ggml-org#12562)

3cd3a39

Signed-off-by: Xiaodong Ye <[email protected]>

docs : add build instructions for KleidiAI (ggml-org#12563)

36ee06d

Signed-off-by: Dan Johansson <[email protected]>

SYCL: disable Q4_0 reorder optimization (ggml-org#12560)

e2f5601

ggml-ci

ggml-cpu : update KleidiAI to v1.5.0 (ggml-org#12568)

053b3f9

ggml-cpu : bug fix related to KleidiAI LHS packing Signed-off-by: Dan Johansson <[email protected]>

run: de-duplicate fmt and format functions and optimize (ggml-org#11596)

ef19c71

convert: fix Mistral3/Gemma3 model hparams init (ggml-org#12571)

53af4db

* Fix Mistral3/Gemma3 model hparams init * set positional args correctly * use existing hparams if passed

l3utterfly merged commit 8a04972 into layla-build Mar 26, 2025
81 of 100 checks passed

github-actions bot added documentation Improvements or additions to documentation SYCL Nvidia GPU Vulkan testing examples devops python server ggml labels Mar 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

merge from upstream #59

merge from upstream #59

Uh oh!

l3utterfly commented Mar 26, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

merge from upstream #59

merge from upstream #59

Uh oh!

Conversation

l3utterfly commented Mar 26, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants