merge from upstream #79

l3utterfly · 2025-08-11T04:16:29Z

No description provided.

ggml-ci

* test-thread-safety : each context uses a single sequence * embedding : handle --parallel argument ggml-ci * save-load : handle -np 1 ggml-ci * thread-safety : avoid overriding threads, reduce test case arg ggml-ci

The pipeline member can be cast to VkPipeline. This is a VkPipeline_T* on 64 bit but a uint64_t on 32 bit. Cf. VK_DEFINE_NON_DISPATCHABLE_HANDLE documentation.

ggml-ci

…apes (ggml-org#14949)

This commit adds support for the `embd_normalize` parameter in the server code. The motivation for this is that currently if the server is started with a pooling type that is not `none`, then Euclidean/L2 normalization will be the normalization method used for embeddings. However, this is not always the desired behavior, and users may want to use other normalization (or none) and this commit allows that. Example usage: ```console curl --request POST \ --url http://localhost:8080/embedding \ --header "Content-Type: application/json" \ --data '{"input": "Hello world today", "embd_normalize": -1} ```

…-org#14973)

…-org#14809)

* graph : avoid creating redundant s_copy views * graph : comment the s_copy views

…t. (ggml-org#14985) * CANN: Improve loading efficiency after converting weights to NZ format. * CANN: fix typo

* Add support for Llada-8b: diffusion model * Add README * Fix README and convert_hf_to_gguf * convert_hf_to_gguf.py: address review comments * Make everything in a single example * Remove model-specific sampling * Remove unused argmax * Remove braced initializers, improve README.md a bit * Add diffusion specific gguf params in set_vocab, remove setting rope_theta and rms_norm_eps * Remove adding the mask token * Move add_add_bos_token to set_vocab * use add_bool in gguf_writer.py

Signed-off-by: Lukas Straub <[email protected]>

…gml-org#14968)

* llama-server : implement universal assisted decoding * Erase prompt tail for kv-cache * set vocab_dft_compatible in common_speculative * rename ctx_main to ctx_tgt * move vocab_dft_compatible to spec struct * clear mem_dft, remove mem * detokenize id_last for incompatible models * update comment * add --spec-replace flag * accept special tokens when translating between draft/main models * Escape spec-replace * clamp draft result to size to params.n_draft * fix comment * clean up code * restore old example * log common_speculative_are_compatible in speculative example * fix * Update common/speculative.cpp Co-authored-by: Georgi Gerganov <[email protected]> * Update common/speculative.cpp Co-authored-by: Georgi Gerganov <[email protected]> * Update common/speculative.cpp Co-authored-by: Georgi Gerganov <[email protected]> --------- Co-authored-by: Georgi Gerganov <[email protected]>

* MODEL_TENSOR.SSM_DT_NORM has defined twice, and second overwritten the jamba model's layername * correct order

* support minicpm-v 4 * add md * support MiniCPM-o 4.0 * add default location * temp rm MiniCPM-o 4.0 * fix code * fix "minicpmv_projector" default path

* vulkan: fix debug mode issues * vulkan: remove broken check_results GGML_OP_SET_ROWS support

…ion (ggml-org#14990)

…gml-org#14992)

@JohannesGaessler

…ml-org#14392) * compare-commits.sh: support both llama-bench and test-backend-ops Signed-off-by: Xiaodong Ye <[email protected]> * Speed up the build by specifying -j 12 Signed-off-by: Xiaodong Ye <[email protected]> * Remove build_number from test-backend-ops db Signed-off-by: Xiaodong Ye <[email protected]> * Apply suggestion from @JohannesGaessler Co-authored-by: Johannes Gäßler <[email protected]> * Refine tool selection logic Signed-off-by: Xiaodong Ye <[email protected]> * Address review comments Signed-off-by: Xiaodong Ye <[email protected]> --------- Signed-off-by: Xiaodong Ye <[email protected]> Signed-off-by: Xiaodong Ye <[email protected]> Co-authored-by: Johannes Gäßler <[email protected]>

* docker: add cann build pipline * docker: add cann build pipline * docker: fix cann devops * cann : fix multi card hccl * Update ggml/src/ggml-cann/ggml-cann.cpp Co-authored-by: Xuan-Son Nguyen <[email protected]> * Update ggml-cann.cpp --------- Co-authored-by: Georgi Gerganov <[email protected]> Co-authored-by: Xuan-Son Nguyen <[email protected]>

ggml-ci

* Initial Q2_K Block Interleaving Implementation * Addressed review comments and clean up of the code * Post rebase fixes * Initial CI/CD fixes * Update declarations in arch-fallback.h * Changes for GEMV Q2_K in arch-fallback.h * Enable repacking only on AVX-512 machines * Update comments in repack.cpp * Address q2k comments --------- Co-authored-by: Manogna-Sree <[email protected]>

* support hunyuan_v1_dense Signed-off-by: stevenkuang <[email protected]> * update hunyuan_moe to hunyuan_v1_moe Signed-off-by: stevenkuang <[email protected]> * fix rope alpha assert and bos token Signed-off-by: stevenkuang <[email protected]> * add blank line Signed-off-by: stevenkuang <[email protected]> * Revert "update hunyuan_moe to hunyuan_v1_moe" This reverts commit aa973ca. * use hunyuan_dense instead of hunyuan_v1_dense Signed-off-by: stevenkuang <[email protected]> * fix hunyuan_moe chat template Signed-off-by: stevenkuang <[email protected]> * remove leftover code Signed-off-by: stevenkuang <[email protected]> * update hunyuan dense chat template Signed-off-by: stevenkuang <[email protected]> * fix hunyuan dense vocab and chat template Signed-off-by: stevenkuang <[email protected]> --------- Signed-off-by: stevenkuang <[email protected]>

* vendor : update vendored copy of google/minja Signed-off-by: Lennart Austenfeld <[email protected]> * Re-remove trailing whitespace Signed-off-by: Lennart Austenfeld <[email protected]> * Remove another trailing whitespace Signed-off-by: Lennart Austenfeld <[email protected]> --------- Signed-off-by: Lennart Austenfeld <[email protected]>

* Add paramater buffer pool, batching of submissions, refactor command building/submission * Add header for linux builds * Free staged parameter buffers at once * Format with clang-format * Fix thread-safe implementation * Use device implicit synchronization * Update workflow to use custom release * Remove testing branch workflow * Disable set_rows until it's implemented * Fix potential issue around empty queue submission * Try synchronous submission * Try waiting on all futures explicitly * Add debug * Add more debug messages * Work on getting ssh access for debugging * Debug on failure * Disable other tests * Remove extra if * Try more locking * maybe passes? * test * Some cleanups * Restore build file * Remove extra testing branch ci

* feat(cann): add optional support for ACL Graph execution This commit adds support for executing ggml computational graphs using Huawei's ACL graph mode via the USE_CANN_GRAPH flag. The support can be enabled at compile time using the CMake option: -DUSE_CANN_GRAPH=ON By default, ACL graph execution is **disabled**, and the fallback path uses node-by-node execution. Key additions: - CMake option to toggle graph mode - Graph capture and execution logic using - Tensor property matching to determine whether graph update is required - Safe fallback and logging if the environment variable LLAMA_SET_ROWS is unset or invalid This prepares the backend for performance improvements in repetitive graph execution scenarios on Ascend devices. Signed-off-by: noemotiovon <[email protected]> * Fix review comments Signed-off-by: noemotiovon <[email protected]> * remane USE_CANN_GRAPH to USE_ACL_GRAPH Signed-off-by: noemotiovon <[email protected]> * fix typo Signed-off-by: noemotiovon <[email protected]> --------- Signed-off-by: noemotiovon <[email protected]>

Signed-off-by: stevenkuang <[email protected]>

* opencl: add `swiglu-oai` * opencl: add `add_id` * opencl: add missing `add_id.cl`

* Begin work on set_rows * Work on set rows * Add error buffers for reporting unsupported SET_ROWS indices * Remove extra comments

This commit addresses an issue with the convert_hf_to_gguf script which is currently failing with: ```console AttributeError: module 'torch' has no attribute 'uint64' ``` This occurred because safetensors expects torch.uint64 to be available in the public API, but PyTorch 2.2.x only provides limited support for unsigned types beyond uint8 it seems. The torch.uint64 dtype exists but is not exposed in the standard torch namespace (see pytorch/pytorch#58734). PyTorch 2.4.0 properly exposes torch.uint64 in the public API, resolving the compatibility issue with safetensors. This also required torchvision to updated to =0.19.0 for compatibility. Refs: https://huggingface.co/spaces/ggml-org/gguf-my-repo/discussions/186#68938de803e47d990aa087fb Refs: pytorch/pytorch#58734

* CUDA: GEMM for FP32/FP16/BF16 and ne11 <= 16

…-org#15094) Any available libraries are found and loaded dynamically at runtime.

…age metrics (ggml-org#15103)

* support internvl * support interns1 * resolve comments * put interns1 in tensor mapping * resolve comment * move tokenizer changes to sub class

* convert : support non-mxfp4 HF model * rm redundant check * disable debug check

* vendor: sync minja * Update minja.hpp * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <[email protected]> --------- Co-authored-by: Sigbjørn Skjæret <[email protected]>

* server-bench: external OAI servers, sqlite * Update scripts/server-bench.py Co-authored-by: Sigbjørn Skjæret <[email protected]> * Update scripts/server-bench.py Co-authored-by: Sigbjørn Skjæret <[email protected]> * Update scripts/server-bench.py Co-authored-by: Sigbjørn Skjæret <[email protected]> * raise_for_status --------- Co-authored-by: Sigbjørn Skjæret <[email protected]>

* gguf-py : add MXFP4 de/quantization support * ggml-quants : handle zero amax for MXFP4

* CUDA: add attention sinks for tile and wmma * Review: formatting changes + remove syncthreads from tile + remove warp_reduce_max from wmma

* cuda: refactored ssm_scan to use CUB * fixed compilation error when when not using CUB * assign L to constant and use size_t instead of int * deduplicated functions * change min blocks per mp to 1 * Use cub load and store warp transpose * suppress clang warning

ggerganov and others added 30 commits July 30, 2025 13:52

graph : fix stack-use-after-return (ggml-org#14960)

1e15bfd

ggml-ci

tests : update for LLAMA_SET_ROWS=1 (ggml-org#14961)

00131d6

* test-thread-safety : each context uses a single sequence * embedding : handle --parallel argument ggml-ci * save-load : handle -np 1 ggml-ci * thread-safety : avoid overriding threads, reduce test case arg ggml-ci

CUDA: skip masked KV slices for all FA kernels (ggml-org#14924)

92b8810

vulkan : fix 32-bit builds (ggml/1313)

73a8e5c

The pipeline member can be cast to VkPipeline. This is a VkPipeline_T* on 64 bit but a uint64_t on 32 bit. Cf. VK_DEFINE_NON_DISPATCHABLE_HANDLE documentation.

cmake : Fix BLAS link interface (ggml/1316)

e228de9

sync : ggml

e32a4ec

ggml-ci

HIP: enable mfma mmq on gfx908 and gfx90a for select datatypes and sh…

ad4a700

…apes (ggml-org#14949)

quantize : fix using combined imatrix GGUFs (multiple datasets) (ggml…

e9192be

…-org#14973)

opencl: add mul_mat_f32_f32_l4_lm and mul_mat_f16_f32_l4_lm (ggml…

6e67254

…-org#14809)

graph : reduce splits for recurrent and hybrid models (ggml-org#14825)

66625a5

* graph : avoid creating redundant s_copy views * graph : comment the s_copy views

CANN: Improve loading efficiency after converting weights to NZ forma…

11490b3

…t. (ggml-org#14985) * CANN: Improve loading efficiency after converting weights to NZ format. * CANN: fix typo

server : add openai-style logit_bias support (ggml-org#14946)

a9f77a8

Signed-off-by: Lukas Straub <[email protected]>

llama : merge build_moe_ffn_from_probs function into build_moe_ffn (g…

c1dacaa

…gml-org#14968)

MODEL_TENSOR.SSM_DT_NORM has defined twice (ggml-org#14991)

36e5fe7

* MODEL_TENSOR.SSM_DT_NORM has defined twice, and second overwritten the jamba model's layername * correct order

mtmd : support MiniCPM-V 4.0 (ggml-org#14983)

952a47f

* support minicpm-v 4 * add md * support MiniCPM-o 4.0 * add default location * temp rm MiniCPM-o 4.0 * fix code * fix "minicpmv_projector" default path

Vulkan: Fix minor debug mode issues (ggml-org#14899)

e08a988

* vulkan: fix debug mode issues * vulkan: remove broken check_results GGML_OP_SET_ROWS support

llama : allow other bufts when overriding to CPU, add --no-repack opt…

d6818d0

…ion (ggml-org#14990)

Fix params bug in diffusion example (ggml-org#14993)

7845240

llama : add simple option to enable CPU for MoE weights (--cpu-moe) (g…

a06ed5f

…gml-org#14992)

quantize : skip tensor override when in fallback mode (ggml-org#14995)

daf2dd7

graph : fix equal_seq() check (ggml-org#14986)

ba42794

ggml-ci

opencl: add f16 for add, sub, mul, div (ggml-org#14984)

1c872f7

reeselevine and others added 28 commits August 5, 2025 16:26

chat : fix hunyuan auto-detection (ggml-org#15114)

2572689

Signed-off-by: stevenkuang <[email protected]>

chat : fix yandex chat template (ggml-org#15116)

65c797c

ggml : fix fallback to CPU for ununsupported ops (ggml-org#15118)

0d88315

Fixed name -override-tensors to -override-tensor (ggml-org#15129)

476aa3f

chat : support Granite model reasoning and tool call (ggml-org#14864)

3db4da5

opencl: add swiglu_oai and add_id (ggml-org#15121)

e725a1a

* opencl: add `swiglu-oai` * opencl: add `add_id` * opencl: add missing `add_id.cl`

fix profiling crash (ggml-org#15072)

756cfea

ggml: Add basic SET_ROWS support in WebGPU (ggml-org#15137)

5fd160b

* Begin work on set_rows * Work on set rows * Add error buffers for reporting unsupported SET_ROWS indices * Remove extra comments

scripts: fix crash when --tool is not set (ggml-org#15133)

20638e4

CUDA: GEMM for FP32/FP16/BF16 and ne11 <= 16 (ggml-org#15131)

1d72c84

* CUDA: GEMM for FP32/FP16/BF16 and ne11 <= 16

ggml: Skip backend library linking code when GGML_BACKEND_DL=ON (ggml…

9a96389

…-org#15094) Any available libraries are found and loaded dynamically at runtime.

HIP: add cmake option to enable compiler output of kernel resource us…

7ad67ba

…age metrics (ggml-org#15103)

llama : Support intern-s1 (ggml-org#14875)

99acbc9

* support internvl * support interns1 * resolve comments * put interns1 in tensor mapping * resolve comment * move tokenizer changes to sub class

vulkan: Add env var to disable host visible vidmem (ggml-org#15109)

a0552c8

vulkan: support fattn sinks (ggml-org#15126)

c4f5356

convert : support non-mxfp4 HF model (ggml-org#15153)

50aa938

* convert : support non-mxfp4 HF model * rm redundant check * disable debug check

opencl: support sink in soft_max (attn sinks) (ggml-org#15152)

aaa3d07

CUDA: attention sinks for mma FlashAttention (ggml-org#15157)

1425f58

vendor: sync minja (ggml-org#15161)

6c7e9a5

* vendor: sync minja * Update minja.hpp * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <[email protected]> --------- Co-authored-by: Sigbjørn Skjæret <[email protected]>

ggml : fix field name when new ggml_backend (ggml-org#14944)

cd6983d

gguf-py : add Numpy MXFP4 de/quantization support (ggml-org#15111)

e54d41b

* gguf-py : add MXFP4 de/quantization support * ggml-quants : handle zero amax for MXFP4

CUDA: add attention sinks for tile and wmma (ggml-org#15178)

34c9d76

* CUDA: add attention sinks for tile and wmma * Review: formatting changes + remove syncthreads from tile + remove warp_reduce_max from wmma

Merge branch 'layla-build' into merge

6b40f52

l3utterfly merged commit 20c9590 into layla-build Aug 11, 2025
9 of 54 checks passed

l3utterfly deleted the merge branch August 11, 2025 04:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

merge from upstream #79

merge from upstream #79

Uh oh!

l3utterfly commented Aug 11, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

merge from upstream #79

merge from upstream #79

Uh oh!

Conversation

l3utterfly commented Aug 11, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants