forked from ggml-org/llama.cpp
-
Notifications
You must be signed in to change notification settings - Fork 3
Sync master with upstream release b6445 #248
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
jan-service-account
merged 13 commits into
dev
from
update-dev-from-master-2025-09-11-00-33
Sep 11, 2025
Merged
Sync master with upstream release b6445 #248
jan-service-account
merged 13 commits into
dev
from
update-dev-from-master-2025-09-11-00-33
Sep 11, 2025
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
) This commit adds caching of the ROCm installation for the windows-latest-cmake-hip job. The motivation for this is that the installation can sometimes hang and/or not complete properly leaving an invalid installation which later fails the build. By caching the installation hopefully we can keep a good installation available in the cache and avoid the installation step. Refs: ggml-org#15365
…ggml-org#15893) This commit adds check for two function pointers returned from ggml_backend_reg_get_proc_address. The motivation for this is that the function pointer could be nullptr if the get proc address function changes in the future. This is also consistent with all the other calls to ggml_backend_reg_get_proc_address in the code base.
* CANN: implement LRU cache for ACL graphs in CANN backend - Introduce ggml_cann_graph_lru_cache to store multiple ggml_cann_graph objects. - Graphs are loaded on demand and evicted using LRU policy when capacity is exceeded. - Updated push, move_to_front, and clear methods to manage cached graphs efficiently. - Ensures reuse of graphs, reducing graph reconstruction overhead in CANN backend. * fix typo * The LRU cache capacity can be configured via an env variable Signed-off-by: noemotiovon <[email protected]> * refactory acl graph * refactory && fix review comments Signed-off-by: noemotiovon <[email protected]> --------- Signed-off-by: noemotiovon <[email protected]>
* CANN: Add ROPE sin/cos cache for reuse Introduce sin/cos caching mechanism in ROPE to avoid redundant computation across layers. The cache is built on the first layer per device and reused by subsequent layers if parameters match. - Added sin_cache / cos_cache pointers and position_length tracking - Introduced cache validity flags and properties: (ext_factor, theta_scale, freq_scale, attn_factor, is_neox) - Accelerates ROPE by eliminating repeated sin/cos generation This change reduces overhead in multi-layer scenarios while preserving correctness by verifying parameter consistency. Co-authored-by: hipudding <[email protected]> * fix typo Signed-off-by: noemotiovon <[email protected]> --------- Signed-off-by: noemotiovon <[email protected]> Co-authored-by: hipudding <[email protected]>
* tests : filter out no-ops from coverage report This commit is a follow-up commit for ggml-org#15745 to address the feedback on how no-op operations should be filtered out from the coverage report. The feedback regarding the UNARY and GLU sub-operations not being handled I not exactly sure what should be done. They are included in the coverage, for example ABS, ELU, EXP, GELU, GEGLU, GEGLU_ERF etc are in the list of covered operations: ```console $ ./build/bin/test-backend-ops --show-coverage Operations covered by tests (89): ✓ ABS ✓ ACC ✓ ADD ✓ ADD1 ✓ ADD_ID ✓ ARANGE ✓ ARGMAX ✓ ARGSORT ✓ CLAMP ✓ CONCAT ✓ CONV_2D ✓ CONV_2D_DW ✓ CONV_3D ✓ CONV_TRANSPOSE_1D ✓ CONV_TRANSPOSE_2D ✓ COS ✓ COUNT_EQUAL ✓ CPY ✓ CROSS_ENTROPY_LOSS ✓ CROSS_ENTROPY_LOSS_BACK ✓ DIAG_MASK_INF ✓ DIV ✓ DUP ✓ ELU ✓ EXP ✓ FLASH_ATTN_EXT ✓ GATED_LINEAR_ATTN ✓ GEGLU ✓ GEGLU_ERF ✓ GEGLU_QUICK ✓ GELU ✓ GELU_ERF ✓ GELU_QUICK ✓ GET_ROWS ✓ GET_ROWS_BACK ✓ GROUP_NORM ✓ HARDSIGMOID ✓ HARDSWISH ✓ IM2COL ✓ IM2COL_3D ✓ L2_NORM ✓ LEAKY_RELU ✓ LOG ✓ MEAN ✓ MUL ✓ MUL_MAT ✓ MUL_MAT_ID ✓ NEG ✓ NORM ✓ OPT_STEP_ADAMW ✓ OPT_STEP_SGD ✓ OUT_PROD ✓ PAD ✓ PAD_REFLECT_1D ✓ POOL_2D ✓ REGLU ✓ RELU ✓ REPEAT ✓ REPEAT_BACK ✓ RMS_NORM ✓ RMS_NORM_BACK ✓ ROLL ✓ ROPE ✓ ROPE_BACK ✓ RWKV_WKV6 ✓ RWKV_WKV7 ✓ SCALE ✓ SET ✓ SET_ROWS ✓ SGN ✓ SIGMOID ✓ SILU ✓ SILU_BACK ✓ SIN ✓ SOFT_MAX ✓ SOFT_MAX_BACK ✓ SQR ✓ SQRT ✓ SSM_CONV ✓ SSM_SCAN ✓ STEP ✓ SUB ✓ SUM ✓ SUM_ROWS ✓ SWIGLU ✓ SWIGLU_OAI ✓ TANH ✓ TIMESTEP_EMBEDDING ✓ UPSCALE Operations without tests (14): ✗ ADD_REL_POS ✗ CUSTOM ✗ DIAG ✗ DIAG_MASK_ZERO ✗ FLASH_ATTN_BACK ✗ GET_REL_POS ✗ IM2COL_BACK ✗ MAP_CUSTOM1 ✗ MAP_CUSTOM2 ✗ MAP_CUSTOM3 ✗ POOL_1D ✗ POOL_2D_BACK ✗ WIN_PART ✗ WIN_UNPART Coverage Summary: Total operations: 103 Tested operations: 89 Untested operations: 14 Coverage: 86.4% ``` Refs: ggml-org#15745 * use of ggml_op enum values instead of strcmp
…15924) This commit applies the same caching to the release workflow which currently exists for the main CI workflow that was introduced in Commit ff02caf ("ci : cache ROCm installation in windows-latest-cmake-hip (ggml-org#15887)").
* metal : make the backend async ggml-ci * cont : add comments, extend op offload, clean up ggml-ci * metal : fix batch size for MUL_MAT_ID * metal : remove deprecated ggml_backend_metal_buffer_from_ptr * metal : create only metal buffers, no wrapping of host memory ggml-ci * metal : restore .alloc_buffer for buffer_from_ptr_type ggml-ci * metal : remove broken implementation of GGML_OP_SET ggml-ci * metal : clean-up loose ends, ready for tests ggml-ci * metal : support both private and shared buffers ggml-ci * metal : enable private buffers + add global device queue * metal : disable host buffer to prevent races ggml-ci * metal : avoid extra copy during set_tensor ggml-ci * metal : use separate buffer types for shread and private Metal buffers ggml-ci * metal : simplify synchronization logic ggml-ci * metal : fix build ggml-ci * metal : do not implement cpy_tensor ggml-ci * metal : separate implementations for shared and private buffers ggml-ci
This commit fixes the zero padding for odd dimensions in ggml_compute_forward_timestep_embedding_f32. The motivation for this is that currently if an odd dimension is used, the padding check incorrectly uses the dimension value for indexing. For example, with dim=15: Elements 0-6 are set to cosine values Elements 7-13 are set to sine values Element 14 is left uninitialized (contains garbage) Element 15 is correctly set to zero This fix changes embed_data[dim] to embed_data[2 * half] so that element 14 (the first unused element) is properly set to zero as well as the last element. Resolves: ggml-org/ggml#1324
* support non-contiguous Q in build_attn_mha * Update src/llama-graph.cpp ggml-ci Co-authored-by: Georgi Gerganov <[email protected]> --------- Co-authored-by: Georgi Gerganov <[email protected]>
…rs (ggml-org#15909) * Extend the support of T5 models with different encoder-decoder layers Signed-off-by: Jie Fu <[email protected]> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <[email protected]> * Update gguf-py/gguf/constants.py Co-authored-by: Sigbjørn Skjæret <[email protected]> * Update gguf-py/gguf/gguf_writer.py Co-authored-by: Sigbjørn Skjæret <[email protected]> * Update src/llama-arch.cpp Co-authored-by: Sigbjørn Skjæret <[email protected]> * Update src/llama-arch.h Co-authored-by: Sigbjørn Skjæret <[email protected]> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <[email protected]> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <[email protected]> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <[email protected]> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <[email protected]> * Update src/llama-hparams.h Co-authored-by: Sigbjørn Skjæret <[email protected]> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <[email protected]> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <[email protected]> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <[email protected]> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <[email protected]> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <[email protected]> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <[email protected]> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <[email protected]> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <[email protected]> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <[email protected]> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <[email protected]> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <[email protected]> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <[email protected]> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <[email protected]> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <[email protected]> * Rename n_dec_layer --> dec_n_layer Signed-off-by: Jie Fu <[email protected]> * Adapt to cases when dec_n_layer > n_layer Signed-off-by: Jie Fu <[email protected]> --------- Signed-off-by: Jie Fu <[email protected]> Co-authored-by: Sigbjørn Skjæret <[email protected]>
…gml-org#15872) * Add fastdiv and fastmodulo to k_bin_bcast kernel * Address review comments * `prod_` instead of `prod` suffix * Add test case for `k_bin_bcast_unravel` in CUDA backend
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Updates dev branch with latest release (b6445) from ggml-org/llama.cpp