forked from ggml-org/llama.cpp
-
Notifications
You must be signed in to change notification settings - Fork 3
Sync master with upstream release b5833 #152
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
jan-service-account
wants to merge
378
commits into
archive-dev
from
update-dev-from-master-2025-07-05-15-07
Closed
Sync master with upstream release b5833 #152
jan-service-account
wants to merge
378
commits into
archive-dev
from
update-dev-from-master-2025-07-05-15-07
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
ggml-org#14326) Mistral Small 2506 models using Pixtral vision encoder were running out of GPU memory when processing images larger than 1024x1024 pixels due to exponential memory growth from unlimited image size. This fix applies the same 1024x1024 limit used by Qwen2VL models to prevent OOM issues while maintaining compatibility with existing models.
* run : avoid double tokenization by adopting common_tokenize heuristic * build : fix windows gcc and clang warnings * lint : fixed trailing whitepace * run : fix is_first flag
* kv-cells : fix tracking of seq_pos during cache reuse ggml-ci * cont : improve error message ggml-ci * cont : add more comments
* CUDA: mul_mat_v support for batch sizes > 1 * use 64 bit math for initial offset calculation
…setting (ggml-org#14336) * llama-cli : add missing `inputs.use_jinja` setting Signed-off-by: Molly Sophia <[email protected]> * llama : better legacy chat template for rwkv Signed-off-by: Molly Sophia <[email protected]> --------- Signed-off-by: Molly Sophia <[email protected]>
Co-authored-by: Johannes Gäßler <[email protected]>
This will allow the use of tools on the llama-server
* batch : fix check for empty sequences in memory ggml-ci * cont : reuse the var ggml-ci
ggml-org#14254) * Move profiling info into `ggml_backend_opencl_context` * Add `enqueue_ndrange_kernel` to launch kernel
* ggml-cpu: add nnpa compile flag Signed-off-by: Aaron Teo <[email protected]> (cherry picked from commit 4a9f60c) * ggml-cpu: add fp16->fp32 nnpa first Signed-off-by: Aaron Teo <[email protected]> (cherry picked from commit 8d4a798) * ggml-cpu: add fp32->fp16 Signed-off-by: Aaron Teo <[email protected]> (cherry picked from commit 0ff0d65) * ggml-cpu: better variable names Signed-off-by: Aaron Teo <[email protected]> (cherry picked from commit 2f58bbc) * docs: update s390x docs Signed-off-by: Aaron Teo <[email protected]> (cherry picked from commit 01b9294) * ggml-cpu: add debugging prints to see if dlf16 is correct Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: fix print vs printf Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: fix float placeholder Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: ensure fp16 and fp32 load and stores are called Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: fp16 load ensured to hit Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: remove sigint from fp16 store for some reason, the function is not getting a hit when debugged with gdb. we will need to investigate further Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: activate nnpa for ggml_cpu_fp16_to_fp32 Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: nnpa activate ggml_cpu_fp16_to_fp32 for 8 elements Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: nnpa switch to vec_xst test Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: switch to vec_xst for 4 element loops also Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: rework noop Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: remove noop, general code cleanup Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: clarify variable naming Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: activate nnpa for ggml_cpu_fp32_to_fp16 Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: add breakpoint for debugging Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: test fix for conversion failure Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: disable fp32->fp16 nnpa conversions for now there are some conversion failures in nnpa that requires the eyes of an ibm stsm. will create a separate pr to introduce the fp32->fp16 change. Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: switch to elif macro Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: reattempt fp32->fp16 Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: fix typo Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: reattempt fp32->fp16 Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: fix compiler types Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: change to typedef vector types Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: add 4 element loops for fp32->fp16 Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: clarified vector naming Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: bring back fp32->fp16 store nnpa Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: activate nnpa fp32->fp16 or fp16->fp32 compute Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: add nnpa macro check in ggml-impl Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: add missing __func__ Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: diagnose why __NNPA__ macro is not being defined Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: import vecintrin.h to fix compiler errors Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: update macro tests Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: move s390x typedef to own header file Signed-off-by: Aaron Teo <[email protected]> * Revert "ggml-cpu: move s390x typedef to own header file" This reverts commit 157f856. Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: switch to importing ggml-cpu-impl instead Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: fix macro declaration Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: test more macros Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: add debug prints Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: bruteforce macro definitions Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: move macro definitions Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: add ggml-impl.h to cmakelists Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: switch to private macros Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: move s390x typedef to own header file Signed-off-by: Aaron Teo <[email protected]> (cherry picked from commit 157f856) * ggml-cpu: move things around Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: bring back compile macros Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: switch to quotes for import Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: add compiler error macro Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: add s390x detection in ggml-src Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: bring back compile definitions Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: undo cmakelists work Signed-off-by: Aaron Teo <[email protected]> * Revert "ggml-cpu: move s390x typedef to own header file" This reverts commit 18d79e1. Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: remove typedefs.h Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: remove typedef from cmakelists Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: add ggml-impl.h future notes Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: add todo comment for future reference Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: clarify naming of dlf16 Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: remove unnecessary target compile definitions Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: move nnpa fp16->fp32 and fp32->fp16 to simd-mappings Signed-off-by: Aaron Teo <[email protected]> * ggml: refactor fp32->fp16 and fp16->fp32 simd to ggml-cpu Signed-off-by: Aaron Teo <[email protected]> * docs: update broken huggingface link for s390x Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: fix duplicate func names during compile Signed-off-by: Aaron Teo <[email protected]> * Revert "ggml-cpu: fix duplicate func names during compile" This reverts commit fbb7334. Signed-off-by: Aaron Teo <[email protected]> * Revert "ggml: refactor fp32->fp16 and fp16->fp32 simd to ggml-cpu" This reverts commit bd288e8. Signed-off-by: Aaron Teo <[email protected]> * ggml: refactor fp16<->fp32 simd to ggml-cpu Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: fix missing simd-mappings.h import in quants.c Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: fix missing simd-mappings.h within repack Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: fix amx mmq missing simd-mappings.h Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: attempt at fixing loongarch failing build Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: move nnpa together with other fp16<->fp32 simd Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: fix wrong refactor of ggml-base ref: ggml-org#14317 (comment) Signed-off-by: Aaron Teo <[email protected]> * ggml: remove dependency on ggml-cpu from ggml-base Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: rename all fp16<->fp32 macros to prefix with ggml_cpu ref: ggml-org#14317 (comment) Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: remove mistaken fallback macro fallback logic was already implemented but i was too sleepy to realise Signed-off-by: Aaron Teo <[email protected]> * ggml: move ggml_table_f32_f16 to ggml-cpu ref: ggml-org#14317 (comment) Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: move ggml_table_f32_f16 back to ggml-base due to ci failures Signed-off-by: Aaron Teo <[email protected]> * Revert "ggml-cpu: move ggml_table_f32_f16 back to ggml-base due to ci failures" This reverts commit 32a3533. Signed-off-by: Aaron Teo <[email protected]> * Revert "ggml: move ggml_table_f32_f16 to ggml-cpu" This reverts commit 9e40d98. Signed-off-by: Aaron Teo <[email protected]> * ggml: move ggml_table_f32_f16 to ggml-cpu ref: ggml-org#14317 (comment) Signed-off-by: Aaron Teo <[email protected]> (cherry picked from commit 9e40d98) * ggml: move ggml_table_f32_f16 to ggml-cpu.c Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: extern c ggml_table_f32_f16 + chore docs Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: dedup ggml_table_f32_f16 from simd-mappings.h we rely on the variable declaration in ggml-cpu.c instead Signed-off-by: Aaron Teo <[email protected]> * Revert "ggml-cpu: dedup ggml_table_f32_f16 from simd-mappings.h" This reverts commit f71b21d. Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: bring back ggml_table_f32_f16 Signed-off-by: Aaron Teo <[email protected]> * Revert "ggml-cpu: bring back ggml_table_f32_f16" This reverts commit 2dce119. Signed-off-by: Aaron Teo <[email protected]> * fix ggml time initialization * fix f32_f16 table init * remove extra line --------- Signed-off-by: Aaron Teo <[email protected]> Co-authored-by: slaren <[email protected]>
* musa: enable fp16 mma (all) and cublas on qy2 Signed-off-by: Xiaodong Ye <[email protected]> * Update ggml/src/ggml-cuda/ggml-cuda.cu Co-authored-by: Johannes Gäßler <[email protected]> * Address review comments Signed-off-by: Xiaodong Ye <[email protected]> * Address review comments Signed-off-by: Xiaodong Ye <[email protected]> * musa: disable MUL_MAT_ID (q2_k × f32) due to precision issues Signed-off-by: Xiaodong Ye <[email protected]> --------- Signed-off-by: Xiaodong Ye <[email protected]> Co-authored-by: Johannes Gäßler <[email protected]>
* docs: update s390x documentation + add faq Signed-off-by: Aaron Teo <[email protected]> * docs: add s390x z17 build q&a Signed-off-by: Aaron Teo <[email protected]> --------- Signed-off-by: Aaron Teo <[email protected]>
* metal : batch rows copy in a single threadgroup ggml-ci * metal : handle some edge cases when threadgroup size is not a power of 2 ggml-ci
…#14398) * Add shaders-gen sources as target deps
* gemma3n * add llm_graph_input_one
* ggml : add version function to get lib version
This commit adds a function `ggml_version()` to the ggml library that
returns the version of the library as a string.
The motivation for this is that it can be useful to be able to
programmatically check the version of the ggml library being used.
Usage:
```c
printf("GGML version: %s\n", ggml_version());
```
Output:
```console
GGML version: 0.0.2219
```
* ggml : add ggml_commit()
---------
Co-authored-by: Georgi Gerganov <[email protected]>
ggml-ci
* llama : initial Mamba-2 support * ggml : SIMD ggml_ssm_scan for Mamba-2 * ggml : improve ggml_mul speed when masking recurrent states * llama : support running Mamba-Codestral-7B-v0.1 * llama : fix Mamba-2 conv state saving * ggml : make the ggml_mul fast broadcast path more consistently formatted * llama : remove unused variable * llama : add missing break * convert_hf : prefer SentencePiece tokenizer for Mamba-2 when present The tokenzier.json of Mamba-Codestral-7B-v0.1 otherwise requires workarounds to work correctly. * llama : avoid redundant state copy for Mamba 1 and 2 * metal : attempt to adapt SSM_SCAN for Mamba-2 * metal : fix SSM_SCAN pipeline scope * metal : use log and exp instead of log1pf and expf in SSM_SCAN * metal : remove unused arguments for SSM_SCAN The max index is 31, so trimming the arguments is necessary. * metal : add back n_seqs to SSM_SCAN args Whoops, this is needed for the offset in the concatenated output. * metal : fix SSM_SCAN state head offset * metal : fix wrong number of tokens per sequence in SSM_SCAN * ggml : remove unused fast broadcast path in GGML_MUL This was initially added because states were masked with ggml_mul, but this is no longer done and so this "optimisation" is no longer necessary, or at least not worth the additional code complexity. * ggml : avoid multiply by D in GGML_OP_SSM_SCAN This makes the weight buft detection in src/llama.cpp simpler. * convert : transpose Mamba-2 A, D and reshape SSM_NORM This breaks existing conversions of Mamba-2 models to avoid some reshapes. Not sure if it's a good idea, but it makes the graph slightly cleaner. * llama : more appropriate SSM_SCAN and SSM_CONV buft support checks * convert : fix flake8 lint * metal : fix confusion between ; and , * metal : add missing args for nb references in ssm_scan_f32_group * metal : single-user mamba2 inference works * kv-cache : remove const_cast when setting inputs for s_copy And also fix multi-user inference for recurrent models by using cell_id instead of i as the kv cell index when populating s_copy. * convert : avoid AutoConfig for Mamba and Mamba2 hparams * kv-cache : allow context shift for recurrent models * graph : fix recurrent state copies when avoiding copies Works, but using lambda functions might not be that clean. * ggml : fix mamba2 ssm scan when compiled with SVE * ggml-cpu : reorder SVE FMA for consistency with other SIMD arches * cuda : implement ssm scan for Mamba2 There is still room for improvement, but it works! * cuda : adapt Mamba1 ssm scan to shape changes from Mamba2 * mamba : fix mismatched new and delete size for llm_build_mamba Subclasses of llm_graph_context cannot have extra fields, because the called destructor is not the one from the subclass. This otherwise would cause problems when runnning Mamba-(1|2) inference when compiled -DGGML_SANITIZE_ADDRESS=ON * cuda : graceful fallback for Mamba-1 models with weird embd size
* add support for chat template jinja files * remove gemma3n hack
* ggml : fix FA mask dim 2 and 3 ggml-ci * backends : unsupport batched FA in CUDA and Vulkan ggml-ci * vulkan : disable FA for mask->ne[2] != 1
* kv-cache : use ggml_set_rows ggml-ci * graph : separate k and v indices ggml-ci * cont : remove redundant ifs ggml-ci * kv-cache : improve find_slot impl * kv-cache : bounds-check when accessing slot_info indices * kv-cache : add comments ggml-ci * ggml : add TODOs for adding GGML_OP_SET_ROWS support in the backends ggml-ci
* convert : correct gemma 3n conversion * rm redundant code
…g#14504) Signed-off-by: nscipione <[email protected]>
* vulkan: better parameterize FA by head sizes * vulkan: support mixed/deepseekR1 FA head sizes
…g#14002) Co-authored-by: luyuhong <[email protected]>
…14368) * test-backend-ops: add support for specifying output format Signed-off-by: Xiaodong Ye <[email protected]> * Address review comments Signed-off-by: Xiaodong Ye <[email protected]> * Add build_commit and build_number in test_result Signed-off-by: Xiaodong Ye <[email protected]> * Address review comments Signed-off-by: Xiaodong Ye <[email protected]> * refactor Signed-off-by: Xiaodong Ye <[email protected]> * Get build commit from ggml_commit() Signed-off-by: Xiaodong Ye <[email protected]> * Merge errors into test_operation_info && address review comments Signed-off-by: Xiaodong Ye <[email protected]> * Address review comments Signed-off-by: Xiaodong Ye <[email protected]> * Address review comments Signed-off-by: Xiaodong Ye <[email protected]> * remove visitor nonsense * remove visitor comment Signed-off-by: Xiaodong Ye <[email protected]> * Address review comments Signed-off-by: Xiaodong Ye <[email protected]> --------- Signed-off-by: Xiaodong Ye <[email protected]> Co-authored-by: slaren <[email protected]>
* vulkan: Handle updated FA dim2/3 definition Pack mask boolean and n_head_log2 into a single dword to keep the push constant block under the 128B limit. * handle null mask for gqa * allow gqa with dim3>1
- Removed duplicate enum entries in ggml-metal and ggml.c files. - Enhanced kernel functions in ggml-metal.metal for better performance and clarity. - Streamlined SYCL element-wise operations in element_wise.cpp, consolidating redundant code. - Cleaned up Vulkan CMake configuration and source files, eliminating unnecessary lines. - Improved llama batch processing logic in llama-batch.cpp and llama-batch.h for better efficiency. - Simplified memory management in llama-memory.h and llama-kv-cache-unified.cpp. - Removed outdated comments and redundant code across multiple files for clarity. - Adjusted server task handling in server.cpp to improve batch processing and error handling.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Updates dev branch with latest release (b5833) from ggml-org/llama.cpp