Merged the latest changes from ggml-org/llama.cpp into our fork(master) #52

dineshReddy6381 · 2025-09-17T11:46:45Z

Attached is fpga Test log
ggml_log.txt

Note : SIGMOID test is failing on both posix & fpga. This was failing in master branch code also.

… format (ggml-org#15108) - Use server_tokens in more places in server and util.cpp - Convert most functions that used llama_tokens to server_tokens - Modify input tokenizer to handle JSON objects as subprompts - Break out MTMD prompt parsing into utility function - Support JSON objects with multimodal_data arrays for MTMD prompts along with other existing types - Add capability to model endpoint to indicate if client can send multimodal data - Add tests.

* ggml-cpu: initial q5_0 impl for s390x Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: updated q5_0 code for better performance Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: use optimised hsum for better performance Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: introduce q5_1 simd + refactor q5_0 Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: fix incorrect return type vec_hsum Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: q5_0 incomplete refactor + table_b2b_0 activation Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: refactor q5_1 Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: q5_1 update loop unroll to 4 Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: update q5_0 unroll to 4 Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: update build-s390x docs Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: update unused variables q5_0 Signed-off-by: Aaron Teo <[email protected]> * docs: update the last update date Signed-off-by: Aaron Teo <[email protected]> --------- Signed-off-by: Aaron Teo <[email protected]>

ggml-ci

* Add Pad Reflect 1D CUDA support * Update ggml/src/ggml-cuda/pad_reflect_1d.cu Co-authored-by: Johannes Gäßler <[email protected]> --------- Co-authored-by: Johannes Gäßler <[email protected]>

* add conv3d * bump GGML_OP_COUNT

* Begin work on set_rows * Work on set rows * Add error buffers for reporting unsupported SET_ROWS indices * Remove extra comments * Work on templating for different types in shaders * Work on shader type generation * Working q4_0 mul_mat and some templating for different types * Add q4_0_f16 matmul and fix device init * Add matmul support for basic quantization types * Add q2_k and q3_k quantization * Add rest of k-quants * Get firt i-quant working * Closer to supporting all i-quants * Support rest of i-quants * Cleanup code * Fix python formatting * debug * Bugfix for memset * Add padding to end of buffers on creation * Simplify bit-shifting * Update usage of StringView

…org#15427) - Spread the work across the whole workgroup. Using more threads seems to far outweigh the synchronization overhead. - Specialize the code for when the division is by a power of two.

* vulkan : support ggml_mean * vulkan : support sum, sum_rows and mean with non-contiguous tensors * vulkan : fix subbuffer size not accounting for misalign offset * tests : add backend-op tests for non-contiguous sum_rows * cuda : require contiguous src for SUM_ROWS, MEAN support * sycl : require contiguous src for SUM, SUM_ROWS, ARGSORT support * require ggml_contiguous_rows in supports_op and expect nb00=1 in the shader

Signed-off-by: Xiaodong Ye <[email protected]>

…gml-org#15489) Track a list of nodes that need synchronization, and only sync if the new node depends on them (or overwrites them). This allows some overlap which can improve performance, and centralizes a big chunk of the synchronization logic. The remaining synchronization logic involves writes to memory other than the nodes, e.g. for dequantization or split_k. Each of these allocations has a bool indicating whether they were in use and need to be synced. This should be checked before they are written to, and set to true after they are done being consumed.

* First draft * Fix linter errors * Added missing sinks nullptr * Don't forget the llama-arch! * We're through to the generation stage. * Fix post-attention norm * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <[email protected]> * Fix RoPE type * Fix tensor name and reorder llm_types * Update gguf-py/gguf/constants.py Remove nonexistent FFN_POST_NORM tensor Co-authored-by: Sigbjørn Skjæret <[email protected]> * Update src/llama-model.h Co-authored-by: Sigbjørn Skjæret <[email protected]> * Add basic chat template * Add chat template tests * Remake chat template test * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <[email protected]> * Update src/llama-chat.cpp Co-authored-by: Sigbjørn Skjæret <[email protected]> * Reorder llm type descriptions * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <[email protected]> --------- Co-authored-by: Sigbjørn Skjæret <[email protected]>

…le SMs (ggml-org#15281) * vulkan: optimize rms_norm, and allow the work to spread across multiple SMs There are really two parts to this change: (1) Some optimizations similar to what we have in soft_max, to unroll with different numbers of iterations. (2) A fusion optimization where we detect add followed by rms_norm, and make the add shader atomically accumulate the values^2 into memory. Then the rms_norm shader can just load that sum. This allows the rms_norm to be parallelized across multiple workgroups, it just becomes a simple per-element multiply. The fusion optimization is currently only applied when the rms_norm is on a single vector. This previously always ran on a single SM. It could apply more broadly, but when there are other dimensions the work can already spread across SMs, and there would be some complexity to tracking multiple atomic sums. * Change add+rms_norm optimization to write out an array of partial sums rather than using atomic add, to make it deterministic. The rms_norm shader fetches a subgroup's worth in parallel and uses subgroupAdd to add them up. * complete rebase against fused adds - multi_add shader can also compute partial sums * fix validation errors * disable add_rms_fusion for Intel due to possible driver bug * resolve against ggml-org#15489, sync after clearing partial sums

) * vulkan: workaround MoltenVK compile failure in multi_add * Update ggml/src/ggml-vulkan/vulkan-shaders/multi_add.comp Co-authored-by: 0cc4m <[email protected]>

…g#15526)

The scalar FA shader already handled multiples of 8. The coopmat1 FA shader assumed 16x16x16 and the shared memory allocations need the HSK dimensions padded to a multiple of 16. NVIDIA's coopmat2 implementation requires multiples of 16 for N and K, and needs the matrix dimensions padded and loads clamped. Store the FA pipelines in a map, indexed by the pipeline state.

* kv-cache : support layer reuse ggml-ci * cont : update comments [no ci]

…ggml-org#15524) * vulkan: use subgroup function for mul_mat_id shader even without coopmat * vulkan: fix compile warnings * vulkan: properly check for subgroup size control and require full subgroups for subgroup mul_mat_id * vulkan: disable subgroup mul_mat_id on devices with subgroups < 16

Signed-off-by: noemotiovon <[email protected]>

* support interns1-mini * fix comment * update

ggml-ci

Signed-off-by: Weizhao Ouyang <[email protected]>

…5562) * batched-bench : fix unified KV cache handling + pp timing * cont : run dummy token only with split KV cache

…ml-org#15557) * model-conversion: add model card template for embeddings [no ci] This commit adds a separate model card template (model repository README.md template) for embedding models. The motivation for this is that there server command for the embedding model is a little different and some addition information can be useful in the model card for embedding models which might not be directly relevant for causal models. * squash! model-conversion: add model card template for embeddings [no ci] Fix pyright lint error. * remove --pooling override and clarify embd_normalize usage

…5564) This commit explicitly sets the pooling type to 'none' in the logits.cpp to support models that have a pooling type specified. The motivation for this is that some models may have a pooling type set in the model file (.gguf file) and for this specific case where we only want to extract logits, we need to ensure that no pooling is used to so that we are comparing raw logits and not pooled embeddings.

* CUDA: MoE helper in device code, better tile sizes * reduce superfluous CUDA blocks

atrivedi-tsavoritesi

Hi Dinesh,

Did you do a merge or rebase ?

Ideally these files should only show the diferences that we made right ?

Also I looked at the results and looks like the run_llama_cli.sh is working ok even if sigmoid test fails. But sigmoid is still CPU only right ?

Thanks
Ashish

dineshReddy6381 · 2025-09-17T15:51:20Z

Hi Dinesh,

Did you do a merge or rebase ?

Ideally these files should only show the differences that we made right ?

Also I looked at the results and looks like the run_llama_cli.sh is working ok even if sigmoid test fails. But sigmoid is still CPU only right ?

Thanks Ashish

I did merge.
I think last update happened sometime in May, there were lot of new files and changes in many old files, so it showing many changes and new files.

sigmoid was not working in earlier code(master branch) also. Its not CPU, its OPU now. I noticed sigmiod today while doing final testing, I have to debug sigmoid code and input/output parameters.

atrivedi-tsavoritesi · 2025-09-17T16:16:55Z

Hi Dinesh,
Did you do a merge or rebase ?
Ideally these files should only show the differences that we made right ?
Also I looked at the results and looks like the run_llama_cli.sh is working ok even if sigmoid test fails. But sigmoid is still CPU only right ?
Thanks Ashish

I did merge. I think last update happened sometime in May, there were lot of new files and changes in many old files, so it showing many changes and new files.

sigmoid was not working in earlier code(master branch) also. Its not CPU, its OPU now. I noticed sigmiod today while doing final testing, I have to debug sigmoid code and input/output parameters.

Ok, I am a confused, do you mean sigmoid fails in our fork or upstream ?

dineshReddy6381 · 2025-09-17T16:19:49Z

Hi Dinesh,
Did you do a merge or rebase ?
Ideally these files should only show the differences that we made right ?
Also I looked at the results and looks like the run_llama_cli.sh is working ok even if sigmoid test fails. But sigmoid is still CPU only right ?
Thanks Ashish

I did merge. I think last update happened sometime in May, there were lot of new files and changes in many old files, so it showing many changes and new files.
sigmoid was not working in earlier code(master branch) also. Its not CPU, its OPU now. I noticed sigmiod today while doing final testing, I have to debug sigmoid code and input/output parameters.

Ok, I am a confused, do you mean sigmoid fails in our fork or upstream ?

Yes, Its failing in our fork also.

akapoor3518

Dinesh,

There are 804 files here. I believe I had asked for a list of only the files we’ve modified or are relevant to our work. I had already sent you a list, and I expected that you would manually verify each file against it.

Below is the list I shared in chat. I hope you’ve reviewed each file manually—not just relied on testing—since automated checks might miss some cases. I’ll also go through the files listed below to ensure coverage.

Snapshots are provided below for manual merge.

https://github.com/tsisw/llama.cpp/pull/1/files
1 CMakeLists.txt
2 common/CMakeLists.txt
3 examples/gguf-hash/CMakeLists.txt
4 examples/gguf/CMakeLists.txt
5 examples/lookup/CMakeLists.txt
6 examples/simple-chat/CMakeLists.txt
7 examples/simple/CMakeLists.txt
8 examples/simple/simple-backend-tsi.cpp ---New file
9 ggml/CMakeLists.txt
10 ggml/include/ggml-tsavorite.h ---New file
11 ggml/src/CMakeLists.txt
12 ggml/src/ggml-backend-reg.cpp
13 ggml/src/ggml-tsavorite ---New Dir
14 tests/CMakeLists.txt
15 tsi-pkg-build.sh ---New

https://github.com/tsisw/llama.cpp/pull/2/files
16 ggml/src/ggml-cpu/CMakeLists.txt

17 README.md
18 docs/build.md
19 ggml/src/ggml-backend.cpp

Disable unnecessay logs
https://github.com/tsisw/llama.cpp/pull/13/files
20 common/log.h
21 ggml/include/ggml.h
22 ggml/src/ggml-impl.h
23 ggml/src/ggml.c
24 src/llama-context.cpp
25 src/llama-impl.h
26 src/llama-sampling.cpp
27 tools/main/main.cpp

Perf Status
https://github.com/tsisw/llama.cpp/pull/26/files
28 ggml/src/ggml-cpu/ggml-cpu.c
29 src/llama-context.h
https://github.com/tsisw/llama.cpp/pull/40/files
python script to run model with different prompt to measure performance
30 model-rerun.py

…t run Signed-off-by: Dinesh Reddy <[email protected]>

…list run for every commit Signed-off-by: Dinesh Reddy <[email protected]>

Signed-off-by: Dinesh Reddy <[email protected]>

dineshReddy6381 · 2025-09-19T10:44:48Z

Dinesh,

There are 804 files here. I believe I had asked for a list of only the files we’ve modified or are relevant to our work. I had already sent you a list, and I expected that you would manually verify each file against it.

Below is the list I shared in chat. I hope you’ve reviewed each file manually—not just relied on testing—since automated checks might miss some cases. I’ll also go through the files listed below to ensure coverage.

Snapshots are provided below for manual merge.

https://github.com/tsisw/llama.cpp/pull/1/files 1 CMakeLists.txt 2 common/CMakeLists.txt 3 examples/gguf-hash/CMakeLists.txt 4 examples/gguf/CMakeLists.txt 5 examples/lookup/CMakeLists.txt 6 examples/simple-chat/CMakeLists.txt 7 examples/simple/CMakeLists.txt 8 examples/simple/simple-backend-tsi.cpp ---New file 9 ggml/CMakeLists.txt 10 ggml/include/ggml-tsavorite.h ---New file 11 ggml/src/CMakeLists.txt 12 ggml/src/ggml-backend-reg.cpp 13 ggml/src/ggml-tsavorite ---New Dir 14 tests/CMakeLists.txt 15 tsi-pkg-build.sh ---New

https://github.com/tsisw/llama.cpp/pull/2/files 16 ggml/src/ggml-cpu/CMakeLists.txt

17 README.md 18 docs/build.md 19 ggml/src/ggml-backend.cpp

Disable unnecessay logs https://github.com/tsisw/llama.cpp/pull/13/files 20 common/log.h 21 ggml/include/ggml.h 22 ggml/src/ggml-impl.h 23 ggml/src/ggml.c 24 src/llama-context.cpp 25 src/llama-impl.h 26 src/llama-sampling.cpp 27 tools/main/main.cpp

Perf Status https://github.com/tsisw/llama.cpp/pull/26/files 28 ggml/src/ggml-cpu/ggml-cpu.c 29 src/llama-context.h https://github.com/tsisw/llama.cpp/pull/40/files python script to run model with different prompt to measure performance 30 model-rerun.py

@akapoor3518 I have manually verified above commits, All commit changes are taken care and are there in merged files.

akapoor3518

I also looked at concern files, looks fine, approving

atrivedi-tsavoritesi

@dineshReddy6381 Approving, I am assuming you have validated posixs and FPGA. Can you make sure you the relevant logs here to PR.

…tch to GCC 13.3.0 to ensure compatibility with target GLIBC version.This addresses runtime linking errors when running tests on posix when we run without export path. -Cleaned unused .yml files. This will stop running after every commit. Signed-off-by: Dinesh Reddy <[email protected]>

dineshReddy6381 · 2025-09-22T11:03:45Z

@akapoor3518 @atrivedi-tsavoritesi : I have pushed few more changes into PR. I noticed when i did fresh clone and run without export path. I also removed few .yml files which are not needed.
Attached both posix & fpag results
ggml_log_fpga.txt
ggml_log_posix.txt
.

akapoor3518

Please create new terminal and test at multuple AWS machine just to make sure all ENV variable u are setting with .sh is taken care. I am approving now

65a and others added 30 commits August 22, 2025 10:10

llama : remove KV cache defragmentation logic (ggml-org#15473)

9ebebef

ggml-ci

cuda : add Pad Reflect 1D support (ggml-org#14659)

b1ab918

* Add Pad Reflect 1D CUDA support * Update ggml/src/ggml-cuda/pad_reflect_1d.cu Co-authored-by: Johannes Gäßler <[email protected]> --------- Co-authored-by: Johannes Gäßler <[email protected]>

ggml: add conv3d op (ggml-org#15182)

92f7f0a

* add conv3d * bump GGML_OP_COUNT

model : gpt-oss add response_format support (ggml-org#15494)

32732f2

test-opt: allow slight inprecision (ggml-org#15503)

e92734d

vulkan: optimize mul_mat_id loading row ids into shared memory (ggml-…

330c3d2

…org#15427) - Spread the work across the whole workgroup. Using more threads seems to far outweigh the synchronization overhead. - Specialize the code for when the division is by a power of two.

vulkan.Dockerfile: install vulkan SDK using tarball (ggml-org#15282)

b55f06e

Signed-off-by: Xiaodong Ye <[email protected]>

chat : fix debug build assertion in trim function (ggml-org#15520)

21dc4dd

scripts: fix compare-llama-bench.py (ggml-org#15521)

9ef5369

CUDA: fix half2 -> half conversion for HIP (ggml-org#15529)

710dfc4

vulkan: workaround MoltenVK compile failure in multi_add (ggml-org#15506

e78cf0d

) * vulkan: workaround MoltenVK compile failure in multi_add * Update ggml/src/ggml-vulkan/vulkan-shaders/multi_add.comp Co-authored-by: 0cc4m <[email protected]>

vulkan: enable Conv2D for Apple after MoltenVK fixed the bug (ggml-or…

a9c6ffc

…g#15526)

kv-cache : support layer reuse (ggml-org#15504)

b730706

* kv-cache : support layer reuse ggml-ci * cont : update comments [no ci]

CANN: ROPE cache sin/cos repeat (ggml-org#15501)

c247d06

Signed-off-by: noemotiovon <[email protected]>

convert : support interns1-mini (ggml-org#15412)

7da9fed

* support interns1-mini * fix comment * update

metal : add FA kernels for HS=40 (ggml-org#15559)

b0ba31f

ggml-ci

convert : update Ernie 4.5 dense architecture name (ggml-org#15555)

0d5a470

Signed-off-by: Weizhao Ouyang <[email protected]>

batched-bench : fix unified KV cache handling + pp timing (ggml-org#1…

6b64f74

…5562) * batched-bench : fix unified KV cache handling + pp timing * cont : run dummy token only with split KV cache

CUDA: MoE helper in device code, better tile sizes (ggml-org#15525)

5eff6ec

* CUDA: MoE helper in device code, better tile sizes * reduce superfluous CUDA blocks

github-actions bot added examples devops python script android server ggml nix Ascend NPU OpenCL labels Sep 17, 2025

atrivedi-tsavoritesi reviewed Sep 17, 2025

View reviewed changes

akapoor3518 reviewed Sep 18, 2025

View reviewed changes

Dinesh Reddy added 4 commits September 18, 2025 22:11

-Disabled CI, flake8 Lint ,editor config, python lint workflow defaul…

7a6ce92

…t run Signed-off-by: Dinesh Reddy <[email protected]>

-Disabled riscv-native, editorconfig, python-type-check, server check…

1432366

…list run for every commit Signed-off-by: Dinesh Reddy <[email protected]>

-disabled python-lint. CI check

b91626c

Signed-off-by: Dinesh Reddy <[email protected]>

-Disabled all automatic checks for commits

c9a365e

Signed-off-by: Dinesh Reddy <[email protected]>

akapoor3518 approved these changes Sep 19, 2025

View reviewed changes

atrivedi-tsavoritesi approved these changes Sep 19, 2025

View reviewed changes

dineshReddy6381 requested review from akapoor3518 and atrivedi-tsavoritesi September 22, 2025 11:04

atrivedi-tsavoritesi approved these changes Sep 22, 2025

View reviewed changes

akapoor3518 approved these changes Sep 22, 2025

View reviewed changes

dineshReddy6381 merged commit 0e6f8a7 into master Sep 22, 2025
1 check passed

dineshReddy6381 mentioned this pull request Sep 23, 2025

Revert "Merged the latest changes from ggml-org/llama.cpp into our fork(master)" #53

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Merged the latest changes from ggml-org/llama.cpp into our fork(master) #52

Merged the latest changes from ggml-org/llama.cpp into our fork(master) #52

Uh oh!

dineshReddy6381 commented Sep 17, 2025

Uh oh!

atrivedi-tsavoritesi left a comment

Uh oh!

dineshReddy6381 commented Sep 17, 2025

Uh oh!

atrivedi-tsavoritesi commented Sep 17, 2025

Uh oh!

dineshReddy6381 commented Sep 17, 2025

Uh oh!

akapoor3518 left a comment

Uh oh!

dineshReddy6381 commented Sep 19, 2025

Uh oh!

akapoor3518 left a comment

Uh oh!

atrivedi-tsavoritesi left a comment

Uh oh!

dineshReddy6381 commented Sep 22, 2025

Uh oh!

akapoor3518 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

71 participants

Merged the latest changes from ggml-org/llama.cpp into our fork(master) #52

Merged the latest changes from ggml-org/llama.cpp into our fork(master) #52

Uh oh!

Conversation

dineshReddy6381 commented Sep 17, 2025

Uh oh!

atrivedi-tsavoritesi left a comment

Choose a reason for hiding this comment

Uh oh!

dineshReddy6381 commented Sep 17, 2025

Uh oh!

atrivedi-tsavoritesi commented Sep 17, 2025

Uh oh!

dineshReddy6381 commented Sep 17, 2025

Uh oh!

akapoor3518 left a comment

Choose a reason for hiding this comment

Uh oh!

dineshReddy6381 commented Sep 19, 2025

Uh oh!

akapoor3518 left a comment

Choose a reason for hiding this comment

Uh oh!

atrivedi-tsavoritesi left a comment

Choose a reason for hiding this comment

Uh oh!

dineshReddy6381 commented Sep 22, 2025

Uh oh!

akapoor3518 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

71 participants