Skip to content

Conversation

@DajanaV
Copy link
Contributor

@DajanaV DajanaV commented Nov 3, 2025

Mirrored from ggml-org/llama.cpp#16969

This PR adds basic Flash Attention support for the SYCL backend, enabling more efficient attention computation on Intel GPUs.

  • Implemented Flash Attention kernel for SYCL backend
  • Added forward pass implementation with block-wise computation
  • Integrated with existing GGML SYCL infrastructure
  • Support for both F32

Authors:
Joint work by @safranowith and @ye-NX

Notes:

  • This is an initial implementation
  • Performance benchmarks and optimizations are planned for future iterations
  • Feedback and suggestions are welcome!

DamonFool and others added 30 commits September 24, 2025 08:46
* model : add label for LiquidAI LFM2-2.6B model

HF link: [LiquidAI/LFM2-2.6B](https://huggingface.co/LiquidAI/LFM2-2.6B).

Support for GGUF conversion and inference is added in #14620.

However, due to similar `n_embd`, it identifies as a 1.2B model.
Fix the label by using `n_ff` to identify the model instead.

Output of `llama-bench`:
```
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| lfm2 1.2B F16                  |   2.18 GiB |     1.17 B | CPU        |      10 |           pp512 |        223.97 ± 5.32 |
| lfm2 2.6B F16                  |   4.79 GiB |     2.57 B | CPU        |      10 |           pp512 |         92.53 ± 4.14 |
| lfm2 350M F16                  | 676.25 MiB |   354.48 M | CPU        |      10 |           pp512 |       725.52 ± 11.70 |
| lfm2 700M F16                  |   1.38 GiB |   742.49 M | CPU        |      10 |           pp512 |       336.22 ± 12.93 |
```

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <[email protected]>

---------

Co-authored-by: Sigbjørn Skjæret <[email protected]>
…#15815)

* ggml : make gallocr respect the backend's max buffer size

* if the graph requires more memory than can fit into a single allocation, split it into multiple backend buffers
* vulkan: report the actual max  allocation size in buffer type  interface

* fix missing newline, apple-clang warning

* track size of individual chunks in ggml_dyn_tallocr and raise max chunks.
revert to use suballocation_block_size as max chunk size for vulkan.

* track (chunk, offset) pairs instead of "global" offsets through gallocr.

* simpler, don't need loops to map between local/global offsets
* touches more code

* fix dyn_tallocr_max_size and initialization

* fix memory leak when buffers are reused due to same buffer type appearing multiple times

* make vbuffer allocation follow the same logic as backend_buffer did before

* continue to use leftover unallocated space of previous chunks after a new one has been created

* treat free blocks of each chunk as separate list
* they're still allocated together, but start/end of each chunk is tracked, and allocate/free iterate over sub-ranges
* exhaust freed blocks of all chunks before considering their last blocks with unallocated space
* start with 0 chunks/blocks and create chunks as needed
* allow the last chunk to grow beyond max size

* refactor: move adding new free block and new chunk into separate functions

* allocate chunks individually with a separate free-blocks list for each one

* needs a bit more memory/allocations/indirections, but code is simpler

* fix warnings (missing static) & debug checks
* llama: print memory breakdown on exit
* run the x64 ci on regular machines

* set up the same thing for arm

fix test-quantize-perf just like #12306

* try to disable sve

* add another sve run
Use RPC_DEBUG environment variable to enable debug messages.
Add helper macro LOG_DBG() which does an early
check of the env var before calling GGML_LOG_DEBUG().
Make sure we log a debug message for every server function.
* metal : fuse NORM + MUL + ADD

* metal : support norms of non-multiple of 4

* cont : fix comment [no ci]
This commit adds support for using an externally started llama-server
instance for the server tests. This can be enabled by setting the
DEBUG_EXTERNAL environment variable.

The motivation for this is to allow debugging of the server itself
when investigating a test failure. Instructions for how to do this are
added to the README.md file in the tests directory.
This commit adds support for passing a prompt file to the model
conversion targets/scripts. It also updates the logits.cpp to print out
embedding information in the same format as when running the original
embedding model.

The motivation for this is that it allows us to pass files of different
sizes when running the converted models and validating the logits.

This can be particularly important when testing the sliding window
functionality of models where the sequence length needs to exceed a
certain number of tokens to trigger the sliding window logic.
* CUDA: add a fused top-K MoE kernel

This kernel does the following:
1. softmax over the logits per token [n_experts, n_tokens]
2. argmax reduce over the top-k (n_experts_used) logits
3. write weights + ids to global memory

It is intended as fusion of softmax->top-k->get_rows pipeline for MoE models

* Refactor into ggml_cuda_should_use_topk_moe

* Review: Use better coalescing pattern, use WARP_SIZE, store logits into registers before

* Review: format + micro-optimizations

* Fix bug: fix tie breakers

* Add optional norm + clean-up code

* Use smem for final write

* Add bounds check

* Use better memory pattern for writeback
Link to Java JNA bindings to llama.cpp native libraries
* vendor: update miniaudio.h

Signed-off-by: Aaron Teo <[email protected]>

* vendor: update miniaudio.h

Signed-off-by: Aaron Teo <[email protected]>

---------

Signed-off-by: Aaron Teo <[email protected]>
* add GroveMoE support

* remove constexpr that fails on certain compilers

* revert crude scalar div implementation, use cast

* build_attn_inp_kv_unified -> build_attn_inp_kv

* fix build_attn

* re-apply ffn_exps regex changes
* ci : create git tags for released docker images

When releasing a docker image for build number X, we should also create
the corresponding git tag. This allows users to easily checkout the
corresponding source tree for given docker image.

* Update .github/workflows/docker.yml

Co-authored-by: Sigbjørn Skjæret <[email protected]>

* Update .github/workflows/docker.yml

Co-authored-by: Sigbjørn Skjæret <[email protected]>

* Apply suggestion from @CISC

Co-authored-by: Sigbjørn Skjæret <[email protected]>

---------

Co-authored-by: Sigbjørn Skjæret <[email protected]>
* ggml-cpu: impl mxfp4 s390x

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: missing s = sumf

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: fix incorrect kval_mxfp4 type

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: rework mxfp4

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: missing delta calc

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: fix typo

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: fix typo for vec_splats

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: expand to 2 blocks per loop

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: add unroll to boost perf

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: back to 1 block per loop to test perf

Signed-off-by: Aaron Teo <[email protected]>

* Revert "ggml-cpu: back to 1 block per loop to test perf"

This reverts commit 1fe55724e2dc295701101bf838bdd4a512237492.

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: rm unroll from single block

Signed-off-by: Aaron Teo <[email protected]>

---------

Signed-off-by: Aaron Teo <[email protected]>
am17an and others added 20 commits October 18, 2025 11:52
Uses the technique used in the vulkan PR #16641. Neat trick!
This is similar to the CUDA shader from #16130, but doesn't use shared memory
and handles different subgroup sizes.
…t) (#16664)

* devops: initial patch

Signed-off-by: Aaron Teo <[email protected]>

* devops: forgot the z15 suffix

Signed-off-by: Aaron Teo <[email protected]>

* devops: attempt at impl GGML_CPU_ALL_VARIANTS for s390x

Signed-off-by: Aaron Teo <[email protected]>

* devops: rm baseline version

Signed-off-by: Aaron Teo <[email protected]>

---------

Signed-off-by: Aaron Teo <[email protected]>
add Granite 4 models mapping their embedding dimensions to the # of
parameters.

Information taken from https://huggingface.co/ibm-granite/granite-4.0-h-tiny

Signed-off-by: Giuseppe Scrivano <[email protected]>
The unexpeced pooling_type warning was incorrectly shown when users did not
specify the --pooling-type parameter. In this case, the parameter
defaults to `LLAMA_POOLING_TYPE_UNSPECIFIED (-1)`, and the code
automatically applies the model's default pooling type.

Example of spurious warning:
```
$ llama-embedding -hf ggml-org/bge-m3-Q8_0-GGUF -p "hello"
...
llama_init_from_model: model default pooling_type is [2], but [-1] was specified
...
```

This fix ensures the warning only appears when users explicitly specify
a pooling type that differs from the model's default (e.g., using
--pooling-type mean on a model that expects CLS pooling).
…613)

* SYCL: Add support for FLOOR,CEIL,ROUND and TRUNC unary operators

Clean up unrelated changes from previous commit

* Chore: remove empty lines and fix indentation

* Clean up: remove leftover blank lines and fix spacing

* chore: fix trailing whitespace and ensure final newline

* Cleanup: remove redundant declarations already defined in header

* Sync docs/ops.md with updated backend operation support

* docs: update ops.md after rebase

* docs: update ops.md - Vulkan supports SSM_CONV and SSM_SCAN
## Why it failed

When compiling with strict compiler flags (-Wmissing-braces -Werror=missing-braces),
the build fails with the following error:

```
cmake \
  -S . \
  -B ../llama.cpp.build \
  --preset=x64-linux-gcc-debug \
  -DCMAKE_INSTALL_PREFIX=/tmp/local \
  -DCMAKE_CXX_FLAGS="-Wmissing-braces -Werror=missing-braces" && \
cmake --build ../llama.cpp.build/
...
In file included from /home/otegami/work/cpp/llama.cpp/src/llama-graph.h:4,
                 from /home/otegami/work/cpp/llama.cpp/src/llama-model.h:5,
                 from /home/otegami/work/cpp/llama.cpp/src/llama.cpp:8:
/home/otegami/work/cpp/llama.cpp/src/llama-batch.h:126:48: error: missing braces around initializer for 'std::__array_traits<int, 1>::_Type' {aka 'int [1]'} [-Werror=missing-braces]
  126 |     std::array<llama_seq_id, 1> seq_id_0 = { 0 }; // default sequence id
      |                                                ^
cc1plus: some warnings being treated as errors
```

The issue is that std::array initialization requires double braces.

## How to fix

This PR changes `{ 0 }` to `{{ 0 }}` for std::array initialization.

This is part of a series of commits to fix missing braces warnings across the codebase.
- src/llama-batch.h <- This PR is here.
- src/llama-context.cpp
- tests/test-backend-ops.cpp
- tests/test-gguf.cpp
- tools/mtmd/clip.cpp

Benefits:
- std::array is a struct containing a C-style array, requiring nested braces
- Enables stricter compiler warnings to catch potential issues
…rsations (#16327)

* feat: Per-conversation loading states and tracking streaming stats

* chore: update webui build output

* refactor: Chat state management

Consolidates loading state management by using a global `isLoading` store synchronized with individual conversation states.

This change ensures proper reactivity and avoids potential race conditions when updating the UI based on the loading status of different conversations. It also improves the accuracy of statistics displayed.

Additionally, slots service methods are updated to use conversation IDs for per-conversation state management, avoiding global state pollution.

* feat: Adds loading indicator to conversation items

* chore: update webui build output

* fix: Fix aborting chat streaming

Improves the chat stream abortion process by ensuring that partial responses are saved before the abort signal is sent.

This avoids a race condition where the onError callback could clear the streaming state before the partial response is saved. Additionally, the stream reading loop and callbacks are now checked for abort signals to prevent further processing after abortion.

* refactor: Remove redundant comments

* chore: build webui static output

* refactor: Cleanup

* chore: update webui build output

* chore: update webui build output

* fix: Conversation loading indicator for regenerating messages

* chore: update webui static build

* feat: Improve configuration

* feat: Install `http-server` as dev dependency to not need to rely on `npx` in CI
* webui : added download action (#13552)

* webui : import and export (for all conversations)

* webui : fixed download-format, import of one conversation

* webui : add ExportedConversations type for chat import/export

* feat: Update naming & order

* chore: Linting

* feat: Import/Export UX improvements

* chore: update webui build output

* feat: Update UI placement of Import/Export tab in Chat Settings Dialog

* refactor: Cleanup

chore: update webui build output

* feat: Enable shift-click multiple conversation items selection

* chore: update webui static build

* chore: update webui static build

---------

Co-authored-by: Sascha Rogmann <[email protected]>
* fix: Prevent premature submission on IME input

* chore: update webui static build

* refactor: Put IME completion checker in a helper function and add checking for `KeyboardEvent.eventKey === 229`

* chore: update webui static build

* chore: update webui static build

* chore: update webui static build
* add BailingMoeV2 support

* update llm types

* undo

* undo

* update llm types

* add model collection link

* update

* almost working

* correct group selection and rename n_group_exp

* avoid large top_k and use argmax instead for now

if we had something like argmax2 that would be equivalent, but this works fine until then

* poke

* skip group selection when there are no tokens

* fix 1T conversion

* hopefully fixed expert group selection

third time's the charm?

* make expert group selection generally available

The new LLaDA2Moe model uses this method too, make it generally available regardless of architecture.

* allow n_expert_groups to be 1 (Kimi K2)

* address review suggestions
* sycl: add PAD_REFLECT_D1 operator support

* docs(ops): regenerate docs/ops.md

* remove trailing whitespaces

* style: fix editorconfig issues — trim trailing spaces and normalize EOLs

* fix: move PAD_REFLECT_1D case outside of fall-through block
@loci-agentic-ai
Copy link

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary: LLaMA.cpp Project

Critical Function Performance Changes

Based on the analysis of version 4728565c-fc50-4956-9b0d-8a1586354530 compared to baseline 6e2eaebb-4d5e-4576-a5d2-5cc3b9172699, the following critical functions show performance changes:

Response Time Degradations

  • llama_context_default_params: +31 ns (+0.057%)
  • _ZNKSt8__detail12_CharMatcherINSt7__cxx1112regex_traitsIwEELb1ELb1EEclEw: +15 ns (+0.057%)

Throughput Degradations

  • _ZNKSt8__detail12_CharMatcherINSt7__cxx1112regex_traitsIwEELb1ELb1EEclEw: +15 ns (+0.057%)
  • _ZSt10_ConstructI12llama_ubatchJRKS0_EEvPT_DpOT0_: +15 ns (+0.053%)

Bottleneck Degradations

  • _ZSt10_ConstructI12llama_ubatchJRKS0_EEvPT_DpOT0_: +15 ns (+0.107%)

KPI Impact Analysis

1. Tokens Per Second Impact

No Direct Impact Identified: The core inference functions show no performance degradation:

  • llama_decode(): No changes detected
  • llama_encode(): No changes detected
  • llama_tokenize(): No changes detected

Assessment: Based on the reference that 2ms slower llama_decode results in 7% fewer tokens per second, the current changes do not affect tokenization or inference performance directly.

2. Power Consumption Impact

Binary-Level Analysis:

  • build.bin.libllama.so: -0.0% change (306.895 µJ vs 306.895 µJ baseline)
  • build.bin.libggml-base.so: 0.0% change
  • build.bin.libggml-cpu.so: 0.0% change
  • build.bin.libggml.so: 0.0% change

Assessment: Power consumption remains stable across all binaries with negligible changes.

3. Quantization Efficiency

No Impact Detected:

  • llama_model_quantize(): No performance changes identified
  • Quantization-related functions show no degradation
  • GGML quantization operations remain stable

4. Memory Usage Impact

Affected Functions:

  • Memory Management: _ZSt10_ConstructI12llama_ubatchJRKS0_EEvPT_DpOT0_ shows +15 ns bottleneck increase
  • KV Cache Operations: Multiple cache-related functions show minor improvements (-0.07 to -0.11 ns)

Key Changes:

  • _ZNK27llama_kv_cache_iswa_context8get_baseEv: -0.07 ns throughput improvement
  • _ZNK22llama_kv_cache_context10get_ubatchEv: -0.07 ns throughput improvement

5. Batch Processing Impact

Affected Functions:

  • Batch Construction: _ZSt10_ConstructI12llama_ubatchJRKS0_EEvPT_DpOT0_ shows bottleneck degradation
  • Batch Operations: Core batch processing functions (llama_batch_init, llama_batch_get_one) show no changes

Root Cause Analysis

Control Flow Investigation

The CFG analysis of llama_context_default_params reveals:

  • Identical Assembly Code: No instruction-level changes between versions
  • Same Control Flow: Linear execution path unchanged
  • External Factors: Performance degradation likely due to binary layout changes from SYCL Flash Attention integration

SYCL Integration Impact

The addition of Flash Attention support introduces:

  • Binary Layout Changes: New object files alter memory layout affecting cache behavior
  • Compilation Effects: Additional template instantiations and link-time optimization changes
  • System Resource Competition: SYCL runtime initialization affects system performance

Action Items for Performance Improvement

Immediate Code Optimizations

  1. Batch Construction Optimization:

    • Investigate _ZSt10_ConstructI12llama_ubatchJRKS0_EEvPT_DpOT0_ bottleneck increase
    • Review memory allocation patterns in batch initialization
    • Consider pre-allocated batch pools to reduce construction overhead
  2. Binary Layout Optimization:

    • Implement section ordering to improve instruction cache locality
    • Consider profile-guided optimization (PGO) for better code placement
    • Review linker flags for optimal memory layout
  3. Template Instantiation Control:

    • Minimize regex template instantiations affecting character matching performance
    • Use explicit template instantiation to control compilation overhead
    • Consider template specialization for common use cases

Build System Improvements

  1. Compilation Flags:

    • Enable link-time optimization (LTO) for better cross-module optimization
    • Use -fprofile-use with representative workloads
    • Consider -march=native for target-specific optimizations
  2. Memory Layout Control:

    • Implement custom section placement for critical functions
    • Use __attribute__((hot)) for frequently called functions
    • Consider memory prefetching hints in critical paths
  3. SYCL Integration Isolation:

    • Implement lazy SYCL initialization to reduce startup overhead
    • Isolate SYCL-specific code to minimize binary footprint impact
    • Use conditional compilation to reduce template instantiation overhead

Performance Monitoring Focus Areas

The analysis shows that while individual function changes are minimal (0.05-0.11%), the cumulative effect of binary layout changes requires attention to:

  • Cache Performance: Monitor instruction and data cache miss rates
  • Memory Allocation Patterns: Track batch construction and KV cache efficiency
  • Template Instantiation Overhead: Control regex and SYCL template expansion
  • Binary Size Growth: Monitor object file size increases from new features

The current changes maintain stable inference performance while introducing new SYCL capabilities, with optimization opportunities focused on build-time and memory layout improvements rather than algorithmic changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.