UPSTREAM PR #16969: sycl: flash-attention implementation #51

DajanaV · 2025-11-03T13:13:21Z

This PR adds basic Flash Attention support for the SYCL backend, enabling more efficient attention computation on Intel GPUs.

Implemented Flash Attention kernel for SYCL backend
Added forward pass implementation with block-wise computation
Integrated with existing GGML SYCL infrastructure
Support for both F32

Authors:
Joint work by @safranowith and @ye-NX

Notes:

This is an initial implementation
Performance benchmarks and optimizations are planned for future iterations
Feedback and suggestions are welcome!

Signed-off-by: Jie Fu <[email protected]>

Signed-off-by: Uilian Ries <[email protected]>

…ontaining "." (#16215) Signed-off-by: Jie Fu <[email protected]>

* model : add label for LiquidAI LFM2-2.6B model HF link: [LiquidAI/LFM2-2.6B](https://huggingface.co/LiquidAI/LFM2-2.6B). Support for GGUF conversion and inference is added in #14620. However, due to similar `n_embd`, it identifies as a 1.2B model. Fix the label by using `n_ff` to identify the model instead. Output of `llama-bench`: ``` | model | size | params | backend | threads | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: | | lfm2 1.2B F16 | 2.18 GiB | 1.17 B | CPU | 10 | pp512 | 223.97 ± 5.32 | | lfm2 2.6B F16 | 4.79 GiB | 2.57 B | CPU | 10 | pp512 | 92.53 ± 4.14 | | lfm2 350M F16 | 676.25 MiB | 354.48 M | CPU | 10 | pp512 | 725.52 ± 11.70 | | lfm2 700M F16 | 1.38 GiB | 742.49 M | CPU | 10 | pp512 | 336.22 ± 12.93 | ``` * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <[email protected]> --------- Co-authored-by: Sigbjørn Skjæret <[email protected]>

…#15815) * ggml : make gallocr respect the backend's max buffer size * if the graph requires more memory than can fit into a single allocation, split it into multiple backend buffers * vulkan: report the actual max allocation size in buffer type interface * fix missing newline, apple-clang warning * track size of individual chunks in ggml_dyn_tallocr and raise max chunks. revert to use suballocation_block_size as max chunk size for vulkan. * track (chunk, offset) pairs instead of "global" offsets through gallocr. * simpler, don't need loops to map between local/global offsets * touches more code * fix dyn_tallocr_max_size and initialization * fix memory leak when buffers are reused due to same buffer type appearing multiple times * make vbuffer allocation follow the same logic as backend_buffer did before * continue to use leftover unallocated space of previous chunks after a new one has been created * treat free blocks of each chunk as separate list * they're still allocated together, but start/end of each chunk is tracked, and allocate/free iterate over sub-ranges * exhaust freed blocks of all chunks before considering their last blocks with unallocated space * start with 0 chunks/blocks and create chunks as needed * allow the last chunk to grow beyond max size * refactor: move adding new free block and new chunk into separate functions * allocate chunks individually with a separate free-blocks list for each one * needs a bit more memory/allocations/indirections, but code is simpler * fix warnings (missing static) & debug checks

* llama: print memory breakdown on exit

@AlekseiNikiforovIBM

add @AlekseiNikiforovIBM to owners of zDNN backend Signed-off-by: Aaron Teo <[email protected]>

* run the x64 ci on regular machines * set up the same thing for arm fix test-quantize-perf just like #12306 * try to disable sve * add another sve run

@Andreas-Krebbel

add @Andreas-Krebbel to owners of zDNN backend Signed-off-by: Aaron Teo <[email protected]>

Use RPC_DEBUG environment variable to enable debug messages. Add helper macro LOG_DBG() which does an early check of the env var before calling GGML_LOG_DEBUG(). Make sure we log a debug message for every server function.

* metal : fuse NORM + MUL + ADD * metal : support norms of non-multiple of 4 * cont : fix comment [no ci]

This commit adds support for using an externally started llama-server instance for the server tests. This can be enabled by setting the DEBUG_EXTERNAL environment variable. The motivation for this is to allow debugging of the server itself when investigating a test failure. Instructions for how to do this are added to the README.md file in the tests directory.

This commit adds support for passing a prompt file to the model conversion targets/scripts. It also updates the logits.cpp to print out embedding information in the same format as when running the original embedding model. The motivation for this is that it allows us to pass files of different sizes when running the converted models and validating the logits. This can be particularly important when testing the sliding window functionality of models where the sequence length needs to exceed a certain number of tokens to trigger the sliding window logic.

* CUDA: add a fused top-K MoE kernel This kernel does the following: 1. softmax over the logits per token [n_experts, n_tokens] 2. argmax reduce over the top-k (n_experts_used) logits 3. write weights + ids to global memory It is intended as fusion of softmax->top-k->get_rows pipeline for MoE models * Refactor into ggml_cuda_should_use_topk_moe * Review: Use better coalescing pattern, use WARP_SIZE, store logits into registers before * Review: format + micro-optimizations * Fix bug: fix tie breakers * Add optional norm + clean-up code * Use smem for final write * Add bounds check * Use better memory pattern for writeback

Link to Java JNA bindings to llama.cpp native libraries

* vendor: update miniaudio.h Signed-off-by: Aaron Teo <[email protected]> * vendor: update miniaudio.h Signed-off-by: Aaron Teo <[email protected]> --------- Signed-off-by: Aaron Teo <[email protected]>

* add GroveMoE support * remove constexpr that fails on certain compilers * revert crude scalar div implementation, use cast * build_attn_inp_kv_unified -> build_attn_inp_kv * fix build_attn * re-apply ffn_exps regex changes

Signed-off-by: Xiaodong Ye <[email protected]>

…268)

@CISC

* ci : create git tags for released docker images When releasing a docker image for build number X, we should also create the corresponding git tag. This allows users to easily checkout the corresponding source tree for given docker image. * Update .github/workflows/docker.yml Co-authored-by: Sigbjørn Skjæret <[email protected]> * Update .github/workflows/docker.yml Co-authored-by: Sigbjørn Skjæret <[email protected]> * Apply suggestion from @CISC Co-authored-by: Sigbjørn Skjæret <[email protected]> --------- Co-authored-by: Sigbjørn Skjæret <[email protected]>

* ggml-cpu: impl mxfp4 s390x Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: missing s = sumf Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: fix incorrect kval_mxfp4 type Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: rework mxfp4 Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: missing delta calc Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: fix typo Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: fix typo for vec_splats Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: expand to 2 blocks per loop Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: add unroll to boost perf Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: back to 1 block per loop to test perf Signed-off-by: Aaron Teo <[email protected]> * Revert "ggml-cpu: back to 1 block per loop to test perf" This reverts commit 1fe55724e2dc295701101bf838bdd4a512237492. Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: rm unroll from single block Signed-off-by: Aaron Teo <[email protected]> --------- Signed-off-by: Aaron Teo <[email protected]>

Signed-off-by: Adrien Gallouët <[email protected]>

Uses the technique used in the vulkan PR #16641. Neat trick!

This is similar to the CUDA shader from #16130, but doesn't use shared memory and handles different subgroup sizes.

Signed-off-by: Aaron Teo <[email protected]>

…t) (#16664) * devops: initial patch Signed-off-by: Aaron Teo <[email protected]> * devops: forgot the z15 suffix Signed-off-by: Aaron Teo <[email protected]> * devops: attempt at impl GGML_CPU_ALL_VARIANTS for s390x Signed-off-by: Aaron Teo <[email protected]> * devops: rm baseline version Signed-off-by: Aaron Teo <[email protected]> --------- Signed-off-by: Aaron Teo <[email protected]>

add Granite 4 models mapping their embedding dimensions to the # of parameters. Information taken from https://huggingface.co/ibm-granite/granite-4.0-h-tiny Signed-off-by: Giuseppe Scrivano <[email protected]>

The unexpeced pooling_type warning was incorrectly shown when users did not specify the --pooling-type parameter. In this case, the parameter defaults to `LLAMA_POOLING_TYPE_UNSPECIFIED (-1)`, and the code automatically applies the model's default pooling type. Example of spurious warning: ``` $ llama-embedding -hf ggml-org/bge-m3-Q8_0-GGUF -p "hello" ... llama_init_from_model: model default pooling_type is [2], but [-1] was specified ... ``` This fix ensures the warning only appears when users explicitly specify a pooling type that differs from the model's default (e.g., using --pooling-type mean on a model that expects CLS pooling).

…613) * SYCL: Add support for FLOOR,CEIL,ROUND and TRUNC unary operators Clean up unrelated changes from previous commit * Chore: remove empty lines and fix indentation * Clean up: remove leftover blank lines and fix spacing * chore: fix trailing whitespace and ensure final newline * Cleanup: remove redundant declarations already defined in header * Sync docs/ops.md with updated backend operation support * docs: update ops.md after rebase * docs: update ops.md - Vulkan supports SSM_CONV and SSM_SCAN

Signed-off-by: deadprogram <[email protected]>

## Why it failed When compiling with strict compiler flags (-Wmissing-braces -Werror=missing-braces), the build fails with the following error: ``` cmake \ -S . \ -B ../llama.cpp.build \ --preset=x64-linux-gcc-debug \ -DCMAKE_INSTALL_PREFIX=/tmp/local \ -DCMAKE_CXX_FLAGS="-Wmissing-braces -Werror=missing-braces" && \ cmake --build ../llama.cpp.build/ ... In file included from /home/otegami/work/cpp/llama.cpp/src/llama-graph.h:4, from /home/otegami/work/cpp/llama.cpp/src/llama-model.h:5, from /home/otegami/work/cpp/llama.cpp/src/llama.cpp:8: /home/otegami/work/cpp/llama.cpp/src/llama-batch.h:126:48: error: missing braces around initializer for 'std::__array_traits<int, 1>::_Type' {aka 'int [1]'} [-Werror=missing-braces] 126 | std::array<llama_seq_id, 1> seq_id_0 = { 0 }; // default sequence id | ^ cc1plus: some warnings being treated as errors ``` The issue is that std::array initialization requires double braces. ## How to fix This PR changes `{ 0 }` to `{{ 0 }}` for std::array initialization. This is part of a series of commits to fix missing braces warnings across the codebase. - src/llama-batch.h <- This PR is here. - src/llama-context.cpp - tests/test-backend-ops.cpp - tests/test-gguf.cpp - tools/mtmd/clip.cpp Benefits: - std::array is a struct containing a C-style array, requiring nested braces - Enables stricter compiler warnings to catch potential issues

…rsations (#16327) * feat: Per-conversation loading states and tracking streaming stats * chore: update webui build output * refactor: Chat state management Consolidates loading state management by using a global `isLoading` store synchronized with individual conversation states. This change ensures proper reactivity and avoids potential race conditions when updating the UI based on the loading status of different conversations. It also improves the accuracy of statistics displayed. Additionally, slots service methods are updated to use conversation IDs for per-conversation state management, avoiding global state pollution. * feat: Adds loading indicator to conversation items * chore: update webui build output * fix: Fix aborting chat streaming Improves the chat stream abortion process by ensuring that partial responses are saved before the abort signal is sent. This avoids a race condition where the onError callback could clear the streaming state before the partial response is saved. Additionally, the stream reading loop and callbacks are now checked for abort signals to prevent further processing after abortion. * refactor: Remove redundant comments * chore: build webui static output * refactor: Cleanup * chore: update webui build output * chore: update webui build output * fix: Conversation loading indicator for regenerating messages * chore: update webui static build * feat: Improve configuration * feat: Install `http-server` as dev dependency to not need to rely on `npx` in CI

* webui : added download action (#13552) * webui : import and export (for all conversations) * webui : fixed download-format, import of one conversation * webui : add ExportedConversations type for chat import/export * feat: Update naming & order * chore: Linting * feat: Import/Export UX improvements * chore: update webui build output * feat: Update UI placement of Import/Export tab in Chat Settings Dialog * refactor: Cleanup chore: update webui build output * feat: Enable shift-click multiple conversation items selection * chore: update webui static build * chore: update webui static build --------- Co-authored-by: Sascha Rogmann <[email protected]>

* fix: Prevent premature submission on IME input * chore: update webui static build * refactor: Put IME completion checker in a helper function and add checking for `KeyboardEvent.eventKey === 229` * chore: update webui static build * chore: update webui static build * chore: update webui static build

* add BailingMoeV2 support * update llm types * undo * undo * update llm types * add model collection link * update * almost working * correct group selection and rename n_group_exp * avoid large top_k and use argmax instead for now if we had something like argmax2 that would be equivalent, but this works fine until then * poke * skip group selection when there are no tokens * fix 1T conversion * hopefully fixed expert group selection third time's the charm? * make expert group selection generally available The new LLaDA2Moe model uses this method too, make it generally available regardless of architecture. * allow n_expert_groups to be 1 (Kimi K2) * address review suggestions

* sycl: add PAD_REFLECT_D1 operator support * docs(ops): regenerate docs/ops.md * remove trailing whitespaces * style: fix editorconfig issues — trim trailing spaces and normalize EOLs * fix: move PAD_REFLECT_1D case outside of fall-through block

Co-authored-by: safranowith <[email protected]> Co-authored-by: ye-NX <[email protected]>

loci-agentic-ai · 2025-11-03T14:33:52Z

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary: LLaMA.cpp Project

Critical Function Performance Changes

Based on the analysis of version 4728565c-fc50-4956-9b0d-8a1586354530 compared to baseline 6e2eaebb-4d5e-4576-a5d2-5cc3b9172699, the following critical functions show performance changes:

Response Time Degradations

llama_context_default_params: +31 ns (+0.057%)
_ZNKSt8__detail12_CharMatcherINSt7__cxx1112regex_traitsIwEELb1ELb1EEclEw: +15 ns (+0.057%)

Throughput Degradations

_ZNKSt8__detail12_CharMatcherINSt7__cxx1112regex_traitsIwEELb1ELb1EEclEw: +15 ns (+0.057%)
_ZSt10_ConstructI12llama_ubatchJRKS0_EEvPT_DpOT0_: +15 ns (+0.053%)

Bottleneck Degradations

_ZSt10_ConstructI12llama_ubatchJRKS0_EEvPT_DpOT0_: +15 ns (+0.107%)

KPI Impact Analysis

1. Tokens Per Second Impact

No Direct Impact Identified: The core inference functions show no performance degradation:

llama_decode(): No changes detected
llama_encode(): No changes detected
llama_tokenize(): No changes detected

Assessment: Based on the reference that 2ms slower llama_decode results in 7% fewer tokens per second, the current changes do not affect tokenization or inference performance directly.

2. Power Consumption Impact

Binary-Level Analysis:

build.bin.libllama.so: -0.0% change (306.895 µJ vs 306.895 µJ baseline)
build.bin.libggml-base.so: 0.0% change
build.bin.libggml-cpu.so: 0.0% change
build.bin.libggml.so: 0.0% change

Assessment: Power consumption remains stable across all binaries with negligible changes.

3. Quantization Efficiency

No Impact Detected:

llama_model_quantize(): No performance changes identified
Quantization-related functions show no degradation
GGML quantization operations remain stable

4. Memory Usage Impact

Affected Functions:

Memory Management: _ZSt10_ConstructI12llama_ubatchJRKS0_EEvPT_DpOT0_ shows +15 ns bottleneck increase
KV Cache Operations: Multiple cache-related functions show minor improvements (-0.07 to -0.11 ns)

Key Changes:

_ZNK27llama_kv_cache_iswa_context8get_baseEv: -0.07 ns throughput improvement
_ZNK22llama_kv_cache_context10get_ubatchEv: -0.07 ns throughput improvement

5. Batch Processing Impact

Affected Functions:

Batch Construction: _ZSt10_ConstructI12llama_ubatchJRKS0_EEvPT_DpOT0_ shows bottleneck degradation
Batch Operations: Core batch processing functions (llama_batch_init, llama_batch_get_one) show no changes

Root Cause Analysis

Control Flow Investigation

The CFG analysis of llama_context_default_params reveals:

Identical Assembly Code: No instruction-level changes between versions
Same Control Flow: Linear execution path unchanged
External Factors: Performance degradation likely due to binary layout changes from SYCL Flash Attention integration

SYCL Integration Impact

The addition of Flash Attention support introduces:

Binary Layout Changes: New object files alter memory layout affecting cache behavior
Compilation Effects: Additional template instantiations and link-time optimization changes
System Resource Competition: SYCL runtime initialization affects system performance

Action Items for Performance Improvement

Immediate Code Optimizations

Batch Construction Optimization:
- Investigate _ZSt10_ConstructI12llama_ubatchJRKS0_EEvPT_DpOT0_ bottleneck increase
- Review memory allocation patterns in batch initialization
- Consider pre-allocated batch pools to reduce construction overhead
Binary Layout Optimization:
- Implement section ordering to improve instruction cache locality
- Consider profile-guided optimization (PGO) for better code placement
- Review linker flags for optimal memory layout
Template Instantiation Control:
- Minimize regex template instantiations affecting character matching performance
- Use explicit template instantiation to control compilation overhead
- Consider template specialization for common use cases

Build System Improvements

Compilation Flags:
- Enable link-time optimization (LTO) for better cross-module optimization
- Use -fprofile-use with representative workloads
- Consider -march=native for target-specific optimizations
Memory Layout Control:
- Implement custom section placement for critical functions
- Use __attribute__((hot)) for frequently called functions
- Consider memory prefetching hints in critical paths
SYCL Integration Isolation:
- Implement lazy SYCL initialization to reduce startup overhead
- Isolate SYCL-specific code to minimize binary footprint impact
- Use conditional compilation to reduce template instantiation overhead

Performance Monitoring Focus Areas

The analysis shows that while individual function changes are minimal (0.05-0.11%), the cumulative effect of binary layout changes requires attention to:

Cache Performance: Monitor instruction and data cache miss rates
Memory Allocation Patterns: Track batch construction and KV cache efficiency
Template Instantiation Overhead: Control regex and SYCL template expansion
Binary Size Growth: Monitor object file size increases from new features

The current changes maintain stable inference performance while introducing new SYCL capabilities, with optimization opportunities focused on build-time and memory layout improvements rather than algorithmic changes.

DamonFool and others added 30 commits September 24, 2025 08:46

model-conversion : run-org-model.py fails to run on mac m1 (#16213)

7735706

Signed-off-by: Jie Fu <[email protected]>

codeowners : match all requirements files (#16214)

c0c59c1

common : add missing chrono header for common.cpp (#16211)

152729f

Signed-off-by: Uilian Ries <[email protected]>

model-conversion : make causal-verify-logits fails with model names c…

63b54c8

…ontaining "." (#16215) Signed-off-by: Jie Fu <[email protected]>

llama: print memory breakdown on exit (#15860)

e789095

* llama: print memory breakdown on exit

codeowners: add ownership of zdnn backend [no ci] (#16229)

4ae88d0

add @AlekseiNikiforovIBM to owners of zDNN backend Signed-off-by: Aaron Teo <[email protected]>

devops: fix s390x docker release failure (#16231)

5fb5576

ci: run the x64 and arm ci on the github machines instead (#16183)

bee378e

* run the x64 ci on regular machines * set up the same thing for arm fix test-quantize-perf just like #12306 * try to disable sve * add another sve run

codeowners: add ownership of zdnn backend [no ci] (#16232)

e7a5130

add @Andreas-Krebbel to owners of zDNN backend Signed-off-by: Aaron Teo <[email protected]>

rpc : use ggml logging facilities

c498fc8

Use RPC_DEBUG environment variable to enable debug messages. Add helper macro LOG_DBG() which does an early check of the env var before calling GGML_LOG_DEBUG(). Make sure we log a debug message for every server function.

metal : restore im2col perf (#16219)

02a6a82

metal : relax reorder conditions (#16216)

4ea0079

metal : fuse NORM + MUL + ADD, support non-multiples of 4 (#16220)

dfcd53f

* metal : fuse NORM + MUL + ADD * metal : support norms of non-multiple of 4 * cont : fix comment [no ci]

llama : add support for qwen3 reranker (#15824)

b5bd037

docs: fix typo [no ci] (#16244)

4cdd0bb

ggml : fix loongarch lsx compilation error (#15864)

aa719c2

readme : update bindings (#16144)

2705297

Link to Java JNA bindings to llama.cpp native libraries

vendors: update miniaudio version (#16212)

b05a9d6

* vendor: update miniaudio.h Signed-off-by: Aaron Teo <[email protected]> * vendor: update miniaudio.h Signed-off-by: Aaron Teo <[email protected]> --------- Signed-off-by: Aaron Teo <[email protected]>

model : add GroveMoE support (#15510)

835b2b9

* add GroveMoE support * remove constexpr that fails on certain compilers * revert crude scalar div implementation, use cast * build_attn_inp_kv_unified -> build_attn_inp_kv * fix build_attn * re-apply ffn_exps regex changes

musa: fix build warnings (#15611)

0f7c696

Signed-off-by: Xiaodong Ye <[email protected]>

musa: upgrade musa sdk to 4.3.0 (#16240)

a86a580

Signed-off-by: Xiaodong Ye <[email protected]>

codeowners : add danbev as owner of build-xcframework.sh [no ci] (#16…

3b337b0

…268)

build : fix build-ios-device (#16257)

4710dd3

Signed-off-by: Adrien Gallouët <[email protected]>

am17an and others added 20 commits October 18, 2025 11:52

CUDA: use registers instead of smem in topk-moe (#16647)

38355c6

Uses the technique used in the vulkan PR #16641. Neat trick!

vulkan: Implement topk_moe fused shader, ported from CUDA (#16641)

e56abd2

This is similar to the CUDA shader from #16130, but doesn't use shared memory and handles different subgroup sizes.

HIP: fix GPU_TARGETS (#16642)

ee09828

CODEOWNERS: update for ggml-cuda/mmf (#16660)

55754be

ci: include s390x release binaries (#16648)

fcb235b

Signed-off-by: Aaron Teo <[email protected]>

ci : avoid manual updates of docs/ops.md (#16663)

cec5edb

model : add Granite Hybrid types (#16635)

0398752

add Granite 4 models mapping their embedding dimensions to the # of parameters. Information taken from https://huggingface.co/ibm-granite/granite-4.0-h-tiny Signed-off-by: Giuseppe Scrivano <[email protected]>

readme: update bindings (#16651)

72d53e6

Signed-off-by: deadprogram <[email protected]>

ggml-alloc : fix leak when reusing a tensor with a larger size (#16679)

b617cfd

Handle legacy 'context' attachments (#16687)

c9c1972

sycl: initialize flash-attention implementation

fc0e041

Co-authored-by: safranowith <[email protected]> Co-authored-by: ye-NX <[email protected]>

DajanaV temporarily deployed to PROD__AL_DEMO November 3, 2025 13:13 — with GitHub Actions Inactive

DajanaV force-pushed the main branch 6 times, most recently from b655780 to 94ec54d Compare November 3, 2025 20:09

DajanaV closed this Nov 3, 2025

DajanaV force-pushed the main branch from 94ec54d to 92c0c2f Compare November 3, 2025 23:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

UPSTREAM PR #16969: sycl: flash-attention implementation #51

UPSTREAM PR #16969: sycl: flash-attention implementation #51

Uh oh!

DajanaV commented Nov 3, 2025

Uh oh!

loci-agentic-ai bot commented Nov 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

75 participants