-
Notifications
You must be signed in to change notification settings - Fork 0
UPSTREAM PR #16969: sycl: flash-attention implementation #51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Jie Fu <[email protected]>
Signed-off-by: Uilian Ries <[email protected]>
…ontaining "." (#16215) Signed-off-by: Jie Fu <[email protected]>
* model : add label for LiquidAI LFM2-2.6B model HF link: [LiquidAI/LFM2-2.6B](https://huggingface.co/LiquidAI/LFM2-2.6B). Support for GGUF conversion and inference is added in #14620. However, due to similar `n_embd`, it identifies as a 1.2B model. Fix the label by using `n_ff` to identify the model instead. Output of `llama-bench`: ``` | model | size | params | backend | threads | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: | | lfm2 1.2B F16 | 2.18 GiB | 1.17 B | CPU | 10 | pp512 | 223.97 ± 5.32 | | lfm2 2.6B F16 | 4.79 GiB | 2.57 B | CPU | 10 | pp512 | 92.53 ± 4.14 | | lfm2 350M F16 | 676.25 MiB | 354.48 M | CPU | 10 | pp512 | 725.52 ± 11.70 | | lfm2 700M F16 | 1.38 GiB | 742.49 M | CPU | 10 | pp512 | 336.22 ± 12.93 | ``` * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <[email protected]> --------- Co-authored-by: Sigbjørn Skjæret <[email protected]>
…#15815) * ggml : make gallocr respect the backend's max buffer size * if the graph requires more memory than can fit into a single allocation, split it into multiple backend buffers * vulkan: report the actual max allocation size in buffer type interface * fix missing newline, apple-clang warning * track size of individual chunks in ggml_dyn_tallocr and raise max chunks. revert to use suballocation_block_size as max chunk size for vulkan. * track (chunk, offset) pairs instead of "global" offsets through gallocr. * simpler, don't need loops to map between local/global offsets * touches more code * fix dyn_tallocr_max_size and initialization * fix memory leak when buffers are reused due to same buffer type appearing multiple times * make vbuffer allocation follow the same logic as backend_buffer did before * continue to use leftover unallocated space of previous chunks after a new one has been created * treat free blocks of each chunk as separate list * they're still allocated together, but start/end of each chunk is tracked, and allocate/free iterate over sub-ranges * exhaust freed blocks of all chunks before considering their last blocks with unallocated space * start with 0 chunks/blocks and create chunks as needed * allow the last chunk to grow beyond max size * refactor: move adding new free block and new chunk into separate functions * allocate chunks individually with a separate free-blocks list for each one * needs a bit more memory/allocations/indirections, but code is simpler * fix warnings (missing static) & debug checks
* llama: print memory breakdown on exit
add @AlekseiNikiforovIBM to owners of zDNN backend Signed-off-by: Aaron Teo <[email protected]>
* run the x64 ci on regular machines * set up the same thing for arm fix test-quantize-perf just like #12306 * try to disable sve * add another sve run
add @Andreas-Krebbel to owners of zDNN backend Signed-off-by: Aaron Teo <[email protected]>
Use RPC_DEBUG environment variable to enable debug messages. Add helper macro LOG_DBG() which does an early check of the env var before calling GGML_LOG_DEBUG(). Make sure we log a debug message for every server function.
* metal : fuse NORM + MUL + ADD * metal : support norms of non-multiple of 4 * cont : fix comment [no ci]
This commit adds support for using an externally started llama-server instance for the server tests. This can be enabled by setting the DEBUG_EXTERNAL environment variable. The motivation for this is to allow debugging of the server itself when investigating a test failure. Instructions for how to do this are added to the README.md file in the tests directory.
This commit adds support for passing a prompt file to the model conversion targets/scripts. It also updates the logits.cpp to print out embedding information in the same format as when running the original embedding model. The motivation for this is that it allows us to pass files of different sizes when running the converted models and validating the logits. This can be particularly important when testing the sliding window functionality of models where the sequence length needs to exceed a certain number of tokens to trigger the sliding window logic.
* CUDA: add a fused top-K MoE kernel This kernel does the following: 1. softmax over the logits per token [n_experts, n_tokens] 2. argmax reduce over the top-k (n_experts_used) logits 3. write weights + ids to global memory It is intended as fusion of softmax->top-k->get_rows pipeline for MoE models * Refactor into ggml_cuda_should_use_topk_moe * Review: Use better coalescing pattern, use WARP_SIZE, store logits into registers before * Review: format + micro-optimizations * Fix bug: fix tie breakers * Add optional norm + clean-up code * Use smem for final write * Add bounds check * Use better memory pattern for writeback
Link to Java JNA bindings to llama.cpp native libraries
* vendor: update miniaudio.h Signed-off-by: Aaron Teo <[email protected]> * vendor: update miniaudio.h Signed-off-by: Aaron Teo <[email protected]> --------- Signed-off-by: Aaron Teo <[email protected]>
* add GroveMoE support * remove constexpr that fails on certain compilers * revert crude scalar div implementation, use cast * build_attn_inp_kv_unified -> build_attn_inp_kv * fix build_attn * re-apply ffn_exps regex changes
Signed-off-by: Xiaodong Ye <[email protected]>
Signed-off-by: Xiaodong Ye <[email protected]>
* ci : create git tags for released docker images When releasing a docker image for build number X, we should also create the corresponding git tag. This allows users to easily checkout the corresponding source tree for given docker image. * Update .github/workflows/docker.yml Co-authored-by: Sigbjørn Skjæret <[email protected]> * Update .github/workflows/docker.yml Co-authored-by: Sigbjørn Skjæret <[email protected]> * Apply suggestion from @CISC Co-authored-by: Sigbjørn Skjæret <[email protected]> --------- Co-authored-by: Sigbjørn Skjæret <[email protected]>
* ggml-cpu: impl mxfp4 s390x Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: missing s = sumf Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: fix incorrect kval_mxfp4 type Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: rework mxfp4 Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: missing delta calc Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: fix typo Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: fix typo for vec_splats Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: expand to 2 blocks per loop Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: add unroll to boost perf Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: back to 1 block per loop to test perf Signed-off-by: Aaron Teo <[email protected]> * Revert "ggml-cpu: back to 1 block per loop to test perf" This reverts commit 1fe55724e2dc295701101bf838bdd4a512237492. Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: rm unroll from single block Signed-off-by: Aaron Teo <[email protected]> --------- Signed-off-by: Aaron Teo <[email protected]>
Signed-off-by: Adrien Gallouët <[email protected]>
Uses the technique used in the vulkan PR #16641. Neat trick!
This is similar to the CUDA shader from #16130, but doesn't use shared memory and handles different subgroup sizes.
Signed-off-by: Aaron Teo <[email protected]>
…t) (#16664) * devops: initial patch Signed-off-by: Aaron Teo <[email protected]> * devops: forgot the z15 suffix Signed-off-by: Aaron Teo <[email protected]> * devops: attempt at impl GGML_CPU_ALL_VARIANTS for s390x Signed-off-by: Aaron Teo <[email protected]> * devops: rm baseline version Signed-off-by: Aaron Teo <[email protected]> --------- Signed-off-by: Aaron Teo <[email protected]>
add Granite 4 models mapping their embedding dimensions to the # of parameters. Information taken from https://huggingface.co/ibm-granite/granite-4.0-h-tiny Signed-off-by: Giuseppe Scrivano <[email protected]>
The unexpeced pooling_type warning was incorrectly shown when users did not specify the --pooling-type parameter. In this case, the parameter defaults to `LLAMA_POOLING_TYPE_UNSPECIFIED (-1)`, and the code automatically applies the model's default pooling type. Example of spurious warning: ``` $ llama-embedding -hf ggml-org/bge-m3-Q8_0-GGUF -p "hello" ... llama_init_from_model: model default pooling_type is [2], but [-1] was specified ... ``` This fix ensures the warning only appears when users explicitly specify a pooling type that differs from the model's default (e.g., using --pooling-type mean on a model that expects CLS pooling).
…613) * SYCL: Add support for FLOOR,CEIL,ROUND and TRUNC unary operators Clean up unrelated changes from previous commit * Chore: remove empty lines and fix indentation * Clean up: remove leftover blank lines and fix spacing * chore: fix trailing whitespace and ensure final newline * Cleanup: remove redundant declarations already defined in header * Sync docs/ops.md with updated backend operation support * docs: update ops.md after rebase * docs: update ops.md - Vulkan supports SSM_CONV and SSM_SCAN
Signed-off-by: deadprogram <[email protected]>
## Why it failed
When compiling with strict compiler flags (-Wmissing-braces -Werror=missing-braces),
the build fails with the following error:
```
cmake \
-S . \
-B ../llama.cpp.build \
--preset=x64-linux-gcc-debug \
-DCMAKE_INSTALL_PREFIX=/tmp/local \
-DCMAKE_CXX_FLAGS="-Wmissing-braces -Werror=missing-braces" && \
cmake --build ../llama.cpp.build/
...
In file included from /home/otegami/work/cpp/llama.cpp/src/llama-graph.h:4,
from /home/otegami/work/cpp/llama.cpp/src/llama-model.h:5,
from /home/otegami/work/cpp/llama.cpp/src/llama.cpp:8:
/home/otegami/work/cpp/llama.cpp/src/llama-batch.h:126:48: error: missing braces around initializer for 'std::__array_traits<int, 1>::_Type' {aka 'int [1]'} [-Werror=missing-braces]
126 | std::array<llama_seq_id, 1> seq_id_0 = { 0 }; // default sequence id
| ^
cc1plus: some warnings being treated as errors
```
The issue is that std::array initialization requires double braces.
## How to fix
This PR changes `{ 0 }` to `{{ 0 }}` for std::array initialization.
This is part of a series of commits to fix missing braces warnings across the codebase.
- src/llama-batch.h <- This PR is here.
- src/llama-context.cpp
- tests/test-backend-ops.cpp
- tests/test-gguf.cpp
- tools/mtmd/clip.cpp
Benefits:
- std::array is a struct containing a C-style array, requiring nested braces
- Enables stricter compiler warnings to catch potential issues
…rsations (#16327) * feat: Per-conversation loading states and tracking streaming stats * chore: update webui build output * refactor: Chat state management Consolidates loading state management by using a global `isLoading` store synchronized with individual conversation states. This change ensures proper reactivity and avoids potential race conditions when updating the UI based on the loading status of different conversations. It also improves the accuracy of statistics displayed. Additionally, slots service methods are updated to use conversation IDs for per-conversation state management, avoiding global state pollution. * feat: Adds loading indicator to conversation items * chore: update webui build output * fix: Fix aborting chat streaming Improves the chat stream abortion process by ensuring that partial responses are saved before the abort signal is sent. This avoids a race condition where the onError callback could clear the streaming state before the partial response is saved. Additionally, the stream reading loop and callbacks are now checked for abort signals to prevent further processing after abortion. * refactor: Remove redundant comments * chore: build webui static output * refactor: Cleanup * chore: update webui build output * chore: update webui build output * fix: Conversation loading indicator for regenerating messages * chore: update webui static build * feat: Improve configuration * feat: Install `http-server` as dev dependency to not need to rely on `npx` in CI
* webui : added download action (#13552) * webui : import and export (for all conversations) * webui : fixed download-format, import of one conversation * webui : add ExportedConversations type for chat import/export * feat: Update naming & order * chore: Linting * feat: Import/Export UX improvements * chore: update webui build output * feat: Update UI placement of Import/Export tab in Chat Settings Dialog * refactor: Cleanup chore: update webui build output * feat: Enable shift-click multiple conversation items selection * chore: update webui static build * chore: update webui static build --------- Co-authored-by: Sascha Rogmann <[email protected]>
* fix: Prevent premature submission on IME input * chore: update webui static build * refactor: Put IME completion checker in a helper function and add checking for `KeyboardEvent.eventKey === 229` * chore: update webui static build * chore: update webui static build * chore: update webui static build
* add BailingMoeV2 support * update llm types * undo * undo * update llm types * add model collection link * update * almost working * correct group selection and rename n_group_exp * avoid large top_k and use argmax instead for now if we had something like argmax2 that would be equivalent, but this works fine until then * poke * skip group selection when there are no tokens * fix 1T conversion * hopefully fixed expert group selection third time's the charm? * make expert group selection generally available The new LLaDA2Moe model uses this method too, make it generally available regardless of architecture. * allow n_expert_groups to be 1 (Kimi K2) * address review suggestions
* sycl: add PAD_REFLECT_D1 operator support * docs(ops): regenerate docs/ops.md * remove trailing whitespaces * style: fix editorconfig issues — trim trailing spaces and normalize EOLs * fix: move PAD_REFLECT_1D case outside of fall-through block
Co-authored-by: safranowith <[email protected]> Co-authored-by: ye-NX <[email protected]>
|
Access the complete analysis in the LOCI Dashboard Performance Analysis Summary: LLaMA.cpp ProjectCritical Function Performance ChangesBased on the analysis of version Response Time Degradations
Throughput Degradations
Bottleneck Degradations
KPI Impact Analysis1. Tokens Per Second ImpactNo Direct Impact Identified: The core inference functions show no performance degradation:
Assessment: Based on the reference that 2ms slower 2. Power Consumption ImpactBinary-Level Analysis:
Assessment: Power consumption remains stable across all binaries with negligible changes. 3. Quantization EfficiencyNo Impact Detected:
4. Memory Usage ImpactAffected Functions:
Key Changes:
5. Batch Processing ImpactAffected Functions:
Root Cause AnalysisControl Flow InvestigationThe CFG analysis of
SYCL Integration ImpactThe addition of Flash Attention support introduces:
Action Items for Performance ImprovementImmediate Code Optimizations
Build System Improvements
Performance Monitoring Focus AreasThe analysis shows that while individual function changes are minimal (0.05-0.11%), the cumulative effect of binary layout changes requires attention to:
The current changes maintain stable inference performance while introducing new SYCL capabilities, with optimization opportunities focused on build-time and memory layout improvements rather than algorithmic changes. |
b655780 to
94ec54d
Compare
Mirrored from ggml-org/llama.cpp#16969
This PR adds basic Flash Attention support for the SYCL backend, enabling more efficient attention computation on Intel GPUs.
Authors:
Joint work by @safranowith and @ye-NX
Notes: