-
Notifications
You must be signed in to change notification settings - Fork 0
UPSTREAM PR #16817: Implement SparseK Attention mechanism — new GGML operator with CPU backend (GPU planned next) #4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…or definition and tensor creation, backend implementation pending to ggml.c/h Co-authored-by: Yael Shuker <[email protected]> Co-authored-by: Gitty Burstein <[email protected]>
Co-authored-by: Yael Shuker <[email protected]> Co-authored-by: Gitty Burstein <[email protected]>
|
Access the complete analysis in the LOCI Dashboard Performance Analysis Summary: SparseK Attention Implementation (PR #4)Key FindingsPerformance Degradations Identified
Core Function Impact AssessmentThe performance degradations do not affect core llama.cpp functions. All degraded functions are C++ standard library components:
Root Cause AnalysisEnvironmental Degradation: All affected functions remain byte-for-byte identical between versions, confirming performance changes stem from:
Flame Graph & CFG Analysis Insights
Critical Code Review IssuesHigh-Priority Algorithmic Bug: // BROKEN: Incorrect top-k implementation
if (row[j] < row[k_top]) row[j] = -INFINITY; // Uses k_top as index, not thresholdMissing Core Features:
Actionable StepsImmediate Critical Fixes (Priority 1)
Performance Optimization (Priority 2)
Code Quality Improvements (Priority 3)
Overall AssessmentChange Impact Evaluation
Maintainability Considerations
Future Performance Outlook
Recommendation: The PR introduces valuable functionality but requires immediate algorithmic fixes before merge. The environmental performance impact on existing functions is acceptable and expected to resolve naturally. Focus should be on correctness and feature completeness rather than the minimal standard library performance variations. |
1983956 to
326a60a
Compare
* Add buffer label and enable dawn-specific toggles to turn off some checks * Minor set_rows optimization (#4) * updated optimization, fixed errors * non vectorized version now dispatches one thread per element * Simplify * Change logic for set_rows pipelines --------- Co-authored-by: Neha Abbas <[email protected]> Co-authored-by: Neha Abbas <[email protected]> Co-authored-by: Reese Levine <[email protected]> * Comment on dawn toggles * Remove some comments * Implement overlap binary operators * Revert "Implement overlap binary operators" This reverts commit ed710b36f51ab3f53fa13db15c1685dc8678a32a. * Disable support for non-contiguous binary_op tensors and leave note for future support --------- Co-authored-by: neha-ha <[email protected]> Co-authored-by: Neha Abbas <[email protected]> Co-authored-by: Neha Abbas <[email protected]>
Mirrored from ggml-org/llama.cpp#16817
New Attention Mechanism: SparseK Attention (CPU Backend)
This PR introduces a new attention mechanism called SparseK Attention, implemented from scratch as a new operator within the GGML framework, currently with CPU backend support.
Overview
SparseK Attention is a selective and efficient attention mechanism inspired by Flash Attention, but introduces additional sparsity through:
Implementation Details
GGML_OP_SPARSEK_ATTNdefined inggml.handggml.c.ggml_sparsek_attn()that creates a computation node with parameters (k_top,win_local,stride_global).ggml-cpu/ops.hggml-cpu/ops.cppggml-cpu.cThe CPU version includes:
QKᵀ / √dNext Steps
Our next goal is to extend SparseK Attention to the SYCL (GPU) backend in order to:
We are submitting this initial CPU implementation first to ensure review, integration, and baseline correctness before introducing GPU acceleration.
Co-Authors
Co-authored-by: Yael Shuker ([email protected])
Co-authored-by: Gitty Burstein ([email protected])