Skip to content

Conversation

@DajanaV
Copy link
Contributor

@DajanaV DajanaV commented Oct 28, 2025

Mirrored from ggml-org/llama.cpp#16817

New Attention Mechanism: SparseK Attention (CPU Backend)

This PR introduces a new attention mechanism called SparseK Attention, implemented from scratch as a new operator within the GGML framework, currently with CPU backend support.


Overview

SparseK Attention is a selective and efficient attention mechanism inspired by Flash Attention, but introduces additional sparsity through:

  • Top-K filtering – keeps only the strongest attention weights.
  • Local windowing – limits attention to a configurable local context.
  • Global stride – adds periodic global connections between tokens.

Implementation Details

  • Added new operator: GGML_OP_SPARSEK_ATTN defined in ggml.h and ggml.c.
  • Implemented construction function ggml_sparsek_attn() that creates a computation node with parameters (k_top, win_local, stride_global).
  • Added full CPU backend implementation in:
    • ggml-cpu/ops.h
    • ggml-cpu/ops.cpp
    • ggml-cpu.c

The CPU version includes:

  • Scaled dot-product computation QKᵀ / √d
  • Dynamic Top-K filtering
  • Softmax normalization
  • Multiplication with V

Next Steps

Our next goal is to extend SparseK Attention to the SYCL (GPU) backend in order to:

  • Measure and compare performance between CPU and GPU implementations.
  • Optimize kernel execution for sparse attention patterns.
  • Validate correctness and scaling on Intel GPUs.

We are submitting this initial CPU implementation first to ensure review, integration, and baseline correctness before introducing GPU acceleration.


Co-Authors

Co-authored-by: Yael Shuker ([email protected])
Co-authored-by: Gitty Burstein ([email protected])

…or definition and tensor creation, backend implementation pending to ggml.c/h

Co-authored-by: Yael Shuker <[email protected]>
Co-authored-by: Gitty Burstein <[email protected]>
@loci-advisor
Copy link

loci-advisor bot commented Oct 28, 2025

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary: SparseK Attention Implementation (PR #4)

Key Findings

Performance Degradations Identified

  • Response Time: std::pow template function shows 0.066% degradation (108.11 ns vs 108.04 ns)
  • Throughput: std::regex _M_match_multiline char variant shows 0.110% degradation (39.49 ns vs 39.44 ns)
  • Bottleneck: std::regex _M_match_multiline wchar variant shows 0.173% degradation (25.05 ns vs 25.01 ns)
  • Power Consumption: Negligible increase of 0.0001% in libllama.so (0.42 nJ increase)

Core Function Impact Assessment

The performance degradations do not affect core llama.cpp functions. All degraded functions are C++ standard library components:

  • Template instantiation overhead in mathematical operations
  • Regex processing in standard library utilities
  • No impact on critical inference functions (model loading, tokenization, attention mechanisms, sampling)

Root Cause Analysis

Environmental Degradation: All affected functions remain byte-for-byte identical between versions, confirming performance changes stem from:

  • Memory layout modifications due to new SparseK attention code addition
  • Instruction cache pressure from increased binary size (+209 lines)
  • Altered branch prediction patterns in surrounding code

Flame Graph & CFG Analysis Insights

  • Template Overhead Dominance: 92.6% of std::pow execution time spent in template wrapper (100 ns) vs actual computation (8 ns)
  • Inefficient Memory Operations: Redundant stack store/load operations in argument processing
  • Identical Control Flow: No structural changes in degraded functions, confirming environmental impact

Critical Code Review Issues

High-Priority Algorithmic Bug:

// BROKEN: Incorrect top-k implementation
if (row[j] < row[k_top]) row[j] = -INFINITY;  // Uses k_top as index, not threshold

Missing Core Features:

  • Local windowing logic not implemented (parameters ignored)
  • Global stride mechanism not implemented
  • No SIMD optimizations for O(T²D) complexity operations

Actionable Steps

Immediate Critical Fixes (Priority 1)

  1. Fix Top-K Algorithm:

    • Implement proper k-th element selection using std::nth_element or sorting
    • Add bounds validation: k_top ≤ sequence_length
    • Add comprehensive unit tests for edge cases
  2. Implement Missing Features:

    • Add local windowing logic using win_local parameter
    • Implement global stride connections using stride_global parameter
    • Validate algorithm correctness against reference implementation
  3. Add Safety Measures:

    • Implement tensor dimension bounds checking
    • Add parameter validation in ggml_sparsek_attn()
    • Prevent buffer overflows in nested loops

Performance Optimization (Priority 2)

  1. Optimize Template Overhead:

    • Consider template specializations for common float-integer power operations
    • Eliminate redundant stack operations in std::pow wrapper
    • Evaluate constexpr evaluation for compile-time constants
  2. SparseK Attention Optimization:

    • Implement SIMD vectorization for dot product computations
    • Use cache-friendly memory access patterns
    • Add OpenMP parallelization for batch processing

Code Quality Improvements (Priority 3)

  1. Documentation & Testing:

    • Add algorithm complexity analysis and usage documentation
    • Expand test coverage for various tensor dimensions and parameter combinations
    • Implement performance benchmarks against standard attention
  2. Build Optimization:

    • Monitor instruction cache impact of binary size growth
    • Consider function placement optimization to minimize cache pressure

Overall Assessment

Change Impact Evaluation

  • Functionality: Successfully adds new SparseK attention operator to GGML framework
  • Integration Quality: Clean integration following established GGML patterns
  • Performance Impact: Minimal environmental degradation (< 0.2%) with no core function impact
  • Correctness Risk: High due to broken top-k implementation requiring immediate fix

Maintainability Considerations

  • Positive: Follows GGML architectural patterns for operator extension
  • Positive: Comprehensive test infrastructure provides good foundation
  • Concern: Complex algorithm requires better documentation and validation
  • Concern: Missing core features may lead to confusion about operator capabilities

Future Performance Outlook

  • Short-term: Environmental performance impact should stabilize with future builds
  • Medium-term: Proper SIMD optimization will be critical for production performance
  • Long-term: GPU backend implementation will determine practical utility

Recommendation: The PR introduces valuable functionality but requires immediate algorithmic fixes before merge. The environmental performance impact on existing functions is acceptable and expected to resolve naturally. Focus should be on correctness and feature completeness rather than the minimal standard library performance variations.

@DajanaV DajanaV force-pushed the main branch 3 times, most recently from 1983956 to 326a60a Compare October 29, 2025 12:13
@DajanaV DajanaV added the dev-stale Stale dev environment — dashboard not accessible label Oct 30, 2025
@DajanaV DajanaV deleted the branch main October 30, 2025 15:25
@DajanaV DajanaV closed this Oct 30, 2025
@DajanaV DajanaV deleted the upstream-PR16817-branch_yael-works-feature/sparsek-attn-sycl branch October 30, 2025 15:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dev-stale Stale dev environment — dashboard not accessible

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants