Skip to content

Conversation

@DajanaV
Copy link
Collaborator

@DajanaV DajanaV commented Nov 4, 2025

Mirrored from ggml-org/llama.cpp#16969

This PR adds basic Flash Attention support for the SYCL backend, enabling more efficient attention computation on Intel GPUs.

  • Implemented Flash Attention kernel for SYCL backend
  • Added forward pass implementation with block-wise computation
  • Integrated with existing GGML SYCL infrastructure
  • Support for both F32

Authors:
Joint work by @safranowith and @ye-NX

Notes:

  • This is an initial implementation
  • Performance benchmarks and optimizations are planned for future iterations
  • Feedback and suggestions are welcome!

@loci-agentic-ai
Copy link

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary

Based on the comprehensive analysis of project 2621b8c0-b5ce-11f0-b333-453f42058aa1 comparing versions bfc046e0-d1d3-46b0-a140-013b4fa1c317 and c451ecc0-5ce3-48ba-866a-585eed18e5f2, the findings indicate Condition 1 applies - no meaningful performance changes detected.

Analysis Results

Performance Metrics: The highest percentage changes identified were minimal:

  • Response Time: llm_graph_input_out_ids::can_reuse() showed 0.096% improvement (0.06 ns)
  • Throughput: std::make_unique<llm_graph_input_pos_bucket>() showed 0.117% degradation (0.12 ns)

Core Function Impact: Neither affected function is part of the critical inference pipeline. The core functions (llama_decode, llama_encode, llama_tokenize) that directly impact tokens per second performance showed no modifications or measurable changes.

Power Consumption: Analysis across all binaries revealed negligible changes (0.0% across all components), with total estimated power consumption remaining stable at approximately 1.31 millijoules.

Flame Graph Analysis: The can_reuse() function exhibits a simple, flat execution profile with no call stack depth, indicating a leaf function with 65 ns self-contained execution time.

CFG Comparison: Both versions contain identical control flow graphs and assembly code for the analyzed function, confirming that performance differences stem from measurement noise rather than code changes.

GitHub Code Review: PR #73 introduces SYCL Flash Attention implementation but does not modify the functions identified in the performance analysis. The PR adds new functionality without affecting existing performance-critical paths.

Conclusion

The detected performance variations are within measurement precision limits and represent environmental factors rather than functional changes. No core inference functions were modified, ensuring no impact on tokens per second performance. The codebase maintains stable performance characteristics between versions.

@auroralabs-loci auroralabs-loci deleted a comment from loci-agentic-ai bot Nov 4, 2025
@DajanaV DajanaV force-pushed the main branch 25 times, most recently from 5fc2eb6 to 7480137 Compare November 8, 2025 04:09
@loci-dev loci-dev force-pushed the main branch 30 times, most recently from 9d00b69 to c481809 Compare December 10, 2025 10:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants