UPSTREAM PR #16969: sycl: flash-attention implementation #73

DajanaV · 2025-11-04T09:38:05Z

This PR adds basic Flash Attention support for the SYCL backend, enabling more efficient attention computation on Intel GPUs.

Implemented Flash Attention kernel for SYCL backend
Added forward pass implementation with block-wise computation
Integrated with existing GGML SYCL infrastructure
Support for both F32

Authors:
Joint work by @safranowith and @ye-NX

Notes:

This is an initial implementation
Performance benchmarks and optimizations are planned for future iterations
Feedback and suggestions are welcome!

Co-authored-by: safranowith <[email protected]> Co-authored-by: ye-NX <[email protected]>

loci-agentic-ai · 2025-11-04T15:08:39Z

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary

Based on the comprehensive analysis of project 2621b8c0-b5ce-11f0-b333-453f42058aa1 comparing versions bfc046e0-d1d3-46b0-a140-013b4fa1c317 and c451ecc0-5ce3-48ba-866a-585eed18e5f2, the findings indicate Condition 1 applies - no meaningful performance changes detected.

Analysis Results

Performance Metrics: The highest percentage changes identified were minimal:

Response Time: llm_graph_input_out_ids::can_reuse() showed 0.096% improvement (0.06 ns)
Throughput: std::make_unique<llm_graph_input_pos_bucket>() showed 0.117% degradation (0.12 ns)

Core Function Impact: Neither affected function is part of the critical inference pipeline. The core functions (llama_decode, llama_encode, llama_tokenize) that directly impact tokens per second performance showed no modifications or measurable changes.

Power Consumption: Analysis across all binaries revealed negligible changes (0.0% across all components), with total estimated power consumption remaining stable at approximately 1.31 millijoules.

Flame Graph Analysis: The can_reuse() function exhibits a simple, flat execution profile with no call stack depth, indicating a leaf function with 65 ns self-contained execution time.

CFG Comparison: Both versions contain identical control flow graphs and assembly code for the analyzed function, confirming that performance differences stem from measurement noise rather than code changes.

GitHub Code Review: PR #73 introduces SYCL Flash Attention implementation but does not modify the functions identified in the performance analysis. The PR adds new functionality without affecting existing performance-critical paths.

Conclusion

The detected performance variations are within measurement precision limits and represent environmental factors rather than functional changes. No core inference functions were modified, ensuring no impact on tokens per second performance. The codebase maintains stable performance characteristics between versions.

sycl: initialize flash-attention implementation

c9429b7

Co-authored-by: safranowith <[email protected]> Co-authored-by: ye-NX <[email protected]>

DajanaV temporarily deployed to PROD__AL_DEMO November 4, 2025 09:38 — with GitHub Actions Inactive

DajanaV force-pushed the main branch from 3667b8e to 491f903 Compare November 4, 2025 12:15

auroralabs-loci deleted a comment from loci-agentic-ai bot Nov 4, 2025

DajanaV force-pushed the main branch 25 times, most recently from 5fc2eb6 to 7480137 Compare November 8, 2025 04:09

loci-dev force-pushed the main branch 30 times, most recently from 9d00b69 to c481809 Compare December 10, 2025 10:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

UPSTREAM PR #16969: sycl: flash-attention implementation #73

UPSTREAM PR #16969: sycl: flash-attention implementation #73

DajanaV commented Nov 4, 2025

Uh oh!

loci-agentic-ai bot commented Nov 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

UPSTREAM PR #16969: sycl: flash-attention implementation #73

Are you sure you want to change the base?

UPSTREAM PR #16969: sycl: flash-attention implementation #73

Conversation

DajanaV commented Nov 4, 2025

Uh oh!

loci-agentic-ai bot commented Nov 4, 2025

Performance Analysis Summary

Analysis Results

Conclusion

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants