Skip to content

Commit 3d3b42e

Browse files
Bug fixes.
1 parent 0dd1865 commit 3d3b42e

File tree

1 file changed

+8
-14
lines changed

1 file changed

+8
-14
lines changed

_collections/_portal_posts/2025-09-02-improving-triton-flashattention-performance-on-intel-gpu.md

Lines changed: 8 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -63,12 +63,10 @@ in Intel GPUs) using registers.
6363
Registers are a kind of small and fast memory bank (called Register File) located just beside the compute engine, as
6464
this can be seen on the following diagrams showing selected parts of an Intel GPU architecture.
6565

66-
![Xe2 GPU Vector engine Illustration]({{ '/assets/images/portal/article-images/2025-09-02-intel-gpu/ComputeUnit.jpg' |
67-
relative_url }})<br>
66+
![Xe2 GPU Vector engine Illustration]({{ '/assets/images/portal/article-images/2025-09-02-intel-gpu/ComputeUnit.jpg' | relative_url }})<br>
6867
*Illustration of an Intel Xe2 GPU Vector engine architecture (simplified)*
6968

70-
![XeCore GPU Illustration]({{ '/assets/images/portal/article-images/2025-09-02-intel-gpu/XeCore.jpg' |
71-
relative_url }})<br>
69+
![XeCore GPU Illustration]({{ '/assets/images/portal/article-images/2025-09-02-intel-gpu/XeCore.jpg' | relative_url }})<br>
7270
*Illustration of an Intel XeCore architecture (simplified)*
7371

7472
Basically, the tensor core reads operands A and B from a the *Register File* and then writes the accumulated output C
@@ -162,8 +160,7 @@ from Global Memory to the L1 Cache, then the second step is carried out by the `
162160
Registers, hopefully from the L1 cache if the data is still available in cache (cache hit).
163161
The diagram below illustrates this process:
164162

165-
![Intel Backend Memory Semantic]({{ '/assets/images/portal/article-images/2025-09-02-intel-gpu/IntelMemory.jpg' |
166-
relative_url }})<br>
163+
![Intel Backend Memory Semantic]({{ '/assets/images/portal/article-images/2025-09-02-intel-gpu/IntelMemory.jpg' | relative_url }})<br>
167164
*Intel Backend Memory Semantic (synchronous)*
168165

169166
Nvidia has chosen to leverage the Shared Local Memory (SMEM) instead of the cache. SMEM is indeed a scratch pad memory
@@ -173,8 +170,7 @@ a memory buffer in SMEM, but also `TritonGPU::LocalLoadOp` and `TritonGPU::Local
173170
between SMEM and Registers.
174171
Consequently, the Triton process for loading and storing data (synchronously) in the Nvidia architecture is as follows:
175172

176-
![Nvidia Backend Memory Semantic]({{ '/assets/images/portal/article-images/2025-09-02-intel-gpu/NvidiaMemory.jpg' |
177-
relative_url }})<br>
173+
![Nvidia Backend Memory Semantic]({{ '/assets/images/portal/article-images/2025-09-02-intel-gpu/NvidiaMemory.jpg' | relative_url }})<br>
178174
*Nvidia Backend Memory Semantic (synchronous)*
179175

180176
---
@@ -199,8 +195,7 @@ So, in our example, if A needs $NumReg_A$ registers to be stored, this means tha
199195
for A across the loop, and thus the compiler needs to fit the variables used between line 1 and 7 in $N - NumReg_A$
200196
registers, with $N$ being the total number of registers available.
201197

202-
![variable liveness simple example]({{ '
203-
/assets/images/portal/article-images/2025-09-02-intel-gpu/liveness_example_annotated.jpg' | relative_url }})<br>
198+
![variable liveness simple example]({{ '/assets/images/portal/article-images/2025-09-02-intel-gpu/liveness_example_annotated.jpg' | relative_url }})<br>
204199
*Variable liveness simple example*
205200

206201
It is therefore easy to understand that in such a kernel, if the variable A is large and the kernel processing between
@@ -392,8 +387,7 @@ an [optimization pass](https://github.com/intel/intel-xpu-backend-for-triton/blo
392387
which aims to reduce variable liveness where possible.
393388
To this end, the pass attempts to bring load operations closer to the actual uses of the loaded data.
394389

395-
![Reduce Variable Liveness pass diagram]({{ '
396-
/assets/images/portal/article-images/2025-09-02-intel-gpu/liveness-pass-diagram.jpg' | relative_url }})<br>
390+
![Reduce Variable Liveness pass diagram]({{ '/assets/images/portal/article-images/2025-09-02-intel-gpu/liveness-pass-diagram.jpg' | relative_url }})<br>
397391
*Reduce Variable Liveness pass diagram*
398392

399393
The diagram above shows how the compiler pass works to reduce the liveness of `DotOp` operands.
@@ -442,10 +436,10 @@ We have evaluated the performance of Triton FlashAttention v2 on Intel GPU Max P
442436
The following plots show the normalised performance of the FlashAttention kernel with the *reduce-liveness-pass* enabled
443437
for different input configurations.
444438

445-
![Normalized performance PVC1100]({{ '/assets/images/portal/article-images/2025-09-02-intel-gpu/pvc1100_new.jpg' | relative_url }})
439+
![Normalized performance PVC1100]({{ '/assets/images/portal/article-images/2025-09-02-intel-gpu/pvc1100_new.png' | relative_url }})
446440
*FlashAttention v2 Normalized performance PVC1100*
447441

448-
![Normalized performance PVC1550]({{ '/assets/images/portal/article-images/2025-09-02-intel-gpu/pvc1550_new.jpg' | relative_url }})
442+
![Normalized performance PVC1550]({{ '/assets/images/portal/article-images/2025-09-02-intel-gpu/pvc1550_new.png' | relative_url }})
449443
*FlashAttention v2 Normalized performance PVC1550*
450444

451445
The testbed used for these evaluations and a disclaimer can be found [at the bottom](#disclaimer) of this blog post.

0 commit comments

Comments
 (0)